[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1581#issuecomment-50114441
  
I agree it is safer to put the magic byte in front of every record. 
However, this is not a public API where users can throw in an arbitrary RDD and 
ask the serializer to create an `RDD[Double]`. Either solution is fine but it 
would be good if you can check the overhead in computation and in storage, for 
example, call `RDD.sum` on a cached RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1581#issuecomment-50114457
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50114354
  
QA results for PR 1582:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17170/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1555#issuecomment-50114262
  
QA results for PR 1555:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17171/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2656] Python version of stratified samp...

2014-07-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1554#issuecomment-50113405
  
Merged. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2656] Python version of stratified samp...

2014-07-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1554


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50112894
  
Matei, what InputFormats did you have problems with when cloning by
default?  I'd love to figure out what it would take to solve the one
element/one value problem.


On Thu, Jul 24, 2014 at 11:26 PM, Matei Zaharia 
wrote:

> BTW you can try WritableUtils.clone. At some point we tried cloning data
> by default in hadoopRDD, or having a flag for it, and we gave up because 
it
> didn't seem to work for every InputFormat. But it's probably worth a shot
> here if the object extends Writable.
>
> —
> Reply to this email directly or view it on GitHub
> .
>


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length support to Spark SQL and...

2014-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50112581
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length support to Spark SQL and...

2014-07-24 Thread javadba
GitHub user javadba opened a pull request:

https://github.com/apache/spark/pull/1586

SPARK-2686 Add Length support to Spark SQL and HQL and Strlen support to SQL

Syntactic, parsing, and operational support have been added for LEN(GTH) 
and STRLEN functions.
Examples:
SQL:
import org.apache.spark.sql._
case class TestData(key: Int, value: String)
val sqlc = new SQLContext(sc)
import sqlc._
val testData: SchemaRDD = sqlc.sparkContext.parallelize(
(1 to 100).map(i => TestData(i, i.toString)))
testData.registerAsTable("testData")
sqlc.sql("select length(key) as key_len from testData order by key_len desc 
limit 5").collect
res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2])
HQL:
val hc = new org.apache.spark.sql.hive.HiveContext(sc)
import hc._
hc.hql
hql("select length(grp) from simplex").collect
res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6])
As far as codebase changes: they have been purposefully made similar to the 
ones made for for adding SUBSTR(ING) from July 17:
SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the 
main classes changed. The testing suites affected are ConstantFolding and 
ExpressionEvaluation.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/javadba/spark strlen

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1586.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1586


commit bb252380399c4146bb63b5d6cbc66234609bab11
Author: Stephen Boesch 
Date:   2014-07-12T12:34:58Z

Support hbase-0.96-1.1 in SparkBuild

commit 947007305cb03515daa8738d3ad2063bcd226a3d
Author: Stephen Boesch 
Date:   2014-07-12T12:56:38Z

overwrote sparkbuild

commit 9b6a6471e3c1f087c186a7597c63c7ef2707eaa3
Author: Stephen Boesch 
Date:   2014-07-16T13:24:32Z

update pom.xml for hadoop-2.3-cdh50.0 and hbase 0.96.1.1

commit b04c4cbef3ecb5a6f13297391b55a36317ce957a
Author: Stephen Boesch 
Date:   2014-07-16T13:24:40Z

Merge branch 'master' of https://github.com/apache/spark

commit 5d1cb0a449bbf1ea95272a45f2d030d5cad0195c
Author: Stephen Boesch 
Date:   2014-07-23T04:33:25Z

SPARK-2638 MapOutputTracker concurrency improvement

commit 483479ac8ccb0c937da5d306fc4591aa974ed37b
Author: Stephen Boesch 
Date:   2014-07-23T16:09:26Z

Mesos workaround

commit 30910b2daac974cd2dac82e8a1b20cd60348a632
Author: Stephen Boesch 
Date:   2014-07-23T19:43:59Z

Merge remote-tracking branch 'upstream/master'

commit 7c675f8d8fc63c5f602c5a767e1215118e0f768c
Author: Stephen Boesch 
Date:   2014-07-23T20:03:18Z

Merge branch 'master' of https://github.com/javadba/spark

commit d646a2e1113252d1955185e355da06ddb690b75f
Author: Stephen Boesch 
Date:   2014-07-25T06:26:11Z

SPARK-2686 Add Length support to Spark SQL and HQL and Strlen support to SQL




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2682] Javadoc generated from Scala sour...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1584#issuecomment-50112325
  
QA results for PR 1584:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17169/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50112244
  
BTW you can try WritableUtils.clone. At some point we tried cloning data by 
default in hadoopRDD, or having a flag for it, and we gave up because it didn't 
seem to work for every InputFormat. But it's probably worth a shot here if the 
object extends Writable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2529] Clean closures in foreach and for...

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1583#issuecomment-50112144
  
Yeah weird, it must've been an oversight while editing. Unfortunately the 
apache/incubator-spark repo is gone so we can't see the old PRs and comments on 
them...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1338#discussion_r15387719
  
--- Diff: python/pyspark/rdd.py ---
@@ -964,6 +964,106 @@ def first(self):
 """
 return self.take(1)[0]
 
+def saveAsNewAPIHadoopDataset(self, conf, keyConverter=None, 
valueConverter=None):
+"""
+Output a Python RDD of key-value pairs (of form C{RDD[(K, V)]}) to 
any Hadoop file
+system, using the new Hadoop OutputFormat API (mapreduce package). 
Keys/values are
+converted for output using either user specified converters or, by 
default,
+L{org.apache.spark.api.python.JavaToWritableConverter}.
+
+@param conf: Hadoop job configuration, passed in as a dict
+@param keyConverter: (None by default)
+@param valueConverter: (None by default)
+"""
+jconf = self.ctx._dictToJavaMap(conf)
+reserialized = 
self._reserialize(BatchedSerializer(PickleSerializer(), 10))
--- End diff --

Since this 10 appears in a lot of places, maybe factor it out into a 
constant?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50111962
  
QA results for PR 1582:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17167/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50111934
  
The same thing happens in normal Spark if you create a hadoopRDD or 
sequenceFile with Writables inside it, and then call cache(). There will be 
only one key element and one value, so all the data will look identical.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50111889
  
@kanzhang it might mean that we're reusing the Bean object on the Java side 
when we read from the InputFormat. Hadoop's RecordReaders actually reuse the 
same object as you read data, so if you want to hold onto multiple data items, 
you need to clone each one. This may be a fair bit of trouble unfortunately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50111861
  
QA results for PR 1346:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17160/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

2014-07-24 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/455#issuecomment-50111706
  
You'd need to run loading code off master branch. It should be in 1.1 
release in a few weeks—
Sent from Mailbox

On Fri, Jul 25, 2014 at 4:14 AM, Russell Jurney 
wrote:

> I got this to run and I'm able to get work done!
> Does this code have to be run on the latest Spark code? Would it run on 
1.0?
> On Tuesday, July 22, 2014, Eric Garcia  wrote:
>> @MLnick , I made a PR here: #1536
>> 
>> @rjurney , the updated code works for the
>> .avro file you posted though it is still not fully implemented for *all*
>> data types. Note that any null values in your data will show up as an 
empty
>> string "". For some reason I could not get Java null to convert to Python
>> None.
>>
>> —
>> Reply to this email directly or view it on GitHub
>> .
>>
> -- 
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com 
datasyndrome.com
> ---
> Reply to this email directly or view it on GitHub:
> https://github.com/apache/spark/pull/455#issuecomment-50101315


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2683] unidoc failed because org.apache....

2014-07-24 Thread yhuai
GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/1585

[SPARK-2683] unidoc failed because org.apache.spark.util.CallSite uses Java 
keywords as value names

Renaming `short` to `shortForm` and `long` to `longForm`.

JIRA: https://issues.apache.org/jira/browse/SPARK-2683

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark SPARK-2683

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1585.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1585


commit 70bec0ea3c960efa21bdf940e22e2c0608a701a0
Author: Yin Huai 
Date:   2014-07-25T06:10:57Z

"short" and "long" are Java keyworks. In order to generate javadoc, 
renaming "short" to "shortForm" and "long" to "longForm".




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2157 Ability to write tight firewa...

2014-07-24 Thread ash211
Github user ash211 commented on the pull request:

https://github.com/apache/spark/pull/1107#issuecomment-50111683
  
Hi @pwendell I had a minor conflict with the fix for SPARK-2392 in #1335 
but it's rebased now and merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2260] Fix standalone-cluster mode, whic...

2014-07-24 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1538#discussion_r15387566
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala
 ---
@@ -45,7 +45,7 @@ private[spark] class SparkDeploySchedulerBackend(
   conf.get("spark.driver.host"), conf.get("spark.driver.port"),
   CoarseGrainedSchedulerBackend.ACTOR_NAME)
 val args = Seq(driverUrl, "{{EXECUTOR_ID}}", "{{HOSTNAME}}", 
"{{CORES}}", "{{WORKER_URL}}")
-val extraJavaOpts = 
sc.conf.getOption("spark.executor.extraJavaOptions")
+val extraJavaOpts = 
sc.conf.getOption("spark.executor.extraJavaOptions").toSeq
--- End diff --

Since this no longer goes through `Utils.splitCommandString`, I don't think 
it will work with options that are quoted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50111307
  
QA tests have started for PR 1582. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17170/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1555#issuecomment-50111308
  
QA tests have started for PR 1555. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17171/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2682] Javadoc generated from Scala sour...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1584#issuecomment-50111042
  
QA tests have started for PR 1584. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17169/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1578#issuecomment-50110964
  
QA results for PR 1578:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17166/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50110901
  
QA results for PR 1582:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17165/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1460


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2682] Javadoc generated from Scala sour...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1584#issuecomment-50110600
  
QA results for PR 1584:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17168/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1460#issuecomment-50110575
  
Thanks Davies. I've merged this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2682] Javadoc generated from Scala sour...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1584#issuecomment-50110559
  
QA tests have started for PR 1584. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17168/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2651: Add maven scalastyle plugin

2014-07-24 Thread rahulsinghaliitd
Github user rahulsinghaliitd commented on a diff in the pull request:

https://github.com/apache/spark/pull/1550#discussion_r15387188
  
--- Diff: pom.xml ---
@@ -957,6 +957,30 @@
 org.apache.maven.plugins
 maven-source-plugin
   
+  
+org.scalastyle
+scalastyle-maven-plugin
+0.4.0
+
+  false
+  true
+  false
+  false
+  ${basedir}/src/main/scala
+  
${basedir}/src/test/scala
+  scalastyle-config.xml
+  scalastyle-output.xml
+  UTF-8
+
+
+  
--- End diff --

Yes, `mvn package` is supposed to run these checks. I am surprised to hear 
that it didn't work for you. I had updated the PR yesterday, do by chance 
happen to have the older version (it was missing the line 
`package`)? Here is the snippet from my machine:

[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ spark-core_2.10 ---
[INFO] Building jar: 
/mnt/devel/rahuls/code/apache-spark/core/target/spark-core_2.10-1.1.0-SNAPSHOT.jar
[INFO] 
[INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ 
spark-core_2.10 ---
[INFO] 
[INFO] --- maven-source-plugin:2.2.1:jar-no-fork (create-source-jar) @ 
spark-core_2.10 ---
[INFO] Building jar: 
/mnt/devel/rahuls/code/apache-spark/core/target/spark-core_2.10-1.1.0-SNAPSHOT-sources.jar
[INFO] 
[INFO] --- scalastyle-maven-plugin:0.4.0:check (default) @ spark-core_2.10 
---
Saving to 
outputFile=/mnt/devel/rahuls/code/apache-spark/core/scalastyle-output.xml
Processed 361 file(s)
Found 0 errors
Found 0 warnings
Found 0 infos
Finished in 8234 ms

Another way to run these checks is through `mvn scalastyle:check` but that 
requires the inter-module dependencies are satisfied via M2 repo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2682] Javadoc generated from Scala sour...

2014-07-24 Thread yhuai
GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/1584

[SPARK-2682] Javadoc generated from Scala source code is not in javadoc's 
index

Add genjavadocSettings back to SparkBuild and resolve Java compiling errors 
caused by using Java keywords as value names.

https://issues.apache.org/jira/browse/SPARK-2682

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark SPARK-2682

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1584.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1584


commit 0ef6c0d602aa236e0f1a901044161fcc00f1ad79
Author: Yin Huai 
Date:   2014-07-25T05:47:00Z

Add genjavadocSettings back.

commit 02c3e503275b51975c873356024ab0fba33e0b3b
Author: Yin Huai 
Date:   2014-07-25T05:47:10Z

`short` and `long` are Java keyworks. In order to generate javadoc, 
renaming `short` to `shortForm` and `long` to `longForm`.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1555#issuecomment-50110354
  
Let me make one other change actually, I'll decrease the initial size of 
CompactBuffer's array to 8.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50110269
  
QA results for PR 1338:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17164/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2260] Fix standalone-cluster mode, whic...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1538#issuecomment-50110061
  
QA results for PR 1538:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17163/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50109829
  
QA tests have started for PR 1582. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17167/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50109757
  
More on the JavaBean test failure, it seems for RDD[(key, Bean)], all the 
keys in a partition (or batch, I didn't test partition size > batch size) are 
paired with the last Bean in that partition (or batch) on the Python side. Does 
it ring any bells?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-50109743
  
QA results for PR 1165:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class Sample(size: Long, numUpdates: Long)For 
more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17162/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2656] Python version of stratified samp...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1554#issuecomment-50109623
  
QA results for PR 1554:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class RDDSamplerBase(object):class 
RDDSampler(RDDSamplerBase):class 
RDDStratifiedSampler(RDDSamplerBase):For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17161/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2671] BlockObjectWriter should create p...

2014-07-24 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/1580#issuecomment-50109613
  
Usually File is gotten through `DiskBlockManger/getFile()`, so parent 
directory will be create in `getFile()`, I think you needn't worry about the 
parent directory if you follow the Spark's convention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50109448
  
It looks like the Jenkins failures are MIMA issues; I'll work on fixing 
them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50109411
  
@manishamde  I do realize it's a big change, and I hope it does not cause 
too much trouble for the other methods!  The functionality should be the same, 
and the internals are almost identical (mostly moving around code, with no 
major duplication), so performance should not change much.  (I do have some 
ideas for future optimizations, but we will push the API update through first.) 
 I appreciate your thoughts on the update!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1025#issuecomment-50109108
  
Well, the other "sample" functions are already approximate anyway. I kind 
of like this here because it conveys that it's more expensive. The other thing 
is that if we want the Exact one to be experimental, we can't just make it a 
parameter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50108750
  
QA tests have started for PR 1582. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17165/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1578#issuecomment-50108754
  
QA tests have started for PR 1578. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17166/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2670] FetchFailedException should be th...

2014-07-24 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1578#issuecomment-50108622
  
Thanks - this is a good idea. Two questions: (a) what type of exception 
have you seen here? (b) could you add a unit test for this? Jenkins, test this 
please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50108508
  
QA tests have started for PR 1338. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17164/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2651: Add maven scalastyle plugin

2014-07-24 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1550#issuecomment-50108448
  
Thanks for adding this to maven - just had one question based on running 
this locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2651: Add maven scalastyle plugin

2014-07-24 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1550#discussion_r15386400
  
--- Diff: pom.xml ---
@@ -957,6 +957,30 @@
 org.apache.maven.plugins
 maven-source-plugin
   
+  
+org.scalastyle
+scalastyle-maven-plugin
+0.4.0
+
+  false
+  true
+  false
+  false
+  ${basedir}/src/main/scala
+  
${basedir}/src/test/scala
+  scalastyle-config.xml
+  scalastyle-output.xml
+  UTF-8
+
+
+  
--- End diff --

is the goal here to make this run when someone runs `mvn package`? It 
didn't seem to do that when I ran it (or maybe it runs them all at the end, I 
just looked after core was compiled?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50108287
  
Uploaded a new patch adding batch serialization to Python when reading 
sequence files. For some reason, the test on custom class (Java Bean) no longer 
works and I disabled it. I suspect it's a bug in Pyrolite or Pickle.

Also addressed Josh's comments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2260] Fix standalone-cluster mode, whic...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1538#issuecomment-50108127
  
QA tests have started for PR 1538. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17163/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread kanzhang
Github user kanzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/1338#discussion_r15386313
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonHadoopUtil.scala ---
@@ -92,6 +104,46 @@ private[python] class DefaultConverter extends 
Converter[Any, Any] {
   }
 }
 
+/**
+ * A converter that converts common types to 
[[org.apache.hadoop.io.Writable]]. Note that array
+ * types are not supported since the user needs to subclass 
[[org.apache.hadoop.io.ArrayWritable]]
+ * to set the type properly. See 
[[org.apache.spark.api.python.DoubleArrayWritable]] and
+ * [[org.apache.spark.api.python.DoubleArrayToWritableConverter]] for an 
example. They are used in
+ * PySpark RDD `saveAsNewAPIHadoopFile` doctest.
+ */
+private[python] class JavaToWritableConverter extends Converter[Any, 
Writable] {
+
+  /**
+   * Converts common data types to [[org.apache.hadoop.io.Writable]]. Note 
that array types are not
+   * supported out-of-the-box.
+   */
+  private def convertToWritable(obj: Any): Writable = {
+import collection.JavaConversions._
+obj match {
+  case i: java.lang.Integer => new IntWritable(i)
+  case d: java.lang.Double => new DoubleWritable(d)
+  case l: java.lang.Long => new LongWritable(l)
+  case f: java.lang.Float => new FloatWritable(f)
+  case s: java.lang.String => new Text(s)
+  case b: java.lang.Boolean => new BooleanWritable(b)
+  case aob: Array[Byte] => new BytesWritable(aob)
+  case null => NullWritable.get()
+  case map: java.util.Map[_, _] =>
+val mapWritable = new MapWritable()
+map.foreach { case (k, v) =>
+  mapWritable.put(convertToWritable(k), convertToWritable(v))
+}
+mapWritable
+  case other => throw new SparkException(s"Data of type $other cannot 
be used")
--- End diff --

@JoshRosen thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50108039
  
@jkbradley You might want to create a JIRA for this one and ask Matei to 
assign to you. It's a big enough change to require one. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50107957
  
@jkbradley Awesome! 

A couple of quick thoughts:
+ I am not completely convinced about the strategy for1a (I was expecting 
thin wrappers for regression and classification tree) but I guess that was 
expected considering I am very familiar with the existing code. I will sleep 
over it and get back. :-) To give a historical perspective, we had a similar 
split implementations for regression and classification in the beginning that 
we decided to combine into one. Perhaps, it's the right time to split them 
again. @etrain was also hinting at that in the multiclass review.
+ I have ensemble RF and Boosting implementations close-to-ready which will 
need major refactoring or rewriting from scratch considering the magnitude of 
this PR. That's fine but we should try and get it accepted ASAP. I promise 
prompt piecemeal reviews.
+ We should perform regression testing and compare with the 1.0 release.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-50107921
  
QA tests have started for PR 1165. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17162/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2656] Python version of stratified samp...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1554#issuecomment-50107918
  
QA tests have started for PR 1554. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17161/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2656] Python version of stratified samp...

2014-07-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1554#issuecomment-50107704
  
LGTM. Waiting for Jenkins ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2656] Python version of stratified samp...

2014-07-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1554#issuecomment-50107711
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50107533
  
QA tests have started for PR 1346. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17160/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1555#issuecomment-50107549
  
QA results for PR 1555:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17159/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1338#discussion_r15386068
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonHadoopUtil.scala ---
@@ -92,6 +104,46 @@ private[python] class DefaultConverter extends 
Converter[Any, Any] {
   }
 }
 
+/**
+ * A converter that converts common types to 
[[org.apache.hadoop.io.Writable]]. Note that array
+ * types are not supported since the user needs to subclass 
[[org.apache.hadoop.io.ArrayWritable]]
+ * to set the type properly. See 
[[org.apache.spark.api.python.DoubleArrayWritable]] and
+ * [[org.apache.spark.api.python.DoubleArrayToWritableConverter]] for an 
example. They are used in
+ * PySpark RDD `saveAsNewAPIHadoopFile` doctest.
+ */
+private[python] class JavaToWritableConverter extends Converter[Any, 
Writable] {
+
+  /**
+   * Converts common data types to [[org.apache.hadoop.io.Writable]]. Note 
that array types are not
+   * supported out-of-the-box.
+   */
+  private def convertToWritable(obj: Any): Writable = {
+import collection.JavaConversions._
+obj match {
+  case i: java.lang.Integer => new IntWritable(i)
+  case d: java.lang.Double => new DoubleWritable(d)
+  case l: java.lang.Long => new LongWritable(l)
+  case f: java.lang.Float => new FloatWritable(f)
+  case s: java.lang.String => new Text(s)
+  case b: java.lang.Boolean => new BooleanWritable(b)
+  case aob: Array[Byte] => new BytesWritable(aob)
+  case null => NullWritable.get()
+  case map: java.util.Map[_, _] =>
+val mapWritable = new MapWritable()
+map.foreach { case (k, v) =>
+  mapWritable.put(convertToWritable(k), convertToWritable(v))
+}
+mapWritable
+  case other => throw new SparkException(s"Data of type $other cannot 
be used")
--- End diff --

This comment also applies to the other unsupported type messages added in 
this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50107456
  
  @mengxr This does not have a specific JIRA.  (It does not solve the 
Python API JIRA [SPARK-2478] yet.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50107147
  
@jkbradley Could you add the JIRA number to the PR title like 
`[SPARK-][MLLIB]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

2014-07-24 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1555#issuecomment-50106558
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2529] Clean closures in foreach and for...

2014-07-24 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1583#issuecomment-50106435
  
Unfortunately I can't think of why I did that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: use config spark.scheduler.priority for specif...

2014-07-24 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1528#issuecomment-50106263
  
We shouldn't should expose these types of hooks into the scheduler 
internals. The TaskSet, for instance, is an implementation detail we don't want 
to be part of a public API and the priority is an internal concept.

The public API of Spark for scheduling policies is the Fair Scheduler. Many 
different types of policies can be achieved within Fair Scheduling, including 
having a high priority pool to which tasks are submitted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2657 Use more compact data structures th...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1555#issuecomment-50105999
  
QA tests have started for PR 1555. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17159/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2529] Clean closures in foreach and for...

2014-07-24 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/1583#issuecomment-50105915
  
Does anyone recall why we lost the closure cleaning in 
https://github.com/apache/spark/commit/6b288b75d4c05f42ad3612813dc77ff824bb6203 
?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2648] through shuffling blocksByAddress...

2014-07-24 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1549#issuecomment-50105526
  
@shivaram ya - we should back port it, but I think TD is cutting an RC for 
1.0.2 right now, I'd prefer to back port it after that release (which is 
focused on just fixing a regression in 1.0.1) so it has a few weeks for people 
to run it before we ship it in a release. Just in-case this has unforseen 
consequences, it looks simple but patches like this can sometimes break 
assumptions downstream in other parts of the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2529] Clean closures in foreach and for...

2014-07-24 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1583#issuecomment-50105413
  
LGTM - probably good to backport as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-24 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1338#discussion_r15385187
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonHadoopUtil.scala ---
@@ -92,6 +104,46 @@ private[python] class DefaultConverter extends 
Converter[Any, Any] {
   }
 }
 
+/**
+ * A converter that converts common types to 
[[org.apache.hadoop.io.Writable]]. Note that array
+ * types are not supported since the user needs to subclass 
[[org.apache.hadoop.io.ArrayWritable]]
+ * to set the type properly. See 
[[org.apache.spark.api.python.DoubleArrayWritable]] and
+ * [[org.apache.spark.api.python.DoubleArrayToWritableConverter]] for an 
example. They are used in
+ * PySpark RDD `saveAsNewAPIHadoopFile` doctest.
+ */
+private[python] class JavaToWritableConverter extends Converter[Any, 
Writable] {
+
+  /**
+   * Converts common data types to [[org.apache.hadoop.io.Writable]]. Note 
that array types are not
+   * supported out-of-the-box.
+   */
+  private def convertToWritable(obj: Any): Writable = {
+import collection.JavaConversions._
+obj match {
+  case i: java.lang.Integer => new IntWritable(i)
+  case d: java.lang.Double => new DoubleWritable(d)
+  case l: java.lang.Long => new LongWritable(l)
+  case f: java.lang.Float => new FloatWritable(f)
+  case s: java.lang.String => new Text(s)
+  case b: java.lang.Boolean => new BooleanWritable(b)
+  case aob: Array[Byte] => new BytesWritable(aob)
+  case null => NullWritable.get()
+  case map: java.util.Map[_, _] =>
+val mapWritable = new MapWritable()
+map.foreach { case (k, v) =>
+  mapWritable.put(convertToWritable(k), convertToWritable(v))
+}
+mapWritable
+  case other => throw new SparkException(s"Data of type $other cannot 
be used")
--- End diff --

It's probably better to log `${other.getClass.getName}` instead of 
`$other`, since the string representation of the object may not identify its 
class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-24 Thread falaki
Github user falaki commented on the pull request:

https://github.com/apache/spark/pull/1025#issuecomment-50104259
  
This is the first place we introduce 'exact' to our API. We already have 
'approx' in function names. I think having both of them is confusing to users. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2529] Clean closures in foreach and for...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1583#issuecomment-50103450
  
QA results for PR 1583:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17158/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2101: import unittest2 when using Python...

2014-07-24 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1042#issuecomment-50103281
  
Actually, scratch that: `pyqver` won't be able to check whether usage of an 
object as a context manager is supported because it won't be able to statically 
determine the type of the object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1561#issuecomment-50103068
  
QA results for PR 1561:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17157/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2101: import unittest2 when using Python...

2014-07-24 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1042#issuecomment-50102861
  
It looks like `unittest2` isn't the only change required for the unit tests 
to pass on Python 2.6.

In `tests.py`, `createFileInZip()` uses `ZipFile` as a context manager, 
which is only supported in Python 2.7+ 
(https://docs.python.org/2.7/library/zipfile.html?highlight=zipfile#zipfile.ZipFile):

```python
with zipfile.ZipFile(path, 'w') as zip:
zip.writestr(name, content)
return path
```

We should probably replace this with a try-finally block.

There's a neat tool called [pyqver](https://github.com/ghewgill/pyqver) 
that aims to identify the minimum required Python version for a particular 
script.  Unfortunately, it doesn't detect this ZipFile issue; I'll look into 
opening a pull request to add a check for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2665] [SQL] Add EqualNS & Unit Tests

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1570#issuecomment-50102831
  
QA results for PR 1570:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class EqualNullSafe(left: Expression, right: 
Expression) extends BinaryComparison {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17149/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1460#issuecomment-50102382
  
QA results for PR 1460:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class AutoSerializer(FramedSerializer):class 
Aggregator(object):class SimpleAggregator(Aggregator):class 
Merger(object):class InMemoryMerger(Merger):class 
ExternalMerger(Merger):For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17153/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2260] Fix standalone-cluster mode, whic...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1538#issuecomment-50102363
  
QA results for PR 1538:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17156/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-50102333
  
QA results for PR 1165:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class Sample(size: Long, numUpdates: Long)For 
more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17155/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL]Update HiveMetastoreCatalog.scala

2014-07-24 Thread baishuo
Github user baishuo commented on the pull request:

https://github.com/apache/spark/pull/1569#issuecomment-50101782
  
modify the title, add [SQL]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2082] stratified sampling in PairRDDFun...

2014-07-24 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1025#issuecomment-50101767
  
Sorry, how was the API changed, was it making `sampleByKeyExact` a separate 
method and making it experimental? That actually seems okay to me, the 
algorithm there is quite a bit more involved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2648] through shuffling blocksByAddress...

2014-07-24 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/1549#issuecomment-50101605
  
@pwendell Can this also be backported to 1.0 branch ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2529] Clean closures in foreach and for...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1583#issuecomment-50101552
  
QA tests have started for PR 1583. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17158/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Cherry picked Hive Thrift/JD...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-50101547
  
QA results for PR 1399:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class SparkSQLOperationManager(hiveContext: HiveContext) 
extends OperationManager with Logging {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17150/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

2014-07-24 Thread rjurney
Github user rjurney commented on the pull request:

https://github.com/apache/spark/pull/455#issuecomment-50101315
  
I got this to run and I'm able to get work done!

Does this code have to be run on the latest Spark code? Would it run on 1.0?

On Tuesday, July 22, 2014, Eric Garcia  wrote:

> @MLnick , I made a PR here: #1536
> 
> @rjurney , the updated code works for the
> .avro file you posted though it is still not fully implemented for *all*
> data types. Note that any null values in your data will show up as an 
empty
> string "". For some reason I could not get Java null to convert to Python
> None.
>
> —
> Reply to this email directly or view it on GitHub
> .
>


-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2529] Clean closures in foreach and for...

2014-07-24 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1583

[SPARK-2529] Clean closures in foreach and foreachPartition.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark closureClean

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1583.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1583


commit 8982fe649e1d2ff669846462677281eed26676c3
Author: Reynold Xin 
Date:   2014-07-25T02:11:09Z

[SPARK-2529] Clean closures in foreach and foreachPartition.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-24 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1499#discussion_r15383498
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala ---
@@ -0,0 +1,156 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.shuffle.sort
+
+import java.io.{BufferedOutputStream, File, FileOutputStream, 
DataOutputStream}
+
+import org.apache.spark.{MapOutputTracker, SparkEnv, Logging, TaskContext}
+import org.apache.spark.executor.ShuffleWriteMetrics
+import org.apache.spark.scheduler.MapStatus
+import org.apache.spark.serializer.Serializer
+import org.apache.spark.shuffle.{ShuffleWriter, BaseShuffleHandle}
+import org.apache.spark.storage.ShuffleBlockId
+import org.apache.spark.util.collection.ExternalSorter
+
+private[spark] class SortShuffleWriter[K, V, C](
+handle: BaseShuffleHandle[K, V, C],
+mapId: Int,
+context: TaskContext)
+  extends ShuffleWriter[K, V] with Logging {
+
+  private val dep = handle.dependency
+  private val numPartitions = dep.partitioner.numPartitions
+
+  private val blockManager = SparkEnv.get.blockManager
+  private val ser = 
Serializer.getSerializer(dep.serializer.getOrElse(null))
+
+  private val conf = SparkEnv.get.conf
+  private val fileBufferSize = conf.getInt("spark.shuffle.file.buffer.kb", 
100) * 1024
+
+  private var sorter: ExternalSorter[K, V, _] = null
+  private var outputFile: File = null
+
+  private var stopping = false
+  private var mapStatus: MapStatus = null
+
+  /** Write a bunch of records to this task's output */
+  override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
+val partitions: Iterator[(Int, Iterator[Product2[K, _]])] = {
+  if (dep.mapSideCombine) {
+if (!dep.aggregator.isDefined) {
+  throw new IllegalStateException("Aggregator is empty for 
map-side combine")
+}
+sorter = new ExternalSorter[K, V, C](
+  dep.aggregator, Some(dep.partitioner), dep.keyOrdering, 
dep.serializer)
+sorter.write(records)
+sorter.partitionedIterator
+  } else {
+// In this case we pass neither an aggregator nor an ordering to 
the sorter, because we
+// don't care whether the keys get sorted in each partition; that 
will be done on the
+// reduce side if the operation being run is sortByKey.
+sorter = new ExternalSorter[K, V, V](
+  None, Some(dep.partitioner), None, dep.serializer)
+sorter.write(records)
+sorter.partitionedIterator
+  }
+}
+
+// Create a single shuffle file with reduce ID 0 that we'll write all 
results to. We'll later
+// serve different ranges of this file using an index file that we 
create at the end.
+val blockId = ShuffleBlockId(dep.shuffleId, mapId, 0)
+outputFile = blockManager.diskBlockManager.getFile(blockId)
+
+// Track location of each range in the output file
+val offsets = new Array[Long](numPartitions + 1)
+val lengths = new Array[Long](numPartitions)
+
+// Statistics
+var totalBytes = 0L
+var totalTime = 0L
+
+for ((id, elements) <- partitions) {
+  if (elements.hasNext) {
+val writer = blockManager.getDiskWriter(blockId, outputFile, ser, 
fileBufferSize)
+for (elem <- elements) {
+  writer.write(elem)
+}
+writer.commit()
+writer.close()
+val segment = writer.fileSegment()
+offsets(id + 1) = segment.offset + segment.length
+lengths(id) = segment.length
+totalTime += writer.timeWriting()
+totalBytes += segment.length
+  } else {
+// Don't create a new writer to avoid writing any headers and 
things like that
+offsets(id + 1) = offsets(id)
+  }
+}
+
+val

[GitHub] spark pull request: [SPARK-2679] [MLLib] Ser/De for Double

2014-07-24 Thread dorx
Github user dorx commented on the pull request:

https://github.com/apache/spark/pull/1581#issuecomment-50101155
  
The issue is what other things we can reasonably serialize into 8 bytes. 
Not sure how other types of doubles are relevant here since the size would be 
different and cause problems right away. Longs are also 8 bytes, so would some 
scheme of serializing an array of 2 shorts/chars. It's a tradeoff between 
efficiency and safety. We can remove the magic byte assuming no one's ever 
going to serialize an RDD of Doubles into 8-byte arrays and then use a Long 
deser on the 8-byte array. A compromise would be embedding the type metadata in 
the RDD of bytearray itself so we don't incur the cost of per point blow-up. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1460#issuecomment-50101034
  
QA results for PR 1460:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):class AutoSerializer(FramedSerializer):class 
Aggregator(object):class SimpleAggregator(Aggregator):class 
Merger(object):class InMemoryMerger(Merger):class 
ExternalMerger(Merger):For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17152/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1561#issuecomment-50100968
  
QA tests have started for PR 1561. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17157/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [mllib] Decision Tree API update and multiclas...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1582#issuecomment-50100883
  
QA results for PR 1582:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17151/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Part of [SPARK-2456] Removed some HashMaps fro...

2014-07-24 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1561#discussion_r15383355
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -341,8 +336,9 @@ class DAGScheduler(
 if (registeredStages.isEmpty || registeredStages.get.isEmpty) {
   logError("No stages registered for job " + job.jobId)
 } else {
-  stageIdToJobIds.filterKeys(stageId => 
registeredStages.get.contains(stageId)).foreach {
-case (stageId, jobSet) =>
+  stageIdToStage.filter(s => 
registeredStages.get.contains(s._1)).foreach {
--- End diff --

Actually that's pretty cool. I looked into the source code more. Didn't 
realize the whole FilteredKeys thing is "lazy".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-24 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15383252
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -275,8 +426,36 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
   override def contains(blockId: BlockId): Boolean = {
 entries.synchronized { entries.containsKey(blockId) }
   }
+
+  /**
+   * Reserve memory for unrolling blocks used by this thread.
+   */
+  private def reserveUnrollMemory(memory: Long): Unit = 
putLock.synchronized {
+unrollMemoryMap(Thread.currentThread().getId) = memory
+  }
+
+  /**
+   * Release memory used by this thread for unrolling blocks.
+   */
+  private[spark] def releaseUnrollMemory(): Unit = putLock.synchronized {
--- End diff --

That's a little long. How about `releaseThreadUnrollMemory`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-24 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15383242
  
--- Diff: core/src/test/scala/org/apache/spark/CacheManagerSuite.scala ---
@@ -52,22 +50,21 @@ class CacheManagerSuite extends FunSuite with 
BeforeAndAfter with EasyMockSugar
   }
 
   test("get uncached rdd") {
-expecting {
-  blockManager.get(RDDBlockId(0, 0)).andReturn(None)
-  blockManager.put(RDDBlockId(0, 0), ArrayBuffer[Any](1, 2, 3, 4), 
StorageLevel.MEMORY_ONLY,
-true).andStubReturn(Seq[(BlockId, BlockStatus)]())
-}
-
-whenExecuting(blockManager) {
-  val context = new TaskContext(0, 0, 0)
-  val value = cacheManager.getOrCompute(rdd, split, context, 
StorageLevel.MEMORY_ONLY)
-  assert(value.toList === List(1, 2, 3, 4))
-}
+// Do not mock this test, because attempting to match Array[Any], 
which is not covariant,
+// in blockManager.put is a losing battle. You have been warned.
+blockManager = sc.env.blockManager
+cacheManager = sc.env.cacheManager
+val context = new TaskContext(0, 0, 0)
+val computeValue = cacheManager.getOrCompute(rdd, split, context, 
StorageLevel.MEMORY_ONLY)
+val getValue = blockManager.get(RDDBlockId(rdd.id, split.index))
+assert(computeValue.toList === List(1, 2, 3, 4))
+assert(getValue.isDefined, "Block cached from getOrCompute is not 
found!")
+assert(getValue.get.data.toArray === List(1, 2, 3, 4))
--- End diff --

Kind of weird, how can the result of toArray be equal to a List? I guess it 
compares Seqs? Maybe it's better to say toList here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-24 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15383171
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -275,8 +426,36 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
   override def contains(blockId: BlockId): Boolean = {
 entries.synchronized { entries.containsKey(blockId) }
   }
+
+  /**
+   * Reserve memory for unrolling blocks used by this thread.
+   */
+  private def reserveUnrollMemory(memory: Long): Unit = 
putLock.synchronized {
+unrollMemoryMap(Thread.currentThread().getId) = memory
+  }
+
+  /**
+   * Release memory used by this thread for unrolling blocks.
+   */
+  private[spark] def releaseUnrollMemory(): Unit = putLock.synchronized {
+unrollMemoryMap.remove(Thread.currentThread().getId)
+  }
+
+  /**
+   * Return the amount of memory currently occupied for unrolling blocks 
across all threads.
+   */
+  private def currentUnrollMemory: Long = putLock.synchronized {
+unrollMemoryMap.values.sum
+  }
+
+  /**
+   * Return the amount of memory currently occupied for unrolling blocks 
by this thread.
+   */
+  private def threadCurrentUnrollMemory: Long = putLock.synchronized {
--- End diff --

Maybe call it currentUnrollMemoryForThisThread or unrollMemoryForThisThread


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-24 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15383154
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -275,8 +426,36 @@ private class MemoryStore(blockManager: BlockManager, 
maxMemory: Long)
   override def contains(blockId: BlockId): Boolean = {
 entries.synchronized { entries.containsKey(blockId) }
   }
+
+  /**
+   * Reserve memory for unrolling blocks used by this thread.
+   */
+  private def reserveUnrollMemory(memory: Long): Unit = 
putLock.synchronized {
+unrollMemoryMap(Thread.currentThread().getId) = memory
+  }
+
+  /**
+   * Release memory used by this thread for unrolling blocks.
+   */
+  private[spark] def releaseUnrollMemory(): Unit = putLock.synchronized {
--- End diff --

For clarity, call it releaseUnrollMemoryForThisThread


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2260] Fix standalone-cluster mode, whic...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1538#issuecomment-50100079
  
QA tests have started for PR 1538. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17156/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-24 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15383132
  
--- Diff: core/src/main/scala/org/apache/spark/storage/MemoryStore.scala ---
@@ -20,25 +20,43 @@ package org.apache.spark.storage
 import java.nio.ByteBuffer
 import java.util.LinkedHashMap
 
+import scala.collection.mutable
 import scala.collection.mutable.ArrayBuffer
 
 import org.apache.spark.util.{SizeEstimator, Utils}
+import org.apache.spark.util.collection.SizeTrackingVector
 
 private case class MemoryEntry(value: Any, size: Long, deserialized: 
Boolean)
 
 /**
- * Stores blocks in memory, either as ArrayBuffers of deserialized Java 
objects or as
+ * Stores blocks in memory, either as Arrays of deserialized Java objects 
or as
  * serialized ByteBuffers.
  */
-private class MemoryStore(blockManager: BlockManager, maxMemory: Long)
+private[spark] class MemoryStore(blockManager: BlockManager, maxMemory: 
Long)
   extends BlockStore(blockManager) {
 
+  private val conf = blockManager.conf
   private val entries = new LinkedHashMap[BlockId, MemoryEntry](32, 0.75f, 
true)
+
   @volatile private var currentMemory = 0L
+
   // Object used to ensure that only one thread is putting blocks and if 
necessary, dropping
   // blocks from the memory store.
   private val putLock = new Object()
--- End diff --

Maybe rename this to accountingLock, since we also use it to guard access 
to unrollMemoryMap


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1165#issuecomment-50099588
  
QA tests have started for PR 1165. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17155/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   4   >