[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50580086
  
QA results for PR 1338:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17424/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2260] Fix standalone-cluster mode, whic...

2014-07-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1538#issuecomment-50579956
  
LGTM - thanks andrew!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1498#issuecomment-50579747
  
QA tests have started for PR 1498. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17428/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...

2014-07-29 Thread naftaliharris
Github user naftaliharris commented on a diff in the pull request:

https://github.com/apache/spark/pull/1493#discussion_r15568446
  
--- Diff: python/pyspark/mllib/classification.py ---
@@ -63,7 +63,10 @@ class LogisticRegressionModel(LinearModel):
 def predict(self, x):
 _linear_predictor_typecheck(x, self._coeff)
 margin = _dot(x, self._coeff) + self._intercept
-prob = 1/(1 + exp(-margin))
+if margin > 0:
+prob = 1 / (1 + exp(-margin))
+else:
+prob = 1 - 1 / (1 + exp(margin))
--- End diff --

Sure, pull request here! #1652 

Thanks a lot! :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Avoid numerical instability

2014-07-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1652#issuecomment-50579636
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-50579659
  
Hey @nkronenfeld - I traced through the exact function call more closely 
and I actually think it's fine. The issue I pointed out in the JIRA is 
orthogonal. So I'm fine to just revert this back to always showing the status. 
However, we should not mark this as a developer API. This is a stable API we 
are happy to support forever.

Still, this will cause a significant amount of object allocation due to the 
way other internal function calls happen (it is basically O(all blocks)) for an 
application. It might be nice to add a note to the docs that the operation 
might be expensive and should not be called inside of a critical code path. 
Thought we could likely optimize those things down the road.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Avoid numerical instability

2014-07-29 Thread naftaliharris
GitHub user naftaliharris opened a pull request:

https://github.com/apache/spark/pull/1652

Avoid numerical instability

This avoids basically doing 1 - 1, for example:

```python
>>> from math import exp
>>> margin = -40
>>> 1 - 1 / (1 + exp(margin))
0.0
>>> exp(margin) / (1 + exp(margin))
4.248354255291589e-18
>>>
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/naftaliharris/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1652.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1652


commit 0d55a9fae74edf990a087463a52b81ef196862a2
Author: Naftali Harris 
Date:   2014-07-30T06:46:30Z

Avoid numerical instability

This avoids basically doing 1 - 1, for example:

>>> from math import exp
>>> margin = -40
>>> 1 - 1 / (1 + exp(margin))
0.0
>>> exp(margin) / (1 + exp(margin))
4.248354255291589e-18
>>>




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2734][SQL] Remove tables from cache whe...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1650#issuecomment-50579517
  
QA results for PR 1650:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):case class DropTable(tableName: String) extends LeafNode 
with Command {For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17420/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/929#issuecomment-50579245
  
QA results for PR 929:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17422/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

2014-07-29 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/1551#discussion_r15568273
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
   throw new SparkException("Unexpected Tuple2 element type " + 
pair._1.getClass)
   }
 case other =>
-  throw new SparkException("Unexpected element type " + 
first.getClass)
+  if (other == null) {
+dataOut.writeInt(SpecialLengths.NULL)
+writeIteratorToStream(iter, dataOut)
--- End diff --

If users want to call UDF in Java/Scala from PySpark, they have to use this 
private API to do it, so it's possible to have null in RDD[string] or 
RDD[Array[Byte]].

BTW, it will be helpful if we can skip some BAD rows during map/reduce, 
which was mentioned in MapReduce paper. This is not MUST have feature, but it 
really improve the robustness of whole framework, very useful for large scale 
jobs.

This PR try to improve the stability of PySpark, let users feel safer and 
happier to hack in PySpark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-07-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15568250
  
--- Diff: 
extras/spark-kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCount.scala
 ---
@@ -0,0 +1,345 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.streaming
+
+import java.nio.ByteBuffer
+import org.apache.log4j.Level
+import org.apache.log4j.Logger
+import org.apache.spark.Logging
+import org.apache.spark.SparkConf
+import org.apache.spark.SparkContext._
+import org.apache.spark.streaming.Milliseconds
+import org.apache.spark.streaming.StreamingContext
+import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions
+import org.apache.spark.streaming.kinesis.KinesisStringRecordSerializer
+import org.apache.spark.streaming.kinesis.KinesisUtils
+import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
+import com.amazonaws.services.kinesis.AmazonKinesisClient
+import com.amazonaws.services.kinesis.model.PutRecordRequest
+import scala.util.Random
+import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.streaming.dstream.ReceiverInputDStream
+import org.apache.spark.streaming.dstream.DStream
+
+/**
+ * Kinesis Spark Streaming WordCount example.
+ *
+ * See 
http://spark.apache.org/docs/latest/streaming-programming-guide.html for more 
details on the Kinesis Spark Streaming integration.
+ *
+ * This example spins up 1 Kinesis Worker (Spark Streaming Receivers) per 
shard of the given stream.
+ * It then starts pulling from the tip of the given  and 
 at the given .
+ * Because we're pulling from the tip (InitialPositionInStream.LATEST), 
only new stream data will be picked up after the KinesisReceiver starts.
+ * This could lead to missed records if data is added to the stream while 
no KinesisReceivers are running.
+ * In production, you'll want to switch to 
InitialPositionInStream.TRIM_HORIZON which will read up to 24 hours (Kinesis 
limit) of previous stream data 
+ *  depending on the checkpoint frequency.
+ *
+ * InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing 
of records depending on the checkpoint frequency.
+ * Record processing should be idempotent when possible.
+ *
+ * This code uses the DefaultAWSCredentialsProviderChain and searches for 
credentials in the following order of precedence:
+ * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
+ * Java System Properties - aws.accessKeyId and aws.secretKey
+ * Credential profiles file - default location (~/.aws/credentials) shared 
by all AWS SDKs
+ * Instance profile credentials - delivered through the Amazon EC2 
metadata service
+ *
+ * Usage: KinesisWordCount   
+ *is the name of the Kinesis stream (ie. mySparkStream)
+ *is the endpoint of the Kinesis service (ie. 
https://kinesis.us-east-1.amazonaws.com)
+ *is the batch interval in millis (ie. 1000ms)
+ *
+ * Example:
+ *  $ export AWS_ACCESS_KEY_ID=
+ *  $ export AWS_SECRET_KEY=
+ *$ bin/run-kinesis-example \
+ *org.apache.spark.examples.streaming.KinesisWordCount 
mySparkStream https://kinesis.us-east-1.amazonaws.com 100
+ *
+ * There is a companion helper class below called KinesisWordCountProducer 
which puts dummy data onto the Kinesis stream.
+ * Usage instructions for KinesisWordCountProducer are provided in that 
class definition.
+ */
+object KinesisWordCount extends Logging {
+  val WordSeparator = " "
+
+  def main(args: Array[String]) {
+/**
+ * Check that all required args were passed in.
+ */
+if (args.length < 3) {
+  System.err.println("Usage: KinesisWordCount  
 ")
+  System.exit(1)
+}
+
+/**
+ * (This was lifted from the StreamingExamples.scala in order to avoid 
the depend

[GitHub] spark pull request: [SPARK-2734][SQL] Remove tables from cache whe...

2014-07-29 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/1650#discussion_r15568205
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -377,6 +378,10 @@ private[hive] object HiveQl {
   }
 
   protected def nodeToPlan(node: Node): LogicalPlan = node match {
+// Special drop table that also uncaches.
+case Token("TOK_DROPTABLE",
+   Token("TOK_TABNAME",
+  Token(tableName, Nil) :: Nil) :: Nil) => DropTable(tableName)
--- End diff --

Seems we also need to support refer to a table with the format of 
`dbName.tableName` and `IF EXISTS`.

An example AST:
The AST tree of `drop table if exists default.src` is 
```
TOK_DROPTABLE
  TOK_TABNAME
default
src
  TOK_IFEXISTS
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2054][SQL] Code Generation for Expressi...

2014-07-29 Thread ueshin
Github user ueshin commented on the pull request:

https://github.com/apache/spark/pull/993#issuecomment-50578802
  
Hi @marmbrus, thanks for great work!
But it seems to break build.

I got the following result when I run `sbt assembly` or `sbt publish-local`:

```
[error] (catalyst/compile:doc) Scaladoc generation failed
```

and I found a lot of error messages in the build log saying `value q is not 
a member of StringContext`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2702][Core] Upgrade Tachyon dependency ...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1651#issuecomment-50578510
  
QA results for PR 1651:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17421/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-07-29 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50577874
  
Sure, added the notes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Handle null values in debug()

2014-07-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1646


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...

2014-07-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1562


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1635#issuecomment-50577707
  
QA tests have started for PR 1635. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17427/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-07-29 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50577608
  
Yeah, I'm hoping to merge #1346 as soon as it passes Jenkins, so I'd wait 
for that.

> I also thought I’d add a couple of notes on what I had in mind with 
this patch: ...

Can you add these notes to the PR description so that they get included in 
the commit message?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...

2014-07-29 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/1635#issuecomment-50577431
  
Jenkins retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1648#issuecomment-50577383
  
QA tests have started for PR 1648. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17425/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-07-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15567741
  
--- Diff: 
extras/spark-kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCount.java
 ---
@@ -0,0 +1,310 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.streaming;
+
+import java.util.List;
+import java.util.regex.Pattern;
+
+import org.apache.log4j.Level;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.Function2;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.storage.StorageLevel;
+import org.apache.spark.streaming.Duration;
+import org.apache.spark.streaming.Milliseconds;
+import org.apache.spark.streaming.api.java.JavaDStream;
+import org.apache.spark.streaming.api.java.JavaPairDStream;
+import org.apache.spark.streaming.api.java.JavaStreamingContext;
+import org.apache.spark.streaming.dstream.DStream;
+import org.apache.spark.streaming.kinesis.KinesisRecordSerializer;
+import org.apache.spark.streaming.kinesis.KinesisStringRecordSerializer;
+import org.apache.spark.streaming.kinesis.KinesisUtils;
+
+import scala.Tuple2;
+
+import com.amazonaws.auth.DefaultAWSCredentialsProviderChain;
+import com.amazonaws.services.kinesis.AmazonKinesisClient;
+import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream;
+import com.google.common.base.Optional;
+import com.google.common.collect.Lists;
+
+/**
+ * Java-friendly Kinesis Spark Streaming WordCount example
+ *
+ * See 
http://spark.apache.org/docs/latest/streaming-programming-guide.html for more 
details on the Kinesis Spark Streaming integration.
+ *
+ * This example spins up 1 Kinesis Worker (Spark Streaming Receivers) per 
shard of the given stream.
+ * It then starts pulling from the tip of the given  and 
 at the given .
+ * Because we're pulling from the tip (InitialPositionInStream.LATEST), 
only new stream data will be picked up after the KinesisReceiver starts.
+ * This could lead to missed records if data is added to the stream while 
no KinesisReceivers are running.
+ * In production, you'll want to switch to 
InitialPositionInStream.TRIM_HORIZON which will read up to 24 hours (Kinesis 
limit) of previous stream data 
+ *  depending on the checkpoint frequency.
+ * InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing 
of records depending on the checkpoint frequency.
+ * Record processing should be idempotent when possible.
+ *
+ * This code uses the DefaultAWSCredentialsProviderChain and searches for 
credentials in the following order of precedence: 
+ * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
+ * Java System Properties - aws.accessKeyId and aws.secretKey
+ * Credential profiles file - default location 
(~/.aws/credentials) shared by all AWS SDKs
+ * Instance profile credentials - delivered through the Amazon EC2 
metadata service
+ *
+ * Usage: JavaKinesisWordCount   

+ *  is the name of the Kinesis stream (ie. 
mySparkStream)
+ *  is the endpoint of the Kinesis service (ie. 
https://kinesis.us-east-1.amazonaws.com)
+ *  is the batch interval in milliseconds (ie. 
1000ms)
+ *
+ * Example:
+ *  $ export AWS_ACCESS_KEY_ID=
+ *  $ export AWS_SECRET_KEY=
+ *$ bin/run-kinesis-example  \
+ *org.apache.spark.examples.streaming.JavaKinesisWordCount 
mySparkStream https://kinesis.us-east-1.amazonaws.com 1000
+ *
+ * There is a companion helper class called KinesisWordCountProducer which 
puts dummy data onto the Kinesis stream. 
+ * Usage instructions for KinesisWordCountProducer ar

[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1639#issuecomment-50577408
  
QA tests have started for PR 1639. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17426/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-07-29 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50577338
  
Sure, I’m fine with reworking based on other changes (it seems that some 
merge conflicts have already cropped up in master since I submitted my PR last 
week). I think my change set is a little simpler than the one you linked to, so 
would it make sense for me to wait until that one goes in?

I also thought I’d add a couple of notes on what I had in mind with this 
patch:

1) I added a new Row serialization pathway between python and java, based 
on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t 
overjoyed about doing this, but I noticed that some QueryPlans implement 
optimizations in executeCollect(), which outputs an Array[Row] rather than the 
typical RDD[Row] that can be shipped to python using the existing serialization 
code. To me it made sense to ship the Array[Row] over to python directly 
instead of converting it back to an RDD[Row] just for the purpose of sending 
the Rows to python using the existing serialization code. But let me know if 
you have any thoughts about this.

2) I moved JavaStackTrace from rdd.py to context.py. This made sense to me 
since JavaStackTrace is all about configuring a context attribute, and the 
_extract_concise_traceback function it depends on was already being called 
separately from context.py (as a ‘private’ function of rdd.py).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

2014-07-29 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1648#issuecomment-50577278
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

2014-07-29 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1648#issuecomment-50577269
  
Jenkins, what are you doing ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50577245
  
QA results for PR 1338:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17419/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1639#issuecomment-50577126
  
QA results for PR 1639:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17418/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-07-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15567656
  
--- Diff: 
extras/spark-kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCount.java
 ---
@@ -0,0 +1,310 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.streaming;
+
+import java.util.List;
+import java.util.regex.Pattern;
+
+import org.apache.log4j.Level;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.Function2;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.storage.StorageLevel;
+import org.apache.spark.streaming.Duration;
+import org.apache.spark.streaming.Milliseconds;
+import org.apache.spark.streaming.api.java.JavaDStream;
+import org.apache.spark.streaming.api.java.JavaPairDStream;
+import org.apache.spark.streaming.api.java.JavaStreamingContext;
+import org.apache.spark.streaming.dstream.DStream;
+import org.apache.spark.streaming.kinesis.KinesisRecordSerializer;
+import org.apache.spark.streaming.kinesis.KinesisStringRecordSerializer;
+import org.apache.spark.streaming.kinesis.KinesisUtils;
+
+import scala.Tuple2;
+
+import com.amazonaws.auth.DefaultAWSCredentialsProviderChain;
+import com.amazonaws.services.kinesis.AmazonKinesisClient;
+import 
com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream;
+import com.google.common.base.Optional;
+import com.google.common.collect.Lists;
+
+/**
+ * Java-friendly Kinesis Spark Streaming WordCount example
+ *
+ * See 
http://spark.apache.org/docs/latest/streaming-programming-guide.html for more 
details on the Kinesis Spark Streaming integration.
+ *
+ * This example spins up 1 Kinesis Worker (Spark Streaming Receivers) per 
shard of the given stream.
+ * It then starts pulling from the tip of the given  and 
 at the given .
+ * Because we're pulling from the tip (InitialPositionInStream.LATEST), 
only new stream data will be picked up after the KinesisReceiver starts.
+ * This could lead to missed records if data is added to the stream while 
no KinesisReceivers are running.
+ * In production, you'll want to switch to 
InitialPositionInStream.TRIM_HORIZON which will read up to 24 hours (Kinesis 
limit) of previous stream data 
+ *  depending on the checkpoint frequency.
+ * InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing 
of records depending on the checkpoint frequency.
+ * Record processing should be idempotent when possible.
+ *
+ * This code uses the DefaultAWSCredentialsProviderChain and searches for 
credentials in the following order of precedence: 
+ * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY
+ * Java System Properties - aws.accessKeyId and aws.secretKey
+ * Credential profiles file - default location 
(~/.aws/credentials) shared by all AWS SDKs
+ * Instance profile credentials - delivered through the Amazon EC2 
metadata service
+ *
+ * Usage: JavaKinesisWordCount   

+ *  is the name of the Kinesis stream (ie. 
mySparkStream)
+ *  is the endpoint of the Kinesis service (ie. 
https://kinesis.us-east-1.amazonaws.com)
+ *  is the batch interval in milliseconds (ie. 
1000ms)
+ *
+ * Example:
+ *  $ export AWS_ACCESS_KEY_ID=
+ *  $ export AWS_SECRET_KEY=
+ *$ bin/run-kinesis-example  \
+ *org.apache.spark.examples.streaming.JavaKinesisWordCount 
mySparkStream https://kinesis.us-east-1.amazonaws.com 1000
+ *
+ * There is a companion helper class called KinesisWordCountProducer which 
puts dummy data onto the Kinesis stream. 
+ * Usage instructions for KinesisWordCountProducer ar

[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-29 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1639#discussion_r15567618
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1239,6 +1239,28 @@ abstract class RDD[T: ClassTag](
   /** The [[org.apache.spark.SparkContext]] that this RDD was created on. 
*/
   def context = sc
 
+  /**
+   * Private API for changing an RDD's ClassTag.
+   * Used for internal Java <-> Scala API compatibility.
+   */
+  private[spark] def retag(cls: Class[T]): RDD[T] = {
+val classTag: ClassTag[T] = ClassTag.apply(cls)
+this.retag(classTag)
+  }
+
+  /**
+   * Private API for changing an RDD's ClassTag.
+   * Used for internal Java <-> Scala API compatibility.
+   */
+  private[spark] def retag(classTag: ClassTag[T]): RDD[T] = {
+val oldRDD = this
+new RDD[T](sc, Seq(new OneToOneDependency(this)))(classTag) {
+  override protected def getPartitions: Array[Partition] = 
oldRDD.getPartitions
+  override def compute(split: Partition, context: TaskContext): 
Iterator[T] =
+oldRDD.compute(split, context)
+}
--- End diff --

Would there be any performance impact of running `mapPartitions(identity, 
preservesPartitioning = true)(classTag)`?  If we have an RDD that's persisted 
in a serialized format, wouldn't this extra map force an unnecessary 
deserialization?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-29 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1639#issuecomment-50576705
  
Sure, sounds good. Did you see my comments on preserving partitions too 
though?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50576604
  
QA tests have started for PR 1338. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17424/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-29 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1639#issuecomment-50576414
  
My last commit made `classTag` implicit in the retag() method, so in many 
cases the Scala code can be written as `someJavaRDD.rdd.retag.[...].collect()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50576339
  
QA tests have started for PR 1346. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-29 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1639#issuecomment-50576330
  
This method is intended to be called by Scala classes that implement 
Java-friendly wrappers for the Spark Scala API.  For instance, MLlib has APIs 
that accept RDD[LabelledPoint].  Ideally, the Java wrapper code can simply call 
the underlying Scala methods without having to worry about how they're 
implemented.  Therefore, I think we should prefer the `retag()`-based approach, 
since  `collectSeq` would require us to modify the Scala consumer of the RDD.

Since this is a private, internal API, we should be able to revisit this 
decision if we change our minds later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-29 Thread yhuai
Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50576308
  
@chenghao-intel `containsNull` and `valueContainsNull` can be used for 
further optimization. For example, let's say we have an `ArrayType` column and 
the element type is `IntegerType`. If elements of those arrays do not have 
`null` values, we can use a primitive array internal. Since we will expose data 
types to users, we need to introduce these two booleans with this PR. It can be 
hard to add them once users start to use these APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-07-29 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1592#issuecomment-50576177
  
Thanks for working on this!  We'll need to coordinate merging with #1346 
and related PRs. (cc @yhuai)

@JoshRosen can you look at the other pyspark changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support

2014-07-29 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/1434#discussion_r15567310
  
--- Diff: pom.xml ---
@@ -970,6 +972,14 @@
   
 
 
+
+
+  spark-kinesis-asl
--- End diff --

Profile can be named `kinesis-asl`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/929#issuecomment-50576082
  
QA tests have started for PR 929. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17422/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...

2014-07-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/929#issuecomment-50575931
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1628#issuecomment-50575901
  
We pinged Davies today. It seems to be a well-known problem with Python. 
There are ways to force import a standard module in Python 2, but they are all 
very messy:

1. 
https://www.inkling.com/read/learning-python-mark-lutz-4th/chapter-23/package-relative-imports
2. https://hkn.eecs.berkeley.edu/~dyoo/python/__std__/
3. http://www.ianbicking.org/py-std.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1551#discussion_r15567208
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging {
   throw new SparkException("Unexpected Tuple2 element type " + 
pair._1.getClass)
   }
 case other =>
-  throw new SparkException("Unexpected element type " + 
first.getClass)
+  if (other == null) {
+dataOut.writeInt(SpecialLengths.NULL)
+writeIteratorToStream(iter, dataOut)
--- End diff --

Right, but that's a private API, it doesn't matter. Does our own code do it?

Basically I'm worried that this significantly complicates our code for 
something that shouldn't happen. I'd rather have an NPE if our own code later 
passes nulls here (cause it really shouldn't be doing that since we control 
everything we pass in).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2702][Core] Upgrade Tachyon dependency ...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1651#issuecomment-50575608
  
QA tests have started for PR 1651. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17421/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-29 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-50575597
  
That could be a bug, not something due to the timeout. It complains of the 
test taking 20s. Can you look at what that test was doing? As far as I can tell 
it's not even running in cluster mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-29 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1628#issuecomment-50575485
  
If you can't figure out whether this is possible, consider pinging Josh or 
Davies too. I'd be surprised if there's no way around this because there are a 
*lot* of top-level packages in Python. There's gotta be a way to import our own 
vs importing theirs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2740: allow user to specify ascending an...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1645#issuecomment-50575430
  
QA results for PR 1645:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17416/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2702][Core] Upgrade Tachyon dependency ...

2014-07-29 Thread haoyuan
GitHub user haoyuan opened a pull request:

https://github.com/apache/spark/pull/1651

[SPARK-2702][Core] Upgrade Tachyon dependency to 0.5.0



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/haoyuan/spark upgrade-tachyon

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1651.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1651


commit 6f3f98fb08a9be6b582672bb0bb7756b98ae6193
Author: Haoyuan Li 
Date:   2014-07-30T05:33:58Z

upgrade tachyon to 0.5.0




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1498#issuecomment-50575398
  
I did a pass through this -- looks pretty good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15567109
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala ---
@@ -17,134 +17,54 @@
 
 package org.apache.spark.scheduler
 
-import scala.language.existentials
-
-import java.io._
-import java.util.zip.{GZIPInputStream, GZIPOutputStream}
+import java.nio.ByteBuffer
 
-import scala.collection.mutable.HashMap
+import scala.language.existentials
--- End diff --

Ah dunno, I just thought we were doing that, but I guess it's not in 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports.
 No need to do it then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2734][SQL] Remove tables from cache whe...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1650#issuecomment-50575367
  
QA tests have started for PR 1650. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17420/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15567087
  
--- Diff: core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala ---
@@ -89,7 +91,7 @@ class ContextCleanerSuite extends FunSuite with 
BeforeAndAfter with LocalSparkCo
   }
 
   test("automatically cleanup RDD") {
-var rdd = newRDD.persist()
+var rdd = newRDD().persist()
--- End diff --

Does assertCleanup below also check that this RDD's broadcast was cleaned? 
It seems like it doesn't, since you only pass in the RDD ID. Maybe we can also 
grab its broadcast ID somehow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15567061
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -57,6 +57,12 @@ private[spark] object Utils extends Logging {
 new File(sparkHome + File.separator + "bin", which + suffix)
   }
 
+  /** Serialize an object using the closure serializer. */
+  def serializeTaskClosure[T: ClassTag](o: T): Array[Byte] = {
+val ser = SparkEnv.get.closureSerializer.newInstance()
--- End diff --

that's a good idea actually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15567051
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -691,47 +689,81 @@ class DAGScheduler(
 }
   }
 
-
   /** Called when stage's parents are available and we can now do its 
task. */
   private def submitMissingTasks(stage: Stage, jobId: Int) {
 logDebug("submitMissingTasks(" + stage + ")")
 // Get our pending tasks and remember them in our pendingTasks entry
 stage.pendingTasks.clear()
 var tasks = ArrayBuffer[Task[_]]()
+
+val properties = if (jobIdToActiveJob.contains(jobId)) {
+  jobIdToActiveJob(stage.jobId).properties
+} else {
+  // this stage will be assigned to "default" pool
+  null
+}
+
+runningStages += stage
+// SparkListenerStageSubmitted should be posted before testing whether 
tasks are
+// serializable. If tasks are not serializable, a 
SparkListenerStageCompleted event
+// will be posted, which should always come after a corresponding 
SparkListenerStageSubmitted
+// event.
+listenerBus.post(SparkListenerStageSubmitted(stage.info, properties))
+
+// TODO: Maybe we can keep the taskBinary in Stage to avoid 
serializing it multiple times.
+// Broadcasted binary for the task, used to dispatch tasks to 
executors. Note that we broadcast
+// the serialized copy of the RDD and for each task we will 
deserialize it, which means each
+// task gets a different copy of the RDD. This provides stronger 
isolation between tasks that
+// might modify state of objects referenced in their closures. This is 
necessary in Hadoop
+// where the JobConf/Configuration object is not thread-safe.
+var taskBinary: Broadcast[Array[Byte]] = null
+try {
+  // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
+  // For ResultTask, serialize and broadcast (rdd, func).
+  val taskBinaryBytes: Array[Byte] =
+if (stage.isShuffleMap) {
+  Utils.serializeTaskClosure((stage.rdd, stage.shuffleDep.get) : 
AnyRef)
+} else {
+  Utils.serializeTaskClosure((stage.rdd, 
stage.resultOfJob.get.func) : AnyRef)
+}
+  taskBinary = sc.broadcast(taskBinaryBytes)
+} catch {
+  // In the case of a failure during serialization, abort the stage.
+  case e: NotSerializableException =>
+abortStage(stage, "Task not serializable: " + e.toString)
+runningStages -= stage
+return
+  case NonFatal(e) =>
+abortStage(stage, s"Task serialization failed: 
$e\n${e.getStackTraceString}")
+runningStages -= stage
+return
+}
+
 if (stage.isShuffleMap) {
   for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) 
{
 val locs = getPreferredLocs(stage.rdd, p)
-tasks += new ShuffleMapTask(stage.id, stage.rdd, 
stage.shuffleDep.get, p, locs)
+val part = stage.rdd.partitions(p)
+tasks += new ShuffleMapTask(stage.id, taskBinary, part, locs)
   }
 } else {
   // This is a final stage; figure out its job's missing partitions
   val job = stage.resultOfJob.get
   for (id <- 0 until job.numPartitions if !job.finished(id)) {
-val partition = job.partitions(id)
-val locs = getPreferredLocs(stage.rdd, partition)
-tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, 
locs, id)
+val p: Int = job.partitions(id)
+val part = stage.rdd.partitions(p)
+val locs = getPreferredLocs(stage.rdd, p)
+tasks += new ResultTask(stage.id, taskBinary, part, locs, id)
   }
 }
 
-val properties = if (jobIdToActiveJob.contains(jobId)) {
-  jobIdToActiveJob(stage.jobId).properties
-} else {
-  // this stage will be assigned to "default" pool
-  null
-}
-
 if (tasks.size > 0) {
-  runningStages += stage
-  // SparkListenerStageSubmitted should be posted before testing 
whether tasks are
-  // serializable. If tasks are not serializable, a 
SparkListenerStageCompleted event
-  // will be posted, which should always come after a corresponding 
SparkListenerStageSubmitted
-  // event.
-  listenerBus.post(SparkListenerStageSubmitted(stage.info, properties))
-
   // Preemptively serialize a task to make sure it can be serialized. 
We are catching this
   // exception here because it would be fairly hard to catch the 
non-serializable exception
   // down the road, where we have several different implementations 
for local sched

[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15567033
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -691,47 +689,81 @@ class DAGScheduler(
 }
   }
 
-
   /** Called when stage's parents are available and we can now do its 
task. */
   private def submitMissingTasks(stage: Stage, jobId: Int) {
 logDebug("submitMissingTasks(" + stage + ")")
 // Get our pending tasks and remember them in our pendingTasks entry
 stage.pendingTasks.clear()
 var tasks = ArrayBuffer[Task[_]]()
+
+val properties = if (jobIdToActiveJob.contains(jobId)) {
+  jobIdToActiveJob(stage.jobId).properties
+} else {
+  // this stage will be assigned to "default" pool
+  null
+}
+
+runningStages += stage
+// SparkListenerStageSubmitted should be posted before testing whether 
tasks are
+// serializable. If tasks are not serializable, a 
SparkListenerStageCompleted event
+// will be posted, which should always come after a corresponding 
SparkListenerStageSubmitted
+// event.
+listenerBus.post(SparkListenerStageSubmitted(stage.info, properties))
+
+// TODO: Maybe we can keep the taskBinary in Stage to avoid 
serializing it multiple times.
+// Broadcasted binary for the task, used to dispatch tasks to 
executors. Note that we broadcast
+// the serialized copy of the RDD and for each task we will 
deserialize it, which means each
+// task gets a different copy of the RDD. This provides stronger 
isolation between tasks that
+// might modify state of objects referenced in their closures. This is 
necessary in Hadoop
+// where the JobConf/Configuration object is not thread-safe.
+var taskBinary: Broadcast[Array[Byte]] = null
+try {
+  // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
+  // For ResultTask, serialize and broadcast (rdd, func).
+  val taskBinaryBytes: Array[Byte] =
+if (stage.isShuffleMap) {
+  Utils.serializeTaskClosure((stage.rdd, stage.shuffleDep.get) : 
AnyRef)
+} else {
+  Utils.serializeTaskClosure((stage.rdd, 
stage.resultOfJob.get.func) : AnyRef)
+}
+  taskBinary = sc.broadcast(taskBinaryBytes)
+} catch {
+  // In the case of a failure during serialization, abort the stage.
+  case e: NotSerializableException =>
+abortStage(stage, "Task not serializable: " + e.toString)
+runningStages -= stage
+return
+  case NonFatal(e) =>
+abortStage(stage, s"Task serialization failed: 
$e\n${e.getStackTraceString}")
+runningStages -= stage
+return
+}
+
 if (stage.isShuffleMap) {
   for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) 
{
 val locs = getPreferredLocs(stage.rdd, p)
-tasks += new ShuffleMapTask(stage.id, stage.rdd, 
stage.shuffleDep.get, p, locs)
+val part = stage.rdd.partitions(p)
+tasks += new ShuffleMapTask(stage.id, taskBinary, part, locs)
   }
 } else {
   // This is a final stage; figure out its job's missing partitions
   val job = stage.resultOfJob.get
   for (id <- 0 until job.numPartitions if !job.finished(id)) {
-val partition = job.partitions(id)
-val locs = getPreferredLocs(stage.rdd, partition)
-tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, 
locs, id)
+val p: Int = job.partitions(id)
+val part = stage.rdd.partitions(p)
+val locs = getPreferredLocs(stage.rdd, p)
+tasks += new ResultTask(stage.id, taskBinary, part, locs, id)
   }
 }
 
-val properties = if (jobIdToActiveJob.contains(jobId)) {
-  jobIdToActiveJob(stage.jobId).properties
-} else {
-  // this stage will be assigned to "default" pool
-  null
-}
-
 if (tasks.size > 0) {
-  runningStages += stage
-  // SparkListenerStageSubmitted should be posted before testing 
whether tasks are
-  // serializable. If tasks are not serializable, a 
SparkListenerStageCompleted event
-  // will be posted, which should always come after a corresponding 
SparkListenerStageSubmitted
-  // event.
-  listenerBus.post(SparkListenerStageSubmitted(stage.info, properties))
-
   // Preemptively serialize a task to make sure it can be serialized. 
We are catching this
   // exception here because it would be fairly hard to catch the 
non-serializable exception
   // down the road, where we have several different implementations 
for local sched

[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15567030
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -57,6 +57,12 @@ private[spark] object Utils extends Logging {
 new File(sparkHome + File.separator + "bin", which + suffix)
   }
 
+  /** Serialize an object using the closure serializer. */
+  def serializeTaskClosure[T: ClassTag](o: T): Array[Byte] = {
+val ser = SparkEnv.get.closureSerializer.newInstance()
--- End diff --

We could also save an instance of the closure serializer in DAGScheduler 
instead, since it executes everything in one thread. Probably not a big deal 
though but it's something to consider if you're refactoring this code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15567022
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala ---
@@ -17,134 +17,54 @@
 
 package org.apache.spark.scheduler
 
-import scala.language.existentials
-
-import java.io._
-import java.util.zip.{GZIPInputStream, GZIPOutputStream}
+import java.nio.ByteBuffer
 
-import scala.collection.mutable.HashMap
+import scala.language.existentials
--- End diff --

I didn't know about it. Where is it discussed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1740] [PySpark] kill the python worker

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1643#issuecomment-50575076
  
QA results for PR 1643:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17413/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15567013
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -691,47 +689,81 @@ class DAGScheduler(
 }
   }
 
-
   /** Called when stage's parents are available and we can now do its 
task. */
   private def submitMissingTasks(stage: Stage, jobId: Int) {
 logDebug("submitMissingTasks(" + stage + ")")
 // Get our pending tasks and remember them in our pendingTasks entry
 stage.pendingTasks.clear()
 var tasks = ArrayBuffer[Task[_]]()
+
+val properties = if (jobIdToActiveJob.contains(jobId)) {
+  jobIdToActiveJob(stage.jobId).properties
+} else {
+  // this stage will be assigned to "default" pool
+  null
+}
+
+runningStages += stage
+// SparkListenerStageSubmitted should be posted before testing whether 
tasks are
+// serializable. If tasks are not serializable, a 
SparkListenerStageCompleted event
+// will be posted, which should always come after a corresponding 
SparkListenerStageSubmitted
+// event.
+listenerBus.post(SparkListenerStageSubmitted(stage.info, properties))
+
+// TODO: Maybe we can keep the taskBinary in Stage to avoid 
serializing it multiple times.
+// Broadcasted binary for the task, used to dispatch tasks to 
executors. Note that we broadcast
+// the serialized copy of the RDD and for each task we will 
deserialize it, which means each
+// task gets a different copy of the RDD. This provides stronger 
isolation between tasks that
+// might modify state of objects referenced in their closures. This is 
necessary in Hadoop
+// where the JobConf/Configuration object is not thread-safe.
+var taskBinary: Broadcast[Array[Byte]] = null
+try {
+  // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
+  // For ResultTask, serialize and broadcast (rdd, func).
+  val taskBinaryBytes: Array[Byte] =
+if (stage.isShuffleMap) {
+  Utils.serializeTaskClosure((stage.rdd, stage.shuffleDep.get) : 
AnyRef)
+} else {
+  Utils.serializeTaskClosure((stage.rdd, 
stage.resultOfJob.get.func) : AnyRef)
+}
+  taskBinary = sc.broadcast(taskBinaryBytes)
+} catch {
+  // In the case of a failure during serialization, abort the stage.
+  case e: NotSerializableException =>
+abortStage(stage, "Task not serializable: " + e.toString)
+runningStages -= stage
+return
+  case NonFatal(e) =>
+abortStage(stage, s"Task serialization failed: 
$e\n${e.getStackTraceString}")
+runningStages -= stage
+return
+}
+
 if (stage.isShuffleMap) {
   for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) 
{
 val locs = getPreferredLocs(stage.rdd, p)
-tasks += new ShuffleMapTask(stage.id, stage.rdd, 
stage.shuffleDep.get, p, locs)
+val part = stage.rdd.partitions(p)
+tasks += new ShuffleMapTask(stage.id, taskBinary, part, locs)
   }
 } else {
   // This is a final stage; figure out its job's missing partitions
   val job = stage.resultOfJob.get
   for (id <- 0 until job.numPartitions if !job.finished(id)) {
-val partition = job.partitions(id)
-val locs = getPreferredLocs(stage.rdd, partition)
-tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, 
locs, id)
+val p: Int = job.partitions(id)
+val part = stage.rdd.partitions(p)
+val locs = getPreferredLocs(stage.rdd, p)
+tasks += new ResultTask(stage.id, taskBinary, part, locs, id)
   }
 }
 
-val properties = if (jobIdToActiveJob.contains(jobId)) {
-  jobIdToActiveJob(stage.jobId).properties
-} else {
-  // this stage will be assigned to "default" pool
-  null
-}
-
 if (tasks.size > 0) {
-  runningStages += stage
-  // SparkListenerStageSubmitted should be posted before testing 
whether tasks are
-  // serializable. If tasks are not serializable, a 
SparkListenerStageCompleted event
-  // will be posted, which should always come after a corresponding 
SparkListenerStageSubmitted
-  // event.
-  listenerBus.post(SparkListenerStageSubmitted(stage.info, properties))
-
   // Preemptively serialize a task to make sure it can be serialized. 
We are catching this
   // exception here because it would be fairly hard to catch the 
non-serializable exception
   // down the road, where we have several different implementations 
for local sch

[GitHub] spark pull request: [SPARK-1812] [WIP] Scala 2.11 support

2014-07-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1649#issuecomment-50575027
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...

2014-07-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1493#discussion_r15567017
  
--- Diff: python/pyspark/mllib/classification.py ---
@@ -63,7 +63,10 @@ class LogisticRegressionModel(LinearModel):
 def predict(self, x):
 _linear_predictor_typecheck(x, self._coeff)
 margin = _dot(x, self._coeff) + self._intercept
-prob = 1/(1 + exp(-margin))
+if margin > 0:
+prob = 1 / (1 + exp(-margin))
+else:
+prob = 1 - 1 / (1 + exp(margin))
--- End diff --

Yes, that is definitely better. Could you submit a PR? We don't need a JIRA 
for small changes. Btw, please cache `exp(margin)` instead of computing it 
twice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2734][SQL] Remove tables from cache whe...

2014-07-29 Thread marmbrus
GitHub user marmbrus opened a pull request:

https://github.com/apache/spark/pull/1650

[SPARK-2734][SQL] Remove tables from cache when DROP TABLE is run.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marmbrus/spark dropCached

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1650.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1650


commit c3f535d751c177d2f608e2299b8543d3c72dae5f
Author: Michael Armbrust 
Date:   2014-07-30T05:25:19Z

Remove tables from cache when DROP TABLE is run.t p




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15566990
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala 
---
@@ -17,134 +17,55 @@
 
 package org.apache.spark.scheduler
 
-import scala.language.existentials
+import java.nio.ByteBuffer
 
 import java.io._
-import java.util.zip.{GZIPInputStream, GZIPOutputStream}
-
-import scala.collection.mutable.HashMap
 
 import org.apache.spark._
-import org.apache.spark.rdd.{RDD, RDDCheckpointData}
-
-private[spark] object ResultTask {
-
-  // A simple map between the stage id to the serialized byte array of a 
task.
-  // Served as a cache for task serialization because serialization can be
-  // expensive on the master node if it needs to launch thousands of tasks.
-  private val serializedInfoCache = new HashMap[Int, Array[Byte]]
-
-  def serializeInfo(stageId: Int, rdd: RDD[_], func: (TaskContext, 
Iterator[_]) => _): Array[Byte] =
-  {
-synchronized {
-  val old = serializedInfoCache.get(stageId).orNull
-  if (old != null) {
-old
-  } else {
-val out = new ByteArrayOutputStream
-val ser = SparkEnv.get.closureSerializer.newInstance()
-val objOut = ser.serializeStream(new GZIPOutputStream(out))
-objOut.writeObject(rdd)
-objOut.writeObject(func)
-objOut.close()
-val bytes = out.toByteArray
-serializedInfoCache.put(stageId, bytes)
-bytes
-  }
-}
-  }
-
-  def deserializeInfo(stageId: Int, bytes: Array[Byte]): (RDD[_], 
(TaskContext, Iterator[_]) => _) =
-  {
-val in = new GZIPInputStream(new ByteArrayInputStream(bytes))
-val ser = SparkEnv.get.closureSerializer.newInstance()
-val objIn = ser.deserializeStream(in)
-val rdd = objIn.readObject().asInstanceOf[RDD[_]]
-val func = objIn.readObject().asInstanceOf[(TaskContext, Iterator[_]) 
=> _]
-(rdd, func)
-  }
-
-  def removeStage(stageId: Int) {
-serializedInfoCache.remove(stageId)
-  }
-
-  def clearCache() {
-synchronized {
-  serializedInfoCache.clear()
-}
-  }
-}
-
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.rdd.RDD
 
 /**
  * A task that sends back the output to the driver application.
  *
- * See [[org.apache.spark.scheduler.Task]] for more information.
+ * See [[Task]] for more information.
  *
  * @param stageId id of the stage this task belongs to
- * @param rdd input to func
- * @param func a function to apply on a partition of the RDD
- * @param _partitionId index of the number in the RDD
+ * @param taskBinary broadcasted version of the serialized RDD and the 
function to apply on each
+ *   partition of the given RDD.
--- End diff --

Maybe give its type here rather than below to have it all in one place. 
Same for ShuffleMapTask


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-29 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1498#discussion_r15566973
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala ---
@@ -17,134 +17,54 @@
 
 package org.apache.spark.scheduler
 
-import scala.language.existentials
-
-import java.io._
-import java.util.zip.{GZIPInputStream, GZIPOutputStream}
+import java.nio.ByteBuffer
 
-import scala.collection.mutable.HashMap
+import scala.language.existentials
--- End diff --

Didn't we decide to put these language features at the very top of the 
import list?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...

2014-07-29 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/929#issuecomment-50574926
  
Ok, I will try it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-50574875
  
QA tests have started for PR 1338. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17419/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1812] [WIP] Scala 2.11 support

2014-07-29 Thread avati
GitHub user avati opened a pull request:

https://github.com/apache/spark/pull/1649

[SPARK-1812] [WIP] Scala 2.11 support



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/avati/spark scala-2.11

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1649.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1649


commit f8b5e96fca20c13308cb2a9a6c18049bcdd0a7ba
Author: Anand Avati 
Date:   2014-07-30T05:12:30Z

SPARK-1812: use consistent protobuf version across build

Moving to akka-2.3 reveals issues where differing protobuf
versions within the same build results in runtime incompatibility
and exceptions like:

  java.lang.VerifyError: class akka.remote.WireFormats$AkkaControlMessage 
overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)

Signed-off-by: Anand Avati 

commit c6e0c6ea3b0f5b3316f01f1a743f4de0a6cdf816
Author: Anand Avati 
Date:   2014-07-26T07:48:49Z

SPARK-1812: sql/catalyst - upgrade to scala-logging-slf4j 2.1.2

Signed-off-by: Anand Avati 

commit 722aee26399b9bf4b725d17f5cfcfad99464af35
Author: Anand Avati 
Date:   2014-07-27T02:20:34Z

SPARK-1812: core - upgrade to akka 2.3.4

Signed-off-by: Anand Avati 

commit 0d3e37df46c6ce03fe94e10ec26d5bcb96ee86c5
Author: Anand Avati 
Date:   2014-07-27T03:39:15Z

SPARK-1812: core - upgrade to json4s 3.2.10

Signed-off-by: Anand Avati 

commit 1d2600d41caf6bc47977a5aee9ded7a7aaf85818
Author: Anand Avati 
Date:   2014-07-27T03:43:27Z

SPARK-1812: core - upgrade to unofficial chill 1.2 (tv.cntt)

Signed-off-by: Anand Avati 

commit 0d44a59234f94246dd414ef136647fc4483a9180
Author: Anand Avati 
Date:   2014-07-26T00:19:17Z

SPARK-1812: core - Fix overloaded methods with default arguments

Signed-off-by: Anand Avati 

commit 96f449416a13cba033ff17806a9c56e7aea2692f
Author: Anand Avati 
Date:   2014-07-26T00:21:39Z

SPARK-1812: Fix @transient annotation errors

- convert to @transient val when unsure
- remove @transient annotation elsewhere

Signed-off-by: Anand Avati 

commit d39f8678fd4b2aff96d7d6fe15b2878ef085ce0b
Author: Anand Avati 
Date:   2014-07-26T04:06:48Z

SPARK-1812: mllib - upgrade to breeze 0.8.1

Signed-off-by: Anand Avati 

commit d7cf8c8c0476d3c6e7adcea434bb87a2aa665bcf
Author: Anand Avati 
Date:   2014-07-26T04:16:52Z

SPARK-1812: streaming - Fix overloaded methods with default arguments

Signed-off-by: Anand Avati 

commit caef846be53f83498b0603e569404389c792c0b1
Author: Anand Avati 
Date:   2014-07-26T00:20:03Z

SPARK-1812: core - [FIXWARNING] try { } without catch { } warning

Signed-off-by: Anand Avati 

commit 0f68c91d6001f3db516a2e9d6d936d65122bf997
Author: Anand Avati 
Date:   2014-07-26T08:02:49Z

SPARK-1812: temporarily disable

- remove spark-repl from assembly dependency
- temporarily disable external/kafka module (dependency unavailable)
- examples module
  - kafka_2.11 (transitive)
  - algebird_2.11 (com.twitter)

Signed-off-by: Anand Avati 

commit c4f92b6f9bb798dafae69ad8432292be8c3a58d4
Author: Anand Avati 
Date:   2014-07-26T08:15:22Z

SPARK-1812: move to akka-zeromq-2.11.0-M3 version 2.2.0

.. until akka-zeromq-2.11 gets published

Signed-off-by: Anand Avati 

commit 7f4d34bf499d2d238925ca777c6f6a78a31904af
Author: Anand Avati 
Date:   2014-07-26T23:35:51Z

SPARK-1812: core - [FIXWARNING] replace scala.concurrent.{future,promise} 
with {Future,Promise}

Signed-off-by: Anand Avati 

commit c50ff00c97d3de41576a7d61ee4d214cafc9b089
Author: Anand Avati 
Date:   2014-07-26T08:01:32Z

SPARK-1812: update all artifactId to spark-${module}_2.11

Signed-off-by: Anand Avati 

commit 8e621813d77266a51a09d4a217352d61b7e1861e
Author: Anand Avati 
Date:   2014-07-27T04:33:42Z

SPARK-1812: sql/core - [FIXWARNING] replace scala.concurrent.future with 
Future

Signed-off-by: Anand Avati 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled bu

[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-29 Thread miccagiann
Github user miccagiann commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15566894
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable {
   private val DENSE_MATRIX_MAGIC: Byte = 3
   private val LABELED_POINT_MAGIC: Byte = 4
 
+  /**
+   * Enumeration used to define the type of Regularizer
+   * used for linear methods.
+   */
+  object RegularizerType extends Serializable {
+val L2 : Int = 0
+val L1 : Int = 1
+val NONE : Int = 2
+  }
--- End diff --

Ok! I will do it with strings both in python and in scala. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1639#issuecomment-50574640
  
QA tests have started for PR 1639. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17418/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...

2014-07-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/929#issuecomment-50574626
  
LGTM. Waiting for Jenkins. Btw, @witgo if you have a big dataset to test, 
could you try to set the storage level of ratings and user/product in/out links 
to `MEMORY_AND_DISK_SER` and enable `spark.rdd.compress`. It will save a lot of 
memory with a little overhead on the speed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...

2014-07-29 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1562#discussion_r15566850
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -103,26 +107,49 @@ class RangePartitioner[K : Ordering : ClassTag, V](
 private var ascending: Boolean = true)
--- End diff --

It'd be great to update the documentation on when this results in two 
passes vs one pass. We should probably update the documentation for sortByKey 
and various other sorts that use this too. Let's do that in another PR.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...

2014-07-29 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1562#issuecomment-50574526
  
LGTM. Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15566835
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable {
   private val DENSE_MATRIX_MAGIC: Byte = 3
   private val LABELED_POINT_MAGIC: Byte = 4
 
+  /**
+   * Enumeration used to define the type of Regularizer
+   * used for linear methods.
+   */
+  object RegularizerType extends Serializable {
+val L2 : Int = 0
+val L1 : Int = 1
+val NONE : Int = 2
+  }
--- End diff --

Using strings with a clear doc should be sufficient. Then you can map the 
string to `L1Updater` or `SquaredUpdater` inside `PythonMLLibAPI`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...

2014-07-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1628#issuecomment-50574345
  
@dorx I tried `import pyspark.mllib.random` and it failed. It has to be 
`from pyspark.mllib import random`. And to use `RandomRDDGenerators`, I need to 
call `random.RandomRDDGenerators`. Ideally, it should be `from 
pyspark.mllib.random import RandomRDDGenerators`. If we know how to handle the 
name `random` now, maybe we can create `random.py` under `mllib` and define 
class `RandomRDDGenerators` there. If it is not easy to do that because of 
python's own `random` package, it should be fine to rename the package name to 
`rand` in both Python and Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...

2014-07-29 Thread witgo
Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/929#discussion_r15566754
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---
@@ -255,6 +255,9 @@ class ALS private (
   rank, lambda, alpha, YtY)
 previousProducts.unpersist()
 logInfo("Re-computing U given I (Iteration %d/%d)".format(iter, 
iterations))
+if (sc.checkpointDir.isDefined && (iter % 3 == 1)) {
--- End diff --

It's a good idea. Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-29 Thread miccagiann
Github user miccagiann commented on the pull request:

https://github.com/apache/spark/pull/1624#issuecomment-50574026
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-29 Thread miccagiann
Github user miccagiann commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15566690
  
--- Diff: python/pyspark/mllib/regression.py ---
@@ -109,18 +109,35 @@ class 
LinearRegressionModel(LinearRegressionModelBase):
 True
 """
 
+class RegularizerType(object):
+L2 = 0
+L1 = 1
+NONE = 2
--- End diff --

The same enumeration is provided and through the `regression.py` file 
purely in python so as users to use it directly and avoid providing integer 
values to parameter `regType` in function `train` of `LinearRegressionWithSGD`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...

2014-07-29 Thread miccagiann
Github user miccagiann commented on a diff in the pull request:

https://github.com/apache/spark/pull/1624#discussion_r15566654
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable {
   private val DENSE_MATRIX_MAGIC: Byte = 3
   private val LABELED_POINT_MAGIC: Byte = 4
 
+  /**
+   * Enumeration used to define the type of Regularizer
+   * used for linear methods.
+   */
+  object RegularizerType extends Serializable {
+val L2 : Int = 0
+val L1 : Int = 1
+val NONE : Int = 2
+  }
--- End diff --

I used a type of Enumeration in order to separate between the different 
types of Update Methods [Regularizers] with which the user wants to create the 
model from training data. I tried to extend this object from Enumeration but 
from what I have seen it uses reflection heavily and it does not work well with 
serialized objects and with py4j...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Example pyspark-inputformat for Avro file form...

2014-07-29 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1536#issuecomment-50573759
  
Great!  If you don't plan to work on this anytime soon, could you close 
this PR?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...

2014-07-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/929#discussion_r15566629
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---
@@ -255,6 +255,9 @@ class ALS private (
   rank, lambda, alpha, YtY)
 previousProducts.unpersist()
 logInfo("Re-computing U given I (Iteration %d/%d)".format(iter, 
iterations))
+if (sc.checkpointDir.isDefined && (iter % 3 == 1)) {
--- End diff --

Do we need to checkpoint the first RDD? If `iter` starts at `1`, we can use 
`iter % 3 == 0` and hence the checkpoint RDDs are `product-3`, `product-6`, 
`product-9`, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...

2014-07-29 Thread witgo
Github user witgo commented on a diff in the pull request:

https://github.com/apache/spark/pull/929#discussion_r15566577
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala ---
@@ -255,6 +255,9 @@ class ALS private (
   rank, lambda, alpha, YtY)
 previousProducts.unpersist()
 logInfo("Re-computing U given I (Iteration %d/%d)".format(iter, 
iterations))
+if (sc.checkpointDir.isDefined && (iter % 3 == 1)) {
--- End diff --

`iter` from 1 to `iterations` . The checkpoint RDD is `products-1`, 
`products-4`,`products-7`,`products-10`  ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2740: allow user to specify ascending an...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1645#issuecomment-50573567
  
QA tests have started for PR 1645. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17416/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2740: allow user to specify ascending an...

2014-07-29 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1645#issuecomment-50573389
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...

2014-07-29 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/929#issuecomment-50573353
  
@mengxr  The test pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2729] [SQL] Forgot to match Timestamp t...

2014-07-29 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1636#issuecomment-50573263
  
Will you have time to add a test case before Friday (the merge deadline for 
1.1) or should I?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

2014-07-29 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1648

[SPARK-2585] Remove special handling of Hadoop JobConf.

This is based on #1498. Diff here: 
https://github.com/rxin/spark/commit/5904cb6649b03d48e3465ccab664b506cc69327b

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark jobconf

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1648.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1648


commit cae0af33b535a7772fd2861851dca056e0c2186c
Author: Reynold Xin 
Date:   2014-07-19T06:52:47Z

[SPARK-2521] Broadcast RDD object (instead of sending it along with every 
task).

Currently (as of Spark 1.0.1), Spark sends RDD object (which contains 
closures) using Akka along with the task itself to the executors. This is 
inefficient because all tasks in the same stage use the same RDD object, but we 
have to send RDD object multiple times to the executors. This is especially bad 
when a closure references some variable that is very large. The current design 
led to users having to explicitly broadcast large variables.

The patch uses broadcast to send RDD objects and the closures to executors, 
and use Akka to only send a reference to the broadcast RDD/closure along with 
the partition specific information for the task. For those of you who know more 
about the internals, Spark already relies on broadcast to send the Hadoop 
JobConf every time it uses the Hadoop input, because the JobConf is large.

The user-facing impact of the change include:

1. Users won't need to decide what to broadcast anymore, unless they would 
want to use a large object multiple times in different operations
2. Task size will get smaller, resulting in faster scheduling and higher 
task dispatch throughput.

In addition, the change will simplify some internals of Spark, eliminating 
the need to maintain task caches and the complex logic to broadcast JobConf 
(which also led to a deadlock recently).

A simple way to test this:
```scala
val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a);
sc.parallelize(1 to 1000, 1000).map { x => a; x }.groupBy { x => a; x 
}.count
```

Numbers on 3 r3.8xlarge instances on EC2
```
master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s
with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s
```

Author: Reynold Xin 

Closes #1452 from rxin/broadcast-task and squashes the following commits:

762e0be [Reynold Xin] Warn large broadcasts.
ade6eac [Reynold Xin] Log broadcast size.
c3b6f11 [Reynold Xin] Added a unit test for clean up.
754085f [Reynold Xin] Explain why broadcasting serialized copy of the task.
04b17f0 [Reynold Xin] [SPARK-2521] Broadcast RDD object once per TaskSet 
(instead of sending it for every task).

(cherry picked from commit 7b8cd175254d42c8e82f0aa8eb4b7f3508d8fde2)
Signed-off-by: Reynold Xin 

commit d256b456b8450706ecacd98033c3f4d40b37814c
Author: Reynold Xin 
Date:   2014-07-20T07:00:12Z

Fixed unit test failures. One more to go.

commit cc152fcd14bb13104f391da0fb703a1d2203e3a6
Author: Reynold Xin 
Date:   2014-07-21T01:48:18Z

Don't cache the RDD broadcast variable.

commit de779f8704a7f586190dc0e25642836e06136cbb
Author: Reynold Xin 
Date:   2014-07-21T07:21:13Z

Fix TaskContextSuite.

commit 991c002fc4238308108c07fb40b3400a3d448e2f
Author: Reynold Xin 
Date:   2014-07-23T05:41:53Z

Use HttpBroadcast.

commit cf384501c1874284e8412439466eb6e22d5fe6d6
Author: Reynold Xin 
Date:   2014-07-25T07:10:18Z

Use TorrentBroadcastFactory.

commit bab1d8b601d88b946e3770c611c33d2040472492
Author: Reynold Xin 
Date:   2014-07-28T22:38:57Z

Check for NotSerializableException in submitMissingTasks.

commit 797c247ba12dd3eeaa5dda621b9db5a8419732f0
Author: Reynold Xin 
Date:   2014-07-29T01:20:13Z

Properly send SparkListenerStageSubmitted and SparkListenerStageCompleted.

commit 111007d719e9c23e5baf4fc3dc374d01115c0e1f
Author: Reynold Xin 
Date:   2014-07-29T01:29:33Z

Fix broadcast tests.

commit 252238da16fe3a3dfd4a067ca4a9ac47d4fac025
Author: Reynold Xin 
Date:   2014-07-29T04:31:20Z

Serialize the final task closure as well as ShuffleDependency in taskBinary.

commit f8535dc959b6d3b733fd46adbfa07708557a1d05
Author: Reynold Xin 
Date:   2014-07-29T04:35:23Z

Fixed the style violation.

commit 5904cb6649b03d48e3465ccab664b506cc69327b
Author: Reynold Xin 
Date:   2014-07-30T04:39:14Z

[SPARK-2585] Remove special handling of Hadoop JobConf.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes 

[GitHub] spark pull request: [SQL] Handle null values in debug()

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1646#issuecomment-50572929
  
QA results for PR 1646:- This patch PASSES unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17409/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1740] [PySpark] kill the python worker

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1643#issuecomment-50572879
  
QA tests have started for PR 1643. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17413/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2743][SQL] Resolve original attributes ...

2014-07-29 Thread marmbrus
GitHub user marmbrus opened a pull request:

https://github.com/apache/spark/pull/1647

[SPARK-2743][SQL] Resolve original attributes in ParquetTableScan



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/marmbrus/spark parquetCase

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1647.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1647


commit 539a2e1f6c94782d916e7eac12ed1614f0ebfc35
Author: Michael Armbrust 
Date:   2014-07-30T04:37:23Z

Resolve original attributes in ParquetTableScan




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50572425
  
QA results for PR 1290:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds the following public classes 
(experimental):abstract class GeneralizedSteepestDescendModel(val weights: 
Vector )trait ANN {class LeastSquaresGradientANN(class ANNUpdater 
extends Updater {class ParallelANN (For more information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50572266
  
QA tests have started for PR 1290. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50572128
  
@bgreeven Jenkins will be automatically triggered for future updates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50572111
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-07-29 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-50572101
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15566077
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingRegression.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, DeveloperApi}
+import org.apache.spark.streaming.dstream.DStream
+
+/**
+ * :: DeveloperApi ::
+ * StreamingRegression implements methods for training
+ * a linear regression model on streaming data, and using it
+ * for prediction on streaming data.
+ *
+ * This class takes as type parameters a GeneralizedLinearModel,
+ * and a GeneralizedLinearAlgorithm, making it easy to extend to construct
+ * streaming versions of arbitrary regression analyses. For example usage,
+ * see StreamingLinearRegressionWithSGD.
+ *
+ */
+@DeveloperApi
+@Experimental
+abstract class StreamingRegression[
+M <: GeneralizedLinearModel,
+A <: GeneralizedLinearAlgorithm[M]] extends Logging {
+
+  /** The model to be updated and used for prediction. */
+  var model: M
+
+  /** The algorithm to use for updating. */
+  val algorithm: A
+
+  /** Return the latest model. */
+  def latest(): M = {
+model
+  }
+
+  /**
+   * Update the model by training on batches of data from a DStream.
+   * This operation registers a DStream for training the model,
+   * and updates the model based on every subsequent non-empty
+   * batch of data from the stream.
+   *
+   * @param data DStream containing labeled data
+   */
+  def trainOn(data: DStream[LabeledPoint]) {
+data.foreachRDD{
+  rdd =>
+if (rdd.count() > 0) {
+  model = algorithm.run(rdd, model.weights)
+  logInfo("Model updated")
--- End diff --

Maybe we can add more information to it, for example, the current time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15566064
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingRegression.scala
 ---
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, DeveloperApi}
+import org.apache.spark.streaming.dstream.DStream
+
+/**
+ * :: DeveloperApi ::
+ * StreamingRegression implements methods for training
+ * a linear regression model on streaming data, and using it
+ * for prediction on streaming data.
+ *
+ * This class takes as type parameters a GeneralizedLinearModel,
+ * and a GeneralizedLinearAlgorithm, making it easy to extend to construct
+ * streaming versions of arbitrary regression analyses. For example usage,
+ * see StreamingLinearRegressionWithSGD.
+ *
+ */
+@DeveloperApi
+@Experimental
+abstract class StreamingRegression[
+M <: GeneralizedLinearModel,
+A <: GeneralizedLinearAlgorithm[M]] extends Logging {
+
+  /** The model to be updated and used for prediction. */
+  var model: M
+
+  /** The algorithm to use for updating. */
+  val algorithm: A
+
+  /** Return the latest model. */
+  def latest(): M = {
+model
+  }
+
+  /**
+   * Update the model by training on batches of data from a DStream.
+   * This operation registers a DStream for training the model,
+   * and updates the model based on every subsequent non-empty
+   * batch of data from the stream.
+   *
+   * @param data DStream containing labeled data
+   */
+  def trainOn(data: DStream[LabeledPoint]) {
+data.foreachRDD{
+  rdd =>
--- End diff --

Spark's code style prefers the following:

~~~
data.foreachRDD { rdd =>
  ...
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15566050
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearRegression.scala
 ---
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.annotation.Experimental
+
+/**
+ * Train or predict a linear regression model on streaming data. Training 
uses
+ * Stochastic Gradient Descent to update the model based on each new batch 
of
+ * incoming data from a DStream (see LinearRegressionWithSGD for model 
equation)
+ *
+ * Each batch of data is assumed to be an RDD of LabeledPoints.
+ * The number of data points per batch can vary, but the number
+ * of features must be constant.
+ */
+@Experimental
+class StreamingLinearRegressionWithSGD private (
+private var stepSize: Double,
+private var numIterations: Int,
+private var miniBatchFraction: Double,
+private var numFeatures: Int)
--- End diff --

If we ask the users to provide the initial weight, we don't need this 
argument.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15566035
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearRegression.scala
 ---
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.annotation.Experimental
+
+/**
+ * Train or predict a linear regression model on streaming data. Training 
uses
+ * Stochastic Gradient Descent to update the model based on each new batch 
of
+ * incoming data from a DStream (see LinearRegressionWithSGD for model 
equation)
+ *
+ * Each batch of data is assumed to be an RDD of LabeledPoints.
+ * The number of data points per batch can vary, but the number
+ * of features must be constant.
+ */
+@Experimental
+class StreamingLinearRegressionWithSGD private (
+private var stepSize: Double,
+private var numIterations: Int,
+private var miniBatchFraction: Double,
--- End diff --

For streaming updates, the RDDs are usually small. Maybe it is not 
necessary to use `miniBatchFraction`. But it is fine to keep this option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15565983
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionSuite.scala
 ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.File
+
+import com.google.common.io.Files
+import org.apache.commons.io.FileUtils
+import org.scalatest.FunSuite
+import org.apache.spark.SparkConf
+import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext}
+import org.apache.spark.mllib.util.{MLStreamingUtils, LinearDataGenerator, 
LocalSparkContext}
+
+import scala.collection.mutable.ArrayBuffer
--- End diff --

move scala imports before 3rd party imports. usually the imports are 
organized into 4 groups in the following order:

1. java imports (java.*)
2. scala imports (scala.*)
3. 3rd-party imports
4. spark import (org.apache.spark.*)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]

2014-07-29 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1361#discussion_r15565956
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionSuite.scala
 ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.File
+
+import com.google.common.io.Files
+import org.apache.commons.io.FileUtils
+import org.scalatest.FunSuite
--- End diff --

add an empty line to separate 3rd party imports from spark imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2054][SQL] Code Generation for Expressi...

2014-07-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/993


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   4   5   6   7   >