[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-26 Thread suyanNone
Github user suyanNone commented on the pull request:

https://github.com/apache/spark/pull/4055#issuecomment-71458822
  
Last week and current week, my work is not in Spark, so reply was slower


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3975#issuecomment-71460663
  
  [Test build #26094 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26094/consoleFull)
 for   PR 3975 at commit 
[`62a150c`](https://github.com/apache/spark/commit/62a150c5207a4a2679a068589abff10c24337824).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-26 Thread suyanNone
Github user suyanNone commented on the pull request:

https://github.com/apache/spark/pull/4055#issuecomment-71457733
  
@cloud-fan 
According current code, it may not easy to change to not re-submit task in 
pendingTasks. To be honest, current DAGScheduler is complicated but at some 
point, is simple for some situation, still a lot may need to be improved.

re-submit occurs have failed-stage and due to a fetch failed. a 
fetch-failed means current running TaskSet is dead(called `zombie`), so it just 
have the already scheduled on Executor task are running, others in 
stage.pendingTask will never be scheduled in previous taskset. but it can do to 
just resubmit not scheduled tasks.  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3975#issuecomment-71470260
  
  [Test build #26094 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26094/consoleFull)
 for   PR 3975 at commit 
[`62a150c`](https://github.com/apache/spark/commit/62a150c5207a4a2679a068589abff10c24337824).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3975#issuecomment-71470269
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26094/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-26 Thread cloud-fan
Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/4055#issuecomment-71460355
  
@suyanNone Thanks for the explanation of re-submit!
What's the Chinese name of HarryZhang? We don't use English name in the 
lab……


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2691][Mesos] Support for Mesos DockerIn...

2015-01-26 Thread hellertime
Github user hellertime commented on the pull request:

https://github.com/apache/spark/pull/3074#issuecomment-71454692
  
@andrewor14 your suggestions all look quite reasonable, I'll have a closer 
look at them tonight and make appropriate changes.

@mateiz adding a Dockerfile will be no trouble. Basically one typically 
just creates a Docker image with the spark distribution tarball pre-expanded 
inside. An example Dockerfile could either be setup to refer directly to the 
output of `make-distribution.sh`, or to some other pre-existing Spark Docker 
image. Is there some official Spark Docker image I might reference, looking at 
the Docker Hub, there doesn't appear to be one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...

2015-01-26 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/3920#issuecomment-71476341
  
@GenTang ok, thanks for updating the example data. I checked out the PR and 
tested quickly locally, the new example and data works for me. Looks good.

@davies  +1 to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-26 Thread yanbohappy
Github user yanbohappy commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71482088
  
https://issues.apache.org/jira/browse/SPARK-5324


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71482309
  
  [Test build #26095 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26095/consoleFull)
 for   PR 4207 at commit 
[`7fb2e81`](https://github.com/apache/spark/commit/7fb2e81d682d8f641275077b94dfb6d7de466de8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71488018
  
  [Test build #26096 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26096/consoleFull)
 for   PR 4207 at commit 
[`34508a3`](https://github.com/apache/spark/commit/34508a3dcd8d3d9b212c15767ac096845e76b44d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71488029
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26096/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2691][Mesos] Support for Mesos DockerIn...

2015-01-26 Thread rdhyee
Github user rdhyee commented on the pull request:

https://github.com/apache/spark/pull/3074#issuecomment-71475481
  
I [wrote in November of my interest in this 
PR](https://github.com/apache/spark/pull/3074#issuecomment-64227322).  I concur 
the 
[suggestion](https://github.com/apache/spark/pull/3074#issuecomment-71414280) 
by @mateiz  to include a Dockerfile.  It'd be great if that Dockerfile could be 
one that would serve two purposes:  1) as a basic demonstration of how to run 
Spark via docker on Mesos and 2) as the argument for the 
[FROM](https://docs.docker.com/reference/builder/#from) for other people's 
Dockerfiles.  

I've been experimenting with a Dockerfile to combine Spark with 
[ipython/scipyserver](https://github.com/ipython/docker-notebook/blob/master/scipyserver/Dockerfile)
 functionality to enable me to run pyspark in the [IPython 
Notebook](http://ipython.org/notebook.html) interface:  
[rdhyee/ipython-spark](https://github.com/rdhyee/ipython-spark/blob/master/Dockerfile).
  I started off with the [suggested 
Dockerfile](https://issues.apache.org/jira/browse/SPARK-2691?focusedCommentId=14194679page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14194679)
 by @hellertime.  

I've not had a chance to test this PR myself, primarily because I'm still 
learning how to get a stable, standard mesos cluster to work with a stable 
release of Spark. But I'm definitely keen to try out the functionality of this 
PR when it gets merged into master -- or even sooner if I figure stuff out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5355] use j.u.c.ConcurrentHashMap inste...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4208#issuecomment-71492577
  
  [Test build #26097 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26097/consoleFull)
 for   PR 4208 at commit 
[`da14ced`](https://github.com/apache/spark/commit/da14ced75e1413e8c107b58388f99a3cbcae1acd).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71482915
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26095/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71486036
  
  [Test build #26096 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26096/consoleFull)
 for   PR 4207 at commit 
[`34508a3`](https://github.com/apache/spark/commit/34508a3dcd8d3d9b212c15767ac096845e76b44d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Implement Describe Table for SQLContext

2015-01-26 Thread yanbohappy
GitHub user yanbohappy opened a pull request:

https://github.com/apache/spark/pull/4207

Implement Describe Table for SQLContext

Initial code snippet for Describe Table command.
#1 SQL Parser and generate logical plan. Add DESCRIBE [FORMATTED] 
[db_name.]table_name command parser in SparkSQLParser and generate the same 
logical plan for both SQLContext and HiveContext.
(note: HiveContext also support DESCRIBE [FORMATTED] [db_name.]table_name 
PARTITION partition_column_name and DESCRIBE [FORMATTED] [db_name.]table_name 
column_name is implement by Hive native command)
#2 Implement DescribeCommand which invoke by RunnableCommand.
#3 For SQLContext the code is clearly structured, but for HiveContext the 
output of describe command need to stay the same Hive. So For HiveContext, we 
still transfer logical command to DescribeHiveTableCommand which had been 
implement in HiveStrategies.HiveCommandStrategy.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanbohappy/spark spark-5324

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4207.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4207


commit 7fb2e81d682d8f641275077b94dfb6d7de466de8
Author: Yanbo Liang yanboha...@gmail.com
Date:   2015-01-26T15:29:10Z

Implement Describe Table for SQLContext




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4207#issuecomment-71482912
  
  [Test build #26095 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26095/consoleFull)
 for   PR 4207 at commit 
[`7fb2e81`](https://github.com/apache/spark/commit/7fb2e81d682d8f641275077b94dfb6d7de466de8).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2691][Mesos] Support for Mesos DockerIn...

2015-01-26 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/3074#issuecomment-71487679
  
Yeah, so IMO just provide a basic Dockerfile that lets people but in their 
compiled build of Spark, and we can go from there. They should also have the 
option of specifying their own image, as they have now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-26 Thread kennethmyers
Github user kennethmyers commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-71515443
  
Hello,

I made a fork of the main Spark repo and pulled this pull request into it. 
I was able to build it successfully but I get the followng error when calling 
KafkaUtils.createStream( ... )

The line: 
```
kvs = KafkaUtils.createStream(streaming_context, ZOOKEEPER, GROUP, {TOPIC: 
1})
```
...gives the following error message:

```
No kafka package, please put the assembly jar into classpath:
 $ bin/submit --driver-class-path 
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar
```

When I try to run that 'submit command with the specified parameters, I 
get:

```
$ bin/spark-submit --driver-class-path 
external/kafka-assembly/target/scala-2.10/spark-streaming-kafka-assembly-1.3.0-SNAPSHOT.jar

Error: Must specify a primary resource (JAR or Python file)
Run with --help for usage help or --verbose for debug output
Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/myerske/spark/external/kafka-assembly/target/scala-2.10/spark-streaming-kafka-assembly-1.3.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/myerske/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
```

I've verified that the JAR file is there.

Has anybody had the same issue or can offer any help?

Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23555909
  
--- Diff: graphx/pom.xml ---
@@ -50,6 +50,12 @@
   artifactIdscalacheck_${scala.binary.version}/artifactId
   scopetest/scope
 /dependency
+dependency
+  groupIdjunit/groupId
+  artifactIdjunit/artifactId
+version4.11/version
--- End diff --

I don't think this version should be specified here; it should be inherited 
from the parent POM or set via a configuration or propery.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread kdatta
Github user kdatta commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23555962
  
--- Diff: bin/spark-class ---
@@ -170,6 +170,9 @@ export CLASSPATH
 # the driver JVM itself. Instead of handling this complexity in Bash, we 
launch a separate JVM
 # to prepare the launch environment of this driver JVM.
 
+export JAVA_OPTS+= 
-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
--- End diff --

Yes, deleted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23555971
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala ---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.graphx.api.java
+
+import java.lang.{Long = JLong}
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.java.function.{Function = JFunction}
+import org.apache.spark.graphx._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+/**
+ * EdgeRDD['ED'] is a column-oriented edge partition RDD created from 
RDD[Edge[ED]].
+ * JavaEdgeRDD class provides a Java API to access implementations of the 
EdgeRDD class
+ *
+ * @param targetStorageLevel
+ * @tparam ED
+ */
+class JavaEdgeRDD[ED](
--- End diff --

Should this class be public?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread kdatta
Github user kdatta commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23555947
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaVertexRDDLike.scala 
---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.api.java
+
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDDLike
+import org.apache.spark.api.java.function.{Function = JFunction, 
Function2 = JFunction2, Function3 = JFunction3}
+import org.apache.spark.graphx._
+import org.apache.spark.graphx.impl.{EdgeRDDImpl, ShippableVertexPartition}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{Logging, Partition, TaskContext}
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+trait JavaVertexRDDLike[VD, This : JavaVertexRDDLike[VD, This, R],
--- End diff --

Got it. I will change this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23555998
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala ---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.graphx.api.java
+
+import java.lang.{Long = JLong}
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.java.function.{Function = JFunction}
+import org.apache.spark.graphx._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+/**
+ * EdgeRDD['ED'] is a column-oriented edge partition RDD created from 
RDD[Edge[ED]].
+ * JavaEdgeRDD class provides a Java API to access implementations of the 
EdgeRDD class
+ *
+ * @param targetStorageLevel
+ * @tparam ED
+ */
+class JavaEdgeRDD[ED](
+val edges: RDD[Edge[ED]],
+val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
+(implicit val classTag: ClassTag[ED])
+  extends JavaEdgeRDDLike[ED, JavaEdgeRDD[ED], JavaRDD[(VertexId, 
VertexId, ED)]] {
+
+//  /**
+//   * To create JavaEdgeRDD from JavaRDDs of tuples
+//   * (source vertex id, destination vertex id and edge property class).
+//   * The edge property class can be Array[Byte]
+//   * @param jEdges
+//   */
+//  def this(jEdges: JavaRDD[(VertexId, VertexId, ED)]) = {
+//this(jEdges.rdd.map(x = Edge[ED](x._1, x._2, x._3)))
+//  }
+
+  /* Convert RDD[(PartitionID, EdgePartition[ED, VD])] to EdgeRDD[ED, VD] 
*/
+  override def edgeRDD = EdgeRDD.fromEdges(edges)
+
+  /**
+   * Java Wrapper for RDD of Edges
+   *
+   * @param edgeRDD
--- End diff --

If you're not going to document parameters, just omit them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread kdatta
Github user kdatta commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23557180
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala ---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.graphx.api.java
+
+import java.lang.{Long = JLong}
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.java.function.{Function = JFunction}
+import org.apache.spark.graphx._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+/**
+ * EdgeRDD['ED'] is a column-oriented edge partition RDD created from 
RDD[Edge[ED]].
+ * JavaEdgeRDD class provides a Java API to access implementations of the 
EdgeRDD class
+ *
+ * @param targetStorageLevel
+ * @tparam ED
+ */
+class JavaEdgeRDD[ED](
+val edges: RDD[Edge[ED]],
+val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
+(implicit val classTag: ClassTag[ED])
+  extends JavaEdgeRDDLike[ED, JavaEdgeRDD[ED], JavaRDD[(VertexId, 
VertexId, ED)]] {
+
+//  /**
+//   * To create JavaEdgeRDD from JavaRDDs of tuples
+//   * (source vertex id, destination vertex id and edge property class).
+//   * The edge property class can be Array[Byte]
+//   * @param jEdges
+//   */
+//  def this(jEdges: JavaRDD[(VertexId, VertexId, ED)]) = {
+//this(jEdges.rdd.map(x = Edge[ED](x._1, x._2, x._3)))
+//  }
+
+  /* Convert RDD[(PartitionID, EdgePartition[ED, VD])] to EdgeRDD[ED, VD] 
*/
+  override def edgeRDD = EdgeRDD.fromEdges(edges)
+
+  /**
+   * Java Wrapper for RDD of Edges
+   *
+   * @param edgeRDD
+   * @return
+   */
+  def wrapRDD(edgeRDD: RDD[Edge[ED]]): JavaRDD[Edge[ED]] = {
+JavaRDD.fromRDD(edgeRDD)
+  }
+
+  /** Persist RDDs of this JavaEdgeRDD with the default storage level 
(MEMORY_ONLY_SER) */
+  def cache(): this.type = {
+edges.cache()
+this
+  }
+
+  def collect(): JList[Edge[ED]] = {
+import scala.collection.JavaConversions._
+val arr: java.util.Collection[Edge[ED]] = edges.collect().toSeq
+new java.util.ArrayList(arr)
+  }
+
+  /**
+   * Return a new single long element generated by counting all elements 
in the vertex RDD
+   */
+  override def count(): JLong = edges.count()
+
+  /** Return a new VertexRDD containing only the elements that satisfy a 
predicate. */
+  def filter(f: JFunction[Edge[ED], Boolean]): JavaEdgeRDD[ED] =
+JavaEdgeRDD(edgeRDD.filter(x = f.call(x).booleanValue()))
+
+  def id: JLong = edges.id.toLong
+
+  /** Persist RDDs of this JavaEdgeRDD with the default storage level 
(MEMORY_ONLY_SER) */
+  def persist(): this.type = {
+edges.persist()
+this
+  }
+
+  /** Persist the RDDs of this EdgeRDD with the given storage level */
+  def persist(storageLevel: StorageLevel): this.type = {
+edges.persist(storageLevel)
+this
+  }
+
+  def unpersist(blocking: Boolean = true) : this.type = {
+edgeRDD.unpersist(blocking)
+this
+  }
+
+  override def mapValues[ED2: ClassTag](f: Edge[ED] = ED2): 
JavaEdgeRDD[ED2] = {
+JavaEdgeRDD(edgeRDD.mapValues(f))
+  }
+
+  override def reverse: JavaEdgeRDD[ED] = JavaEdgeRDD(edgeRDD.reverse)
+
+  def innerJoin[ED2: ClassTag, ED3: ClassTag]
+(other: EdgeRDD[ED2])
+(f: (VertexId, VertexId, ED, ED2) = ED3): JavaEdgeRDD[ED3] = {
+JavaEdgeRDD(edgeRDD.innerJoin(other)(f))
+  }
+
+  def toRDD : RDD[Edge[ED]] = edges
+}
+
+object JavaEdgeRDD {
+
+  implicit def apply[ED: ClassTag](edges: JavaRDD[Edge[ED]]) : 
JavaEdgeRDD[ED] = {
+JavaEdgeRDD(EdgeRDD.fromEdges(edges.rdd))
+  }
+
+  def toEdgeRDD[ED: 

[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread kdatta
Github user kdatta commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23557129
  
--- Diff: graphx/pom.xml ---
@@ -50,6 +50,12 @@
   artifactIdscalacheck_${scala.binary.version}/artifactId
   scopetest/scope
 /dependency
+dependency
+  groupIdjunit/groupId
+  artifactIdjunit/artifactId
+version4.11/version
--- End diff --

Removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4087#issuecomment-71523796
  
  [Test build #26099 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26099/consoleFull)
 for   PR 4087 at commit 
[`76e5b0f`](https://github.com/apache/spark/commit/76e5b0f90e370e2cda20e1348bf40ff890f51782).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4087#issuecomment-71523807
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26099/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread kdatta
Github user kdatta commented on the pull request:

https://github.com/apache/spark/pull/4205#issuecomment-71527948
  
I like the idea of extending Scala APIs for Java, instead of having
separate Java API. It's a better model and I think we can remove a lot of
code duplication and creating layers of wrapper classes. So, does this mean
that all Java friendly methods in GraphX will return objects that Java can
work with, e.g. Iterators, JList and so on?

On Mon, Jan 26, 2015 at 11:58 AM, Josh Rosen notificati...@github.com
wrote:

 I left an initial pass of comments. I haven't really dug into the details
 very much yet, but a couple of high-level comments:

- There's a lot of code duplication in the Python code that creates
the Java RDDs, so it would be nice to see if there's a way to refactor 
the
code to remove this duplication. My concern here is largely around 
future
maintainability, since I'm worried that we'll see the copies of the 
code
diverge when people make changes without being aware of the duplicate
copies.
- I'd like to avoid repeating the Java*Like pattern, since it doesn't
look necessary here and it has caused problems in the past: see
https://issues.scala-lang.org/browse/SI-8905 and
https://issues.apache.org/jira/browse/SPARK-3266.

 Now that we're increasingly seeing Spark libraries being written in one
 JVM language and used from another (e.g. a Spark library written against
 the Java API and called from Scala), it might be nice to try to extend
 GraphX's Scala API to expose Java-friendly methods instead of adding a new
 Java API. This is a major departure from how we've handled Java APIs up
 until now, but it might be a better long-term decision for new code. I
 think @rxin https://github.com/rxin may be able to chime in here with
 more details. GraphX might be a nice context to explore this idea since
 it's a much smaller API than Spark as a whole.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/4205#issuecomment-71526760.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5019] Update GMM API to use Multivariat...

2015-01-26 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3911#issuecomment-71529072
  
@Lewuathe  Could you please close this PR since it was fixed by 
[https://github.com/apache/spark/pull/4088]?  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-26 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-71531465
  
@kennethmyers fixed, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-71531745
  
  [Test build #26106 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26106/consoleFull)
 for   PR 3715 at commit 
[`370ba61`](https://github.com/apache/spark/commit/370ba61571b98e9bdfb6636852d4404687143853).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...

2015-01-26 Thread Lewuathe
Github user Lewuathe commented on the pull request:

https://github.com/apache/spark/pull/3975#issuecomment-71455512
  
@mengxr @jkbradley Thank you for suggestive advices. Anyway 
`loadLibSVMFile` should not be changed and in this case the problem is occurred 
by negative label in `Impurity`. So I update `Impurity` to throw exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-26 Thread suyanNone
Github user suyanNone commented on the pull request:

https://github.com/apache/spark/pull/4055#issuecomment-71458406
  
@cloud-fan btw, Do you know HarryZhang? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23553843
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaVertexRDDLike.scala 
---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.api.java
+
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDDLike
+import org.apache.spark.api.java.function.{Function = JFunction, 
Function2 = JFunction2, Function3 = JFunction3}
+import org.apache.spark.graphx._
+import org.apache.spark.graphx.impl.{EdgeRDDImpl, ShippableVertexPartition}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{Logging, Partition, TaskContext}
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+trait JavaVertexRDDLike[VD, This : JavaVertexRDDLike[VD, This, R],
--- End diff --

I'd remove this trait and fold all of this code directly into the 
`JavaVertexRDD` class.  The `*RDDLike` pattern in the Java API wasn't a great 
design and I'd like to avoid mimicking it in new code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5362 (4526, 2372) Gradient and O...

2015-01-26 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/4152#issuecomment-71515894
  
@jkbradley +1 for more generic `Datum` type. There are two links to Scala 
type conversion benchmarks in the answer to 
http://stackoverflow.com/questions/18083696/generic-type-class-instances-performance.
 As far as I understand, it will not be a big overhead, especially comparing to 
the code that contains algorithm's implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23555791
  
--- Diff: bin/spark-class ---
@@ -170,6 +170,9 @@ export CLASSPATH
 # the driver JVM itself. Instead of handling this complexity in Bash, we 
launch a separate JVM
 # to prepare the launch environment of this driver JVM.
 
+export JAVA_OPTS+= 
-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005
--- End diff --

It looks like this was just a debugging change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23555762
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaVertexRDDLike.scala 
---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.api.java
+
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDDLike
+import org.apache.spark.api.java.function.{Function = JFunction, 
Function2 = JFunction2, Function3 = JFunction3}
+import org.apache.spark.graphx._
+import org.apache.spark.graphx.impl.{EdgeRDDImpl, ShippableVertexPartition}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{Logging, Partition, TaskContext}
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+trait JavaVertexRDDLike[VD, This : JavaVertexRDDLike[VD, This, R],
--- End diff --

What do you mean by handle type bounds?  In this PR's current code, it 
looks like `JavaVertexRDDLike` is only extended by a single class and isn't 
used as part of any method signatures, return types, etc, so unless I'm 
overlooking something I don't see why it can't be removed.  Inheriting 
implementations from generic traits has bitten us in the past via 
https://issues.scala-lang.org/browse/SI-8905 (see 
https://issues.apache.org/jira/browse/SPARK-3266), so if this trait isn't 
necessary then we shouldn't have it.

The JavaRDDLike traits in the Spark Core Java API are an unfortunate 
holdover from an earlier design and exists primarily for code re-use purposes.  
We can't remove it now due because that would break binary compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556471
  
--- Diff: python/pyspark/__init__.py ---
@@ -46,10 +46,12 @@
 from pyspark.broadcast import Broadcast
 from pyspark.serializers import MarshalSerializer, PickleSerializer
 
-# for back compatibility
-from pyspark.sql import SQLContext, HiveContext, SchemaRDD, Row
--- End diff --

I don't think we can remove this without breaking backwards-compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556426
  
--- Diff: pom.xml ---
@@ -86,9 +86,9 @@
   /mailingLists
 
   modules
+modulegraphx/module
--- End diff --

Mind reverting this POM changes if they're unnecessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556520
  
--- Diff: python/pyspark/__init__.py ---
@@ -46,10 +46,12 @@
 from pyspark.broadcast import Broadcast
 from pyspark.serializers import MarshalSerializer, PickleSerializer
 
-# for back compatibility
-from pyspark.sql import SQLContext, HiveContext, SchemaRDD, Row
+from pyspark.graphx.vertex import VertexRDD
+from pyspark.graphx.edge import EdgeRDD, Edge
+from pyspark.graphx.graph import Graph
 
 __all__ = [
 SparkConf, SparkContext, SparkFiles, RDD, StorageLevel, 
Broadcast,
 Accumulator, AccumulatorParam, MarshalSerializer, 
PickleSerializer,
-]
+VertexRDD, EdgeRDD, Edge, Graph]
--- End diff --

Similarly, I'm not sure that we want to add the GraphX methods to `__all__` 
here since it doesn't include the SQL ones.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556818
  
--- Diff: python/pyspark/graphx/vertex.py ---
@@ -0,0 +1,330 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+Python bindings for VertexRDD in GraphX
+
+
+import itertools
+import os
+from tempfile import NamedTemporaryFile
+from numpy.numarray.numerictypes import Long
+
+from py4j.java_collections import MapConverter, ListConverter
+import operator
+from pyspark.accumulators import PStatsParam
+from pyspark.rdd import PipelinedRDD
+from pyspark.serializers import CloudPickleSerializer, NoOpSerializer, 
AutoBatchedSerializer
+from pyspark import RDD, PickleSerializer, StorageLevel, SparkContext
+from pyspark.traceback_utils import SCCallSiteSync
+
+
+__all__ = [VertexRDD, VertexId]
+
+
+
+The default type of vertex id is Long
+A separate VertexId type is defined
--- End diff --

Minor formatting nit, but these lines can be longer (up to 100 characters).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4173#discussion_r23557733
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -0,0 +1,563 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the License); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an AS IS BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+import com.fasterxml.jackson.core.JsonFactory
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.sql.catalyst.ScalaReflection
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr}
+import org.apache.spark.sql.catalyst.plans.{JoinType, Inner}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.LogicalRDD
+import org.apache.spark.sql.json.JsonRDD
+import org.apache.spark.sql.types.{NumericType, StructType}
+import org.apache.spark.util.Utils
+
+
+/**
+ * A collection of rows that have the same columns.
+ *
+ * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and 
can be created using
+ * various functions in [[SQLContext]].
+ * {{{
+ *   val people = sqlContext.parquetFile(...)
+ * }}}
+ *
+ * Once created, it can be manipulated using the various 
domain-specific-language (DSL) functions
+ * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for 
Scala DSL.
+ *
+ * To select a column from the data frame, use the apply method:
+ * {{{
+ *   val ageCol = people(age)  // in Scala
+ *   Column ageCol = people.apply(age)  // in Java
+ * }}}
+ *
+ * Note that the [[Column]] type can also be manipulated through its 
various functions.
+ * {{
+ *   // The following creates a new column that increases everybody's age 
by 10.
+ *   people(age) + 10  // in Scala
--- End diff --

Yea don't pay too much attention to the documentations yet. We will work on 
it after the PR's merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread kdatta
Github user kdatta commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23557720
  
--- Diff: python/pyspark/__init__.py ---
@@ -46,10 +46,12 @@
 from pyspark.broadcast import Broadcast
 from pyspark.serializers import MarshalSerializer, PickleSerializer
 
-# for back compatibility
-from pyspark.sql import SQLContext, HiveContext, SchemaRDD, Row
--- End diff --

code creep from an older version of this file. fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5355] use j.u.c.ConcurrentHashMap inste...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4208#issuecomment-71523042
  
  [Test build #26105 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26105/consoleFull)
 for   PR 4208 at commit 
[`c2182dc`](https://github.com/apache/spark/commit/c2182dc987d63b4860bcb59c504678456bdc6555).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-26 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-71523615
  
@tomerk  Please do say if you think of simplifications!  Are there 
particular pain points/complications which stand out?  E.g., are there 
particular details which developers will have to get right and may mess up?  
That would be helpful to see (from another person's perspective).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4048#issuecomment-71524664
  
  [Test build #26101 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26101/consoleFull)
 for   PR 4048 at commit 
[`9f125ee`](https://github.com/apache/spark/commit/9f125ee80e974d4b2cf56818efb60dfb40f1d564).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4808] Remove Spillable minimum threshol...

2015-01-26 Thread mccheah
Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/3656#issuecomment-71526176
  
Seeing some problems that this PR could address so reviving this thread.

@lawlerd the configurable count would help because if it is known that the 
individual objects would be large, the sampling could be set to be done more 
frequently. So if sampling every 32 times is too passive then a more aggressive 
option can be configured, say sampling every 5 times.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...

2015-01-26 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2847#issuecomment-71526153
  
 1 I mean I use step 1(that Equivalent to create FPTree and condition 
FPTree ) we have reduce data size and create condition FPTree(only include 
frequently item not transition data), when using condition FPTree mining 
frequently item set,it is have a small candidate set.

The advantage of FP-Growth over Apriori is the tree structure to present 
candidate set. Both algorithms are taking advantage on the fact that the 
candidate set is small. I'm asking whether the current implementation uses the 
tree structure to save communication.

 2 I have test it and compared mahout pfp,it is a good performance that 
about 10 time.

I'm not surprised by the 10x speed-up. It is not equivalent to say the 
current implementation is correct and high-performance. I believe that we can 
be much faster.

 3 afer use groupByKey,ming frequently item set in each node that include 
Specified key,so it is not network communication overhead.

`groupByKey` collects everything to reducers. `aggregateByKey` does part of 
the aggregation on mappers. There is definitely space for improvement.

 4 is there have aggregateByKey operator in new spark version?


http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2015-01-26 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-71531266
  
@dbtsai I did batching for artificial neural networks and the performance 
improved ~5x https://github.com/apache/spark/pull/1290#issuecomment-70313952


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread kdatta
Github user kdatta commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23555143
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaVertexRDDLike.scala 
---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.api.java
+
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDDLike
+import org.apache.spark.api.java.function.{Function = JFunction, 
Function2 = JFunction2, Function3 = JFunction3}
+import org.apache.spark.graphx._
+import org.apache.spark.graphx.impl.{EdgeRDDImpl, ShippableVertexPartition}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{Logging, Partition, TaskContext}
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+trait JavaVertexRDDLike[VD, This : JavaVertexRDDLike[VD, This, R],
--- End diff --

Josh, then how will we handle type bounds? Is there a new design for Java 
API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5199. Input metrics should show up for I...

2015-01-26 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/4050#discussion_r23555820
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -218,13 +219,14 @@ class HadoopRDD[K, V](
 
   // Find a function that will return the FileSystem bytes read by 
this thread. Do this before
   // creating RecordReader, because RecordReader's constructor might 
read some bytes
-  val bytesReadCallback = inputMetrics.bytesReadCallback.orElse(
-split.inputSplit.value match {
-  case split: FileSplit =
-
SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(split.getPath, jobConf)
-  case _ = None
+  val bytesReadCallback = inputMetrics.bytesReadCallback.orElse {
+val inputSplit = split.inputSplit.value
+if (inputSplit.isInstanceOf[FileSplit] || 
inputSplit.isInstanceOf[CombineFileSplit]) {
--- End diff --

this is fine as is, but fyi you can do the same thing in a pattern match:

```
split.inputSplit.value match {
   case _: FileSplit | _: CombineFileSplit = 
SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(jobConf)
   case _ = None
}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556181
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/python/PythonVertexRDD.scala 
---
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.api.python
+
+import java.io.{DataOutputStream, FileOutputStream}
+import java.util.{ArrayList = JArrayList, List = JList, Map = JMap}
+
+import org.apache.spark.Accumulator
+import org.apache.spark.api.java.{JavaRDD, JavaSparkContext}
+import org.apache.spark.api.python.{PythonBroadcast, PythonRDD}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.graphx.VertexId
+import org.apache.spark.graphx.api.java.JavaVertexRDD
+import org.apache.spark.storage.StorageLevel
+
+private[graphx] class PythonVertexRDD(
+@transient parent: JavaRDD[_],
+command: Array[Byte],
+envVars: JMap[String, String],
+pythonIncludes: JList[String],
+preservePartitioning: Boolean,
+pythonExec: String,
+broadcastVars: JList[Broadcast[PythonBroadcast]],
+accumulator: Accumulator[JList[Array[Byte]]],
+targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
+  extends PythonRDD (parent, command, envVars,
+ pythonIncludes, preservePartitioning,
+ pythonExec, broadcastVars, accumulator) {
+
+  def this(@transient parent: JavaVertexRDD[_],
--- End diff --

Do we need both constructors, or can we just make the first one 
Py4J-friendly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556368
  
--- Diff: graphx/src/test/java/org/apache/spark/graphx/JavaAPISuite.java ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.graphx;
+
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.graphx.api.java.JavaEdgeRDD;
+import org.apache.spark.graphx.api.java.JavaVertexRDD;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import scala.Tuple2;
+import scala.reflect.ClassTag;
+import scala.reflect.ClassTag$;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+
+public class JavaAPISuite implements Serializable {
+
+private transient JavaSparkContext ssc;
+private ListTuple2Object, VertexPropertyString, String myList;
+private ClassTagVertexPropertyString, String classTag;
+
+
+@Before
+public void initialize() {
--- End diff --

Mind calling these `before` and `after` instead?  I find `finalize` to be 
confusing since it sounds like finalizers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556761
  
--- Diff: python/pyspark/graphx/partitionstrategy.py ---
@@ -0,0 +1,9 @@
+__author__ = 'kdatta1'
--- End diff --

We don't use author tags, so I'd remove this.

Also, this file and a few others need the Spark license header.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556724
  
--- Diff: python/pyspark/graphx/graph.py ---
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+Python bindings for Graph[VertexRDD, EdgeRDD] in GraphX
+
+
+import itertools
+from pyspark import PickleSerializer, RDD, StorageLevel, SparkContext
+from pyspark.graphx import VertexRDD, EdgeRDD
+
+from pyspark.graphx.partitionstrategy import PartitionStrategy
+from pyspark.rdd import PipelinedRDD
+from pyspark.serializers import BatchedSerializer
+
+__all__ = [Graph]
+
+class Graph(object):
+def __init__(self, vertex_jrdd, edge_jrdd,
+ partition_strategy=PartitionStrategy.EdgePartition1D):
+self._vertex_jrdd = VertexRDD(vertex_jrdd, vertex_jrdd.context,
+  
BatchedSerializer(PickleSerializer()))
+self._edge_jrdd = EdgeRDD(edge_jrdd, edge_jrdd.context,
+  BatchedSerializer(PickleSerializer()))
+self._partition_strategy = partition_strategy
+self._jsc = vertex_jrdd.context
+
+def persist(self, storageLevel):
+self._vertex_jrdd.persist(storageLevel)
+self._edge_jrdd.persist(storageLevel)
+return
+
+def cache(self):
+self._vertex_jrdd.cache()
+self._edge_jrdd.cache()
+return
+
+def vertices(self):
+return self._vertex_jrdd
+
+def edges(self):
+return self._edge_jrdd
+
+def numEdges(self):
+return self._edge_jrdd.count()
+
+def numVertices(self):
+return self._vertex_jrdd.count()
+
+# TODO
+def partitionBy(self, partitionStrategy):
+return
+
+# TODO
+def inDegrees(self):
+return
+
+# TODO
+def outDegrees(self):
+return
+
+# TODO
+def degrees(self):
+return
+
+def triplets(self):
+if (isinstance(self._jsc, SparkContext)):
--- End diff --

What happens if this is false?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread kdatta
Github user kdatta commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23557297
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/python/PythonVertexRDD.scala 
---
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.graphx.api.python
+
+import java.io.{DataOutputStream, FileOutputStream}
+import java.util.{ArrayList = JArrayList, List = JList, Map = JMap}
+
+import org.apache.spark.Accumulator
+import org.apache.spark.api.java.{JavaRDD, JavaSparkContext}
+import org.apache.spark.api.python.{PythonBroadcast, PythonRDD}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.graphx.VertexId
+import org.apache.spark.graphx.api.java.JavaVertexRDD
+import org.apache.spark.storage.StorageLevel
+
+private[graphx] class PythonVertexRDD(
+@transient parent: JavaRDD[_],
+command: Array[Byte],
+envVars: JMap[String, String],
+pythonIncludes: JList[String],
+preservePartitioning: Boolean,
+pythonExec: String,
+broadcastVars: JList[Broadcast[PythonBroadcast]],
+accumulator: Accumulator[JList[Array[Byte]]],
+targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
+  extends PythonRDD (parent, command, envVars,
+ pythonIncludes, preservePartitioning,
+ pythonExec, broadcastVars, accumulator) {
+
+  def this(@transient parent: JavaVertexRDD[_],
--- End diff --

I'll try to change this..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread kdatta
Github user kdatta commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23557318
  
--- Diff: graphx/src/test/java/org/apache/spark/graphx/JavaAPISuite.java ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.graphx;
+
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.graphx.api.java.JavaEdgeRDD;
+import org.apache.spark.graphx.api.java.JavaVertexRDD;
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import scala.Tuple2;
+import scala.reflect.ClassTag;
+import scala.reflect.ClassTag$;
+
+import java.io.Serializable;
+import java.util.ArrayList;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+
+public class JavaAPISuite implements Serializable {
+
+private transient JavaSparkContext ssc;
+private ListTuple2Object, VertexPropertyString, String myList;
+private ClassTagVertexPropertyString, String classTag;
+
+
+@Before
+public void initialize() {
--- End diff --

yes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4173#discussion_r23557373
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -0,0 +1,563 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the License); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an AS IS BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+import com.fasterxml.jackson.core.JsonFactory
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.sql.catalyst.ScalaReflection
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr}
+import org.apache.spark.sql.catalyst.plans.{JoinType, Inner}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.LogicalRDD
+import org.apache.spark.sql.json.JsonRDD
+import org.apache.spark.sql.types.{NumericType, StructType}
+import org.apache.spark.util.Utils
+
+
+/**
+ * A collection of rows that have the same columns.
+ *
+ * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and 
can be created using
+ * various functions in [[SQLContext]].
+ * {{{
+ *   val people = sqlContext.parquetFile(...)
+ * }}}
+ *
+ * Once created, it can be manipulated using the various 
domain-specific-language (DSL) functions
+ * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for 
Scala DSL.
+ *
+ * To select a column from the data frame, use the apply method:
+ * {{{
+ *   val ageCol = people(age)  // in Scala
+ *   Column ageCol = people.apply(age)  // in Java
+ * }}}
+ *
+ * Note that the [[Column]] type can also be manipulated through its 
various functions.
+ * {{
+ *   // The following creates a new column that increases everybody's age 
by 10.
+ *   people(age) + 10  // in Scala
--- End diff --

for this one - do I need to assign it to a new column for the result of the 
expression to be usable?
```
people(birthYear4) = people(birthYear2 + 1900) 
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4205#issuecomment-71526760
  
I left an initial pass of comments.  I haven't really dug into the details 
very much yet, but a couple of high-level comments:

- There's a lot of code duplication in the Python code that creates the 
Java RDDs, so it would be nice to see if there's a way to refactor the code to 
remove this duplication.  My concern here is largely around future 
maintainability, since I'm worried that we'll see the copies of the code 
diverge when people make changes without being aware of the duplicate copies.
- I'd like to avoid repeating the `Java*Like` pattern, since it doesn't 
look necessary here and it has caused problems in the past: see 
https://issues.scala-lang.org/browse/SI-8905 and 
https://issues.apache.org/jira/browse/SPARK-3266.

Now that we're increasingly seeing Spark libraries being written in one JVM 
language and used from another (e.g. a Spark library written against the Java 
API and called from Scala), it might be nice to try to extend GraphX's Scala 
API to expose Java-friendly methods instead of adding a new Java API.  This is 
a major departure from how we've handled Java APIs up until now, but it might 
be a better long-term decision for new code.  I think @rxin may be able to 
chime in here with more details.  GraphX might be a nice context to explore 
this idea since it's a much smaller API than Spark as a whole.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-26 Thread mccheah
Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/4155#discussion_r23560945
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.util.concurrent.{ExecutorService, TimeUnit, Executors, 
ConcurrentHashMap}
+
+import scala.collection.{Map = ScalaImmutableMap}
+import scala.collection.concurrent.{Map = ScalaConcurrentMap}
+import scala.collection.convert.decorateAsScala._
+
+import akka.actor.{ActorRef, Actor}
+
+import org.apache.spark.{SparkConf, Logging}
+import org.apache.spark.util.{AkkaUtils, ActorLogReceive}
+
+private[spark] sealed trait OutputCommitCoordinationMessage
+
+private[spark] case class StageStarted(stage: Int, partitionIds: Seq[Int])
+  extends OutputCommitCoordinationMessage
+private[spark] case class StageEnded(stage: Int) extends 
OutputCommitCoordinationMessage
+private[spark] case object StopCoordinator extends 
OutputCommitCoordinationMessage
+
+private[spark] case class AskPermissionToCommitOutput(
+stage: Int,
+task: Long,
+partId: Int,
+taskAttempt: Long)
+  extends OutputCommitCoordinationMessage with Serializable
+
+private[spark] case class TaskCompleted(
+stage: Int,
+task: Long,
+partId: Int,
+attempt: Long,
+successful: Boolean)
+  extends OutputCommitCoordinationMessage
+
+/**
+ * Authority that decides whether tasks can commit output to HDFS.
+ *
+ * This lives on the driver, but the actor allows the tasks that commit
+ * to Hadoop to invoke it.
+ */
+private[spark] class OutputCommitCoordinator(conf: SparkConf) extends 
Logging {
+
+  private type StageId = Int
+  private type PartitionId = Int
+  private type TaskId = Long
+  private type TaskAttemptId = Long
+
+  // Wrapper for an int option that allows it to be locked via a 
synchronized block
+  // while still setting option itself to Some(...) or None.
+  private class LockableAttemptId(var value: Option[TaskAttemptId])
+
+  private type CommittersByStageHashMap =
+ConcurrentHashMap[StageId, ScalaImmutableMap[PartitionId, 
LockableAttemptId]]
+
+  // Initialized by SparkEnv
+  private var coordinatorActor: Option[ActorRef] = None
+  private val timeout = AkkaUtils.askTimeout(conf)
+  private val maxAttempts = AkkaUtils.numRetries(conf)
+  private val retryInterval = AkkaUtils.retryWaitMs(conf)
+  private val authorizedCommittersByStage = new 
CommittersByStageHashMap().asScala
+
+  private var executorRequestHandlingThreadPool: Option[ExecutorService] = 
None
+
+  def stageStart(stage: StageId, partitionIds: Seq[Int]): Unit = {
+sendToActor(StageStarted(stage, partitionIds))
+  }
+
+  def stageEnd(stage: StageId): Unit = {
+sendToActor(StageEnded(stage))
+  }
+
+  def canCommit(
+  stage: StageId,
+  task: TaskId,
+  partId: PartitionId,
+  attempt: TaskAttemptId): Boolean = {
+askActor(AskPermissionToCommitOutput(stage, task, partId, attempt))
+  }
+
+  def taskCompleted(
+  stage: StageId,
+  task: TaskId,
+  partId: PartitionId,
+  attempt: TaskAttemptId,
+  successful: Boolean): Unit = {
+sendToActor(TaskCompleted(stage, task, partId, attempt, successful))
+  }
+
+  def stop(): Unit = {
+executorRequestHandlingThreadPool.foreach { pool =
+  pool.shutdownNow()
+  pool.awaitTermination(10, TimeUnit.SECONDS)
--- End diff --

What timeout should be used here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact 

[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-71530556
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26103/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3916#issuecomment-71530541
  
  [Test build #26103 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26103/consoleFull)
 for   PR 3916 at commit 
[`ad03c48`](https://github.com/apache/spark/commit/ad03c4859b4da74e95909ab966989d2ded8c8f72).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5345][DEPLOY] Fix unstable test case in...

2015-01-26 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4133#discussion_r23555074
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
 ---
@@ -43,7 +43,7 @@ class FsHistoryProviderSuite extends FunSuite with 
BeforeAndAfter with Matchers
 testDir = Utils.createTempDir()
 provider = new FsHistoryProvider(new SparkConf()
   .set(spark.history.fs.logDirectory, testDir.getAbsolutePath())
-  .set(spark.history.fs.updateInterval, 0))
+  .set(spark.testing, true))
--- End diff --

Hmm... as Sean mentioned, this should already be defined 
[here](https://github.com/apache/spark/blob/master/pom.xml#L1127). Can you 
double check that it's really not set, and if not, what's causing it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5408: Use -XX:MaxPermSize specified by u...

2015-01-26 Thread jacek-lewandowski
Github user jacek-lewandowski commented on the pull request:

https://github.com/apache/spark/pull/4202#issuecomment-71519044
  
I guess I should be overridden, but I couldn't find any specification which 
says that it actually works in this way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556050
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala ---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.graphx.api.java
+
+import java.lang.{Long = JLong}
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.java.function.{Function = JFunction}
+import org.apache.spark.graphx._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+/**
+ * EdgeRDD['ED'] is a column-oriented edge partition RDD created from 
RDD[Edge[ED]].
+ * JavaEdgeRDD class provides a Java API to access implementations of the 
EdgeRDD class
+ *
+ * @param targetStorageLevel
+ * @tparam ED
+ */
+class JavaEdgeRDD[ED](
+val edges: RDD[Edge[ED]],
+val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
+(implicit val classTag: ClassTag[ED])
+  extends JavaEdgeRDDLike[ED, JavaEdgeRDD[ED], JavaRDD[(VertexId, 
VertexId, ED)]] {
+
+//  /**
+//   * To create JavaEdgeRDD from JavaRDDs of tuples
+//   * (source vertex id, destination vertex id and edge property class).
+//   * The edge property class can be Array[Byte]
+//   * @param jEdges
+//   */
+//  def this(jEdges: JavaRDD[(VertexId, VertexId, ED)]) = {
+//this(jEdges.rdd.map(x = Edge[ED](x._1, x._2, x._3)))
+//  }
+
+  /* Convert RDD[(PartitionID, EdgePartition[ED, VD])] to EdgeRDD[ED, VD] 
*/
+  override def edgeRDD = EdgeRDD.fromEdges(edges)
+
+  /**
+   * Java Wrapper for RDD of Edges
+   *
+   * @param edgeRDD
+   * @return
+   */
+  def wrapRDD(edgeRDD: RDD[Edge[ED]]): JavaRDD[Edge[ED]] = {
+JavaRDD.fromRDD(edgeRDD)
+  }
+
+  /** Persist RDDs of this JavaEdgeRDD with the default storage level 
(MEMORY_ONLY_SER) */
+  def cache(): this.type = {
+edges.cache()
+this
+  }
+
+  def collect(): JList[Edge[ED]] = {
+import scala.collection.JavaConversions._
+val arr: java.util.Collection[Edge[ED]] = edges.collect().toSeq
+new java.util.ArrayList(arr)
+  }
+
+  /**
+   * Return a new single long element generated by counting all elements 
in the vertex RDD
+   */
+  override def count(): JLong = edges.count()
+
+  /** Return a new VertexRDD containing only the elements that satisfy a 
predicate. */
+  def filter(f: JFunction[Edge[ED], Boolean]): JavaEdgeRDD[ED] =
+JavaEdgeRDD(edgeRDD.filter(x = f.call(x).booleanValue()))
+
+  def id: JLong = edges.id.toLong
+
+  /** Persist RDDs of this JavaEdgeRDD with the default storage level 
(MEMORY_ONLY_SER) */
+  def persist(): this.type = {
+edges.persist()
+this
+  }
+
+  /** Persist the RDDs of this EdgeRDD with the given storage level */
+  def persist(storageLevel: StorageLevel): this.type = {
+edges.persist(storageLevel)
+this
+  }
+
+  def unpersist(blocking: Boolean = true) : this.type = {
+edgeRDD.unpersist(blocking)
+this
+  }
+
+  override def mapValues[ED2: ClassTag](f: Edge[ED] = ED2): 
JavaEdgeRDD[ED2] = {
+JavaEdgeRDD(edgeRDD.mapValues(f))
+  }
+
+  override def reverse: JavaEdgeRDD[ED] = JavaEdgeRDD(edgeRDD.reverse)
+
+  def innerJoin[ED2: ClassTag, ED3: ClassTag]
+(other: EdgeRDD[ED2])
+(f: (VertexId, VertexId, ED, ED2) = ED3): JavaEdgeRDD[ED3] = {
+JavaEdgeRDD(edgeRDD.innerJoin(other)(f))
+  }
+
+  def toRDD : RDD[Edge[ED]] = edges
+}
+
+object JavaEdgeRDD {
+
+  implicit def apply[ED: ClassTag](edges: JavaRDD[Edge[ED]]) : 
JavaEdgeRDD[ED] = {
+JavaEdgeRDD(EdgeRDD.fromEdges(edges.rdd))
+  }
+
+  def toEdgeRDD[ED: 

[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556071
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala ---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.graphx.api.java
+
+import java.lang.{Long = JLong}
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.java.function.{Function = JFunction}
+import org.apache.spark.graphx._
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+/**
+ * EdgeRDD['ED'] is a column-oriented edge partition RDD created from 
RDD[Edge[ED]].
+ * JavaEdgeRDD class provides a Java API to access implementations of the 
EdgeRDD class
+ *
+ * @param targetStorageLevel
+ * @tparam ED
+ */
+class JavaEdgeRDD[ED](
+val edges: RDD[Edge[ED]],
+val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
+(implicit val classTag: ClassTag[ED])
+  extends JavaEdgeRDDLike[ED, JavaEdgeRDD[ED], JavaRDD[(VertexId, 
VertexId, ED)]] {
+
+//  /**
+//   * To create JavaEdgeRDD from JavaRDDs of tuples
+//   * (source vertex id, destination vertex id and edge property class).
+//   * The edge property class can be Array[Byte]
+//   * @param jEdges
+//   */
+//  def this(jEdges: JavaRDD[(VertexId, VertexId, ED)]) = {
+//this(jEdges.rdd.map(x = Edge[ED](x._1, x._2, x._3)))
+//  }
+
+  /* Convert RDD[(PartitionID, EdgePartition[ED, VD])] to EdgeRDD[ED, VD] 
*/
+  override def edgeRDD = EdgeRDD.fromEdges(edges)
+
+  /**
+   * Java Wrapper for RDD of Edges
+   *
+   * @param edgeRDD
+   * @return
+   */
+  def wrapRDD(edgeRDD: RDD[Edge[ED]]): JavaRDD[Edge[ED]] = {
+JavaRDD.fromRDD(edgeRDD)
+  }
+
+  /** Persist RDDs of this JavaEdgeRDD with the default storage level 
(MEMORY_ONLY_SER) */
+  def cache(): this.type = {
+edges.cache()
+this
+  }
+
+  def collect(): JList[Edge[ED]] = {
+import scala.collection.JavaConversions._
+val arr: java.util.Collection[Edge[ED]] = edges.collect().toSeq
+new java.util.ArrayList(arr)
+  }
+
+  /**
+   * Return a new single long element generated by counting all elements 
in the vertex RDD
+   */
+  override def count(): JLong = edges.count()
+
+  /** Return a new VertexRDD containing only the elements that satisfy a 
predicate. */
+  def filter(f: JFunction[Edge[ED], Boolean]): JavaEdgeRDD[ED] =
+JavaEdgeRDD(edgeRDD.filter(x = f.call(x).booleanValue()))
+
+  def id: JLong = edges.id.toLong
+
+  /** Persist RDDs of this JavaEdgeRDD with the default storage level 
(MEMORY_ONLY_SER) */
+  def persist(): this.type = {
+edges.persist()
+this
+  }
+
+  /** Persist the RDDs of this EdgeRDD with the given storage level */
+  def persist(storageLevel: StorageLevel): this.type = {
+edges.persist(storageLevel)
+this
+  }
+
+  def unpersist(blocking: Boolean = true) : this.type = {
+edgeRDD.unpersist(blocking)
+this
+  }
+
+  override def mapValues[ED2: ClassTag](f: Edge[ED] = ED2): 
JavaEdgeRDD[ED2] = {
+JavaEdgeRDD(edgeRDD.mapValues(f))
+  }
+
+  override def reverse: JavaEdgeRDD[ED] = JavaEdgeRDD(edgeRDD.reverse)
+
+  def innerJoin[ED2: ClassTag, ED3: ClassTag]
+(other: EdgeRDD[ED2])
+(f: (VertexId, VertexId, ED, ED2) = ED3): JavaEdgeRDD[ED3] = {
+JavaEdgeRDD(edgeRDD.innerJoin(other)(f))
+  }
+
+  def toRDD : RDD[Edge[ED]] = edges
+}
+
+object JavaEdgeRDD {
+
+  implicit def apply[ED: ClassTag](edges: JavaRDD[Edge[ED]]) : 
JavaEdgeRDD[ED] = {
--- End diff --

Style nit: no space before the `:`.


---
If your project is set up 

[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556688
  
--- Diff: python/pyspark/graphx/graph.py ---
@@ -0,0 +1,169 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+Python bindings for Graph[VertexRDD, EdgeRDD] in GraphX
+
+
+import itertools
+from pyspark import PickleSerializer, RDD, StorageLevel, SparkContext
+from pyspark.graphx import VertexRDD, EdgeRDD
+
+from pyspark.graphx.partitionstrategy import PartitionStrategy
+from pyspark.rdd import PipelinedRDD
+from pyspark.serializers import BatchedSerializer
+
+__all__ = [Graph]
+
+class Graph(object):
+def __init__(self, vertex_jrdd, edge_jrdd,
+ partition_strategy=PartitionStrategy.EdgePartition1D):
+self._vertex_jrdd = VertexRDD(vertex_jrdd, vertex_jrdd.context,
+  
BatchedSerializer(PickleSerializer()))
+self._edge_jrdd = EdgeRDD(edge_jrdd, edge_jrdd.context,
+  BatchedSerializer(PickleSerializer()))
+self._partition_strategy = partition_strategy
+self._jsc = vertex_jrdd.context
+
+def persist(self, storageLevel):
+self._vertex_jrdd.persist(storageLevel)
+self._edge_jrdd.persist(storageLevel)
+return
+
+def cache(self):
+self._vertex_jrdd.cache()
+self._edge_jrdd.cache()
+return
+
+def vertices(self):
+return self._vertex_jrdd
+
+def edges(self):
+return self._edge_jrdd
+
+def numEdges(self):
+return self._edge_jrdd.count()
+
+def numVertices(self):
+return self._vertex_jrdd.count()
+
+# TODO
+def partitionBy(self, partitionStrategy):
--- End diff --

For these `TODO`s, if they don't land in the first version then we should 
comment out the methods as well.  I think we should keep the TODOs as a 
reminder of what's missing, but shouldn't have empty methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23557218
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDDLike.scala ---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.graphx.api.java
+
+import java.lang.{Long = JLong}
+import java.util.{List = JList}
+
+import org.apache.spark.api.java.JavaRDDLike
+import org.apache.spark.graphx._
+import org.apache.spark.{Partition, TaskContext}
+
+import scala.reflect.ClassTag
+
+trait JavaEdgeRDDLike [ED, This : JavaEdgeRDDLike[ED, This, R],
+R : JavaRDDLike[(VertexId, VertexId, ED), R]]
+  extends Serializable {
+
+  def edgeRDD: EdgeRDD[ED]
+
+  def setName() = edgeRDD.setName(JavaEdgeRDD)
+
+  def count() : JLong = edgeRDD.count()
+
+  def compute(part: Partition, context: TaskContext): Iterator[Edge[ED]] = 
{
+edgeRDD.compute(part, context)
+  }
+
+  def mapValues[ED2: ClassTag](f: Edge[ED] = ED2): JavaEdgeRDD[ED2]
+
+  def reverse: JavaEdgeRDD[ED]
+}
--- End diff --

Scalastyle is going to complain for this file (and a few others), since 
this is missing a blank line at the end of the file.  Mind running `sbt/sbt 
scalastyle` and fixing the errors?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23557113
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaGraph.scala ---
@@ -0,0 +1,115 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.graphx.api.java
+
+import java.lang.{Double = JDouble, Long = JLong}
+
+import org.apache.spark.graphx._
+import org.apache.spark.graphx.lib.PageRank
+import org.apache.spark.rdd.RDD
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+class JavaGraph[@specialized VD: ClassTag, @specialized ED: ClassTag]
+  (vertexRDD : VertexRDD[VD], edgeRDD: EdgeRDD[ED]) {
+
+  def vertices: JavaVertexRDD[VD] = JavaVertexRDD(vertexRDD)
+  def edges: JavaEdgeRDD[ED] = JavaEdgeRDD(edgeRDD)
+  @transient lazy val graph : Graph[VD, ED] = Graph(vertexRDD, edgeRDD)
+
+  def partitionBy(partitionStrategy: PartitionStrategy, numPartitions: 
Int): JavaGraph[VD, ED] = {
+val graph = Graph(vertexRDD, edgeRDD)
+JavaGraph(graph.partitionBy(partitionStrategy, numPartitions))
+  }
+
+  /** The number of edges in the graph. */
+  def numEdges: JLong = edges.count()
+
+  /** The number of vertices in the graph. */
+  def numVertices: JLong = vertices.count()
+
+  def inDegrees: JavaVertexRDD[Int] = JavaVertexRDD[Int](graph.inDegrees)
+
+  def outDegrees: JavaVertexRDD[Int] = JavaVertexRDD[Int](graph.outDegrees)
+
+  def mapVertices[VD2: ClassTag](map: (VertexId, VD) = VD2) : 
JavaGraph[VD2, ED] = {
+JavaGraph(graph.mapVertices(map))
+  }
+
+  def mapEdges[ED2: ClassTag](map: Edge[ED] = ED2): JavaGraph[VD, ED2] = {
+JavaGraph(graph.mapEdges(map))
+  }
+
+  def mapTriplets[ED2: ClassTag](map: EdgeTriplet[VD, ED] = ED2): 
JavaGraph[VD, ED2] = {
+JavaGraph(graph.mapTriplets(map))
+  }
+
+  def reverse : JavaGraph[VD, ED] = JavaGraph(graph.reverse)
+
+  def subgraph(
+epred: EdgeTriplet[VD,ED] = Boolean = (x = true),
+vpred: (VertexId, VD) = Boolean = ((v, d) = true)) : JavaGraph[VD, 
ED] = {
+JavaGraph(graph.subgraph(epred, vpred))
+  }
+
+  def groupEdges(merge: (ED, ED) = ED): JavaGraph[VD, ED] = {
+JavaGraph(graph.groupEdges(merge))
+  }
+
+  @deprecated(use aggregateMessages, 1.2.0)
--- End diff --

If this API has already been deprecated in the other languages, should we 
still introduce it here or can we leave it out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-26 Thread kennethmyers
Github user kennethmyers commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-71524679
  
Also, I believe the warning message on line 77 of 
```python/pyspark/streaming/kafka.py``` should read

 ```$ bin/spark-submit``` 

rather than

 ```$ /bin/submit```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4048#issuecomment-71524670
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26101/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5399] tree Losses strings should match ...

2015-01-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4206#discussion_r23558838
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/LossesSuite.scala ---
@@ -0,0 +1,22 @@
+package org.apache.spark.mllib.tree
--- End diff --

Huh, good point.  It must have been left over after some other change.  
Perhaps we should just remove the whole class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4130#issuecomment-71528824
  
  [Test build #567 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/567/consoleFull)
 for   PR 4130 at commit 
[`4fa0582`](https://github.com/apache/spark/commit/4fa0582e90ebcf3cbc19971e26c8cda83d5613e1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4173#issuecomment-71529504
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26104/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4173#issuecomment-71515094
  
  [Test build #26104 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26104/consoleFull)
 for   PR 4173 at commit 
[`a47e189`](https://github.com/apache/spark/commit/a47e189e5b742444a6371da74a5bf1c94a9bc278).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4173#discussion_r23556987
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -0,0 +1,563 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the License); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an AS IS BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+import com.fasterxml.jackson.core.JsonFactory
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.sql.catalyst.ScalaReflection
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr}
+import org.apache.spark.sql.catalyst.plans.{JoinType, Inner}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.LogicalRDD
+import org.apache.spark.sql.json.JsonRDD
+import org.apache.spark.sql.types.{NumericType, StructType}
+import org.apache.spark.util.Utils
+
+
+/**
+ * A collection of rows that have the same columns.
+ *
+ * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and 
can be created using
+ * various functions in [[SQLContext]].
+ * {{{
+ *   val people = sqlContext.parquetFile(...)
+ * }}}
+ *
+ * Once created, it can be manipulated using the various 
domain-specific-language (DSL) functions
+ * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for 
Scala DSL.
+ *
+ * To select a column from the data frame, use the apply method:
+ * {{{
+ *   val ageCol = people(age)  // in Scala
+ *   Column ageCol = people.apply(age)  // in Java
+ * }}}
+ *
+ * Note that the [[Column]] type can also be manipulated through its 
various functions.
+ * {{
+ *   // The following creates a new column that increases everybody's age 
by 10.
+ *   people(age) + 10  // in Scala
+ * }}
+ *
+ * A more concrete example:
+ * {{{
+ *   // To create DataFrame using SQLContext
+ *   val people = sqlContext.parquetFile(...)
+ *   val department = sqlContext.parquetFile(...)
+ *
+ *   people.filter(age  30)
+ * .join(department, people(deptId) === department(id))
+ * .groupby(department(name), gender)
+ * .agg(avg(people(salary)), max(people(age)))
+ * }}}
+ */
+// TODO: Improve documentation.
+class DataFrame protected[sql](
+val sqlContext: SQLContext,
+private val baseLogicalPlan: LogicalPlan,
+operatorsEnabled: Boolean)
+  extends DataFrameSpecificApi with RDDApi[Row] {
+
+  protected[sql] def this(sqlContext: Option[SQLContext], plan: 
Option[LogicalPlan]) =
+this(sqlContext.orNull, plan.orNull, sqlContext.isDefined  
plan.isDefined)
+
+  protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = 
this(sqlContext, plan, true)
+
+  @transient protected[sql] lazy val queryExecution = 
sqlContext.executePlan(baseLogicalPlan)
+
+  @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan 
match {
+// For various commands (like DDL) and queries with side effects, we 
force query optimization to
+// happen right away to let these side effects take place eagerly.
+case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: 
WriteToFile =
+  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
+case _ =
+  baseLogicalPlan
+  }
+
+  /**
+   * An implicit conversion function internal to this class for us to 
avoid doing
+   * new DataFrame(...) everywhere.
+   */
+  private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): 
DataFrame = {
+new DataFrame(sqlContext, logicalPlan, true)
+  }
+
+  /** Return the list of numeric columns, useful for doing aggregation. */
+  

[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556968
  
--- Diff: python/pyspark/graphx/vertex.py ---
@@ -0,0 +1,330 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+Python bindings for VertexRDD in GraphX
+
+
+import itertools
+import os
+from tempfile import NamedTemporaryFile
+from numpy.numarray.numerictypes import Long
+
+from py4j.java_collections import MapConverter, ListConverter
+import operator
+from pyspark.accumulators import PStatsParam
+from pyspark.rdd import PipelinedRDD
+from pyspark.serializers import CloudPickleSerializer, NoOpSerializer, 
AutoBatchedSerializer
+from pyspark import RDD, PickleSerializer, StorageLevel, SparkContext
+from pyspark.traceback_utils import SCCallSiteSync
+
+
+__all__ = [VertexRDD, VertexId]
+
+
+
+The default type of vertex id is Long
+A separate VertexId type is defined
+here so that other types can be used
+for vertex ids in future
+
+VertexId = Long
+
+
+class VertexRDD(object):
+
+VertexRDD class defines the vertex actions and transformations. The 
complete list of
+transformations and actions for vertices are available at
+`http://spark.apache.org/docs/latest/graphx-programming-guide.html`
+These operations are mapped to Scala functions defined
+in `org.apache.spark.graphx.impl.VertexRDDImpl`
+
+
+def __init__(self, jrdd, 
jrdd_deserializer=AutoBatchedSerializer(PickleSerializer())):
+
+Constructor
+:param jrdd:   A JavaRDD reference passed from the 
parent
+   RDD object
+:param jrdd_deserializer:  The deserializer used in Python workers
+   created from PythonRDD to execute a
+   serialized Python function and RDD
+
+
+self.name = VertexRDD
+self.jrdd = jrdd
+self.is_cached = False
+self.is_checkpointed = False
+self.ctx = SparkContext._active_spark_context
+self.jvertex_rdd_deserializer = jrdd_deserializer
+self.id = jrdd.id()
+self.partitionFunc = None
+self.bypass_serializer = False
+self.preserve_partitioning = False
+
+self.jvertex_rdd = self.getJavaVertexRDD(jrdd, jrdd_deserializer)
+
+def __repr__(self):
+return self.jvertex_rdd.toString()
+
+def cache(self):
+
+Persist this vertex RDD with the default storage level 
(C{MEMORY_ONLY_SER}).
+
+self.is_cached = True
+self.persist(StorageLevel.MEMORY_ONLY_SER)
+return self
+
+def checkpoint(self):
+self.is_checkpointed = True
+self.jvertex_rdd.checkpoint()
+
+def count(self):
+return self.jvertex_rdd.count()
+
+def diff(self, other, numPartitions=2):
+
+Hides vertices that are the same between `this` and `other`.
+For vertices that are different, keeps the values from `other`.
+
+TODO: give an example
+
+if (isinstance(other, RDD)):
+vs = self.map(lambda (k, v): (k, (1, v)))
+ws = other.map(lambda (k, v): (k, (2, v)))
+return vs.union(ws).groupByKey(numPartitions).mapValues(lambda x: 
x.diff(x.__iter__()))
+
+def isCheckpointed(self):
+
+Return whether this RDD has been checkpointed or not
+
+return self.is_checkpointed
+
+def mapValues(self, f, preserves_partitioning=False):
+
+Return a new vertex RDD by applying a function to each vertex 
attributes,
+preserving the index
+
+ rdd = sc.parallelize([(1, b), (2, a), (3, c)])
+ vertices = VertexRDD(rdd)
+ 

[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23556995
  
--- Diff: 
streaming/src/test/java/org/apache/spark/streaming/LocalJavaStreamingContext.java
 ---
@@ -24,7 +24,7 @@
 
 public abstract class LocalJavaStreamingContext {
 
-protected transient JavaStreamingContext ssc;
+protected transient JavaStreamingContext ssc;
--- End diff --

This is a whitespace-only change; mind reverting?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4173#discussion_r23556996
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -0,0 +1,563 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the License); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an AS IS BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+import com.fasterxml.jackson.core.JsonFactory
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.sql.catalyst.ScalaReflection
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr}
+import org.apache.spark.sql.catalyst.plans.{JoinType, Inner}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.LogicalRDD
+import org.apache.spark.sql.json.JsonRDD
+import org.apache.spark.sql.types.{NumericType, StructType}
+import org.apache.spark.util.Utils
+
+
+/**
+ * A collection of rows that have the same columns.
+ *
+ * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and 
can be created using
+ * various functions in [[SQLContext]].
+ * {{{
+ *   val people = sqlContext.parquetFile(...)
+ * }}}
+ *
+ * Once created, it can be manipulated using the various 
domain-specific-language (DSL) functions
+ * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for 
Scala DSL.
+ *
+ * To select a column from the data frame, use the apply method:
+ * {{{
+ *   val ageCol = people(age)  // in Scala
+ *   Column ageCol = people.apply(age)  // in Java
+ * }}}
+ *
+ * Note that the [[Column]] type can also be manipulated through its 
various functions.
+ * {{
+ *   // The following creates a new column that increases everybody's age 
by 10.
+ *   people(age) + 10  // in Scala
+ * }}
+ *
+ * A more concrete example:
+ * {{{
+ *   // To create DataFrame using SQLContext
+ *   val people = sqlContext.parquetFile(...)
+ *   val department = sqlContext.parquetFile(...)
+ *
+ *   people.filter(age  30)
+ * .join(department, people(deptId) === department(id))
+ * .groupby(department(name), gender)
+ * .agg(avg(people(salary)), max(people(age)))
+ * }}}
+ */
+// TODO: Improve documentation.
+class DataFrame protected[sql](
+val sqlContext: SQLContext,
+private val baseLogicalPlan: LogicalPlan,
+operatorsEnabled: Boolean)
+  extends DataFrameSpecificApi with RDDApi[Row] {
+
+  protected[sql] def this(sqlContext: Option[SQLContext], plan: 
Option[LogicalPlan]) =
+this(sqlContext.orNull, plan.orNull, sqlContext.isDefined  
plan.isDefined)
+
+  protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = 
this(sqlContext, plan, true)
+
+  @transient protected[sql] lazy val queryExecution = 
sqlContext.executePlan(baseLogicalPlan)
+
+  @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan 
match {
+// For various commands (like DDL) and queries with side effects, we 
force query optimization to
+// happen right away to let these side effects take place eagerly.
+case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: 
WriteToFile =
+  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
+case _ =
+  baseLogicalPlan
+  }
+
+  /**
+   * An implicit conversion function internal to this class for us to 
avoid doing
+   * new DataFrame(...) everywhere.
+   */
+  private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): 
DataFrame = {
+new DataFrame(sqlContext, logicalPlan, true)
+  }
+
+  /** Return the list of numeric columns, useful for doing aggregation. */
+  

[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4173#discussion_r23557011
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -0,0 +1,563 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the License); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an AS IS BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+import com.fasterxml.jackson.core.JsonFactory
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.sql.catalyst.ScalaReflection
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr}
+import org.apache.spark.sql.catalyst.plans.{JoinType, Inner}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.LogicalRDD
+import org.apache.spark.sql.json.JsonRDD
+import org.apache.spark.sql.types.{NumericType, StructType}
+import org.apache.spark.util.Utils
+
+
+/**
+ * A collection of rows that have the same columns.
+ *
+ * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and 
can be created using
+ * various functions in [[SQLContext]].
+ * {{{
+ *   val people = sqlContext.parquetFile(...)
+ * }}}
+ *
+ * Once created, it can be manipulated using the various 
domain-specific-language (DSL) functions
+ * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for 
Scala DSL.
+ *
+ * To select a column from the data frame, use the apply method:
+ * {{{
+ *   val ageCol = people(age)  // in Scala
+ *   Column ageCol = people.apply(age)  // in Java
+ * }}}
+ *
+ * Note that the [[Column]] type can also be manipulated through its 
various functions.
+ * {{
+ *   // The following creates a new column that increases everybody's age 
by 10.
+ *   people(age) + 10  // in Scala
+ * }}
+ *
+ * A more concrete example:
+ * {{{
+ *   // To create DataFrame using SQLContext
+ *   val people = sqlContext.parquetFile(...)
+ *   val department = sqlContext.parquetFile(...)
+ *
+ *   people.filter(age  30)
+ * .join(department, people(deptId) === department(id))
+ * .groupby(department(name), gender)
+ * .agg(avg(people(salary)), max(people(age)))
+ * }}}
+ */
+// TODO: Improve documentation.
+class DataFrame protected[sql](
+val sqlContext: SQLContext,
+private val baseLogicalPlan: LogicalPlan,
+operatorsEnabled: Boolean)
+  extends DataFrameSpecificApi with RDDApi[Row] {
+
+  protected[sql] def this(sqlContext: Option[SQLContext], plan: 
Option[LogicalPlan]) =
+this(sqlContext.orNull, plan.orNull, sqlContext.isDefined  
plan.isDefined)
+
+  protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = 
this(sqlContext, plan, true)
+
+  @transient protected[sql] lazy val queryExecution = 
sqlContext.executePlan(baseLogicalPlan)
+
+  @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan 
match {
+// For various commands (like DDL) and queries with side effects, we 
force query optimization to
+// happen right away to let these side effects take place eagerly.
+case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: 
WriteToFile =
+  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
+case _ =
+  baseLogicalPlan
+  }
+
+  /**
+   * An implicit conversion function internal to this class for us to 
avoid doing
+   * new DataFrame(...) everywhere.
+   */
+  private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): 
DataFrame = {
+new DataFrame(sqlContext, logicalPlan, true)
+  }
+
+  /** Return the list of numeric columns, useful for doing aggregation. */
+  

[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...

2015-01-26 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4205#discussion_r23557051
  
--- Diff: examples/src/main/python/graphx/simpleGraph.py ---
@@ -0,0 +1,48 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+Correlations using MLlib.
--- End diff --

This looks like the wrong description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4337. [YARN] Add ability to cancel pendi...

2015-01-26 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/4141#discussion_r23557845
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -192,15 +186,32 @@ private[yarn] class YarnAllocator(
   }
 
   /**
-   * Request numExecutors additional containers from YARN. Visible for 
testing.
+   * Update the set of container requests that we will sync with the RM 
based on the number of
+   * executors we have currently running and our target number of 
executors.
+   *
+   * Visible for testing.
*/
-  def addResourceRequests(numExecutors: Int): Unit = {
-for (i - 0 until numExecutors) {
-  val request = new ContainerRequest(resource, null, null, 
RM_REQUEST_PRIORITY)
-  amClient.addContainerRequest(request)
-  val nodes = request.getNodes
-  val hostStr = if (nodes == null || nodes.isEmpty) Any else 
nodes.last
-  logInfo(Container request (host: %s, capability: 
%s.format(hostStr, resource))
+  def updateResourceRequests(): Unit = {
--- End diff --

how about making it `private[yarn]`? will still be visible in tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...

2015-01-26 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3975#issuecomment-71523305
  
@Lewuathe  Sounds good.  Thanks for going through this discussion!  Looking 
at the latest updates, it looks like the 2 Impurity tests are very similar.  
Would you be able to put them in one ImpuritySuite file and combine the tests 
to shorten the code (e.g., by using a type parameter for the Impurity type)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4337. [YARN] Add ability to cancel pendi...

2015-01-26 Thread squito
Github user squito commented on the pull request:

https://github.com/apache/spark/pull/4141#issuecomment-71523335
  
just a minor comment, otherwise lgtm


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4173#issuecomment-71529489
  
  [Test build #26104 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26104/consoleFull)
 for   PR 4173 at commit 
[`a47e189`](https://github.com/apache/spark/commit/a47e189e5b742444a6371da74a5bf1c94a9bc278).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-26 Thread mccheah
Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/4155#issuecomment-71530343
  
@vanzin that's pretty much what I went with. The actor will receive the 
message and for commit permission requests they're farmed off to a thread pool.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...

2015-01-26 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3962#discussion_r23550729
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -231,6 +231,25 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments,
 reporterThread = launchReporterThread()
   }
 
+  /**
+   * Creates an actor that ApplicationMaster communicates with 
YarnScheduler of driver
+   * in Yarn deploy mode.
+   *
+   * If isDriver is set to true, AMActor and driver belong to same process.
+   * so AMActor don't monitor lifecycle of driver.
+   */
+  private def runAMActor(
+  host: String,
+  port: String,
+  isDriver: Boolean): Unit = {
--- End diff --

can you rename `isDriver` to `isClusterMode`? It might be apparent to us 
that these two mean the same thing but for newcomers it's not obvious at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4585. Spark dynamic executor allocation ...

2015-01-26 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/4051#issuecomment-71508342
  
A few comments:
- I don't have a strong opinion on new option vs. not. I think eventually 
what we really want is to always have dynamic allocation on, so that discussion 
would eventually become moot.
- min = 0 makes sense; this is particularly interesting for things like 
spark-shell, or Hive sessions. As long as ramp up is reasonably quick when 
needed, it should be fine. (0 vs. 1 probably won't make a big difference for 
large jobs anyway.)
- I'm not so sure about `Integer.MAX_VALUE`. I get the point of letting the 
resource manager handle it, but perhaps we should be nicer here? e.g., set max 
to number of NMs in the cluster if that info is available client-side?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5408: Use -XX:MaxPermSize specified by u...

2015-01-26 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4202#issuecomment-71511716
  
@vanzin I agree, I'd imagine this actually works as expected here since the 
last one wins. This still feels like a small good thing just so anyone looking 
at the command line is less likely to be confused.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4087#issuecomment-71511624
  
  [Test build #26099 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26099/consoleFull)
 for   PR 4087 at commit 
[`76e5b0f`](https://github.com/apache/spark/commit/76e5b0f90e370e2cda20e1348bf40ff890f51782).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...

2015-01-26 Thread squito
Github user squito commented on the pull request:

https://github.com/apache/spark/pull/4048#issuecomment-71512674
  
I figured out the magic combination to make sbt, scalatest, junit, and the 
sbt-pom-reader all play nicely together.  I had to introduce a new config (or 
scope or something, sbt terminology still baffles me ...) instead of creating a 
new task.  Now you can run `unit:test` (or any other variant you like, eg 
`~unit:testQuick`) which will exclude everything tagged as an IntegrationTest.  
There is both a tag for scalatests, and a category for junit tests.

I've still only bothered with the tagging in core.  But I think this can be 
merged as is in any case -- this does the setup so someone more familiar w/ the 
other projects can figure out what to tag.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...

2015-01-26 Thread gzm0
Github user gzm0 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4130#discussion_r23546974
  
--- Diff: 
repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala ---
@@ -91,7 +91,14 @@ class ExecutorClassLoader(conf: SparkConf, classUri: 
String, parent: ClassLoader
   inputStream.close()
   Some(defineClass(name, bytes, 0, bytes.length))
 } catch {
-  case e: Exception = None
+  case e: FileNotFoundException =
+// We did not find the class
+logDebug(sDid not load class $name from REPL class server at 
$uri, e)
--- End diff --

It is actually a remote URI: Have a look at the [usage site][1]. It can 
either point to an HTTP server or an HDFS.

[1]: 
https://github.com/gzm0/spark/blob/4fa0582e90ebcf3cbc19971e26c8cda83d5613e1/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L47


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...

2015-01-26 Thread gzm0
Github user gzm0 commented on the pull request:

https://github.com/apache/spark/pull/4130#issuecomment-71498788
  
@marmbrus could you have a look at this as well, please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix command spaces issue in make-distribution....

2015-01-26 Thread dyross
Github user dyross commented on the pull request:

https://github.com/apache/spark/pull/4126#issuecomment-71509353
  
Any thoughts on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...

2015-01-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4048#issuecomment-71511628
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26100/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4048#issuecomment-71511623
  
  [Test build #26100 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26100/consoleFull)
 for   PR 4048 at commit 
[`2d6b733`](https://github.com/apache/spark/commit/2d6b733acce37802ac416ef14400b61f116d3aa3).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...

2015-01-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4048#issuecomment-71511610
  
  [Test build #26100 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26100/consoleFull)
 for   PR 4048 at commit 
[`2d6b733`](https://github.com/apache/spark/commit/2d6b733acce37802ac416ef14400b61f116d3aa3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5355] use j.u.c.ConcurrentHashMap inste...

2015-01-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4208#discussion_r23553383
  
--- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
@@ -47,12 +48,12 @@ class SparkConf(loadDefaults: Boolean) extends 
Cloneable with Logging {
   /** Create a SparkConf that loads defaults from system properties and 
the classpath */
   def this() = this(true)
 
-  private[spark] val settings = new TrieMap[String, String]()
+  private[spark] val settings = new ConcurrentHashMap[String, String]()
--- End diff --

while you are at it, define the return type explicitly here 
(java.util.Map?) since it is public type 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >