[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-71458822 Last week and current week, my work is not in Spark, so reply was slower --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3975#issuecomment-71460663 [Test build #26094 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26094/consoleFull) for PR 3975 at commit [`62a150c`](https://github.com/apache/spark/commit/62a150c5207a4a2679a068589abff10c24337824). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-71457733 @cloud-fan According current code, it may not easy to change to not re-submit task in pendingTasks. To be honest, current DAGScheduler is complicated but at some point, is simple for some situation, still a lot may need to be improved. re-submit occurs have failed-stage and due to a fetch failed. a fetch-failed means current running TaskSet is dead(called `zombie`), so it just have the already scheduled on Executor task are running, others in stage.pendingTask will never be scheduled in previous taskset. but it can do to just resubmit not scheduled tasks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3975#issuecomment-71470260 [Test build #26094 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26094/consoleFull) for PR 3975 at commit [`62a150c`](https://github.com/apache/spark/commit/62a150c5207a4a2679a068589abff10c24337824). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3975#issuecomment-71470269 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26094/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-71460355 @suyanNone Thanks for the explanation of re-submit! What's the Chinese name of HarryZhang? We don't use English name in the labâ¦â¦ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2691][Mesos] Support for Mesos DockerIn...
Github user hellertime commented on the pull request: https://github.com/apache/spark/pull/3074#issuecomment-71454692 @andrewor14 your suggestions all look quite reasonable, I'll have a closer look at them tonight and make appropriate changes. @mateiz adding a Dockerfile will be no trouble. Basically one typically just creates a Docker image with the spark distribution tarball pre-expanded inside. An example Dockerfile could either be setup to refer directly to the output of `make-distribution.sh`, or to some other pre-existing Spark Docker image. Is there some official Spark Docker image I might reference, looking at the Docker Hub, there doesn't appear to be one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5090][examples] The improvement of pyth...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/3920#issuecomment-71476341 @GenTang ok, thanks for updating the example data. I checked out the PR and tested quickly locally, the new example and data works for me. Looks good. @davies +1 to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user yanbohappy commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71482088 https://issues.apache.org/jira/browse/SPARK-5324 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71482309 [Test build #26095 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26095/consoleFull) for PR 4207 at commit [`7fb2e81`](https://github.com/apache/spark/commit/7fb2e81d682d8f641275077b94dfb6d7de466de8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71488018 [Test build #26096 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26096/consoleFull) for PR 4207 at commit [`34508a3`](https://github.com/apache/spark/commit/34508a3dcd8d3d9b212c15767ac096845e76b44d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71488029 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26096/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2691][Mesos] Support for Mesos DockerIn...
Github user rdhyee commented on the pull request: https://github.com/apache/spark/pull/3074#issuecomment-71475481 I [wrote in November of my interest in this PR](https://github.com/apache/spark/pull/3074#issuecomment-64227322). I concur the [suggestion](https://github.com/apache/spark/pull/3074#issuecomment-71414280) by @mateiz to include a Dockerfile. It'd be great if that Dockerfile could be one that would serve two purposes: 1) as a basic demonstration of how to run Spark via docker on Mesos and 2) as the argument for the [FROM](https://docs.docker.com/reference/builder/#from) for other people's Dockerfiles. I've been experimenting with a Dockerfile to combine Spark with [ipython/scipyserver](https://github.com/ipython/docker-notebook/blob/master/scipyserver/Dockerfile) functionality to enable me to run pyspark in the [IPython Notebook](http://ipython.org/notebook.html) interface: [rdhyee/ipython-spark](https://github.com/rdhyee/ipython-spark/blob/master/Dockerfile). I started off with the [suggested Dockerfile](https://issues.apache.org/jira/browse/SPARK-2691?focusedCommentId=14194679page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14194679) by @hellertime. I've not had a chance to test this PR myself, primarily because I'm still learning how to get a stable, standard mesos cluster to work with a stable release of Spark. But I'm definitely keen to try out the functionality of this PR when it gets merged into master -- or even sooner if I figure stuff out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5355] use j.u.c.ConcurrentHashMap inste...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4208#issuecomment-71492577 [Test build #26097 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26097/consoleFull) for PR 4208 at commit [`da14ced`](https://github.com/apache/spark/commit/da14ced75e1413e8c107b58388f99a3cbcae1acd). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71482915 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26095/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71486036 [Test build #26096 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26096/consoleFull) for PR 4207 at commit [`34508a3`](https://github.com/apache/spark/commit/34508a3dcd8d3d9b212c15767ac096845e76b44d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Implement Describe Table for SQLContext
GitHub user yanbohappy opened a pull request: https://github.com/apache/spark/pull/4207 Implement Describe Table for SQLContext Initial code snippet for Describe Table command. #1 SQL Parser and generate logical plan. Add DESCRIBE [FORMATTED] [db_name.]table_name command parser in SparkSQLParser and generate the same logical plan for both SQLContext and HiveContext. (note: HiveContext also support DESCRIBE [FORMATTED] [db_name.]table_name PARTITION partition_column_name and DESCRIBE [FORMATTED] [db_name.]table_name column_name is implement by Hive native command) #2 Implement DescribeCommand which invoke by RunnableCommand. #3 For SQLContext the code is clearly structured, but for HiveContext the output of describe command need to stay the same Hive. So For HiveContext, we still transfer logical command to DescribeHiveTableCommand which had been implement in HiveStrategies.HiveCommandStrategy. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanbohappy/spark spark-5324 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4207.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4207 commit 7fb2e81d682d8f641275077b94dfb6d7de466de8 Author: Yanbo Liang yanboha...@gmail.com Date: 2015-01-26T15:29:10Z Implement Describe Table for SQLContext --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Implement Describe Table for SQLContext
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4207#issuecomment-71482912 [Test build #26095 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26095/consoleFull) for PR 4207 at commit [`7fb2e81`](https://github.com/apache/spark/commit/7fb2e81d682d8f641275077b94dfb6d7de466de8). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2691][Mesos] Support for Mesos DockerIn...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/3074#issuecomment-71487679 Yeah, so IMO just provide a basic Dockerfile that lets people but in their compiled build of Spark, and we can go from there. They should also have the option of specifying their own image, as they have now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user kennethmyers commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-71515443 Hello, I made a fork of the main Spark repo and pulled this pull request into it. I was able to build it successfully but I get the followng error when calling KafkaUtils.createStream( ... ) The line: ``` kvs = KafkaUtils.createStream(streaming_context, ZOOKEEPER, GROUP, {TOPIC: 1}) ``` ...gives the following error message: ``` No kafka package, please put the assembly jar into classpath: $ bin/submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar ``` When I try to run that 'submit command with the specified parameters, I get: ``` $ bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-2.10/spark-streaming-kafka-assembly-1.3.0-SNAPSHOT.jar Error: Must specify a primary resource (JAR or Python file) Run with --help for usage help or --verbose for debug output Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/myerske/spark/external/kafka-assembly/target/scala-2.10/spark-streaming-kafka-assembly-1.3.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/myerske/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] ``` I've verified that the JAR file is there. Has anybody had the same issue or can offer any help? Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23555909 --- Diff: graphx/pom.xml --- @@ -50,6 +50,12 @@ artifactIdscalacheck_${scala.binary.version}/artifactId scopetest/scope /dependency +dependency + groupIdjunit/groupId + artifactIdjunit/artifactId +version4.11/version --- End diff -- I don't think this version should be specified here; it should be inherited from the parent POM or set via a configuration or propery. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user kdatta commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23555962 --- Diff: bin/spark-class --- @@ -170,6 +170,9 @@ export CLASSPATH # the driver JVM itself. Instead of handling this complexity in Bash, we launch a separate JVM # to prepare the launch environment of this driver JVM. +export JAVA_OPTS+= -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 --- End diff -- Yes, deleted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23555971 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.graphx.api.java + +import java.lang.{Long = JLong} +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.api.java.function.{Function = JFunction} +import org.apache.spark.graphx._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +/** + * EdgeRDD['ED'] is a column-oriented edge partition RDD created from RDD[Edge[ED]]. + * JavaEdgeRDD class provides a Java API to access implementations of the EdgeRDD class + * + * @param targetStorageLevel + * @tparam ED + */ +class JavaEdgeRDD[ED]( --- End diff -- Should this class be public? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user kdatta commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23555947 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaVertexRDDLike.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.api.java + +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDDLike +import org.apache.spark.api.java.function.{Function = JFunction, Function2 = JFunction2, Function3 = JFunction3} +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.{EdgeRDDImpl, ShippableVertexPartition} +import org.apache.spark.rdd.RDD +import org.apache.spark.{Logging, Partition, TaskContext} + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +trait JavaVertexRDDLike[VD, This : JavaVertexRDDLike[VD, This, R], --- End diff -- Got it. I will change this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23555998 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.graphx.api.java + +import java.lang.{Long = JLong} +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.api.java.function.{Function = JFunction} +import org.apache.spark.graphx._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +/** + * EdgeRDD['ED'] is a column-oriented edge partition RDD created from RDD[Edge[ED]]. + * JavaEdgeRDD class provides a Java API to access implementations of the EdgeRDD class + * + * @param targetStorageLevel + * @tparam ED + */ +class JavaEdgeRDD[ED]( +val edges: RDD[Edge[ED]], +val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) +(implicit val classTag: ClassTag[ED]) + extends JavaEdgeRDDLike[ED, JavaEdgeRDD[ED], JavaRDD[(VertexId, VertexId, ED)]] { + +// /** +// * To create JavaEdgeRDD from JavaRDDs of tuples +// * (source vertex id, destination vertex id and edge property class). +// * The edge property class can be Array[Byte] +// * @param jEdges +// */ +// def this(jEdges: JavaRDD[(VertexId, VertexId, ED)]) = { +//this(jEdges.rdd.map(x = Edge[ED](x._1, x._2, x._3))) +// } + + /* Convert RDD[(PartitionID, EdgePartition[ED, VD])] to EdgeRDD[ED, VD] */ + override def edgeRDD = EdgeRDD.fromEdges(edges) + + /** + * Java Wrapper for RDD of Edges + * + * @param edgeRDD --- End diff -- If you're not going to document parameters, just omit them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user kdatta commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23557180 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.graphx.api.java + +import java.lang.{Long = JLong} +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.api.java.function.{Function = JFunction} +import org.apache.spark.graphx._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +/** + * EdgeRDD['ED'] is a column-oriented edge partition RDD created from RDD[Edge[ED]]. + * JavaEdgeRDD class provides a Java API to access implementations of the EdgeRDD class + * + * @param targetStorageLevel + * @tparam ED + */ +class JavaEdgeRDD[ED]( +val edges: RDD[Edge[ED]], +val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) +(implicit val classTag: ClassTag[ED]) + extends JavaEdgeRDDLike[ED, JavaEdgeRDD[ED], JavaRDD[(VertexId, VertexId, ED)]] { + +// /** +// * To create JavaEdgeRDD from JavaRDDs of tuples +// * (source vertex id, destination vertex id and edge property class). +// * The edge property class can be Array[Byte] +// * @param jEdges +// */ +// def this(jEdges: JavaRDD[(VertexId, VertexId, ED)]) = { +//this(jEdges.rdd.map(x = Edge[ED](x._1, x._2, x._3))) +// } + + /* Convert RDD[(PartitionID, EdgePartition[ED, VD])] to EdgeRDD[ED, VD] */ + override def edgeRDD = EdgeRDD.fromEdges(edges) + + /** + * Java Wrapper for RDD of Edges + * + * @param edgeRDD + * @return + */ + def wrapRDD(edgeRDD: RDD[Edge[ED]]): JavaRDD[Edge[ED]] = { +JavaRDD.fromRDD(edgeRDD) + } + + /** Persist RDDs of this JavaEdgeRDD with the default storage level (MEMORY_ONLY_SER) */ + def cache(): this.type = { +edges.cache() +this + } + + def collect(): JList[Edge[ED]] = { +import scala.collection.JavaConversions._ +val arr: java.util.Collection[Edge[ED]] = edges.collect().toSeq +new java.util.ArrayList(arr) + } + + /** + * Return a new single long element generated by counting all elements in the vertex RDD + */ + override def count(): JLong = edges.count() + + /** Return a new VertexRDD containing only the elements that satisfy a predicate. */ + def filter(f: JFunction[Edge[ED], Boolean]): JavaEdgeRDD[ED] = +JavaEdgeRDD(edgeRDD.filter(x = f.call(x).booleanValue())) + + def id: JLong = edges.id.toLong + + /** Persist RDDs of this JavaEdgeRDD with the default storage level (MEMORY_ONLY_SER) */ + def persist(): this.type = { +edges.persist() +this + } + + /** Persist the RDDs of this EdgeRDD with the given storage level */ + def persist(storageLevel: StorageLevel): this.type = { +edges.persist(storageLevel) +this + } + + def unpersist(blocking: Boolean = true) : this.type = { +edgeRDD.unpersist(blocking) +this + } + + override def mapValues[ED2: ClassTag](f: Edge[ED] = ED2): JavaEdgeRDD[ED2] = { +JavaEdgeRDD(edgeRDD.mapValues(f)) + } + + override def reverse: JavaEdgeRDD[ED] = JavaEdgeRDD(edgeRDD.reverse) + + def innerJoin[ED2: ClassTag, ED3: ClassTag] +(other: EdgeRDD[ED2]) +(f: (VertexId, VertexId, ED, ED2) = ED3): JavaEdgeRDD[ED3] = { +JavaEdgeRDD(edgeRDD.innerJoin(other)(f)) + } + + def toRDD : RDD[Edge[ED]] = edges +} + +object JavaEdgeRDD { + + implicit def apply[ED: ClassTag](edges: JavaRDD[Edge[ED]]) : JavaEdgeRDD[ED] = { +JavaEdgeRDD(EdgeRDD.fromEdges(edges.rdd)) + } + + def toEdgeRDD[ED:
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user kdatta commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23557129 --- Diff: graphx/pom.xml --- @@ -50,6 +50,12 @@ artifactIdscalacheck_${scala.binary.version}/artifactId scopetest/scope /dependency +dependency + groupIdjunit/groupId + artifactIdjunit/artifactId +version4.11/version --- End diff -- Removed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4087#issuecomment-71523796 [Test build #26099 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26099/consoleFull) for PR 4087 at commit [`76e5b0f`](https://github.com/apache/spark/commit/76e5b0f90e370e2cda20e1348bf40ff890f51782). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4087#issuecomment-71523807 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26099/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user kdatta commented on the pull request: https://github.com/apache/spark/pull/4205#issuecomment-71527948 I like the idea of extending Scala APIs for Java, instead of having separate Java API. It's a better model and I think we can remove a lot of code duplication and creating layers of wrapper classes. So, does this mean that all Java friendly methods in GraphX will return objects that Java can work with, e.g. Iterators, JList and so on? On Mon, Jan 26, 2015 at 11:58 AM, Josh Rosen notificati...@github.com wrote: I left an initial pass of comments. I haven't really dug into the details very much yet, but a couple of high-level comments: - There's a lot of code duplication in the Python code that creates the Java RDDs, so it would be nice to see if there's a way to refactor the code to remove this duplication. My concern here is largely around future maintainability, since I'm worried that we'll see the copies of the code diverge when people make changes without being aware of the duplicate copies. - I'd like to avoid repeating the Java*Like pattern, since it doesn't look necessary here and it has caused problems in the past: see https://issues.scala-lang.org/browse/SI-8905 and https://issues.apache.org/jira/browse/SPARK-3266. Now that we're increasingly seeing Spark libraries being written in one JVM language and used from another (e.g. a Spark library written against the Java API and called from Scala), it might be nice to try to extend GraphX's Scala API to expose Java-friendly methods instead of adding a new Java API. This is a major departure from how we've handled Java APIs up until now, but it might be a better long-term decision for new code. I think @rxin https://github.com/rxin may be able to chime in here with more details. GraphX might be a nice context to explore this idea since it's a much smaller API than Spark as a whole. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/4205#issuecomment-71526760. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5019] Update GMM API to use Multivariat...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3911#issuecomment-71529072 @Lewuathe Could you please close this PR since it was fixed by [https://github.com/apache/spark/pull/4088]? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-71531465 @kennethmyers fixed, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-71531745 [Test build #26106 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26106/consoleFull) for PR 3715 at commit [`370ba61`](https://github.com/apache/spark/commit/370ba61571b98e9bdfb6636852d4404687143853). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...
Github user Lewuathe commented on the pull request: https://github.com/apache/spark/pull/3975#issuecomment-71455512 @mengxr @jkbradley Thank you for suggestive advices. Anyway `loadLibSVMFile` should not be changed and in this case the problem is occurred by negative label in `Impurity`. So I update `Impurity` to throw exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on the pull request: https://github.com/apache/spark/pull/4055#issuecomment-71458406 @cloud-fan btw, Do you know HarryZhang? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23553843 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaVertexRDDLike.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.api.java + +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDDLike +import org.apache.spark.api.java.function.{Function = JFunction, Function2 = JFunction2, Function3 = JFunction3} +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.{EdgeRDDImpl, ShippableVertexPartition} +import org.apache.spark.rdd.RDD +import org.apache.spark.{Logging, Partition, TaskContext} + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +trait JavaVertexRDDLike[VD, This : JavaVertexRDDLike[VD, This, R], --- End diff -- I'd remove this trait and fold all of this code directly into the `JavaVertexRDD` class. The `*RDDLike` pattern in the Java API wasn't a great design and I'd like to avoid mimicking it in new code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5362 (4526, 2372) Gradient and O...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/4152#issuecomment-71515894 @jkbradley +1 for more generic `Datum` type. There are two links to Scala type conversion benchmarks in the answer to http://stackoverflow.com/questions/18083696/generic-type-class-instances-performance. As far as I understand, it will not be a big overhead, especially comparing to the code that contains algorithm's implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23555791 --- Diff: bin/spark-class --- @@ -170,6 +170,9 @@ export CLASSPATH # the driver JVM itself. Instead of handling this complexity in Bash, we launch a separate JVM # to prepare the launch environment of this driver JVM. +export JAVA_OPTS+= -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 --- End diff -- It looks like this was just a debugging change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23555762 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaVertexRDDLike.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.api.java + +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDDLike +import org.apache.spark.api.java.function.{Function = JFunction, Function2 = JFunction2, Function3 = JFunction3} +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.{EdgeRDDImpl, ShippableVertexPartition} +import org.apache.spark.rdd.RDD +import org.apache.spark.{Logging, Partition, TaskContext} + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +trait JavaVertexRDDLike[VD, This : JavaVertexRDDLike[VD, This, R], --- End diff -- What do you mean by handle type bounds? In this PR's current code, it looks like `JavaVertexRDDLike` is only extended by a single class and isn't used as part of any method signatures, return types, etc, so unless I'm overlooking something I don't see why it can't be removed. Inheriting implementations from generic traits has bitten us in the past via https://issues.scala-lang.org/browse/SI-8905 (see https://issues.apache.org/jira/browse/SPARK-3266), so if this trait isn't necessary then we shouldn't have it. The JavaRDDLike traits in the Spark Core Java API are an unfortunate holdover from an earlier design and exists primarily for code re-use purposes. We can't remove it now due because that would break binary compatibility. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556471 --- Diff: python/pyspark/__init__.py --- @@ -46,10 +46,12 @@ from pyspark.broadcast import Broadcast from pyspark.serializers import MarshalSerializer, PickleSerializer -# for back compatibility -from pyspark.sql import SQLContext, HiveContext, SchemaRDD, Row --- End diff -- I don't think we can remove this without breaking backwards-compatibility. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556426 --- Diff: pom.xml --- @@ -86,9 +86,9 @@ /mailingLists modules +modulegraphx/module --- End diff -- Mind reverting this POM changes if they're unnecessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556520 --- Diff: python/pyspark/__init__.py --- @@ -46,10 +46,12 @@ from pyspark.broadcast import Broadcast from pyspark.serializers import MarshalSerializer, PickleSerializer -# for back compatibility -from pyspark.sql import SQLContext, HiveContext, SchemaRDD, Row +from pyspark.graphx.vertex import VertexRDD +from pyspark.graphx.edge import EdgeRDD, Edge +from pyspark.graphx.graph import Graph __all__ = [ SparkConf, SparkContext, SparkFiles, RDD, StorageLevel, Broadcast, Accumulator, AccumulatorParam, MarshalSerializer, PickleSerializer, -] +VertexRDD, EdgeRDD, Edge, Graph] --- End diff -- Similarly, I'm not sure that we want to add the GraphX methods to `__all__` here since it doesn't include the SQL ones. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556818 --- Diff: python/pyspark/graphx/vertex.py --- @@ -0,0 +1,330 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +Python bindings for VertexRDD in GraphX + + +import itertools +import os +from tempfile import NamedTemporaryFile +from numpy.numarray.numerictypes import Long + +from py4j.java_collections import MapConverter, ListConverter +import operator +from pyspark.accumulators import PStatsParam +from pyspark.rdd import PipelinedRDD +from pyspark.serializers import CloudPickleSerializer, NoOpSerializer, AutoBatchedSerializer +from pyspark import RDD, PickleSerializer, StorageLevel, SparkContext +from pyspark.traceback_utils import SCCallSiteSync + + +__all__ = [VertexRDD, VertexId] + + + +The default type of vertex id is Long +A separate VertexId type is defined --- End diff -- Minor formatting nit, but these lines can be longer (up to 100 characters). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23557733 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,563 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.LogicalRDD +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} +import org.apache.spark.util.Utils + + +/** + * A collection of rows that have the same columns. + * + * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and can be created using + * various functions in [[SQLContext]]. + * {{{ + * val people = sqlContext.parquetFile(...) + * }}} + * + * Once created, it can be manipulated using the various domain-specific-language (DSL) functions + * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for Scala DSL. + * + * To select a column from the data frame, use the apply method: + * {{{ + * val ageCol = people(age) // in Scala + * Column ageCol = people.apply(age) // in Java + * }}} + * + * Note that the [[Column]] type can also be manipulated through its various functions. + * {{ + * // The following creates a new column that increases everybody's age by 10. + * people(age) + 10 // in Scala --- End diff -- Yea don't pay too much attention to the documentations yet. We will work on it after the PR's merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user kdatta commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23557720 --- Diff: python/pyspark/__init__.py --- @@ -46,10 +46,12 @@ from pyspark.broadcast import Broadcast from pyspark.serializers import MarshalSerializer, PickleSerializer -# for back compatibility -from pyspark.sql import SQLContext, HiveContext, SchemaRDD, Row --- End diff -- code creep from an older version of this file. fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5355] use j.u.c.ConcurrentHashMap inste...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4208#issuecomment-71523042 [Test build #26105 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26105/consoleFull) for PR 4208 at commit [`c2182dc`](https://github.com/apache/spark/commit/c2182dc987d63b4860bcb59c504678456bdc6555). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-71523615 @tomerk Please do say if you think of simplifications! Are there particular pain points/complications which stand out? E.g., are there particular details which developers will have to get right and may mess up? That would be helpful to see (from another person's perspective). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-71524664 [Test build #26101 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26101/consoleFull) for PR 4048 at commit [`9f125ee`](https://github.com/apache/spark/commit/9f125ee80e974d4b2cf56818efb60dfb40f1d564). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4808] Remove Spillable minimum threshol...
Github user mccheah commented on the pull request: https://github.com/apache/spark/pull/3656#issuecomment-71526176 Seeing some problems that this PR could address so reviving this thread. @lawlerd the configurable count would help because if it is known that the individual objects would be large, the sampling could be set to be done more frequently. So if sampling every 32 times is too passive then a more aggressive option can be configured, say sampling every 5 times. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4001][MLlib] adding parallel FP-Growth ...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2847#issuecomment-71526153 1 I mean I use step 1(that Equivalent to create FPTree and condition FPTree ) we have reduce data size and create condition FPTreeï¼only include frequently item not transition dataï¼, when using condition FPTree mining frequently item setï¼it is have a small candidate set. The advantage of FP-Growth over Apriori is the tree structure to present candidate set. Both algorithms are taking advantage on the fact that the candidate set is small. I'm asking whether the current implementation uses the tree structure to save communication. 2 I have test it and compared mahout pfpï¼it is a good performance that about 10 time. I'm not surprised by the 10x speed-up. It is not equivalent to say the current implementation is correct and high-performance. I believe that we can be much faster. 3 afer use groupByKey,ming frequently item set in each node that include Specified keyï¼so it is not network communication overhead. `groupByKey` collects everything to reducers. `aggregateByKey` does part of the aggregation on mappers. There is definitely space for improvement. 4 is there have aggregateByKey operator in new spark version? http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-71531266 @dbtsai I did batching for artificial neural networks and the performance improved ~5x https://github.com/apache/spark/pull/1290#issuecomment-70313952 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user kdatta commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23555143 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaVertexRDDLike.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.api.java + +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDDLike +import org.apache.spark.api.java.function.{Function = JFunction, Function2 = JFunction2, Function3 = JFunction3} +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.{EdgeRDDImpl, ShippableVertexPartition} +import org.apache.spark.rdd.RDD +import org.apache.spark.{Logging, Partition, TaskContext} + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +trait JavaVertexRDDLike[VD, This : JavaVertexRDDLike[VD, This, R], --- End diff -- Josh, then how will we handle type bounds? Is there a new design for Java API? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5199. Input metrics should show up for I...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/4050#discussion_r23555820 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -218,13 +219,14 @@ class HadoopRDD[K, V]( // Find a function that will return the FileSystem bytes read by this thread. Do this before // creating RecordReader, because RecordReader's constructor might read some bytes - val bytesReadCallback = inputMetrics.bytesReadCallback.orElse( -split.inputSplit.value match { - case split: FileSplit = - SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(split.getPath, jobConf) - case _ = None + val bytesReadCallback = inputMetrics.bytesReadCallback.orElse { +val inputSplit = split.inputSplit.value +if (inputSplit.isInstanceOf[FileSplit] || inputSplit.isInstanceOf[CombineFileSplit]) { --- End diff -- this is fine as is, but fyi you can do the same thing in a pattern match: ``` split.inputSplit.value match { case _: FileSplit | _: CombineFileSplit = SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(jobConf) case _ = None } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556181 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/python/PythonVertexRDD.scala --- @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.api.python + +import java.io.{DataOutputStream, FileOutputStream} +import java.util.{ArrayList = JArrayList, List = JList, Map = JMap} + +import org.apache.spark.Accumulator +import org.apache.spark.api.java.{JavaRDD, JavaSparkContext} +import org.apache.spark.api.python.{PythonBroadcast, PythonRDD} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.graphx.VertexId +import org.apache.spark.graphx.api.java.JavaVertexRDD +import org.apache.spark.storage.StorageLevel + +private[graphx] class PythonVertexRDD( +@transient parent: JavaRDD[_], +command: Array[Byte], +envVars: JMap[String, String], +pythonIncludes: JList[String], +preservePartitioning: Boolean, +pythonExec: String, +broadcastVars: JList[Broadcast[PythonBroadcast]], +accumulator: Accumulator[JList[Array[Byte]]], +targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) + extends PythonRDD (parent, command, envVars, + pythonIncludes, preservePartitioning, + pythonExec, broadcastVars, accumulator) { + + def this(@transient parent: JavaVertexRDD[_], --- End diff -- Do we need both constructors, or can we just make the first one Py4J-friendly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556368 --- Diff: graphx/src/test/java/org/apache/spark/graphx/JavaAPISuite.java --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.graphx; + + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.graphx.api.java.JavaEdgeRDD; +import org.apache.spark.graphx.api.java.JavaVertexRDD; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; +import scala.Tuple2; +import scala.reflect.ClassTag; +import scala.reflect.ClassTag$; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.List; + +import static org.junit.Assert.assertEquals; + +public class JavaAPISuite implements Serializable { + +private transient JavaSparkContext ssc; +private ListTuple2Object, VertexPropertyString, String myList; +private ClassTagVertexPropertyString, String classTag; + + +@Before +public void initialize() { --- End diff -- Mind calling these `before` and `after` instead? I find `finalize` to be confusing since it sounds like finalizers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556761 --- Diff: python/pyspark/graphx/partitionstrategy.py --- @@ -0,0 +1,9 @@ +__author__ = 'kdatta1' --- End diff -- We don't use author tags, so I'd remove this. Also, this file and a few others need the Spark license header. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556724 --- Diff: python/pyspark/graphx/graph.py --- @@ -0,0 +1,169 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +Python bindings for Graph[VertexRDD, EdgeRDD] in GraphX + + +import itertools +from pyspark import PickleSerializer, RDD, StorageLevel, SparkContext +from pyspark.graphx import VertexRDD, EdgeRDD + +from pyspark.graphx.partitionstrategy import PartitionStrategy +from pyspark.rdd import PipelinedRDD +from pyspark.serializers import BatchedSerializer + +__all__ = [Graph] + +class Graph(object): +def __init__(self, vertex_jrdd, edge_jrdd, + partition_strategy=PartitionStrategy.EdgePartition1D): +self._vertex_jrdd = VertexRDD(vertex_jrdd, vertex_jrdd.context, + BatchedSerializer(PickleSerializer())) +self._edge_jrdd = EdgeRDD(edge_jrdd, edge_jrdd.context, + BatchedSerializer(PickleSerializer())) +self._partition_strategy = partition_strategy +self._jsc = vertex_jrdd.context + +def persist(self, storageLevel): +self._vertex_jrdd.persist(storageLevel) +self._edge_jrdd.persist(storageLevel) +return + +def cache(self): +self._vertex_jrdd.cache() +self._edge_jrdd.cache() +return + +def vertices(self): +return self._vertex_jrdd + +def edges(self): +return self._edge_jrdd + +def numEdges(self): +return self._edge_jrdd.count() + +def numVertices(self): +return self._vertex_jrdd.count() + +# TODO +def partitionBy(self, partitionStrategy): +return + +# TODO +def inDegrees(self): +return + +# TODO +def outDegrees(self): +return + +# TODO +def degrees(self): +return + +def triplets(self): +if (isinstance(self._jsc, SparkContext)): --- End diff -- What happens if this is false? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user kdatta commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23557297 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/python/PythonVertexRDD.scala --- @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx.api.python + +import java.io.{DataOutputStream, FileOutputStream} +import java.util.{ArrayList = JArrayList, List = JList, Map = JMap} + +import org.apache.spark.Accumulator +import org.apache.spark.api.java.{JavaRDD, JavaSparkContext} +import org.apache.spark.api.python.{PythonBroadcast, PythonRDD} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.graphx.VertexId +import org.apache.spark.graphx.api.java.JavaVertexRDD +import org.apache.spark.storage.StorageLevel + +private[graphx] class PythonVertexRDD( +@transient parent: JavaRDD[_], +command: Array[Byte], +envVars: JMap[String, String], +pythonIncludes: JList[String], +preservePartitioning: Boolean, +pythonExec: String, +broadcastVars: JList[Broadcast[PythonBroadcast]], +accumulator: Accumulator[JList[Array[Byte]]], +targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) + extends PythonRDD (parent, command, envVars, + pythonIncludes, preservePartitioning, + pythonExec, broadcastVars, accumulator) { + + def this(@transient parent: JavaVertexRDD[_], --- End diff -- I'll try to change this.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user kdatta commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23557318 --- Diff: graphx/src/test/java/org/apache/spark/graphx/JavaAPISuite.java --- @@ -0,0 +1,107 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.graphx; + + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.graphx.api.java.JavaEdgeRDD; +import org.apache.spark.graphx.api.java.JavaVertexRDD; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; +import scala.Tuple2; +import scala.reflect.ClassTag; +import scala.reflect.ClassTag$; + +import java.io.Serializable; +import java.util.ArrayList; +import java.util.List; + +import static org.junit.Assert.assertEquals; + +public class JavaAPISuite implements Serializable { + +private transient JavaSparkContext ssc; +private ListTuple2Object, VertexPropertyString, String myList; +private ClassTagVertexPropertyString, String classTag; + + +@Before +public void initialize() { --- End diff -- yes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23557373 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,563 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.LogicalRDD +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} +import org.apache.spark.util.Utils + + +/** + * A collection of rows that have the same columns. + * + * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and can be created using + * various functions in [[SQLContext]]. + * {{{ + * val people = sqlContext.parquetFile(...) + * }}} + * + * Once created, it can be manipulated using the various domain-specific-language (DSL) functions + * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for Scala DSL. + * + * To select a column from the data frame, use the apply method: + * {{{ + * val ageCol = people(age) // in Scala + * Column ageCol = people.apply(age) // in Java + * }}} + * + * Note that the [[Column]] type can also be manipulated through its various functions. + * {{ + * // The following creates a new column that increases everybody's age by 10. + * people(age) + 10 // in Scala --- End diff -- for this one - do I need to assign it to a new column for the result of the expression to be usable? ``` people(birthYear4) = people(birthYear2 + 1900) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4205#issuecomment-71526760 I left an initial pass of comments. I haven't really dug into the details very much yet, but a couple of high-level comments: - There's a lot of code duplication in the Python code that creates the Java RDDs, so it would be nice to see if there's a way to refactor the code to remove this duplication. My concern here is largely around future maintainability, since I'm worried that we'll see the copies of the code diverge when people make changes without being aware of the duplicate copies. - I'd like to avoid repeating the `Java*Like` pattern, since it doesn't look necessary here and it has caused problems in the past: see https://issues.scala-lang.org/browse/SI-8905 and https://issues.apache.org/jira/browse/SPARK-3266. Now that we're increasingly seeing Spark libraries being written in one JVM language and used from another (e.g. a Spark library written against the Java API and called from Scala), it might be nice to try to extend GraphX's Scala API to expose Java-friendly methods instead of adding a new Java API. This is a major departure from how we've handled Java APIs up until now, but it might be a better long-term decision for new code. I think @rxin may be able to chime in here with more details. GraphX might be a nice context to explore this idea since it's a much smaller API than Spark as a whole. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/4155#discussion_r23560945 --- Diff: core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala --- @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.util.concurrent.{ExecutorService, TimeUnit, Executors, ConcurrentHashMap} + +import scala.collection.{Map = ScalaImmutableMap} +import scala.collection.concurrent.{Map = ScalaConcurrentMap} +import scala.collection.convert.decorateAsScala._ + +import akka.actor.{ActorRef, Actor} + +import org.apache.spark.{SparkConf, Logging} +import org.apache.spark.util.{AkkaUtils, ActorLogReceive} + +private[spark] sealed trait OutputCommitCoordinationMessage + +private[spark] case class StageStarted(stage: Int, partitionIds: Seq[Int]) + extends OutputCommitCoordinationMessage +private[spark] case class StageEnded(stage: Int) extends OutputCommitCoordinationMessage +private[spark] case object StopCoordinator extends OutputCommitCoordinationMessage + +private[spark] case class AskPermissionToCommitOutput( +stage: Int, +task: Long, +partId: Int, +taskAttempt: Long) + extends OutputCommitCoordinationMessage with Serializable + +private[spark] case class TaskCompleted( +stage: Int, +task: Long, +partId: Int, +attempt: Long, +successful: Boolean) + extends OutputCommitCoordinationMessage + +/** + * Authority that decides whether tasks can commit output to HDFS. + * + * This lives on the driver, but the actor allows the tasks that commit + * to Hadoop to invoke it. + */ +private[spark] class OutputCommitCoordinator(conf: SparkConf) extends Logging { + + private type StageId = Int + private type PartitionId = Int + private type TaskId = Long + private type TaskAttemptId = Long + + // Wrapper for an int option that allows it to be locked via a synchronized block + // while still setting option itself to Some(...) or None. + private class LockableAttemptId(var value: Option[TaskAttemptId]) + + private type CommittersByStageHashMap = +ConcurrentHashMap[StageId, ScalaImmutableMap[PartitionId, LockableAttemptId]] + + // Initialized by SparkEnv + private var coordinatorActor: Option[ActorRef] = None + private val timeout = AkkaUtils.askTimeout(conf) + private val maxAttempts = AkkaUtils.numRetries(conf) + private val retryInterval = AkkaUtils.retryWaitMs(conf) + private val authorizedCommittersByStage = new CommittersByStageHashMap().asScala + + private var executorRequestHandlingThreadPool: Option[ExecutorService] = None + + def stageStart(stage: StageId, partitionIds: Seq[Int]): Unit = { +sendToActor(StageStarted(stage, partitionIds)) + } + + def stageEnd(stage: StageId): Unit = { +sendToActor(StageEnded(stage)) + } + + def canCommit( + stage: StageId, + task: TaskId, + partId: PartitionId, + attempt: TaskAttemptId): Boolean = { +askActor(AskPermissionToCommitOutput(stage, task, partId, attempt)) + } + + def taskCompleted( + stage: StageId, + task: TaskId, + partId: PartitionId, + attempt: TaskAttemptId, + successful: Boolean): Unit = { +sendToActor(TaskCompleted(stage, task, partId, attempt, successful)) + } + + def stop(): Unit = { +executorRequestHandlingThreadPool.foreach { pool = + pool.shutdownNow() + pool.awaitTermination(10, TimeUnit.SECONDS) --- End diff -- What timeout should be used here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-71530556 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26103/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4924] Add a library for launching Spark...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3916#issuecomment-71530541 [Test build #26103 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26103/consoleFull) for PR 3916 at commit [`ad03c48`](https://github.com/apache/spark/commit/ad03c4859b4da74e95909ab966989d2ded8c8f72). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5345][DEPLOY] Fix unstable test case in...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/4133#discussion_r23555074 --- Diff: core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala --- @@ -43,7 +43,7 @@ class FsHistoryProviderSuite extends FunSuite with BeforeAndAfter with Matchers testDir = Utils.createTempDir() provider = new FsHistoryProvider(new SparkConf() .set(spark.history.fs.logDirectory, testDir.getAbsolutePath()) - .set(spark.history.fs.updateInterval, 0)) + .set(spark.testing, true)) --- End diff -- Hmm... as Sean mentioned, this should already be defined [here](https://github.com/apache/spark/blob/master/pom.xml#L1127). Can you double check that it's really not set, and if not, what's causing it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5408: Use -XX:MaxPermSize specified by u...
Github user jacek-lewandowski commented on the pull request: https://github.com/apache/spark/pull/4202#issuecomment-71519044 I guess I should be overridden, but I couldn't find any specification which says that it actually works in this way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556050 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.graphx.api.java + +import java.lang.{Long = JLong} +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.api.java.function.{Function = JFunction} +import org.apache.spark.graphx._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +/** + * EdgeRDD['ED'] is a column-oriented edge partition RDD created from RDD[Edge[ED]]. + * JavaEdgeRDD class provides a Java API to access implementations of the EdgeRDD class + * + * @param targetStorageLevel + * @tparam ED + */ +class JavaEdgeRDD[ED]( +val edges: RDD[Edge[ED]], +val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) +(implicit val classTag: ClassTag[ED]) + extends JavaEdgeRDDLike[ED, JavaEdgeRDD[ED], JavaRDD[(VertexId, VertexId, ED)]] { + +// /** +// * To create JavaEdgeRDD from JavaRDDs of tuples +// * (source vertex id, destination vertex id and edge property class). +// * The edge property class can be Array[Byte] +// * @param jEdges +// */ +// def this(jEdges: JavaRDD[(VertexId, VertexId, ED)]) = { +//this(jEdges.rdd.map(x = Edge[ED](x._1, x._2, x._3))) +// } + + /* Convert RDD[(PartitionID, EdgePartition[ED, VD])] to EdgeRDD[ED, VD] */ + override def edgeRDD = EdgeRDD.fromEdges(edges) + + /** + * Java Wrapper for RDD of Edges + * + * @param edgeRDD + * @return + */ + def wrapRDD(edgeRDD: RDD[Edge[ED]]): JavaRDD[Edge[ED]] = { +JavaRDD.fromRDD(edgeRDD) + } + + /** Persist RDDs of this JavaEdgeRDD with the default storage level (MEMORY_ONLY_SER) */ + def cache(): this.type = { +edges.cache() +this + } + + def collect(): JList[Edge[ED]] = { +import scala.collection.JavaConversions._ +val arr: java.util.Collection[Edge[ED]] = edges.collect().toSeq +new java.util.ArrayList(arr) + } + + /** + * Return a new single long element generated by counting all elements in the vertex RDD + */ + override def count(): JLong = edges.count() + + /** Return a new VertexRDD containing only the elements that satisfy a predicate. */ + def filter(f: JFunction[Edge[ED], Boolean]): JavaEdgeRDD[ED] = +JavaEdgeRDD(edgeRDD.filter(x = f.call(x).booleanValue())) + + def id: JLong = edges.id.toLong + + /** Persist RDDs of this JavaEdgeRDD with the default storage level (MEMORY_ONLY_SER) */ + def persist(): this.type = { +edges.persist() +this + } + + /** Persist the RDDs of this EdgeRDD with the given storage level */ + def persist(storageLevel: StorageLevel): this.type = { +edges.persist(storageLevel) +this + } + + def unpersist(blocking: Boolean = true) : this.type = { +edgeRDD.unpersist(blocking) +this + } + + override def mapValues[ED2: ClassTag](f: Edge[ED] = ED2): JavaEdgeRDD[ED2] = { +JavaEdgeRDD(edgeRDD.mapValues(f)) + } + + override def reverse: JavaEdgeRDD[ED] = JavaEdgeRDD(edgeRDD.reverse) + + def innerJoin[ED2: ClassTag, ED3: ClassTag] +(other: EdgeRDD[ED2]) +(f: (VertexId, VertexId, ED, ED2) = ED3): JavaEdgeRDD[ED3] = { +JavaEdgeRDD(edgeRDD.innerJoin(other)(f)) + } + + def toRDD : RDD[Edge[ED]] = edges +} + +object JavaEdgeRDD { + + implicit def apply[ED: ClassTag](edges: JavaRDD[Edge[ED]]) : JavaEdgeRDD[ED] = { +JavaEdgeRDD(EdgeRDD.fromEdges(edges.rdd)) + } + + def toEdgeRDD[ED:
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556071 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDD.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.graphx.api.java + +import java.lang.{Long = JLong} +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.api.java.function.{Function = JFunction} +import org.apache.spark.graphx._ +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +/** + * EdgeRDD['ED'] is a column-oriented edge partition RDD created from RDD[Edge[ED]]. + * JavaEdgeRDD class provides a Java API to access implementations of the EdgeRDD class + * + * @param targetStorageLevel + * @tparam ED + */ +class JavaEdgeRDD[ED]( +val edges: RDD[Edge[ED]], +val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY) +(implicit val classTag: ClassTag[ED]) + extends JavaEdgeRDDLike[ED, JavaEdgeRDD[ED], JavaRDD[(VertexId, VertexId, ED)]] { + +// /** +// * To create JavaEdgeRDD from JavaRDDs of tuples +// * (source vertex id, destination vertex id and edge property class). +// * The edge property class can be Array[Byte] +// * @param jEdges +// */ +// def this(jEdges: JavaRDD[(VertexId, VertexId, ED)]) = { +//this(jEdges.rdd.map(x = Edge[ED](x._1, x._2, x._3))) +// } + + /* Convert RDD[(PartitionID, EdgePartition[ED, VD])] to EdgeRDD[ED, VD] */ + override def edgeRDD = EdgeRDD.fromEdges(edges) + + /** + * Java Wrapper for RDD of Edges + * + * @param edgeRDD + * @return + */ + def wrapRDD(edgeRDD: RDD[Edge[ED]]): JavaRDD[Edge[ED]] = { +JavaRDD.fromRDD(edgeRDD) + } + + /** Persist RDDs of this JavaEdgeRDD with the default storage level (MEMORY_ONLY_SER) */ + def cache(): this.type = { +edges.cache() +this + } + + def collect(): JList[Edge[ED]] = { +import scala.collection.JavaConversions._ +val arr: java.util.Collection[Edge[ED]] = edges.collect().toSeq +new java.util.ArrayList(arr) + } + + /** + * Return a new single long element generated by counting all elements in the vertex RDD + */ + override def count(): JLong = edges.count() + + /** Return a new VertexRDD containing only the elements that satisfy a predicate. */ + def filter(f: JFunction[Edge[ED], Boolean]): JavaEdgeRDD[ED] = +JavaEdgeRDD(edgeRDD.filter(x = f.call(x).booleanValue())) + + def id: JLong = edges.id.toLong + + /** Persist RDDs of this JavaEdgeRDD with the default storage level (MEMORY_ONLY_SER) */ + def persist(): this.type = { +edges.persist() +this + } + + /** Persist the RDDs of this EdgeRDD with the given storage level */ + def persist(storageLevel: StorageLevel): this.type = { +edges.persist(storageLevel) +this + } + + def unpersist(blocking: Boolean = true) : this.type = { +edgeRDD.unpersist(blocking) +this + } + + override def mapValues[ED2: ClassTag](f: Edge[ED] = ED2): JavaEdgeRDD[ED2] = { +JavaEdgeRDD(edgeRDD.mapValues(f)) + } + + override def reverse: JavaEdgeRDD[ED] = JavaEdgeRDD(edgeRDD.reverse) + + def innerJoin[ED2: ClassTag, ED3: ClassTag] +(other: EdgeRDD[ED2]) +(f: (VertexId, VertexId, ED, ED2) = ED3): JavaEdgeRDD[ED3] = { +JavaEdgeRDD(edgeRDD.innerJoin(other)(f)) + } + + def toRDD : RDD[Edge[ED]] = edges +} + +object JavaEdgeRDD { + + implicit def apply[ED: ClassTag](edges: JavaRDD[Edge[ED]]) : JavaEdgeRDD[ED] = { --- End diff -- Style nit: no space before the `:`. --- If your project is set up
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556688 --- Diff: python/pyspark/graphx/graph.py --- @@ -0,0 +1,169 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +Python bindings for Graph[VertexRDD, EdgeRDD] in GraphX + + +import itertools +from pyspark import PickleSerializer, RDD, StorageLevel, SparkContext +from pyspark.graphx import VertexRDD, EdgeRDD + +from pyspark.graphx.partitionstrategy import PartitionStrategy +from pyspark.rdd import PipelinedRDD +from pyspark.serializers import BatchedSerializer + +__all__ = [Graph] + +class Graph(object): +def __init__(self, vertex_jrdd, edge_jrdd, + partition_strategy=PartitionStrategy.EdgePartition1D): +self._vertex_jrdd = VertexRDD(vertex_jrdd, vertex_jrdd.context, + BatchedSerializer(PickleSerializer())) +self._edge_jrdd = EdgeRDD(edge_jrdd, edge_jrdd.context, + BatchedSerializer(PickleSerializer())) +self._partition_strategy = partition_strategy +self._jsc = vertex_jrdd.context + +def persist(self, storageLevel): +self._vertex_jrdd.persist(storageLevel) +self._edge_jrdd.persist(storageLevel) +return + +def cache(self): +self._vertex_jrdd.cache() +self._edge_jrdd.cache() +return + +def vertices(self): +return self._vertex_jrdd + +def edges(self): +return self._edge_jrdd + +def numEdges(self): +return self._edge_jrdd.count() + +def numVertices(self): +return self._vertex_jrdd.count() + +# TODO +def partitionBy(self, partitionStrategy): --- End diff -- For these `TODO`s, if they don't land in the first version then we should comment out the methods as well. I think we should keep the TODOs as a reminder of what's missing, but shouldn't have empty methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23557218 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaEdgeRDDLike.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.graphx.api.java + +import java.lang.{Long = JLong} +import java.util.{List = JList} + +import org.apache.spark.api.java.JavaRDDLike +import org.apache.spark.graphx._ +import org.apache.spark.{Partition, TaskContext} + +import scala.reflect.ClassTag + +trait JavaEdgeRDDLike [ED, This : JavaEdgeRDDLike[ED, This, R], +R : JavaRDDLike[(VertexId, VertexId, ED), R]] + extends Serializable { + + def edgeRDD: EdgeRDD[ED] + + def setName() = edgeRDD.setName(JavaEdgeRDD) + + def count() : JLong = edgeRDD.count() + + def compute(part: Partition, context: TaskContext): Iterator[Edge[ED]] = { +edgeRDD.compute(part, context) + } + + def mapValues[ED2: ClassTag](f: Edge[ED] = ED2): JavaEdgeRDD[ED2] + + def reverse: JavaEdgeRDD[ED] +} --- End diff -- Scalastyle is going to complain for this file (and a few others), since this is missing a blank line at the end of the file. Mind running `sbt/sbt scalastyle` and fixing the errors? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23557113 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaGraph.scala --- @@ -0,0 +1,115 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.graphx.api.java + +import java.lang.{Double = JDouble, Long = JLong} + +import org.apache.spark.graphx._ +import org.apache.spark.graphx.lib.PageRank +import org.apache.spark.rdd.RDD + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +class JavaGraph[@specialized VD: ClassTag, @specialized ED: ClassTag] + (vertexRDD : VertexRDD[VD], edgeRDD: EdgeRDD[ED]) { + + def vertices: JavaVertexRDD[VD] = JavaVertexRDD(vertexRDD) + def edges: JavaEdgeRDD[ED] = JavaEdgeRDD(edgeRDD) + @transient lazy val graph : Graph[VD, ED] = Graph(vertexRDD, edgeRDD) + + def partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int): JavaGraph[VD, ED] = { +val graph = Graph(vertexRDD, edgeRDD) +JavaGraph(graph.partitionBy(partitionStrategy, numPartitions)) + } + + /** The number of edges in the graph. */ + def numEdges: JLong = edges.count() + + /** The number of vertices in the graph. */ + def numVertices: JLong = vertices.count() + + def inDegrees: JavaVertexRDD[Int] = JavaVertexRDD[Int](graph.inDegrees) + + def outDegrees: JavaVertexRDD[Int] = JavaVertexRDD[Int](graph.outDegrees) + + def mapVertices[VD2: ClassTag](map: (VertexId, VD) = VD2) : JavaGraph[VD2, ED] = { +JavaGraph(graph.mapVertices(map)) + } + + def mapEdges[ED2: ClassTag](map: Edge[ED] = ED2): JavaGraph[VD, ED2] = { +JavaGraph(graph.mapEdges(map)) + } + + def mapTriplets[ED2: ClassTag](map: EdgeTriplet[VD, ED] = ED2): JavaGraph[VD, ED2] = { +JavaGraph(graph.mapTriplets(map)) + } + + def reverse : JavaGraph[VD, ED] = JavaGraph(graph.reverse) + + def subgraph( +epred: EdgeTriplet[VD,ED] = Boolean = (x = true), +vpred: (VertexId, VD) = Boolean = ((v, d) = true)) : JavaGraph[VD, ED] = { +JavaGraph(graph.subgraph(epred, vpred)) + } + + def groupEdges(merge: (ED, ED) = ED): JavaGraph[VD, ED] = { +JavaGraph(graph.groupEdges(merge)) + } + + @deprecated(use aggregateMessages, 1.2.0) --- End diff -- If this API has already been deprecated in the other languages, should we still introduce it here or can we leave it out? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user kennethmyers commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-71524679 Also, I believe the warning message on line 77 of ```python/pyspark/streaming/kafka.py``` should read ```$ bin/spark-submit``` rather than ```$ /bin/submit``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-71524670 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26101/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5399] tree Losses strings should match ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4206#discussion_r23558838 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/LossesSuite.scala --- @@ -0,0 +1,22 @@ +package org.apache.spark.mllib.tree --- End diff -- Huh, good point. It must have been left over after some other change. Perhaps we should just remove the whole class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4130#issuecomment-71528824 [Test build #567 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/567/consoleFull) for PR 4130 at commit [`4fa0582`](https://github.com/apache/spark/commit/4fa0582e90ebcf3cbc19971e26c8cda83d5613e1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71529504 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26104/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71515094 [Test build #26104 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26104/consoleFull) for PR 4173 at commit [`a47e189`](https://github.com/apache/spark/commit/a47e189e5b742444a6371da74a5bf1c94a9bc278). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23556987 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,563 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.LogicalRDD +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} +import org.apache.spark.util.Utils + + +/** + * A collection of rows that have the same columns. + * + * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and can be created using + * various functions in [[SQLContext]]. + * {{{ + * val people = sqlContext.parquetFile(...) + * }}} + * + * Once created, it can be manipulated using the various domain-specific-language (DSL) functions + * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for Scala DSL. + * + * To select a column from the data frame, use the apply method: + * {{{ + * val ageCol = people(age) // in Scala + * Column ageCol = people.apply(age) // in Java + * }}} + * + * Note that the [[Column]] type can also be manipulated through its various functions. + * {{ + * // The following creates a new column that increases everybody's age by 10. + * people(age) + 10 // in Scala + * }} + * + * A more concrete example: + * {{{ + * // To create DataFrame using SQLContext + * val people = sqlContext.parquetFile(...) + * val department = sqlContext.parquetFile(...) + * + * people.filter(age 30) + * .join(department, people(deptId) === department(id)) + * .groupby(department(name), gender) + * .agg(avg(people(salary)), max(people(age))) + * }}} + */ +// TODO: Improve documentation. +class DataFrame protected[sql]( +val sqlContext: SQLContext, +private val baseLogicalPlan: LogicalPlan, +operatorsEnabled: Boolean) + extends DataFrameSpecificApi with RDDApi[Row] { + + protected[sql] def this(sqlContext: Option[SQLContext], plan: Option[LogicalPlan]) = +this(sqlContext.orNull, plan.orNull, sqlContext.isDefined plan.isDefined) + + protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = this(sqlContext, plan, true) + + @transient protected[sql] lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan) + + @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan match { +// For various commands (like DDL) and queries with side effects, we force query optimization to +// happen right away to let these side effects take place eagerly. +case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: WriteToFile = + LogicalRDD(queryExecution.analyzed.output, queryExecution.toRdd)(sqlContext) +case _ = + baseLogicalPlan + } + + /** + * An implicit conversion function internal to this class for us to avoid doing + * new DataFrame(...) everywhere. + */ + private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): DataFrame = { +new DataFrame(sqlContext, logicalPlan, true) + } + + /** Return the list of numeric columns, useful for doing aggregation. */ +
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556968 --- Diff: python/pyspark/graphx/vertex.py --- @@ -0,0 +1,330 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +Python bindings for VertexRDD in GraphX + + +import itertools +import os +from tempfile import NamedTemporaryFile +from numpy.numarray.numerictypes import Long + +from py4j.java_collections import MapConverter, ListConverter +import operator +from pyspark.accumulators import PStatsParam +from pyspark.rdd import PipelinedRDD +from pyspark.serializers import CloudPickleSerializer, NoOpSerializer, AutoBatchedSerializer +from pyspark import RDD, PickleSerializer, StorageLevel, SparkContext +from pyspark.traceback_utils import SCCallSiteSync + + +__all__ = [VertexRDD, VertexId] + + + +The default type of vertex id is Long +A separate VertexId type is defined +here so that other types can be used +for vertex ids in future + +VertexId = Long + + +class VertexRDD(object): + +VertexRDD class defines the vertex actions and transformations. The complete list of +transformations and actions for vertices are available at +`http://spark.apache.org/docs/latest/graphx-programming-guide.html` +These operations are mapped to Scala functions defined +in `org.apache.spark.graphx.impl.VertexRDDImpl` + + +def __init__(self, jrdd, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer())): + +Constructor +:param jrdd: A JavaRDD reference passed from the parent + RDD object +:param jrdd_deserializer: The deserializer used in Python workers + created from PythonRDD to execute a + serialized Python function and RDD + + +self.name = VertexRDD +self.jrdd = jrdd +self.is_cached = False +self.is_checkpointed = False +self.ctx = SparkContext._active_spark_context +self.jvertex_rdd_deserializer = jrdd_deserializer +self.id = jrdd.id() +self.partitionFunc = None +self.bypass_serializer = False +self.preserve_partitioning = False + +self.jvertex_rdd = self.getJavaVertexRDD(jrdd, jrdd_deserializer) + +def __repr__(self): +return self.jvertex_rdd.toString() + +def cache(self): + +Persist this vertex RDD with the default storage level (C{MEMORY_ONLY_SER}). + +self.is_cached = True +self.persist(StorageLevel.MEMORY_ONLY_SER) +return self + +def checkpoint(self): +self.is_checkpointed = True +self.jvertex_rdd.checkpoint() + +def count(self): +return self.jvertex_rdd.count() + +def diff(self, other, numPartitions=2): + +Hides vertices that are the same between `this` and `other`. +For vertices that are different, keeps the values from `other`. + +TODO: give an example + +if (isinstance(other, RDD)): +vs = self.map(lambda (k, v): (k, (1, v))) +ws = other.map(lambda (k, v): (k, (2, v))) +return vs.union(ws).groupByKey(numPartitions).mapValues(lambda x: x.diff(x.__iter__())) + +def isCheckpointed(self): + +Return whether this RDD has been checkpointed or not + +return self.is_checkpointed + +def mapValues(self, f, preserves_partitioning=False): + +Return a new vertex RDD by applying a function to each vertex attributes, +preserving the index + + rdd = sc.parallelize([(1, b), (2, a), (3, c)]) + vertices = VertexRDD(rdd) +
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23556995 --- Diff: streaming/src/test/java/org/apache/spark/streaming/LocalJavaStreamingContext.java --- @@ -24,7 +24,7 @@ public abstract class LocalJavaStreamingContext { -protected transient JavaStreamingContext ssc; +protected transient JavaStreamingContext ssc; --- End diff -- This is a whitespace-only change; mind reverting? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23556996 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,563 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.LogicalRDD +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} +import org.apache.spark.util.Utils + + +/** + * A collection of rows that have the same columns. + * + * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and can be created using + * various functions in [[SQLContext]]. + * {{{ + * val people = sqlContext.parquetFile(...) + * }}} + * + * Once created, it can be manipulated using the various domain-specific-language (DSL) functions + * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for Scala DSL. + * + * To select a column from the data frame, use the apply method: + * {{{ + * val ageCol = people(age) // in Scala + * Column ageCol = people.apply(age) // in Java + * }}} + * + * Note that the [[Column]] type can also be manipulated through its various functions. + * {{ + * // The following creates a new column that increases everybody's age by 10. + * people(age) + 10 // in Scala + * }} + * + * A more concrete example: + * {{{ + * // To create DataFrame using SQLContext + * val people = sqlContext.parquetFile(...) + * val department = sqlContext.parquetFile(...) + * + * people.filter(age 30) + * .join(department, people(deptId) === department(id)) + * .groupby(department(name), gender) + * .agg(avg(people(salary)), max(people(age))) + * }}} + */ +// TODO: Improve documentation. +class DataFrame protected[sql]( +val sqlContext: SQLContext, +private val baseLogicalPlan: LogicalPlan, +operatorsEnabled: Boolean) + extends DataFrameSpecificApi with RDDApi[Row] { + + protected[sql] def this(sqlContext: Option[SQLContext], plan: Option[LogicalPlan]) = +this(sqlContext.orNull, plan.orNull, sqlContext.isDefined plan.isDefined) + + protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = this(sqlContext, plan, true) + + @transient protected[sql] lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan) + + @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan match { +// For various commands (like DDL) and queries with side effects, we force query optimization to +// happen right away to let these side effects take place eagerly. +case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: WriteToFile = + LogicalRDD(queryExecution.analyzed.output, queryExecution.toRdd)(sqlContext) +case _ = + baseLogicalPlan + } + + /** + * An implicit conversion function internal to this class for us to avoid doing + * new DataFrame(...) everywhere. + */ + private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): DataFrame = { +new DataFrame(sqlContext, logicalPlan, true) + } + + /** Return the list of numeric columns, useful for doing aggregation. */ +
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23557011 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,563 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag + +import com.fasterxml.jackson.core.JsonFactory + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.LogicalRDD +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} +import org.apache.spark.util.Utils + + +/** + * A collection of rows that have the same columns. + * + * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and can be created using + * various functions in [[SQLContext]]. + * {{{ + * val people = sqlContext.parquetFile(...) + * }}} + * + * Once created, it can be manipulated using the various domain-specific-language (DSL) functions + * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for Scala DSL. + * + * To select a column from the data frame, use the apply method: + * {{{ + * val ageCol = people(age) // in Scala + * Column ageCol = people.apply(age) // in Java + * }}} + * + * Note that the [[Column]] type can also be manipulated through its various functions. + * {{ + * // The following creates a new column that increases everybody's age by 10. + * people(age) + 10 // in Scala + * }} + * + * A more concrete example: + * {{{ + * // To create DataFrame using SQLContext + * val people = sqlContext.parquetFile(...) + * val department = sqlContext.parquetFile(...) + * + * people.filter(age 30) + * .join(department, people(deptId) === department(id)) + * .groupby(department(name), gender) + * .agg(avg(people(salary)), max(people(age))) + * }}} + */ +// TODO: Improve documentation. +class DataFrame protected[sql]( +val sqlContext: SQLContext, +private val baseLogicalPlan: LogicalPlan, +operatorsEnabled: Boolean) + extends DataFrameSpecificApi with RDDApi[Row] { + + protected[sql] def this(sqlContext: Option[SQLContext], plan: Option[LogicalPlan]) = +this(sqlContext.orNull, plan.orNull, sqlContext.isDefined plan.isDefined) + + protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = this(sqlContext, plan, true) + + @transient protected[sql] lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan) + + @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan match { +// For various commands (like DDL) and queries with side effects, we force query optimization to +// happen right away to let these side effects take place eagerly. +case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: WriteToFile = + LogicalRDD(queryExecution.analyzed.output, queryExecution.toRdd)(sqlContext) +case _ = + baseLogicalPlan + } + + /** + * An implicit conversion function internal to this class for us to avoid doing + * new DataFrame(...) everywhere. + */ + private[this] implicit def toDataFrame(logicalPlan: LogicalPlan): DataFrame = { +new DataFrame(sqlContext, logicalPlan, true) + } + + /** Return the list of numeric columns, useful for doing aggregation. */ +
[GitHub] spark pull request: [GRAPHX] Spark 3789 - Python Bindings for Grap...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4205#discussion_r23557051 --- Diff: examples/src/main/python/graphx/simpleGraph.py --- @@ -0,0 +1,48 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +Correlations using MLlib. --- End diff -- This looks like the wrong description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4337. [YARN] Add ability to cancel pendi...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/4141#discussion_r23557845 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -192,15 +186,32 @@ private[yarn] class YarnAllocator( } /** - * Request numExecutors additional containers from YARN. Visible for testing. + * Update the set of container requests that we will sync with the RM based on the number of + * executors we have currently running and our target number of executors. + * + * Visible for testing. */ - def addResourceRequests(numExecutors: Int): Unit = { -for (i - 0 until numExecutors) { - val request = new ContainerRequest(resource, null, null, RM_REQUEST_PRIORITY) - amClient.addContainerRequest(request) - val nodes = request.getNodes - val hostStr = if (nodes == null || nodes.isEmpty) Any else nodes.last - logInfo(Container request (host: %s, capability: %s.format(hostStr, resource)) + def updateResourceRequests(): Unit = { --- End diff -- how about making it `private[yarn]`? will still be visible in tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5119] java.lang.ArrayIndexOutOfBoundsEx...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3975#issuecomment-71523305 @Lewuathe Sounds good. Thanks for going through this discussion! Looking at the latest updates, it looks like the 2 Impurity tests are very similar. Would you be able to put them in one ImpuritySuite file and combine the tests to shorten the code (e.g., by using a type parameter for the Impurity type)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4337. [YARN] Add ability to cancel pendi...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/4141#issuecomment-71523335 just a minor comment, otherwise lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][WIP] DataFrame as the common abst...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71529489 [Test build #26104 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26104/consoleFull) for PR 4173 at commit [`a47e189`](https://github.com/apache/spark/commit/a47e189e5b742444a6371da74a5bf1c94a9bc278). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user mccheah commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71530343 @vanzin that's pretty much what I went with. The actor will receive the message and for commit permission requests they're farmed off to a thread pool. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4955]With executor dynamic scaling enab...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/3962#discussion_r23550729 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -231,6 +231,25 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, reporterThread = launchReporterThread() } + /** + * Creates an actor that ApplicationMaster communicates with YarnScheduler of driver + * in Yarn deploy mode. + * + * If isDriver is set to true, AMActor and driver belong to same process. + * so AMActor don't monitor lifecycle of driver. + */ + private def runAMActor( + host: String, + port: String, + isDriver: Boolean): Unit = { --- End diff -- can you rename `isDriver` to `isClusterMode`? It might be apparent to us that these two mean the same thing but for newcomers it's not obvious at all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4585. Spark dynamic executor allocation ...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4051#issuecomment-71508342 A few comments: - I don't have a strong opinion on new option vs. not. I think eventually what we really want is to always have dynamic allocation on, so that discussion would eventually become moot. - min = 0 makes sense; this is particularly interesting for things like spark-shell, or Hive sessions. As long as ramp up is reasonably quick when needed, it should be fine. (0 vs. 1 probably won't make a big difference for large jobs anyway.) - I'm not so sure about `Integer.MAX_VALUE`. I get the point of letting the resource manager handle it, but perhaps we should be nicer here? e.g., set max to number of NMs in the cluster if that info is available client-side? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5408: Use -XX:MaxPermSize specified by u...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4202#issuecomment-71511716 @vanzin I agree, I'd imagine this actually works as expected here since the last one wins. This still feels like a small good thing just so anyone looking at the command line is less likely to be confused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4894][mllib] Added Bernoulli option to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4087#issuecomment-71511624 [Test build #26099 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26099/consoleFull) for PR 4087 at commit [`76e5b0f`](https://github.com/apache/spark/commit/76e5b0f90e370e2cda20e1348bf40ff890f51782). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-71512674 I figured out the magic combination to make sbt, scalatest, junit, and the sbt-pom-reader all play nicely together. I had to introduce a new config (or scope or something, sbt terminology still baffles me ...) instead of creating a new task. Now you can run `unit:test` (or any other variant you like, eg `~unit:testQuick`) which will exclude everything tagged as an IntegrationTest. There is both a tag for scalatests, and a category for junit tests. I've still only bothered with the tagging in core. But I think this can be merged as is in any case -- this does the setup so someone more familiar w/ the other projects can figure out what to tag. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...
Github user gzm0 commented on a diff in the pull request: https://github.com/apache/spark/pull/4130#discussion_r23546974 --- Diff: repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala --- @@ -91,7 +91,14 @@ class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader inputStream.close() Some(defineClass(name, bytes, 0, bytes.length)) } catch { - case e: Exception = None + case e: FileNotFoundException = +// We did not find the class +logDebug(sDid not load class $name from REPL class server at $uri, e) --- End diff -- It is actually a remote URI: Have a look at the [usage site][1]. It can either point to an HTTP server or an HDFS. [1]: https://github.com/gzm0/spark/blob/4fa0582e90ebcf3cbc19971e26c8cda83d5613e1/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L47 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...
Github user gzm0 commented on the pull request: https://github.com/apache/spark/pull/4130#issuecomment-71498788 @marmbrus could you have a look at this as well, please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix command spaces issue in make-distribution....
Github user dyross commented on the pull request: https://github.com/apache/spark/pull/4126#issuecomment-71509353 Any thoughts on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-71511628 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26100/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-71511623 [Test build #26100 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26100/consoleFull) for PR 4048 at commit [`2d6b733`](https://github.com/apache/spark/commit/2d6b733acce37802ac416ef14400b61f116d3aa3). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4746 make it easy to skip IntegrationTes...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4048#issuecomment-71511610 [Test build #26100 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26100/consoleFull) for PR 4048 at commit [`2d6b733`](https://github.com/apache/spark/commit/2d6b733acce37802ac416ef14400b61f116d3aa3). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5355] use j.u.c.ConcurrentHashMap inste...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/4208#discussion_r23553383 --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala --- @@ -47,12 +48,12 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging { /** Create a SparkConf that loads defaults from system properties and the classpath */ def this() = this(true) - private[spark] val settings = new TrieMap[String, String]() + private[spark] val settings = new ConcurrentHashMap[String, String]() --- End diff -- while you are at it, define the return type explicitly here (java.util.Map?) since it is public type --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org