[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1338#issuecomment-50580086 QA results for PR 1338:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17424/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2260] Fix standalone-cluster mode, whic...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1538#issuecomment-50579956 LGTM - thanks andrew! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1498#issuecomment-50579747 QA tests have started for PR 1498. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17428/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...
Github user naftaliharris commented on a diff in the pull request: https://github.com/apache/spark/pull/1493#discussion_r15568446 --- Diff: python/pyspark/mllib/classification.py --- @@ -63,7 +63,10 @@ class LogisticRegressionModel(LinearModel): def predict(self, x): _linear_predictor_typecheck(x, self._coeff) margin = _dot(x, self._coeff) + self._intercept -prob = 1/(1 + exp(-margin)) +if margin > 0: +prob = 1 / (1 + exp(-margin)) +else: +prob = 1 - 1 / (1 + exp(margin)) --- End diff -- Sure, pull request here! #1652 Thanks a lot! :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Avoid numerical instability
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1652#issuecomment-50579636 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-50579659 Hey @nkronenfeld - I traced through the exact function call more closely and I actually think it's fine. The issue I pointed out in the JIRA is orthogonal. So I'm fine to just revert this back to always showing the status. However, we should not mark this as a developer API. This is a stable API we are happy to support forever. Still, this will cause a significant amount of object allocation due to the way other internal function calls happen (it is basically O(all blocks)) for an application. It might be nice to add a note to the docs that the operation might be expensive and should not be called inside of a critical code path. Thought we could likely optimize those things down the road. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Avoid numerical instability
GitHub user naftaliharris opened a pull request: https://github.com/apache/spark/pull/1652 Avoid numerical instability This avoids basically doing 1 - 1, for example: ```python >>> from math import exp >>> margin = -40 >>> 1 - 1 / (1 + exp(margin)) 0.0 >>> exp(margin) / (1 + exp(margin)) 4.248354255291589e-18 >>> ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/naftaliharris/spark patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1652.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1652 commit 0d55a9fae74edf990a087463a52b81ef196862a2 Author: Naftali Harris Date: 2014-07-30T06:46:30Z Avoid numerical instability This avoids basically doing 1 - 1, for example: >>> from math import exp >>> margin = -40 >>> 1 - 1 / (1 + exp(margin)) 0.0 >>> exp(margin) / (1 + exp(margin)) 4.248354255291589e-18 >>> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2734][SQL] Remove tables from cache whe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1650#issuecomment-50579517 QA results for PR 1650:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):case class DropTable(tableName: String) extends LeafNode with Command {For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17420/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/929#issuecomment-50579245 QA results for PR 929:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17422/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1551#discussion_r15568273 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging { throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass) } case other => - throw new SparkException("Unexpected element type " + first.getClass) + if (other == null) { +dataOut.writeInt(SpecialLengths.NULL) +writeIteratorToStream(iter, dataOut) --- End diff -- If users want to call UDF in Java/Scala from PySpark, they have to use this private API to do it, so it's possible to have null in RDD[string] or RDD[Array[Byte]]. BTW, it will be helpful if we can skip some BAD rows during map/reduce, which was mentioned in MapReduce paper. This is not MUST have feature, but it really improve the robustness of whole framework, very useful for large scale jobs. This PR try to improve the stability of PySpark, let users feel safer and happier to hack in PySpark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/1434#discussion_r15568250 --- Diff: extras/spark-kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCount.scala --- @@ -0,0 +1,345 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.streaming + +import java.nio.ByteBuffer +import org.apache.log4j.Level +import org.apache.log4j.Logger +import org.apache.spark.Logging +import org.apache.spark.SparkConf +import org.apache.spark.SparkContext._ +import org.apache.spark.streaming.Milliseconds +import org.apache.spark.streaming.StreamingContext +import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions +import org.apache.spark.streaming.kinesis.KinesisStringRecordSerializer +import org.apache.spark.streaming.kinesis.KinesisUtils +import com.amazonaws.auth.DefaultAWSCredentialsProviderChain +import com.amazonaws.services.kinesis.AmazonKinesisClient +import com.amazonaws.services.kinesis.model.PutRecordRequest +import scala.util.Random +import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream +import org.apache.spark.storage.StorageLevel +import org.apache.spark.streaming.dstream.ReceiverInputDStream +import org.apache.spark.streaming.dstream.DStream + +/** + * Kinesis Spark Streaming WordCount example. + * + * See http://spark.apache.org/docs/latest/streaming-programming-guide.html for more details on the Kinesis Spark Streaming integration. + * + * This example spins up 1 Kinesis Worker (Spark Streaming Receivers) per shard of the given stream. + * It then starts pulling from the tip of the given and at the given . + * Because we're pulling from the tip (InitialPositionInStream.LATEST), only new stream data will be picked up after the KinesisReceiver starts. + * This could lead to missed records if data is added to the stream while no KinesisReceivers are running. + * In production, you'll want to switch to InitialPositionInStream.TRIM_HORIZON which will read up to 24 hours (Kinesis limit) of previous stream data + * depending on the checkpoint frequency. + * + * InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of records depending on the checkpoint frequency. + * Record processing should be idempotent when possible. + * + * This code uses the DefaultAWSCredentialsProviderChain and searches for credentials in the following order of precedence: + * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY + * Java System Properties - aws.accessKeyId and aws.secretKey + * Credential profiles file - default location (~/.aws/credentials) shared by all AWS SDKs + * Instance profile credentials - delivered through the Amazon EC2 metadata service + * + * Usage: KinesisWordCount + *is the name of the Kinesis stream (ie. mySparkStream) + *is the endpoint of the Kinesis service (ie. https://kinesis.us-east-1.amazonaws.com) + *is the batch interval in millis (ie. 1000ms) + * + * Example: + * $ export AWS_ACCESS_KEY_ID= + * $ export AWS_SECRET_KEY= + *$ bin/run-kinesis-example \ + *org.apache.spark.examples.streaming.KinesisWordCount mySparkStream https://kinesis.us-east-1.amazonaws.com 100 + * + * There is a companion helper class below called KinesisWordCountProducer which puts dummy data onto the Kinesis stream. + * Usage instructions for KinesisWordCountProducer are provided in that class definition. + */ +object KinesisWordCount extends Logging { + val WordSeparator = " " + + def main(args: Array[String]) { +/** + * Check that all required args were passed in. + */ +if (args.length < 3) { + System.err.println("Usage: KinesisWordCount ") + System.exit(1) +} + +/** + * (This was lifted from the StreamingExamples.scala in order to avoid the depend
[GitHub] spark pull request: [SPARK-2734][SQL] Remove tables from cache whe...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/1650#discussion_r15568205 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -377,6 +378,10 @@ private[hive] object HiveQl { } protected def nodeToPlan(node: Node): LogicalPlan = node match { +// Special drop table that also uncaches. +case Token("TOK_DROPTABLE", + Token("TOK_TABNAME", + Token(tableName, Nil) :: Nil) :: Nil) => DropTable(tableName) --- End diff -- Seems we also need to support refer to a table with the format of `dbName.tableName` and `IF EXISTS`. An example AST: The AST tree of `drop table if exists default.src` is ``` TOK_DROPTABLE TOK_TABNAME default src TOK_IFEXISTS ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2054][SQL] Code Generation for Expressi...
Github user ueshin commented on the pull request: https://github.com/apache/spark/pull/993#issuecomment-50578802 Hi @marmbrus, thanks for great work! But it seems to break build. I got the following result when I run `sbt assembly` or `sbt publish-local`: ``` [error] (catalyst/compile:doc) Scaladoc generation failed ``` and I found a lot of error messages in the build log saying `value q is not a member of StringContext`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2702][Core] Upgrade Tachyon dependency ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1651#issuecomment-50578510 QA results for PR 1651:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17421/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50577874 Sure, added the notes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Handle null values in debug()
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1646 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1562 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1635#issuecomment-50577707 QA tests have started for PR 1635. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17427/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50577608 Yeah, I'm hoping to merge #1346 as soon as it passes Jenkins, so I'd wait for that. > I also thought Iâd add a couple of notes on what I had in mind with this patch: ... Can you add these notes to the PR description so that they get included in the commit message? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2632, SPARK-2576. Fixed by only importin...
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/1635#issuecomment-50577431 Jenkins retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1648#issuecomment-50577383 QA tests have started for PR 1648. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17425/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/1434#discussion_r15567741 --- Diff: extras/spark-kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCount.java --- @@ -0,0 +1,310 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.examples.streaming; + +import java.util.List; +import java.util.regex.Pattern; + +import org.apache.log4j.Level; +import org.apache.log4j.Logger; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaPairRDD; +import org.apache.spark.api.java.function.FlatMapFunction; +import org.apache.spark.api.java.function.Function; +import org.apache.spark.api.java.function.Function2; +import org.apache.spark.api.java.function.PairFunction; +import org.apache.spark.storage.StorageLevel; +import org.apache.spark.streaming.Duration; +import org.apache.spark.streaming.Milliseconds; +import org.apache.spark.streaming.api.java.JavaDStream; +import org.apache.spark.streaming.api.java.JavaPairDStream; +import org.apache.spark.streaming.api.java.JavaStreamingContext; +import org.apache.spark.streaming.dstream.DStream; +import org.apache.spark.streaming.kinesis.KinesisRecordSerializer; +import org.apache.spark.streaming.kinesis.KinesisStringRecordSerializer; +import org.apache.spark.streaming.kinesis.KinesisUtils; + +import scala.Tuple2; + +import com.amazonaws.auth.DefaultAWSCredentialsProviderChain; +import com.amazonaws.services.kinesis.AmazonKinesisClient; +import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream; +import com.google.common.base.Optional; +import com.google.common.collect.Lists; + +/** + * Java-friendly Kinesis Spark Streaming WordCount example + * + * See http://spark.apache.org/docs/latest/streaming-programming-guide.html for more details on the Kinesis Spark Streaming integration. + * + * This example spins up 1 Kinesis Worker (Spark Streaming Receivers) per shard of the given stream. + * It then starts pulling from the tip of the given and at the given . + * Because we're pulling from the tip (InitialPositionInStream.LATEST), only new stream data will be picked up after the KinesisReceiver starts. + * This could lead to missed records if data is added to the stream while no KinesisReceivers are running. + * In production, you'll want to switch to InitialPositionInStream.TRIM_HORIZON which will read up to 24 hours (Kinesis limit) of previous stream data + * depending on the checkpoint frequency. + * InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of records depending on the checkpoint frequency. + * Record processing should be idempotent when possible. + * + * This code uses the DefaultAWSCredentialsProviderChain and searches for credentials in the following order of precedence: + * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY + * Java System Properties - aws.accessKeyId and aws.secretKey + * Credential profiles file - default location (~/.aws/credentials) shared by all AWS SDKs + * Instance profile credentials - delivered through the Amazon EC2 metadata service + * + * Usage: JavaKinesisWordCount + * is the name of the Kinesis stream (ie. mySparkStream) + * is the endpoint of the Kinesis service (ie. https://kinesis.us-east-1.amazonaws.com) + * is the batch interval in milliseconds (ie. 1000ms) + * + * Example: + * $ export AWS_ACCESS_KEY_ID= + * $ export AWS_SECRET_KEY= + *$ bin/run-kinesis-example \ + *org.apache.spark.examples.streaming.JavaKinesisWordCount mySparkStream https://kinesis.us-east-1.amazonaws.com 1000 + * + * There is a companion helper class called KinesisWordCountProducer which puts dummy data onto the Kinesis stream. + * Usage instructions for KinesisWordCountProducer ar
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1639#issuecomment-50577408 QA tests have started for PR 1639. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17426/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50577338 Sure, Iâm fine with reworking based on other changes (it seems that some merge conflicts have already cropped up in master since I submitted my PR last week). I think my change set is a little simpler than the one you linked to, so would it make sense for me to wait until that one goes in? I also thought Iâd add a couple of notes on what I had in mind with this patch: 1) I added a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasnât overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code. But let me know if you have any thoughts about this. 2) I moved JavaStackTrace from rdd.py to context.py. This made sense to me since JavaStackTrace is all about configuring a context attribute, and the _extract_concise_traceback function it depends on was already being called separately from context.py (as a âprivateâ function of rdd.py). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1648#issuecomment-50577278 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1648#issuecomment-50577269 Jenkins, what are you doing ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1338#issuecomment-50577245 QA results for PR 1338:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17419/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1639#issuecomment-50577126 QA results for PR 1639:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17418/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/1434#discussion_r15567656 --- Diff: extras/spark-kinesis-asl/src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCount.java --- @@ -0,0 +1,310 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.examples.streaming; + +import java.util.List; +import java.util.regex.Pattern; + +import org.apache.log4j.Level; +import org.apache.log4j.Logger; +import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaPairRDD; +import org.apache.spark.api.java.function.FlatMapFunction; +import org.apache.spark.api.java.function.Function; +import org.apache.spark.api.java.function.Function2; +import org.apache.spark.api.java.function.PairFunction; +import org.apache.spark.storage.StorageLevel; +import org.apache.spark.streaming.Duration; +import org.apache.spark.streaming.Milliseconds; +import org.apache.spark.streaming.api.java.JavaDStream; +import org.apache.spark.streaming.api.java.JavaPairDStream; +import org.apache.spark.streaming.api.java.JavaStreamingContext; +import org.apache.spark.streaming.dstream.DStream; +import org.apache.spark.streaming.kinesis.KinesisRecordSerializer; +import org.apache.spark.streaming.kinesis.KinesisStringRecordSerializer; +import org.apache.spark.streaming.kinesis.KinesisUtils; + +import scala.Tuple2; + +import com.amazonaws.auth.DefaultAWSCredentialsProviderChain; +import com.amazonaws.services.kinesis.AmazonKinesisClient; +import com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream; +import com.google.common.base.Optional; +import com.google.common.collect.Lists; + +/** + * Java-friendly Kinesis Spark Streaming WordCount example + * + * See http://spark.apache.org/docs/latest/streaming-programming-guide.html for more details on the Kinesis Spark Streaming integration. + * + * This example spins up 1 Kinesis Worker (Spark Streaming Receivers) per shard of the given stream. + * It then starts pulling from the tip of the given and at the given . + * Because we're pulling from the tip (InitialPositionInStream.LATEST), only new stream data will be picked up after the KinesisReceiver starts. + * This could lead to missed records if data is added to the stream while no KinesisReceivers are running. + * In production, you'll want to switch to InitialPositionInStream.TRIM_HORIZON which will read up to 24 hours (Kinesis limit) of previous stream data + * depending on the checkpoint frequency. + * InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of records depending on the checkpoint frequency. + * Record processing should be idempotent when possible. + * + * This code uses the DefaultAWSCredentialsProviderChain and searches for credentials in the following order of precedence: + * Environment Variables - AWS_ACCESS_KEY_ID and AWS_SECRET_KEY + * Java System Properties - aws.accessKeyId and aws.secretKey + * Credential profiles file - default location (~/.aws/credentials) shared by all AWS SDKs + * Instance profile credentials - delivered through the Amazon EC2 metadata service + * + * Usage: JavaKinesisWordCount + * is the name of the Kinesis stream (ie. mySparkStream) + * is the endpoint of the Kinesis service (ie. https://kinesis.us-east-1.amazonaws.com) + * is the batch interval in milliseconds (ie. 1000ms) + * + * Example: + * $ export AWS_ACCESS_KEY_ID= + * $ export AWS_SECRET_KEY= + *$ bin/run-kinesis-example \ + *org.apache.spark.examples.streaming.JavaKinesisWordCount mySparkStream https://kinesis.us-east-1.amazonaws.com 1000 + * + * There is a companion helper class called KinesisWordCountProducer which puts dummy data onto the Kinesis stream. + * Usage instructions for KinesisWordCountProducer ar
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1639#discussion_r15567618 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1239,6 +1239,28 @@ abstract class RDD[T: ClassTag]( /** The [[org.apache.spark.SparkContext]] that this RDD was created on. */ def context = sc + /** + * Private API for changing an RDD's ClassTag. + * Used for internal Java <-> Scala API compatibility. + */ + private[spark] def retag(cls: Class[T]): RDD[T] = { +val classTag: ClassTag[T] = ClassTag.apply(cls) +this.retag(classTag) + } + + /** + * Private API for changing an RDD's ClassTag. + * Used for internal Java <-> Scala API compatibility. + */ + private[spark] def retag(classTag: ClassTag[T]): RDD[T] = { +val oldRDD = this +new RDD[T](sc, Seq(new OneToOneDependency(this)))(classTag) { + override protected def getPartitions: Array[Partition] = oldRDD.getPartitions + override def compute(split: Partition, context: TaskContext): Iterator[T] = +oldRDD.compute(split, context) +} --- End diff -- Would there be any performance impact of running `mapPartitions(identity, preservesPartitioning = true)(classTag)`? If we have an RDD that's persisted in a serialized format, wouldn't this extra map force an unnecessary deserialization? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1639#issuecomment-50576705 Sure, sounds good. Did you see my comments on preserving partitions too though? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1338#issuecomment-50576604 QA tests have started for PR 1338. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17424/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1639#issuecomment-50576414 My last commit made `classTag` implicit in the retag() method, so in many cases the Scala code can be written as `someJavaRDD.rdd.retag.[...].collect()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50576339 QA tests have started for PR 1346. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17423/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1639#issuecomment-50576330 This method is intended to be called by Scala classes that implement Java-friendly wrappers for the Spark Scala API. For instance, MLlib has APIs that accept RDD[LabelledPoint]. Ideally, the Java wrapper code can simply call the underlying Scala methods without having to worry about how they're implemented. Therefore, I think we should prefer the `retag()`-based approach, since `collectSeq` would require us to modify the Scala consumer of the RDD. Since this is a private, internal API, we should be able to revisit this decision if we change our minds later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50576308 @chenghao-intel `containsNull` and `valueContainsNull` can be used for further optimization. For example, let's say we have an `ArrayType` column and the element type is `IntegerType`. If elements of those arrays do not have `null` values, we can use a primitive array internal. Since we will expose data types to users, we need to introduce these two booleans with this PR. It can be hard to add them once users start to use these APIs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-50576177 Thanks for working on this! We'll need to coordinate merging with #1346 and related PRs. (cc @yhuai) @JoshRosen can you look at the other pyspark changes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1981] Add AWS Kinesis streaming support
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/1434#discussion_r15567310 --- Diff: pom.xml --- @@ -970,6 +972,14 @@ + + + spark-kinesis-asl --- End diff -- Profile can be named `kinesis-asl` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/929#issuecomment-50576082 QA tests have started for PR 929. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17422/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/929#issuecomment-50575931 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1628#issuecomment-50575901 We pinged Davies today. It seems to be a well-known problem with Python. There are ways to force import a standard module in Python 2, but they are all very messy: 1. https://www.inkling.com/read/learning-python-mark-lutz-4th/chapter-23/package-relative-imports 2. https://hkn.eecs.berkeley.edu/~dyoo/python/__std__/ 3. http://www.ianbicking.org/py-std.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1630] Turn Null of Java/Scala into None...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1551#discussion_r15567208 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -344,7 +345,12 @@ private[spark] object PythonRDD extends Logging { throw new SparkException("Unexpected Tuple2 element type " + pair._1.getClass) } case other => - throw new SparkException("Unexpected element type " + first.getClass) + if (other == null) { +dataOut.writeInt(SpecialLengths.NULL) +writeIteratorToStream(iter, dataOut) --- End diff -- Right, but that's a private API, it doesn't matter. Does our own code do it? Basically I'm worried that this significantly complicates our code for something that shouldn't happen. I'd rather have an NPE if our own code later passes nulls here (cause it really shouldn't be doing that since we control everything we pass in). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2702][Core] Upgrade Tachyon dependency ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1651#issuecomment-50575608 QA tests have started for PR 1651. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17421/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-50575597 That could be a bug, not something due to the timeout. It complains of the test taking 20s. Can you look at what that test was doing? As far as I can tell it's not even running in cluster mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1628#issuecomment-50575485 If you can't figure out whether this is possible, consider pinging Josh or Davies too. I'd be surprised if there's no way around this because there are a *lot* of top-level packages in Python. There's gotta be a way to import our own vs importing theirs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2740: allow user to specify ascending an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1645#issuecomment-50575430 QA results for PR 1645:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17416/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2702][Core] Upgrade Tachyon dependency ...
GitHub user haoyuan opened a pull request: https://github.com/apache/spark/pull/1651 [SPARK-2702][Core] Upgrade Tachyon dependency to 0.5.0 You can merge this pull request into a Git repository by running: $ git pull https://github.com/haoyuan/spark upgrade-tachyon Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1651.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1651 commit 6f3f98fb08a9be6b582672bb0bb7756b98ae6193 Author: Haoyuan Li Date: 2014-07-30T05:33:58Z upgrade tachyon to 0.5.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1498#issuecomment-50575398 I did a pass through this -- looks pretty good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15567109 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala --- @@ -17,134 +17,54 @@ package org.apache.spark.scheduler -import scala.language.existentials - -import java.io._ -import java.util.zip.{GZIPInputStream, GZIPOutputStream} +import java.nio.ByteBuffer -import scala.collection.mutable.HashMap +import scala.language.existentials --- End diff -- Ah dunno, I just thought we were doing that, but I guess it's not in https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Imports. No need to do it then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2734][SQL] Remove tables from cache whe...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1650#issuecomment-50575367 QA tests have started for PR 1650. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17420/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15567087 --- Diff: core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala --- @@ -89,7 +91,7 @@ class ContextCleanerSuite extends FunSuite with BeforeAndAfter with LocalSparkCo } test("automatically cleanup RDD") { -var rdd = newRDD.persist() +var rdd = newRDD().persist() --- End diff -- Does assertCleanup below also check that this RDD's broadcast was cleaned? It seems like it doesn't, since you only pass in the RDD ID. Maybe we can also grab its broadcast ID somehow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15567061 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -57,6 +57,12 @@ private[spark] object Utils extends Logging { new File(sparkHome + File.separator + "bin", which + suffix) } + /** Serialize an object using the closure serializer. */ + def serializeTaskClosure[T: ClassTag](o: T): Array[Byte] = { +val ser = SparkEnv.get.closureSerializer.newInstance() --- End diff -- that's a good idea actually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15567051 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -691,47 +689,81 @@ class DAGScheduler( } } - /** Called when stage's parents are available and we can now do its task. */ private def submitMissingTasks(stage: Stage, jobId: Int) { logDebug("submitMissingTasks(" + stage + ")") // Get our pending tasks and remember them in our pendingTasks entry stage.pendingTasks.clear() var tasks = ArrayBuffer[Task[_]]() + +val properties = if (jobIdToActiveJob.contains(jobId)) { + jobIdToActiveJob(stage.jobId).properties +} else { + // this stage will be assigned to "default" pool + null +} + +runningStages += stage +// SparkListenerStageSubmitted should be posted before testing whether tasks are +// serializable. If tasks are not serializable, a SparkListenerStageCompleted event +// will be posted, which should always come after a corresponding SparkListenerStageSubmitted +// event. +listenerBus.post(SparkListenerStageSubmitted(stage.info, properties)) + +// TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times. +// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast +// the serialized copy of the RDD and for each task we will deserialize it, which means each +// task gets a different copy of the RDD. This provides stronger isolation between tasks that +// might modify state of objects referenced in their closures. This is necessary in Hadoop +// where the JobConf/Configuration object is not thread-safe. +var taskBinary: Broadcast[Array[Byte]] = null +try { + // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). + // For ResultTask, serialize and broadcast (rdd, func). + val taskBinaryBytes: Array[Byte] = +if (stage.isShuffleMap) { + Utils.serializeTaskClosure((stage.rdd, stage.shuffleDep.get) : AnyRef) +} else { + Utils.serializeTaskClosure((stage.rdd, stage.resultOfJob.get.func) : AnyRef) +} + taskBinary = sc.broadcast(taskBinaryBytes) +} catch { + // In the case of a failure during serialization, abort the stage. + case e: NotSerializableException => +abortStage(stage, "Task not serializable: " + e.toString) +runningStages -= stage +return + case NonFatal(e) => +abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}") +runningStages -= stage +return +} + if (stage.isShuffleMap) { for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) { val locs = getPreferredLocs(stage.rdd, p) -tasks += new ShuffleMapTask(stage.id, stage.rdd, stage.shuffleDep.get, p, locs) +val part = stage.rdd.partitions(p) +tasks += new ShuffleMapTask(stage.id, taskBinary, part, locs) } } else { // This is a final stage; figure out its job's missing partitions val job = stage.resultOfJob.get for (id <- 0 until job.numPartitions if !job.finished(id)) { -val partition = job.partitions(id) -val locs = getPreferredLocs(stage.rdd, partition) -tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, locs, id) +val p: Int = job.partitions(id) +val part = stage.rdd.partitions(p) +val locs = getPreferredLocs(stage.rdd, p) +tasks += new ResultTask(stage.id, taskBinary, part, locs, id) } } -val properties = if (jobIdToActiveJob.contains(jobId)) { - jobIdToActiveJob(stage.jobId).properties -} else { - // this stage will be assigned to "default" pool - null -} - if (tasks.size > 0) { - runningStages += stage - // SparkListenerStageSubmitted should be posted before testing whether tasks are - // serializable. If tasks are not serializable, a SparkListenerStageCompleted event - // will be posted, which should always come after a corresponding SparkListenerStageSubmitted - // event. - listenerBus.post(SparkListenerStageSubmitted(stage.info, properties)) - // Preemptively serialize a task to make sure it can be serialized. We are catching this // exception here because it would be fairly hard to catch the non-serializable exception // down the road, where we have several different implementations for local sched
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15567033 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -691,47 +689,81 @@ class DAGScheduler( } } - /** Called when stage's parents are available and we can now do its task. */ private def submitMissingTasks(stage: Stage, jobId: Int) { logDebug("submitMissingTasks(" + stage + ")") // Get our pending tasks and remember them in our pendingTasks entry stage.pendingTasks.clear() var tasks = ArrayBuffer[Task[_]]() + +val properties = if (jobIdToActiveJob.contains(jobId)) { + jobIdToActiveJob(stage.jobId).properties +} else { + // this stage will be assigned to "default" pool + null +} + +runningStages += stage +// SparkListenerStageSubmitted should be posted before testing whether tasks are +// serializable. If tasks are not serializable, a SparkListenerStageCompleted event +// will be posted, which should always come after a corresponding SparkListenerStageSubmitted +// event. +listenerBus.post(SparkListenerStageSubmitted(stage.info, properties)) + +// TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times. +// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast +// the serialized copy of the RDD and for each task we will deserialize it, which means each +// task gets a different copy of the RDD. This provides stronger isolation between tasks that +// might modify state of objects referenced in their closures. This is necessary in Hadoop +// where the JobConf/Configuration object is not thread-safe. +var taskBinary: Broadcast[Array[Byte]] = null +try { + // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). + // For ResultTask, serialize and broadcast (rdd, func). + val taskBinaryBytes: Array[Byte] = +if (stage.isShuffleMap) { + Utils.serializeTaskClosure((stage.rdd, stage.shuffleDep.get) : AnyRef) +} else { + Utils.serializeTaskClosure((stage.rdd, stage.resultOfJob.get.func) : AnyRef) +} + taskBinary = sc.broadcast(taskBinaryBytes) +} catch { + // In the case of a failure during serialization, abort the stage. + case e: NotSerializableException => +abortStage(stage, "Task not serializable: " + e.toString) +runningStages -= stage +return + case NonFatal(e) => +abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}") +runningStages -= stage +return +} + if (stage.isShuffleMap) { for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) { val locs = getPreferredLocs(stage.rdd, p) -tasks += new ShuffleMapTask(stage.id, stage.rdd, stage.shuffleDep.get, p, locs) +val part = stage.rdd.partitions(p) +tasks += new ShuffleMapTask(stage.id, taskBinary, part, locs) } } else { // This is a final stage; figure out its job's missing partitions val job = stage.resultOfJob.get for (id <- 0 until job.numPartitions if !job.finished(id)) { -val partition = job.partitions(id) -val locs = getPreferredLocs(stage.rdd, partition) -tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, locs, id) +val p: Int = job.partitions(id) +val part = stage.rdd.partitions(p) +val locs = getPreferredLocs(stage.rdd, p) +tasks += new ResultTask(stage.id, taskBinary, part, locs, id) } } -val properties = if (jobIdToActiveJob.contains(jobId)) { - jobIdToActiveJob(stage.jobId).properties -} else { - // this stage will be assigned to "default" pool - null -} - if (tasks.size > 0) { - runningStages += stage - // SparkListenerStageSubmitted should be posted before testing whether tasks are - // serializable. If tasks are not serializable, a SparkListenerStageCompleted event - // will be posted, which should always come after a corresponding SparkListenerStageSubmitted - // event. - listenerBus.post(SparkListenerStageSubmitted(stage.info, properties)) - // Preemptively serialize a task to make sure it can be serialized. We are catching this // exception here because it would be fairly hard to catch the non-serializable exception // down the road, where we have several different implementations for local sched
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15567030 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -57,6 +57,12 @@ private[spark] object Utils extends Logging { new File(sparkHome + File.separator + "bin", which + suffix) } + /** Serialize an object using the closure serializer. */ + def serializeTaskClosure[T: ClassTag](o: T): Array[Byte] = { +val ser = SparkEnv.get.closureSerializer.newInstance() --- End diff -- We could also save an instance of the closure serializer in DAGScheduler instead, since it executes everything in one thread. Probably not a big deal though but it's something to consider if you're refactoring this code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15567022 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala --- @@ -17,134 +17,54 @@ package org.apache.spark.scheduler -import scala.language.existentials - -import java.io._ -import java.util.zip.{GZIPInputStream, GZIPOutputStream} +import java.nio.ByteBuffer -import scala.collection.mutable.HashMap +import scala.language.existentials --- End diff -- I didn't know about it. Where is it discussed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1740] [PySpark] kill the python worker
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1643#issuecomment-50575076 QA results for PR 1643:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17413/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15567013 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -691,47 +689,81 @@ class DAGScheduler( } } - /** Called when stage's parents are available and we can now do its task. */ private def submitMissingTasks(stage: Stage, jobId: Int) { logDebug("submitMissingTasks(" + stage + ")") // Get our pending tasks and remember them in our pendingTasks entry stage.pendingTasks.clear() var tasks = ArrayBuffer[Task[_]]() + +val properties = if (jobIdToActiveJob.contains(jobId)) { + jobIdToActiveJob(stage.jobId).properties +} else { + // this stage will be assigned to "default" pool + null +} + +runningStages += stage +// SparkListenerStageSubmitted should be posted before testing whether tasks are +// serializable. If tasks are not serializable, a SparkListenerStageCompleted event +// will be posted, which should always come after a corresponding SparkListenerStageSubmitted +// event. +listenerBus.post(SparkListenerStageSubmitted(stage.info, properties)) + +// TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times. +// Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast +// the serialized copy of the RDD and for each task we will deserialize it, which means each +// task gets a different copy of the RDD. This provides stronger isolation between tasks that +// might modify state of objects referenced in their closures. This is necessary in Hadoop +// where the JobConf/Configuration object is not thread-safe. +var taskBinary: Broadcast[Array[Byte]] = null +try { + // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). + // For ResultTask, serialize and broadcast (rdd, func). + val taskBinaryBytes: Array[Byte] = +if (stage.isShuffleMap) { + Utils.serializeTaskClosure((stage.rdd, stage.shuffleDep.get) : AnyRef) +} else { + Utils.serializeTaskClosure((stage.rdd, stage.resultOfJob.get.func) : AnyRef) +} + taskBinary = sc.broadcast(taskBinaryBytes) +} catch { + // In the case of a failure during serialization, abort the stage. + case e: NotSerializableException => +abortStage(stage, "Task not serializable: " + e.toString) +runningStages -= stage +return + case NonFatal(e) => +abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}") +runningStages -= stage +return +} + if (stage.isShuffleMap) { for (p <- 0 until stage.numPartitions if stage.outputLocs(p) == Nil) { val locs = getPreferredLocs(stage.rdd, p) -tasks += new ShuffleMapTask(stage.id, stage.rdd, stage.shuffleDep.get, p, locs) +val part = stage.rdd.partitions(p) +tasks += new ShuffleMapTask(stage.id, taskBinary, part, locs) } } else { // This is a final stage; figure out its job's missing partitions val job = stage.resultOfJob.get for (id <- 0 until job.numPartitions if !job.finished(id)) { -val partition = job.partitions(id) -val locs = getPreferredLocs(stage.rdd, partition) -tasks += new ResultTask(stage.id, stage.rdd, job.func, partition, locs, id) +val p: Int = job.partitions(id) +val part = stage.rdd.partitions(p) +val locs = getPreferredLocs(stage.rdd, p) +tasks += new ResultTask(stage.id, taskBinary, part, locs, id) } } -val properties = if (jobIdToActiveJob.contains(jobId)) { - jobIdToActiveJob(stage.jobId).properties -} else { - // this stage will be assigned to "default" pool - null -} - if (tasks.size > 0) { - runningStages += stage - // SparkListenerStageSubmitted should be posted before testing whether tasks are - // serializable. If tasks are not serializable, a SparkListenerStageCompleted event - // will be posted, which should always come after a corresponding SparkListenerStageSubmitted - // event. - listenerBus.post(SparkListenerStageSubmitted(stage.info, properties)) - // Preemptively serialize a task to make sure it can be serialized. We are catching this // exception here because it would be fairly hard to catch the non-serializable exception // down the road, where we have several different implementations for local sch
[GitHub] spark pull request: [SPARK-1812] [WIP] Scala 2.11 support
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1649#issuecomment-50575027 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2552][MLLIB] stabilize logistic functio...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1493#discussion_r15567017 --- Diff: python/pyspark/mllib/classification.py --- @@ -63,7 +63,10 @@ class LogisticRegressionModel(LinearModel): def predict(self, x): _linear_predictor_typecheck(x, self._coeff) margin = _dot(x, self._coeff) + self._intercept -prob = 1/(1 + exp(-margin)) +if margin > 0: +prob = 1 / (1 + exp(-margin)) +else: +prob = 1 - 1 / (1 + exp(margin)) --- End diff -- Yes, that is definitely better. Could you submit a PR? We don't need a JIRA for small changes. Btw, please cache `exp(margin)` instead of computing it twice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2734][SQL] Remove tables from cache whe...
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/1650 [SPARK-2734][SQL] Remove tables from cache when DROP TABLE is run. You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark dropCached Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1650.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1650 commit c3f535d751c177d2f608e2299b8543d3c72dae5f Author: Michael Armbrust Date: 2014-07-30T05:25:19Z Remove tables from cache when DROP TABLE is run.t p --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15566990 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala --- @@ -17,134 +17,55 @@ package org.apache.spark.scheduler -import scala.language.existentials +import java.nio.ByteBuffer import java.io._ -import java.util.zip.{GZIPInputStream, GZIPOutputStream} - -import scala.collection.mutable.HashMap import org.apache.spark._ -import org.apache.spark.rdd.{RDD, RDDCheckpointData} - -private[spark] object ResultTask { - - // A simple map between the stage id to the serialized byte array of a task. - // Served as a cache for task serialization because serialization can be - // expensive on the master node if it needs to launch thousands of tasks. - private val serializedInfoCache = new HashMap[Int, Array[Byte]] - - def serializeInfo(stageId: Int, rdd: RDD[_], func: (TaskContext, Iterator[_]) => _): Array[Byte] = - { -synchronized { - val old = serializedInfoCache.get(stageId).orNull - if (old != null) { -old - } else { -val out = new ByteArrayOutputStream -val ser = SparkEnv.get.closureSerializer.newInstance() -val objOut = ser.serializeStream(new GZIPOutputStream(out)) -objOut.writeObject(rdd) -objOut.writeObject(func) -objOut.close() -val bytes = out.toByteArray -serializedInfoCache.put(stageId, bytes) -bytes - } -} - } - - def deserializeInfo(stageId: Int, bytes: Array[Byte]): (RDD[_], (TaskContext, Iterator[_]) => _) = - { -val in = new GZIPInputStream(new ByteArrayInputStream(bytes)) -val ser = SparkEnv.get.closureSerializer.newInstance() -val objIn = ser.deserializeStream(in) -val rdd = objIn.readObject().asInstanceOf[RDD[_]] -val func = objIn.readObject().asInstanceOf[(TaskContext, Iterator[_]) => _] -(rdd, func) - } - - def removeStage(stageId: Int) { -serializedInfoCache.remove(stageId) - } - - def clearCache() { -synchronized { - serializedInfoCache.clear() -} - } -} - +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.rdd.RDD /** * A task that sends back the output to the driver application. * - * See [[org.apache.spark.scheduler.Task]] for more information. + * See [[Task]] for more information. * * @param stageId id of the stage this task belongs to - * @param rdd input to func - * @param func a function to apply on a partition of the RDD - * @param _partitionId index of the number in the RDD + * @param taskBinary broadcasted version of the serialized RDD and the function to apply on each + * partition of the given RDD. --- End diff -- Maybe give its type here rather than below to have it all in one place. Same for ShuffleMapTask --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1498#discussion_r15566973 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala --- @@ -17,134 +17,54 @@ package org.apache.spark.scheduler -import scala.language.existentials - -import java.io._ -import java.util.zip.{GZIPInputStream, GZIPOutputStream} +import java.nio.ByteBuffer -import scala.collection.mutable.HashMap +import scala.language.existentials --- End diff -- Didn't we decide to put these language features at the very top of the import list? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/929#issuecomment-50574926 Ok, I will try it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1338#issuecomment-50574875 QA tests have started for PR 1338. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17419/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1812] [WIP] Scala 2.11 support
GitHub user avati opened a pull request: https://github.com/apache/spark/pull/1649 [SPARK-1812] [WIP] Scala 2.11 support You can merge this pull request into a Git repository by running: $ git pull https://github.com/avati/spark scala-2.11 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1649.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1649 commit f8b5e96fca20c13308cb2a9a6c18049bcdd0a7ba Author: Anand Avati Date: 2014-07-30T05:12:30Z SPARK-1812: use consistent protobuf version across build Moving to akka-2.3 reveals issues where differing protobuf versions within the same build results in runtime incompatibility and exceptions like: java.lang.VerifyError: class akka.remote.WireFormats$AkkaControlMessage overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) Signed-off-by: Anand Avati commit c6e0c6ea3b0f5b3316f01f1a743f4de0a6cdf816 Author: Anand Avati Date: 2014-07-26T07:48:49Z SPARK-1812: sql/catalyst - upgrade to scala-logging-slf4j 2.1.2 Signed-off-by: Anand Avati commit 722aee26399b9bf4b725d17f5cfcfad99464af35 Author: Anand Avati Date: 2014-07-27T02:20:34Z SPARK-1812: core - upgrade to akka 2.3.4 Signed-off-by: Anand Avati commit 0d3e37df46c6ce03fe94e10ec26d5bcb96ee86c5 Author: Anand Avati Date: 2014-07-27T03:39:15Z SPARK-1812: core - upgrade to json4s 3.2.10 Signed-off-by: Anand Avati commit 1d2600d41caf6bc47977a5aee9ded7a7aaf85818 Author: Anand Avati Date: 2014-07-27T03:43:27Z SPARK-1812: core - upgrade to unofficial chill 1.2 (tv.cntt) Signed-off-by: Anand Avati commit 0d44a59234f94246dd414ef136647fc4483a9180 Author: Anand Avati Date: 2014-07-26T00:19:17Z SPARK-1812: core - Fix overloaded methods with default arguments Signed-off-by: Anand Avati commit 96f449416a13cba033ff17806a9c56e7aea2692f Author: Anand Avati Date: 2014-07-26T00:21:39Z SPARK-1812: Fix @transient annotation errors - convert to @transient val when unsure - remove @transient annotation elsewhere Signed-off-by: Anand Avati commit d39f8678fd4b2aff96d7d6fe15b2878ef085ce0b Author: Anand Avati Date: 2014-07-26T04:06:48Z SPARK-1812: mllib - upgrade to breeze 0.8.1 Signed-off-by: Anand Avati commit d7cf8c8c0476d3c6e7adcea434bb87a2aa665bcf Author: Anand Avati Date: 2014-07-26T04:16:52Z SPARK-1812: streaming - Fix overloaded methods with default arguments Signed-off-by: Anand Avati commit caef846be53f83498b0603e569404389c792c0b1 Author: Anand Avati Date: 2014-07-26T00:20:03Z SPARK-1812: core - [FIXWARNING] try { } without catch { } warning Signed-off-by: Anand Avati commit 0f68c91d6001f3db516a2e9d6d936d65122bf997 Author: Anand Avati Date: 2014-07-26T08:02:49Z SPARK-1812: temporarily disable - remove spark-repl from assembly dependency - temporarily disable external/kafka module (dependency unavailable) - examples module - kafka_2.11 (transitive) - algebird_2.11 (com.twitter) Signed-off-by: Anand Avati commit c4f92b6f9bb798dafae69ad8432292be8c3a58d4 Author: Anand Avati Date: 2014-07-26T08:15:22Z SPARK-1812: move to akka-zeromq-2.11.0-M3 version 2.2.0 .. until akka-zeromq-2.11 gets published Signed-off-by: Anand Avati commit 7f4d34bf499d2d238925ca777c6f6a78a31904af Author: Anand Avati Date: 2014-07-26T23:35:51Z SPARK-1812: core - [FIXWARNING] replace scala.concurrent.{future,promise} with {Future,Promise} Signed-off-by: Anand Avati commit c50ff00c97d3de41576a7d61ee4d214cafc9b089 Author: Anand Avati Date: 2014-07-26T08:01:32Z SPARK-1812: update all artifactId to spark-${module}_2.11 Signed-off-by: Anand Avati commit 8e621813d77266a51a09d4a217352d61b7e1861e Author: Anand Avati Date: 2014-07-27T04:33:42Z SPARK-1812: sql/core - [FIXWARNING] replace scala.concurrent.future with Future Signed-off-by: Anand Avati --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled bu
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user miccagiann commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15566894 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable { private val DENSE_MATRIX_MAGIC: Byte = 3 private val LABELED_POINT_MAGIC: Byte = 4 + /** + * Enumeration used to define the type of Regularizer + * used for linear methods. + */ + object RegularizerType extends Serializable { +val L2 : Int = 0 +val L1 : Int = 1 +val NONE : Int = 2 + } --- End diff -- Ok! I will do it with strings both in python and in scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2737] Add retag() method for changing R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1639#issuecomment-50574640 QA tests have started for PR 1639. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17418/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/929#issuecomment-50574626 LGTM. Waiting for Jenkins. Btw, @witgo if you have a big dataset to test, could you try to set the storage level of ratings and user/product in/out links to `MEMORY_AND_DISK_SER` and enable `spark.rdd.compress`. It will save a lot of memory with a little overhead on the speed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1562#discussion_r15566850 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -103,26 +107,49 @@ class RangePartitioner[K : Ordering : ClassTag, V]( private var ascending: Boolean = true) --- End diff -- It'd be great to update the documentation on when this results in two passes vs one pass. We should probably update the documentation for sortByKey and various other sorts that use this too. Let's do that in another PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2568] RangePartitioner should run only ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1562#issuecomment-50574526 LGTM. Merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15566835 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable { private val DENSE_MATRIX_MAGIC: Byte = 3 private val LABELED_POINT_MAGIC: Byte = 4 + /** + * Enumeration used to define the type of Regularizer + * used for linear methods. + */ + object RegularizerType extends Serializable { +val L2 : Int = 0 +val L1 : Int = 1 +val NONE : Int = 2 + } --- End diff -- Using strings with a clear doc should be sufficient. Then you can map the string to `L1Updater` or `SquaredUpdater` inside `PythonMLLibAPI`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2724] Python version of RandomRDDGenera...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1628#issuecomment-50574345 @dorx I tried `import pyspark.mllib.random` and it failed. It has to be `from pyspark.mllib import random`. And to use `RandomRDDGenerators`, I need to call `random.RandomRDDGenerators`. Ideally, it should be `from pyspark.mllib.random import RandomRDDGenerators`. If we know how to handle the name `random` now, maybe we can create `random.py` under `mllib` and define class `RandomRDDGenerators` there. If it is not easy to do that because of python's own `random` package, it should be fine to rename the package name to `rand` in both Python and Scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/929#discussion_r15566754 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala --- @@ -255,6 +255,9 @@ class ALS private ( rank, lambda, alpha, YtY) previousProducts.unpersist() logInfo("Re-computing U given I (Iteration %d/%d)".format(iter, iterations)) +if (sc.checkpointDir.isDefined && (iter % 3 == 1)) { --- End diff -- It's a good idea. Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user miccagiann commented on the pull request: https://github.com/apache/spark/pull/1624#issuecomment-50574026 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user miccagiann commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15566690 --- Diff: python/pyspark/mllib/regression.py --- @@ -109,18 +109,35 @@ class LinearRegressionModel(LinearRegressionModelBase): True """ +class RegularizerType(object): +L2 = 0 +L1 = 1 +NONE = 2 --- End diff -- The same enumeration is provided and through the `regression.py` file purely in python so as users to use it directly and avoid providing integer values to parameter `regType` in function `train` of `LinearRegressionWithSGD`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2550][MLLIB][APACHE SPARK] Support regu...
Github user miccagiann commented on a diff in the pull request: https://github.com/apache/spark/pull/1624#discussion_r15566654 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -42,6 +43,16 @@ class PythonMLLibAPI extends Serializable { private val DENSE_MATRIX_MAGIC: Byte = 3 private val LABELED_POINT_MAGIC: Byte = 4 + /** + * Enumeration used to define the type of Regularizer + * used for linear methods. + */ + object RegularizerType extends Serializable { +val L2 : Int = 0 +val L1 : Int = 1 +val NONE : Int = 2 + } --- End diff -- I used a type of Enumeration in order to separate between the different types of Update Methods [Regularizers] with which the user wants to create the model from training data. I tried to extend this object from Enumeration but from what I have seen it uses reflection heavily and it does not work well with serialized objects and with py4j... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Example pyspark-inputformat for Avro file form...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1536#issuecomment-50573759 Great! If you don't plan to work on this anytime soon, could you close this PR? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/929#discussion_r15566629 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala --- @@ -255,6 +255,9 @@ class ALS private ( rank, lambda, alpha, YtY) previousProducts.unpersist() logInfo("Re-computing U given I (Iteration %d/%d)".format(iter, iterations)) +if (sc.checkpointDir.isDefined && (iter % 3 == 1)) { --- End diff -- Do we need to checkpoint the first RDD? If `iter` starts at `1`, we can use `iter % 3 == 0` and hence the checkpoint RDDs are `product-3`, `product-6`, `product-9`, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/929#discussion_r15566577 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala --- @@ -255,6 +255,9 @@ class ALS private ( rank, lambda, alpha, YtY) previousProducts.unpersist() logInfo("Re-computing U given I (Iteration %d/%d)".format(iter, iterations)) +if (sc.checkpointDir.isDefined && (iter % 3 == 1)) { --- End diff -- `iter` from 1 to `iterations` . The checkpoint RDD is `products-1`, `products-4`,`products-7`,`products-10` ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2740: allow user to specify ascending an...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1645#issuecomment-50573567 QA tests have started for PR 1645. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17416/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2740: allow user to specify ascending an...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1645#issuecomment-50573389 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2544][MLLIB] Improve ALS algorithm reso...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/929#issuecomment-50573353 @mengxr The test pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2729] [SQL] Forgot to match Timestamp t...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1636#issuecomment-50573263 Will you have time to add a test case before Friday (the merge deadline for 1.1) or should I? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1648 [SPARK-2585] Remove special handling of Hadoop JobConf. This is based on #1498. Diff here: https://github.com/rxin/spark/commit/5904cb6649b03d48e3465ccab664b506cc69327b You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark jobconf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1648.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1648 commit cae0af33b535a7772fd2861851dca056e0c2186c Author: Reynold Xin Date: 2014-07-19T06:52:47Z [SPARK-2521] Broadcast RDD object (instead of sending it along with every task). Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables. The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large. The user-facing impact of the change include: 1. Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations 2. Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput. In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently). A simple way to test this: ```scala val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a); sc.parallelize(1 to 1000, 1000).map { x => a; x }.groupBy { x => a; x }.count ``` Numbers on 3 r3.8xlarge instances on EC2 ``` master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s ``` Author: Reynold Xin Closes #1452 from rxin/broadcast-task and squashes the following commits: 762e0be [Reynold Xin] Warn large broadcasts. ade6eac [Reynold Xin] Log broadcast size. c3b6f11 [Reynold Xin] Added a unit test for clean up. 754085f [Reynold Xin] Explain why broadcasting serialized copy of the task. 04b17f0 [Reynold Xin] [SPARK-2521] Broadcast RDD object once per TaskSet (instead of sending it for every task). (cherry picked from commit 7b8cd175254d42c8e82f0aa8eb4b7f3508d8fde2) Signed-off-by: Reynold Xin commit d256b456b8450706ecacd98033c3f4d40b37814c Author: Reynold Xin Date: 2014-07-20T07:00:12Z Fixed unit test failures. One more to go. commit cc152fcd14bb13104f391da0fb703a1d2203e3a6 Author: Reynold Xin Date: 2014-07-21T01:48:18Z Don't cache the RDD broadcast variable. commit de779f8704a7f586190dc0e25642836e06136cbb Author: Reynold Xin Date: 2014-07-21T07:21:13Z Fix TaskContextSuite. commit 991c002fc4238308108c07fb40b3400a3d448e2f Author: Reynold Xin Date: 2014-07-23T05:41:53Z Use HttpBroadcast. commit cf384501c1874284e8412439466eb6e22d5fe6d6 Author: Reynold Xin Date: 2014-07-25T07:10:18Z Use TorrentBroadcastFactory. commit bab1d8b601d88b946e3770c611c33d2040472492 Author: Reynold Xin Date: 2014-07-28T22:38:57Z Check for NotSerializableException in submitMissingTasks. commit 797c247ba12dd3eeaa5dda621b9db5a8419732f0 Author: Reynold Xin Date: 2014-07-29T01:20:13Z Properly send SparkListenerStageSubmitted and SparkListenerStageCompleted. commit 111007d719e9c23e5baf4fc3dc374d01115c0e1f Author: Reynold Xin Date: 2014-07-29T01:29:33Z Fix broadcast tests. commit 252238da16fe3a3dfd4a067ca4a9ac47d4fac025 Author: Reynold Xin Date: 2014-07-29T04:31:20Z Serialize the final task closure as well as ShuffleDependency in taskBinary. commit f8535dc959b6d3b733fd46adbfa07708557a1d05 Author: Reynold Xin Date: 2014-07-29T04:35:23Z Fixed the style violation. commit 5904cb6649b03d48e3465ccab664b506cc69327b Author: Reynold Xin Date: 2014-07-30T04:39:14Z [SPARK-2585] Remove special handling of Hadoop JobConf. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes
[GitHub] spark pull request: [SQL] Handle null values in debug()
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1646#issuecomment-50572929 QA results for PR 1646:- This patch PASSES unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17409/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1740] [PySpark] kill the python worker
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1643#issuecomment-50572879 QA tests have started for PR 1643. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17413/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2743][SQL] Resolve original attributes ...
GitHub user marmbrus opened a pull request: https://github.com/apache/spark/pull/1647 [SPARK-2743][SQL] Resolve original attributes in ParquetTableScan You can merge this pull request into a Git repository by running: $ git pull https://github.com/marmbrus/spark parquetCase Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1647.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1647 commit 539a2e1f6c94782d916e7eac12ed1614f0ebfc35 Author: Michael Armbrust Date: 2014-07-30T04:37:23Z Resolve original attributes in ParquetTableScan --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-50572425 QA results for PR 1290:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds the following public classes (experimental):abstract class GeneralizedSteepestDescendModel(val weights: Vector )trait ANN {class LeastSquaresGradientANN(class ANNUpdater extends Updater {class ParallelANN (For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-50572266 QA tests have started for PR 1290. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17412/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-50572128 @bgreeven Jenkins will be automatically triggered for future updates. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-50572111 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-50572101 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15566077 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingRegression.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, DeveloperApi} +import org.apache.spark.streaming.dstream.DStream + +/** + * :: DeveloperApi :: + * StreamingRegression implements methods for training + * a linear regression model on streaming data, and using it + * for prediction on streaming data. + * + * This class takes as type parameters a GeneralizedLinearModel, + * and a GeneralizedLinearAlgorithm, making it easy to extend to construct + * streaming versions of arbitrary regression analyses. For example usage, + * see StreamingLinearRegressionWithSGD. + * + */ +@DeveloperApi +@Experimental +abstract class StreamingRegression[ +M <: GeneralizedLinearModel, +A <: GeneralizedLinearAlgorithm[M]] extends Logging { + + /** The model to be updated and used for prediction. */ + var model: M + + /** The algorithm to use for updating. */ + val algorithm: A + + /** Return the latest model. */ + def latest(): M = { +model + } + + /** + * Update the model by training on batches of data from a DStream. + * This operation registers a DStream for training the model, + * and updates the model based on every subsequent non-empty + * batch of data from the stream. + * + * @param data DStream containing labeled data + */ + def trainOn(data: DStream[LabeledPoint]) { +data.foreachRDD{ + rdd => +if (rdd.count() > 0) { + model = algorithm.run(rdd, model.weights) + logInfo("Model updated") --- End diff -- Maybe we can add more information to it, for example, the current time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15566064 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingRegression.scala --- @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, DeveloperApi} +import org.apache.spark.streaming.dstream.DStream + +/** + * :: DeveloperApi :: + * StreamingRegression implements methods for training + * a linear regression model on streaming data, and using it + * for prediction on streaming data. + * + * This class takes as type parameters a GeneralizedLinearModel, + * and a GeneralizedLinearAlgorithm, making it easy to extend to construct + * streaming versions of arbitrary regression analyses. For example usage, + * see StreamingLinearRegressionWithSGD. + * + */ +@DeveloperApi +@Experimental +abstract class StreamingRegression[ +M <: GeneralizedLinearModel, +A <: GeneralizedLinearAlgorithm[M]] extends Logging { + + /** The model to be updated and used for prediction. */ + var model: M + + /** The algorithm to use for updating. */ + val algorithm: A + + /** Return the latest model. */ + def latest(): M = { +model + } + + /** + * Update the model by training on batches of data from a DStream. + * This operation registers a DStream for training the model, + * and updates the model based on every subsequent non-empty + * batch of data from the stream. + * + * @param data DStream containing labeled data + */ + def trainOn(data: DStream[LabeledPoint]) { +data.foreachRDD{ + rdd => --- End diff -- Spark's code style prefers the following: ~~~ data.foreachRDD { rdd => ... ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15566050 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearRegression.scala --- @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.annotation.Experimental + +/** + * Train or predict a linear regression model on streaming data. Training uses + * Stochastic Gradient Descent to update the model based on each new batch of + * incoming data from a DStream (see LinearRegressionWithSGD for model equation) + * + * Each batch of data is assumed to be an RDD of LabeledPoints. + * The number of data points per batch can vary, but the number + * of features must be constant. + */ +@Experimental +class StreamingLinearRegressionWithSGD private ( +private var stepSize: Double, +private var numIterations: Int, +private var miniBatchFraction: Double, +private var numFeatures: Int) --- End diff -- If we ask the users to provide the initial weight, we don't need this argument. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15566035 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearRegression.scala --- @@ -0,0 +1,104 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.annotation.Experimental + +/** + * Train or predict a linear regression model on streaming data. Training uses + * Stochastic Gradient Descent to update the model based on each new batch of + * incoming data from a DStream (see LinearRegressionWithSGD for model equation) + * + * Each batch of data is assumed to be an RDD of LabeledPoints. + * The number of data points per batch can vary, but the number + * of features must be constant. + */ +@Experimental +class StreamingLinearRegressionWithSGD private ( +private var stepSize: Double, +private var numIterations: Int, +private var miniBatchFraction: Double, --- End diff -- For streaming updates, the RDDs are usually small. Maybe it is not necessary to use `miniBatchFraction`. But it is fine to keep this option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15565983 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionSuite.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.File + +import com.google.common.io.Files +import org.apache.commons.io.FileUtils +import org.scalatest.FunSuite +import org.apache.spark.SparkConf +import org.apache.spark.streaming.{Milliseconds, Seconds, StreamingContext} +import org.apache.spark.mllib.util.{MLStreamingUtils, LinearDataGenerator, LocalSparkContext} + +import scala.collection.mutable.ArrayBuffer --- End diff -- move scala imports before 3rd party imports. usually the imports are organized into 4 groups in the following order: 1. java imports (java.*) 2. scala imports (scala.*) 3. 3rd-party imports 4. spark import (org.apache.spark.*) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Streaming mllib [SPARK-2438][MLLIB]
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1361#discussion_r15565956 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionSuite.scala --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.File + +import com.google.common.io.Files +import org.apache.commons.io.FileUtils +import org.scalatest.FunSuite --- End diff -- add an empty line to separate 3rd party imports from spark imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2054][SQL] Code Generation for Expressi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/993 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---