[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1056#issuecomment-49701430 QA tests have started for PR 1056. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16945/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2612/Fix data skew in ALS
GitHub user renozhang opened a pull request: https://github.com/apache/spark/pull/1521 SPARK-2612/Fix data skew in ALS You can merge this pull request into a Git repository by running: $ git pull https://github.com/renozhang/spark fix-als Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1521.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1521 commit 1a4f7a02ea9dc077837c19cc64d2135275d5a258 Author: peng.zhang peng.zh...@xiaomi.com Date: 2014-07-22T05:59:23Z Fix data skew in ALS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2612/Fix data skew in ALS
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1521#issuecomment-49701893 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1729. Make Flume pull data from source, ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/807#issuecomment-49703872 Hey folks - I just took a light pass on this. One thing I'm a bit confused about - the sink here is not really specific to Spark at all (there is no dependency on Spark). The classes used are entirely general avro classes that other applications could use easily. So why is this being contributed to Spark instead of being put inside of Flume as a general purpose sink that supports polling from external sources? I could imagine that other systems might also want to integrate with flume via a pull/polling based model than via a push model. Was there any previous consideration to put this directly in the Flume project? Adding this in Spark adds some nontrivial steps to the Spark build (such as an avro compiler). So naturally a question is whether it belongs in Spark or Flume. I don't think we can easily change this later since we expose the `SparkFlumePollingEvent` type directly to users, and that type is in the spark package. Actually one related thing, can we not convert these to `SparkFlumeEvents`'s so that the types do not differ depending on which type of flume integration they are using? This could be confusing for users to have to deal with a different type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213062 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.random + +import cern.jet.random.Poisson +import cern.jet.random.engine.DRand + +import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom} + +/** + * Trait for random number generators that generate i.i.d values from a distribution. + */ +trait DistributionGenerator extends Pseudorandom with Serializable { + + /** + * @return An i.i.d sample as a Double from an underlying distribution. + */ + def nextValue(): Double + + /** + * @return A copy of the DistributionGenerator with a new instance of the rng object used in the + * class when applicable. Each partition has a unique seed and therefore requires its + * own instance of the DistributionGenerator. + */ + def newInstance(): DistributionGenerator --- End diff -- I saw your argument about using `clone`. Between `copy` and `newInstance`, I think `copy` is better. For example, in Poisson, we need to copy the mean, which is not reflected in `newInstance`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213061 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.random + +import cern.jet.random.Poisson +import cern.jet.random.engine.DRand + +import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom} + +/** + * Trait for random number generators that generate i.i.d values from a distribution. + */ +trait DistributionGenerator extends Pseudorandom with Serializable { + + /** + * @return An i.i.d sample as a Double from an underlying distribution. + */ + def nextValue(): Double + + /** + * @return A copy of the DistributionGenerator with a new instance of the rng object used in the + * class when applicable. Each partition has a unique seed and therefore requires its --- End diff -- `partition` has no context here. Maybe simply mention that this is for running multiple instances concurrently. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213059 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.random + +import cern.jet.random.Poisson +import cern.jet.random.engine.DRand + +import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom} + +/** + * Trait for random number generators that generate i.i.d values from a distribution. + */ +trait DistributionGenerator extends Pseudorandom with Serializable { + + /** + * @return An i.i.d sample as a Double from an underlying distribution. --- End diff -- Change `@return` to `Returns`. Otherwise the summary will be empty in the generated docs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213064 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.random + +import cern.jet.random.Poisson +import cern.jet.random.engine.DRand + +import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom} + +/** + * Trait for random number generators that generate i.i.d values from a distribution. + */ +trait DistributionGenerator extends Pseudorandom with Serializable { + + /** + * @return An i.i.d sample as a Double from an underlying distribution. + */ + def nextValue(): Double + + /** + * @return A copy of the DistributionGenerator with a new instance of the rng object used in the + * class when applicable. Each partition has a unique seed and therefore requires its + * own instance of the DistributionGenerator. + */ + def newInstance(): DistributionGenerator +} + +/** + * Generates i.i.d. samples from U[0.0, 1.0] + */ +class UniformGenerator() extends DistributionGenerator { + + // XORShiftRandom for better performance. Thread safety isn't necessary here. + private val random = new XORShiftRandom() + + /** + * @return An i.i.d sample as a Double from U[0.0, 1.0]. + */ + override def nextValue(): Double = { +random.nextDouble() + } + + /** Set random seed. */ --- End diff -- Doc is not necessary for the overloaded methods, unless you want to update it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213072 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDGenerators.scala --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.random + +import org.apache.spark.SparkContext +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.rdd.{RandomVectorRDD, RandomRDD} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.Utils + +// TODO add Scaladocs once API fully approved +// Alternatively, we can use the generator pattern to set numPartitions, seed, etc instead to bring +// down the number of methods here. +object RandomRDDGenerators { + + def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int, seed: Long): RDD[Double] = { +val uniform = new UniformGenerator() +randomRDD(sc, size, numPartitions, uniform, seed) + } + + def uniformRDD(sc: SparkContext, size: Long, seed: Long): RDD[Double] = { +uniformRDD(sc, size, sc.defaultParallelism, seed) + } + + def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int): RDD[Double] = { +uniformRDD(sc, size, numPartitions, Utils.random.nextLong) + } + + def uniformRDD(sc: SparkContext, size: Long): RDD[Double] = { +uniformRDD(sc, size, sc.defaultParallelism, Utils.random.nextLong) + } + + def normalRDD(sc: SparkContext, size: Long, numPartitions: Int, seed: Long): RDD[Double] = { +val normal = new StandardNormalGenerator() +randomRDD(sc, size, numPartitions, normal, seed) + } + + def normalRDD(sc: SparkContext, size: Long, seed: Long): RDD[Double] = { +normalRDD(sc, size, sc.defaultParallelism, seed) + } + + def normalRDD(sc: SparkContext, size: Long, numPartitions: Int): RDD[Double] = { +normalRDD(sc, size, numPartitions, Utils.random.nextLong) + } + + def normalRDD(sc: SparkContext, size: Long): RDD[Double] = { +normalRDD(sc, size, sc.defaultParallelism, Utils.random.nextLong) + } + + def poissonRDD(sc: SparkContext, + size: Long, + numPartitions: Int, + mean: Double, + seed: Long): RDD[Double] = { +val poisson = new PoissonGenerator(mean) +randomRDD(sc, size, numPartitions, poisson, seed) + } + + def poissonRDD(sc: SparkContext, size: Long, mean: Double, seed: Long): RDD[Double] = { +poissonRDD(sc, size, sc.defaultParallelism, mean, seed) + } + + def poissonRDD(sc: SparkContext, size: Long, numPartitions: Int, mean: Double): RDD[Double] = { +poissonRDD(sc, size, numPartitions, mean, Utils.random.nextLong) + } + + def poissonRDD(sc: SparkContext, size: Long, mean: Double): RDD[Double] = { +poissonRDD(sc, size, sc.defaultParallelism, mean, Utils.random.nextLong) + } + + def randomRDD(sc: SparkContext, + size: Long, + numPartitions: Int, + distribution: DistributionGenerator, --- End diff -- We need to be consistent on the argument name. `distribution` is used here but `rng` is used in `randomVectorRDD`. `generator` sounds better to me than `distribution` because `DistributionGenerator` is a generator but not a distribution and in the context we don't need to use `distributionGenerator`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213069 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDGenerators.scala --- @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.random + +import org.apache.spark.SparkContext +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.rdd.{RandomVectorRDD, RandomRDD} +import org.apache.spark.rdd.RDD +import org.apache.spark.util.Utils + +// TODO add Scaladocs once API fully approved +// Alternatively, we can use the generator pattern to set numPartitions, seed, etc instead to bring +// down the number of methods here. +object RandomRDDGenerators { + + def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int, seed: Long): RDD[Double] = { +val uniform = new UniformGenerator() +randomRDD(sc, size, numPartitions, uniform, seed) + } + + def uniformRDD(sc: SparkContext, size: Long, seed: Long): RDD[Double] = { +uniformRDD(sc, size, sc.defaultParallelism, seed) + } + + def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int): RDD[Double] = { --- End diff -- It is very confusing to have both `(SparkContext, Long, Int)` and `(SparkContext, Long, Long)`. If a user doesn't see `(SparkContext, Long, Int)` and treat the third argument as the seed and set an integer, it actually sets the number of partitions. Maybe we should only allow default values at the end. That is, ~~~ def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int, seed: Long) def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int) ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213063 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.random + +import cern.jet.random.Poisson +import cern.jet.random.engine.DRand + +import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom} + +/** + * Trait for random number generators that generate i.i.d values from a distribution. + */ +trait DistributionGenerator extends Pseudorandom with Serializable { + + /** + * @return An i.i.d sample as a Double from an underlying distribution. + */ + def nextValue(): Double + + /** + * @return A copy of the DistributionGenerator with a new instance of the rng object used in the + * class when applicable. Each partition has a unique seed and therefore requires its + * own instance of the DistributionGenerator. + */ + def newInstance(): DistributionGenerator +} + +/** + * Generates i.i.d. samples from U[0.0, 1.0] + */ +class UniformGenerator() extends DistributionGenerator { --- End diff -- Is `()` necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213096 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.rdd + +import org.apache.spark.{Partition, SparkContext, TaskContext} +import org.apache.spark.mllib.linalg.{DenseVector, Vector} +import org.apache.spark.mllib.random.DistributionGenerator +import org.apache.spark.rdd.RDD +import org.apache.spark.util.Utils + +private[mllib] class RandomRDDPartition(val idx: Int, +val size: Long, +val rng: DistributionGenerator, +val seed: Long) extends Partition { + + override val index: Int = idx + +} + +// These two classes are necessary since Range objects in Scala cannot have size Int.MaxValue +private[mllib] class RandomRDD(@transient private var sc: SparkContext, +size: Long, +numSlices: Int, +@transient rng: DistributionGenerator, +@transient seed: Long = Utils.random.nextLong) extends RDD[Double](sc, Nil) { + + require(size 0, Positive RDD size required.) + require(numSlices 0, Positive number of partitions required) + + override def compute(splitIn: Partition, context: TaskContext): Iterator[Double] = { +val split = splitIn.asInstanceOf[RandomRDDPartition] +RandomRDD.getPointIterator(split) + } + + override def getPartitions: Array[Partition] = { +RandomRDD.getPartitions(size, numSlices, rng, seed) + } +} + +private[mllib] class RandomVectorRDD(@transient private var sc: SparkContext, +size: Long, +vectorSize: Int, +numSlices: Int, +@transient rng: DistributionGenerator, +@transient seed: Long = Utils.random.nextLong) extends RDD[Vector](sc, Nil) { + + require(size 0, Positive RDD size required.) + require(numSlices 0, Positive number of partitions required) + require(vectorSize 0, Positive vector size required.) + + override def compute(splitIn: Partition, context: TaskContext): Iterator[Vector] = { +val split = splitIn.asInstanceOf[RandomRDDPartition] +RandomRDD.getVectorIterator(split, vectorSize) + } + + override protected def getPartitions: Array[Partition] = { +RandomRDD.getPartitions(size, numSlices, rng, seed) + } +} + +private[mllib] object RandomRDD { + + private[mllib] class FixedSizePointIterator(val numElem: Long, val rng: DistributionGenerator) +extends Iterator[Double] { + +private var currentSize = 0 + +override def hasNext: Boolean = currentSize numElem + +override def next(): Double = { + currentSize += 1 + rng.nextValue() +} + } + + private[mllib] class FixedSizeVectorIterator(val numElem: Long, + val vectorSize: Int, + val rng: DistributionGenerator) +extends Iterator[Vector] { + +private var currentSize = 0 + +override def hasNext: Boolean = currentSize numElem + +override def next(): Vector = { + currentSize += 1 + new DenseVector((0 until vectorSize).map { _ = rng.nextValue() }.toArray) +} + } + + def getPartitions(size: Long, + numSlices: Int, + rng: DistributionGenerator, + seed: Long): Array[Partition] = { + +val firstPartitionSize = size / numSlices + size % numSlices --- End diff -- This is not evenly distributed, for example, when `size = 1000` and `numSlices = 101`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213082 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.rdd + +import org.apache.spark.{Partition, SparkContext, TaskContext} +import org.apache.spark.mllib.linalg.{DenseVector, Vector} +import org.apache.spark.mllib.random.DistributionGenerator +import org.apache.spark.rdd.RDD +import org.apache.spark.util.Utils + +private[mllib] class RandomRDDPartition(val idx: Int, +val size: Long, +val rng: DistributionGenerator, +val seed: Long) extends Partition { + + override val index: Int = idx --- End diff -- You can put `override` directly in the constructor. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213078 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.rdd + +import org.apache.spark.{Partition, SparkContext, TaskContext} +import org.apache.spark.mllib.linalg.{DenseVector, Vector} +import org.apache.spark.mllib.random.DistributionGenerator +import org.apache.spark.rdd.RDD +import org.apache.spark.util.Utils + +private[mllib] class RandomRDDPartition(val idx: Int, +val size: Long, --- End diff -- Could you double check whether we can have more than `Int.MaxValue` items in a single partition? It may break storage and couple RDD functions like `glob`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213091 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.rdd + +import org.apache.spark.{Partition, SparkContext, TaskContext} +import org.apache.spark.mllib.linalg.{DenseVector, Vector} +import org.apache.spark.mllib.random.DistributionGenerator +import org.apache.spark.rdd.RDD +import org.apache.spark.util.Utils + +private[mllib] class RandomRDDPartition(val idx: Int, +val size: Long, +val rng: DistributionGenerator, +val seed: Long) extends Partition { + + override val index: Int = idx + +} + +// These two classes are necessary since Range objects in Scala cannot have size Int.MaxValue +private[mllib] class RandomRDD(@transient private var sc: SparkContext, +size: Long, +numSlices: Int, +@transient rng: DistributionGenerator, +@transient seed: Long = Utils.random.nextLong) extends RDD[Double](sc, Nil) { + + require(size 0, Positive RDD size required.) + require(numSlices 0, Positive number of partitions required) + + override def compute(splitIn: Partition, context: TaskContext): Iterator[Double] = { +val split = splitIn.asInstanceOf[RandomRDDPartition] +RandomRDD.getPointIterator(split) + } + + override def getPartitions: Array[Partition] = { +RandomRDD.getPartitions(size, numSlices, rng, seed) + } +} + +private[mllib] class RandomVectorRDD(@transient private var sc: SparkContext, +size: Long, +vectorSize: Int, +numSlices: Int, +@transient rng: DistributionGenerator, +@transient seed: Long = Utils.random.nextLong) extends RDD[Vector](sc, Nil) { + + require(size 0, Positive RDD size required.) + require(numSlices 0, Positive number of partitions required) + require(vectorSize 0, Positive vector size required.) + + override def compute(splitIn: Partition, context: TaskContext): Iterator[Vector] = { +val split = splitIn.asInstanceOf[RandomRDDPartition] +RandomRDD.getVectorIterator(split, vectorSize) + } + + override protected def getPartitions: Array[Partition] = { +RandomRDD.getPartitions(size, numSlices, rng, seed) + } +} + +private[mllib] object RandomRDD { + + private[mllib] class FixedSizePointIterator(val numElem: Long, val rng: DistributionGenerator) --- End diff -- If we couldn't have more than `Int.MaxValue` items per iteration, this is `Iterator.fill(numElem)(rng.nextValue())`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15213097 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.rdd + +import org.apache.spark.{Partition, SparkContext, TaskContext} +import org.apache.spark.mllib.linalg.{DenseVector, Vector} +import org.apache.spark.mllib.random.DistributionGenerator +import org.apache.spark.rdd.RDD +import org.apache.spark.util.Utils + +private[mllib] class RandomRDDPartition(val idx: Int, +val size: Long, +val rng: DistributionGenerator, +val seed: Long) extends Partition { + + override val index: Int = idx + +} + +// These two classes are necessary since Range objects in Scala cannot have size Int.MaxValue +private[mllib] class RandomRDD(@transient private var sc: SparkContext, +size: Long, +numSlices: Int, +@transient rng: DistributionGenerator, +@transient seed: Long = Utils.random.nextLong) extends RDD[Double](sc, Nil) { + + require(size 0, Positive RDD size required.) + require(numSlices 0, Positive number of partitions required) + + override def compute(splitIn: Partition, context: TaskContext): Iterator[Double] = { +val split = splitIn.asInstanceOf[RandomRDDPartition] +RandomRDD.getPointIterator(split) + } + + override def getPartitions: Array[Partition] = { +RandomRDD.getPartitions(size, numSlices, rng, seed) + } +} + +private[mllib] class RandomVectorRDD(@transient private var sc: SparkContext, +size: Long, +vectorSize: Int, +numSlices: Int, +@transient rng: DistributionGenerator, +@transient seed: Long = Utils.random.nextLong) extends RDD[Vector](sc, Nil) { + + require(size 0, Positive RDD size required.) + require(numSlices 0, Positive number of partitions required) + require(vectorSize 0, Positive vector size required.) + + override def compute(splitIn: Partition, context: TaskContext): Iterator[Vector] = { +val split = splitIn.asInstanceOf[RandomRDDPartition] +RandomRDD.getVectorIterator(split, vectorSize) + } + + override protected def getPartitions: Array[Partition] = { +RandomRDD.getPartitions(size, numSlices, rng, seed) + } +} + +private[mllib] object RandomRDD { + + private[mllib] class FixedSizePointIterator(val numElem: Long, val rng: DistributionGenerator) +extends Iterator[Double] { + +private var currentSize = 0 + +override def hasNext: Boolean = currentSize numElem + +override def next(): Double = { + currentSize += 1 + rng.nextValue() +} + } + + private[mllib] class FixedSizeVectorIterator(val numElem: Long, + val vectorSize: Int, + val rng: DistributionGenerator) +extends Iterator[Vector] { + +private var currentSize = 0 + +override def hasNext: Boolean = currentSize numElem + +override def next(): Vector = { + currentSize += 1 + new DenseVector((0 until vectorSize).map { _ = rng.nextValue() }.toArray) +} + } + + def getPartitions(size: Long, + numSlices: Int, + rng: DistributionGenerator, + seed: Long): Array[Partition] = { + +val firstPartitionSize = size / numSlices + size % numSlices +val otherPartitionSize = size / numSlices + +val partitions = new Array[RandomRDDPartition](numSlices) +var i = 0 +while (i numSlices) { + partitions(i) = if (i == 0) { +new RandomRDDPartition(i, firstPartitionSize, rng, seed) --- End diff -- It is safer to make a copy in `compute` than here. In local mode, this may cause problems if a user uses the same generator to create two random RDDs. --- If your
[GitHub] spark pull request: SPARK-1729. Make Flume pull data from source, ...
Github user harishreedharan commented on the pull request: https://github.com/apache/spark/pull/807#issuecomment-49704453 Thanks for taking a look Patrick. The reason this really is not going into Flume itself is that Flume sinks are push model (this sink is hacking around it). For now, I don't expect a polling model sink to get committed to Flume, plus this is a very special case compared to the normal sinks - and it is more specific than general purpose. I will look into that SparkFlumeEvent - for now, it seems like it should be possible. Will update this later if needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2269 Refactor mesos scheduler resourceOf...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1487#issuecomment-49704715 Jenkins, retest this please. Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2269 Refactor mesos scheduler resourceOf...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1487#issuecomment-49705027 QA tests have started for PR 1487. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16946/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2612] Fix data skew in ALS
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1521#issuecomment-49706031 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1399#discussion_r15213826 --- Diff: docs/sql-programming-guide.md --- @@ -573,4 +572,170 @@ prefixed with a tick (`'`). Implicit conversions turn these symbols into expres evaluated by the SQL execution engine. A full list of the functions supported can be found in the [ScalaDoc](api/scala/index.html#org.apache.spark.sql.SchemaRDD). -!-- TODO: Include the table of operations here. -- \ No newline at end of file +!-- TODO: Include the table of operations here. -- + +## Running the Thrift JDBC server + +The Thrift JDBC server implemented here corresponds to the [`HiveServer2`] +(https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) in Hive 0.12. You can test +the JDBC server with the beeline script comes with either Spark or Hive 0.12. + +To start the JDBC server, run the following in the Spark directory: + +./sbin/start-thriftserver.sh + +The default port the server listens on is 1. Now you can use beeline to test the Thrift JDBC +server: + +./bin/beeline + +Connect to the JDBC server in beeline with: + +beeline !connect jdbc:hive2://localhost:1 + +Beeline will ask you for a username and password. In non-secure mode, simply enter the username on +your machine and a blank password. For secure mode, please follow the instructions given in the +[beeline documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients) + +Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. + +You may also use the beeline script comes with Hive. + +### Migration Guide for Shark Users + + Reducer number + +In Shark, default reducer number is 1, and can be tuned by property `mapred.reduce.tasks`. In Spark SQL, reducer number is default to 200, and can be customized by the `spark.sql.shuffle.partitions` property: --- End diff -- Yep, exactly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2612] Fix data skew in ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1521#issuecomment-49706388 QA tests have started for PR 1521. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16947/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49706936 @aarondav pushed an update to the grow code as well, which will now use estimateSize. I implemented the same thing in EAOM. @colorant fixed those, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49707121 QA tests have started for PR 1499. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16948/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1510#discussion_r15214158 --- Diff: tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala --- @@ -99,9 +99,25 @@ object GenerateMIMAIgnore { (ignoredClasses.flatMap(c = Seq(c, c.replace($, #))).toSet, ignoredMembers.toSet) } + /** Scala reflection does not let us see inner function even if they are upgraded +* to public for some reason. So had to resort to java reflection to get all inner +* functions with $$ in there name. +*/ + def getInnerFunc(classSymbol: unv.ClassSymbol): Seq[String] = { --- End diff -- Can this be called `getInnerFunctions` since it enumerates multiple inner functions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1510#issuecomment-49707182 Small comment but looks good. Have you tested this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2615] [SQL] Add Equal Sign == Support...
GitHub user chenghao-intel opened a pull request: https://github.com/apache/spark/pull/1522 [SPARK-2615] [SQL] Add Equal Sign == Support for HiveQl Currently, the == in HiveQL expression will cause exception thrown, this patch will fix it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chenghao-intel/spark equal Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1522.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1522 commit f62a0ff31627d7e11c9c598869d3439890185518 Author: Cheng Hao hao.ch...@intel.com Date: 2014-07-22T07:24:43Z Add == Support for HiveQl --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1056#issuecomment-49707381 QA results for PR 1056:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class Heartbeat(brclass HeartbeatReceiver(scheduler: TaskScheduler) extends Actor {brcase class SparkListenerExecutorMetricsUpdate(execId: String,brcase class BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends ToBlockManagerMasterbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16944/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2612] Fix data skew in ALS
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1521#issuecomment-49707382 @renozhang The change looks good to me and thanks for finding it! Do you mind adding `[MLLIB]` to the title of this PR and also remove `productPartitioner` from `updateFeatures` since it is no longer used there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2612] Fix data skew in ALS
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1521#issuecomment-49707420 Jenkins, add to whitelist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2452] Create a new valid for each inste...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1441#issuecomment-49707481 Okay merging this into master. For the 1.0 branch we just reverted the fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2615] [SQL] Add Equal Sign == Support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1522#issuecomment-49707497 QA tests have started for PR 1522. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16949/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2452] Create a new valid for each inste...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1441 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1253#issuecomment-49707690 LGTM - @mateiz @rxin any final comments here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1502#issuecomment-49707787 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1056#issuecomment-49707887 QA results for PR 1056:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brcase class Heartbeat(brclass HeartbeatReceiver(scheduler: TaskScheduler) extends Actor {brcase class SparkListenerExecutorMetricsUpdate(execId: String,brcase class BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends ToBlockManagerMasterbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16945/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1253#discussion_r15214419 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -55,6 +55,7 @@ private[spark] class SparkSubmitArguments(args: Seq[String]) { var verbose: Boolean = false var isPython: Boolean = false var pyFiles: String = null + var sparkProperties: HashMap[String, String] = new HashMap[String, String]() --- End diff -- I think you want a val here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49707966 @willb mind throwing [SQL] in the title here? It helps this filter nicely on some internal tools we have to triage PR's. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1253#discussion_r15214475 --- Diff: docs/configuration.md --- @@ -42,13 +42,14 @@ val sc = new SparkContext(new SparkConf()) Then, you can supply configuration values at runtime: {% highlight bash %} -./bin/spark-submit --name My fancy app --master local[4] myApp.jar +./bin/spark-submit --name My app --master local[4] myApp.jar --conf spark.shuffle.spill=false --- End diff -- It'd be great to have an example with values that contain spaces so quotes are needed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1502#discussion_r15214488 --- Diff: core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala --- @@ -252,7 +251,7 @@ class ExternalAppendOnlyMap[K, V, C]( if (it.hasNext) { var kc = it.next() kcPairs += kc -val minHash = getKeyHashCode(kc) +val minHash = hashKey(kc) --- End diff -- Just curious, is this more efficient than calling ExternalAppendOnlyMap.hash directly as we did before? It was kind of weird that we were doing it in an inner class, maybe it created another pointer derefernece. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1498#issuecomment-49708099 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1253#discussion_r15214491 --- Diff: docs/configuration.md --- @@ -42,13 +42,14 @@ val sc = new SparkContext(new SparkConf()) Then, you can supply configuration values at runtime: {% highlight bash %} -./bin/spark-submit --name My fancy app --master local[4] myApp.jar +./bin/spark-submit --name My app --master local[4] myApp.jar --conf spark.shuffle.spill=false {% endhighlight %} The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support two ways to load configurations dynamically. The first are command line options, -such as `--master`, as shown above. Running `./bin/spark-submit --help` will show the entire list -of options. +such as `--master`, as shown above. `spark-submit` can accept any Spark property using the `--conf` +flag, but uses special flags for properties that play a part in launching the Spark application. +Running `./bin/spark-submit --help` will show the entire list of these options. --- End diff -- can we also make spark-submit without any options print the help? (maybe you are already doing that) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1498#issuecomment-49708172 Somehow this makes the unit test taking very long to finish. I also suspect there is some racing condition in the cleaning code. This PR makes them manifest more often. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1498#issuecomment-49708325 QA tests have started for PR 1498. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16952/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1502#issuecomment-49708347 QA tests have started for PR 1502. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16953/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1498#issuecomment-49708333 I was doing some random other thing locally and noticed that there was some weird issue with synchronization around the TorrentBroadcast lock (there is a single shared lock used a bunch inside of that) a bunch of tasks were waiting for the lock for a long time. Maybe somehow that is slowing things down the tests as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1195#issuecomment-49708363 QA tests have started for PR 1195. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16954/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/1338#issuecomment-49708463 @kanzhang @mateiz Yeah this is one issue with Pyrolite vs MsgPack. MsgPack supported case classes out the box, which would likely be a bit more common that beans. I'd say that custom serde via `Converter` will be far more common (as we've already seen with various Avro commentary etc). Thinking about it some more, I would be ok to remove from the docs. This would still be available as undocumented functionality so if relevant use cases did come up on the mailing list, we could point to it and in the unlikely case that there was demand we could simply document it as read-only functionality. Bearing in mind this is also still marked experimental and we'll need to see how users use it in the wild a bit and make any amendments as required. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15214670 --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala --- @@ -124,15 +124,20 @@ private[spark] class CacheManager(blockManager: BlockManager) extends Logging { private def putInBlockManager[T]( key: BlockId, values: Iterator[T], - storageLevel: StorageLevel, - updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)]): Iterator[T] = { - -if (!storageLevel.useMemory) { - /* This RDD is not to be cached in memory, so we can just pass the computed values + level: StorageLevel, + updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)], + effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = { --- End diff -- This should be documented in the javadoc here - it's not at all obvious what it means. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15214812 --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala --- @@ -124,15 +124,20 @@ private[spark] class CacheManager(blockManager: BlockManager) extends Logging { private def putInBlockManager[T]( key: BlockId, values: Iterator[T], - storageLevel: StorageLevel, - updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)]): Iterator[T] = { - -if (!storageLevel.useMemory) { - /* This RDD is not to be cached in memory, so we can just pass the computed values + level: StorageLevel, + updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)], + effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = { + +val putLevel = effectiveStorageLevel.getOrElse(level) +if (!putLevel.useMemory) { + /* + * This RDD is not to be cached in memory, so we can just pass the computed values * as an iterator directly to the BlockManager, rather than first fully unrolling * it in memory. The latter option potentially uses much more memory and risks OOM --- End diff -- This doc is a bit outdated now (we no longer risk OOM ideally). I think it's a bit overly complicated anyways. I'd just make it something like: ``` // This RDD is not being cached in memory, so pass an iterator directly to the block manager. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Graphx example
GitHub user CrazyJvm opened a pull request: https://github.com/apache/spark/pull/1523 Graphx example fix examples You can merge this pull request into a Git repository by running: $ git pull https://github.com/CrazyJvm/spark graphx-example Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1523.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1523 commit 7cfff1d029ace9bdb2cd39e726d144a1cb8d868f Author: CrazyJvm crazy...@gmail.com Date: 2014-07-22T06:56:28Z fix example for joinVertices commit 663457a9f63c6e7bb1087e1ca4ed2a483ad3aa7a Author: CrazyJvm crazy...@gmail.com Date: 2014-07-22T07:04:03Z outDegrees does not take parameters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15214933 --- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala --- @@ -140,14 +145,36 @@ private[spark] class CacheManager(blockManager: BlockManager) extends Logging { throw new BlockException(key, sBlock manager failed to return cached value for $key!) } } else { - /* This RDD is to be cached in memory. In this case we cannot pass the computed values + /* + * This RDD is to be cached in memory. In this case we cannot pass the computed values * to the BlockManager as an iterator and expect to read it back later. This is because - * we may end up dropping a partition from memory store before getting it back, e.g. - * when the entirety of the RDD does not fit in memory. */ - val elements = new ArrayBuffer[Any] - elements ++= values - updatedBlocks ++= blockManager.put(key, elements, storageLevel, tellMaster = true) - elements.iterator.asInstanceOf[Iterator[T]] + * we may end up dropping a partition from memory store before getting it back. + * + * In addition, we must be careful to not unroll the entire partition in memory at once. + * Otherwise, we may cause an OOM exception if the JVM does not have enough space for this + * single partition. Instead, we unroll the values cautiously, potentially aborting and + * dropping the partition to disk if applicable. + */ + blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) match { +case Left(arr) = + // We have successfully unrolled the entire partition, so cache it in memory + updatedBlocks ++= +blockManager.putArray(key, arr, level, tellMaster = true, effectiveStorageLevel) + arr.iterator.asInstanceOf[Iterator[T]] +case Right(it) = + // There is not enough space to cache this partition in memory + logWarning(sNot enough space to cache $key in memory! + +sFree memory is ${blockManager.memoryStore.freeMemory} bytes.) + var returnValues = it.asInstanceOf[Iterator[T]] --- End diff -- can you restructure this to not use a val?: ``` if (putLevel.useDisk) { logWarning(sPersisting $key to disk instead.) diskOnlyLevel = StorageLevel(...) putInBlockManager[T](key, returnValues, level, updatedBlocks, Some(diskOnlyLevel)) } else { it.asInstanceOf[Iterator[T]] } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1165#discussion_r15214991 --- Diff: core/src/main/scala/org/apache/spark/util/collection/SizeTracker.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util.collection + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.util.SizeEstimator + +/** + * A general interface for collections to keep track of their estimated sizes in bytes. + * We sample with a slow exponential back-off using the SizeEstimator to amortize the time, + * as each call to SizeEstimator is somewhat expensive (order of a few milliseconds). + */ +private[spark] trait SizeTracker { + + import SizeTracker._ + + /** + * Controls the base of the exponential which governs the rate of sampling. + * E.g., a value of 2 would mean we sample at 1, 2, 4, 8, ... elements. + */ + private val SAMPLE_GROWTH_RATE = 1.1 + + /** Samples taken since last resetSamples(). Only the last two are kept for extrapolation. */ + private val samples = new ArrayBuffer[Sample] + + /** The average number of bytes per update between our last two samples. */ + private var bytesPerUpdate: Double = _ + + /** Total number of insertions and updates into the map since the last resetSamples(). */ + private var numUpdates: Long = _ + + /** The value of 'numUpdates' at which we will take our next sample. */ + private var nextSampleNum: Long = _ + + resetSamples() + + /** + * Reset samples collected so far. + * This should be called after the collection undergoes a dramatic change in size. + */ + protected def resetSamples(): Unit = { +numUpdates = 1 +nextSampleNum = 1 +samples.clear() +takeSample() + } + + /** + * Callback to be invoked after every update. + */ + protected def afterUpdate(): Unit = { +numUpdates += 1 +if (nextSampleNum == numUpdates) { + takeSample() +} + } + + /** + * Take a new sample of the current collection's size. + */ + private def takeSample(): Unit = { +samples += Sample(SizeEstimator.estimate(this), numUpdates) +// Only use the last two samples to extrapolate +if (samples.size 2) { + samples.remove(0) +} +val bytesDelta = samples.toSeq.reverse match { + case latest :: previous :: tail = +(latest.size - previous.size).toDouble / (latest.numUpdates - previous.numUpdates) + // If fewer than 2 samples, assume no change + case _ = 0 +} +bytesPerUpdate = math.max(0, bytesDelta) +nextSampleNum = math.ceil(numUpdates * SAMPLE_GROWTH_RATE).toLong + } + + /** + * Estimate the current size of the collection in bytes. O(1) time. + */ + def estimateSize(): Long = { +assert(samples.nonEmpty) --- End diff -- Okay - got it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1253#discussion_r15215030 --- Diff: docs/configuration.md --- @@ -42,13 +42,14 @@ val sc = new SparkContext(new SparkConf()) Then, you can supply configuration values at runtime: {% highlight bash %} -./bin/spark-submit --name My fancy app --master local[4] myApp.jar +./bin/spark-submit --name My app --master local[4] myApp.jar --conf spark.shuffle.spill=false --- End diff -- ah good call - asked about this earlier but forgot to follow up --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1253#discussion_r15215054 --- Diff: docs/configuration.md --- @@ -42,13 +42,14 @@ val sc = new SparkContext(new SparkConf()) Then, you can supply configuration values at runtime: {% highlight bash %} -./bin/spark-submit --name My fancy app --master local[4] myApp.jar +./bin/spark-submit --name My app --master local[4] myApp.jar --conf spark.shuffle.spill=false {% endhighlight %} The Spark shell and [`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit) tool support two ways to load configurations dynamically. The first are command line options, -such as `--master`, as shown above. Running `./bin/spark-submit --help` will show the entire list -of options. +such as `--master`, as shown above. `spark-submit` can accept any Spark property using the `--conf` +flag, but uses special flags for properties that play a part in launching the Spark application. +Running `./bin/spark-submit --help` will show the entire list of these options. --- End diff -- That is already the case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1521#issuecomment-49709515 QA tests have started for PR 1521. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16955/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49709494 QA tests have started for PR 1460. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16957/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Graphx example
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1523#issuecomment-49709531 QA tests have started for PR 1523. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16956/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1502#issuecomment-49709588 This looks good to me other than the question above. This will be pretty cool for sorting multiple columnar data structures. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1253#issuecomment-49709826 Can you support `-c` in addition to `--conf`? Also, the spark-submit doc (http://spark.apache.org/docs/latest/submitting-applications.html) should be updated to list this option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15215432 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.rdd + +import org.apache.spark.{Partition, SparkContext, TaskContext} +import org.apache.spark.mllib.linalg.{DenseVector, Vector} +import org.apache.spark.mllib.random.DistributionGenerator +import org.apache.spark.rdd.RDD +import org.apache.spark.util.Utils + +private[mllib] class RandomRDDPartition(val idx: Int, +val size: Long, +val rng: DistributionGenerator, +val seed: Long) extends Partition { + + override val index: Int = idx + +} + +// These two classes are necessary since Range objects in Scala cannot have size Int.MaxValue +private[mllib] class RandomRDD(@transient private var sc: SparkContext, +size: Long, +numSlices: Int, +@transient rng: DistributionGenerator, +@transient seed: Long = Utils.random.nextLong) extends RDD[Double](sc, Nil) { + + require(size 0, Positive RDD size required.) + require(numSlices 0, Positive number of partitions required) + + override def compute(splitIn: Partition, context: TaskContext): Iterator[Double] = { +val split = splitIn.asInstanceOf[RandomRDDPartition] +RandomRDD.getPointIterator(split) + } + + override def getPartitions: Array[Partition] = { +RandomRDD.getPartitions(size, numSlices, rng, seed) + } +} + +private[mllib] class RandomVectorRDD(@transient private var sc: SparkContext, +size: Long, +vectorSize: Int, +numSlices: Int, +@transient rng: DistributionGenerator, +@transient seed: Long = Utils.random.nextLong) extends RDD[Vector](sc, Nil) { + + require(size 0, Positive RDD size required.) + require(numSlices 0, Positive number of partitions required) + require(vectorSize 0, Positive vector size required.) + + override def compute(splitIn: Partition, context: TaskContext): Iterator[Vector] = { +val split = splitIn.asInstanceOf[RandomRDDPartition] +RandomRDD.getVectorIterator(split, vectorSize) + } + + override protected def getPartitions: Array[Partition] = { +RandomRDD.getPartitions(size, numSlices, rng, seed) + } +} + +private[mllib] object RandomRDD { + + private[mllib] class FixedSizePointIterator(val numElem: Long, val rng: DistributionGenerator) +extends Iterator[Double] { + +private var currentSize = 0 + +override def hasNext: Boolean = currentSize numElem + +override def next(): Double = { + currentSize += 1 + rng.nextValue() +} + } + + private[mllib] class FixedSizeVectorIterator(val numElem: Long, + val vectorSize: Int, + val rng: DistributionGenerator) +extends Iterator[Vector] { + +private var currentSize = 0 + +override def hasNext: Boolean = currentSize numElem + +override def next(): Vector = { + currentSize += 1 + new DenseVector((0 until vectorSize).map { _ = rng.nextValue() }.toArray) +} + } + + def getPartitions(size: Long, + numSlices: Int, + rng: DistributionGenerator, + seed: Long): Array[Partition] = { + +val firstPartitionSize = size / numSlices + size % numSlices +val otherPartitionSize = size / numSlices + +val partitions = new Array[RandomRDDPartition](numSlices) +var i = 0 +while (i numSlices) { + partitions(i) = if (i == 0) { +new RandomRDDPartition(i, firstPartitionSize, rng, seed) + } else { +new RandomRDDPartition(i, otherPartitionSize, rng.newInstance(), seed) + } + i += 1 +} +partitions.asInstanceOf[Array[Partition]]
[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1520#discussion_r15215522 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala --- @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.random + +import cern.jet.random.Poisson +import cern.jet.random.engine.DRand + +import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom} + +/** + * Trait for random number generators that generate i.i.d values from a distribution. + */ +trait DistributionGenerator extends Pseudorandom with Serializable { + + /** + * @return An i.i.d sample as a Double from an underlying distribution. + */ + def nextValue(): Double + + /** + * @return A copy of the DistributionGenerator with a new instance of the rng object used in the + * class when applicable. Each partition has a unique seed and therefore requires its + * own instance of the DistributionGenerator. + */ + def newInstance(): DistributionGenerator +} + +/** + * Generates i.i.d. samples from U[0.0, 1.0] + */ +class UniformGenerator() extends DistributionGenerator { + + // XORShiftRandom for better performance. Thread safety isn't necessary here. + private val random = new XORShiftRandom() + + /** + * @return An i.i.d sample as a Double from U[0.0, 1.0]. + */ + override def nextValue(): Double = { +random.nextDouble() + } + + /** Set random seed. */ + override def setSeed(seed: Long) = random.setSeed(seed) + + override def newInstance(): UniformGenerator = new UniformGenerator() +} + +/** + * Generates i.i.d. samples from the Standard Normal Distribution. + */ +class StandardNormalGenerator() extends DistributionGenerator { + + // XORShiftRandom for better performance. Thread safety isn't necessary here. + private val random = new XORShiftRandom() --- End diff -- Is it allowed to use a DistributionGenerator before calling setSeed? It would seem simpler to disallow that, but it seems to be something it got from trait Pseudorandom. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1502#discussion_r15215866 --- Diff: LICENSE --- @@ -483,6 +483,24 @@ SUCH DAMAGE. +For Timsort: --- End diff -- Say which source file this is in (core/src/main/java/...) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1502#discussion_r15215883 --- Diff: core/src/main/scala/org/apache/spark/util/collection/Sorter.java --- @@ -0,0 +1,915 @@ +/* --- End diff -- This is a Java file, but you put it in the src/main/scala folder! That works with SBT but it's pretty confusing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/1425#discussion_r15216588 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtils.scala --- @@ -18,28 +18,90 @@ package org.apache.spark.mllib.util import org.apache.spark.mllib.linalg.Vector +import org.scalatest.exceptions.TestFailedException object TestingUtils { + val defaultEpsilon = 1E-10 + implicit class DoubleWithAlmostEquals(val x: Double) { -// An improved version of AlmostEquals would always divide by the larger number. -// This will avoid the problem of diving by zero. -def almostEquals(y: Double, epsilon: Double = 1E-10): Boolean = { - if(x == y) { + +def almostEquals(y: Double, eps: Double = defaultEpsilon): Boolean = { + val absX = math.abs(x) + val absY = math.abs(y) + val diff = math.abs(x - y) + if (x == y) { true - } else if(math.abs(x) math.abs(y)) { -math.abs(x - y) / math.abs(x) epsilon + } else if (absX 1E-15 || absY 1E-15) { --- End diff -- See commentary at https://issues.apache.org/jira/browse/SPARK-2599?focusedCommentId=14068293page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14068293 I can see the idea here, but all of these kinds of efforts seem to lead to errors or unintuitive behavior. For example this line means that: Fails: expected = 1e-15 actual = 2e-15 eps = 0.1 Passes: expected = 1e-16 actual = 2e-16 eps = 0.1 Why is 1e-15 special anyway? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2269 Refactor mesos scheduler resourceOf...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1487#issuecomment-49713371 QA results for PR 1487:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16946/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [CORE] Fixed stage description in stage info p...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/1524 [CORE] Fixed stage description in stage info page Stage description should be a `String`, but was changed to an `Option[String]` by mistake: ![stage-desc-small](https://cloud.githubusercontent.com/assets/230655/3655611/f6d0b0f6-117b-11e4-83ed-71000dcd5009.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark fix-stage-desc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1524.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1524 commit a57f80d67c48243051629630b8ed859fafe027c5 Author: Cheng Lian lian.cs@gmail.com Date: 2014-07-22T08:09:58Z Fixed stage description in stage info page --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [CORE] Fixed stage description in stage info p...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1524#issuecomment-49713650 QA tests have started for PR 1524. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16958/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49713754 Is it possible to support syntax like `0.3 +- 0.1` for absolute error, and `0.3 +- 10%` for relative error? Seems like the kind of crazy thing that Scala just might support. Maybe it's a nice way to support both semantics; I think relative error semantics have to be a separate method anwayy. Also, there is the method `Math.ulp` (http://docs.oracle.com/javase/7/docs/api/java/lang/Math.html#ulp(double)) whichs tell you how big the gap is between floating-point values around a given value. How about using this to pick a reasonable absolute error around the test's expected value? At least it scales automatically. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix race condition at SchedulerBackend.isReady...
GitHub user li-zhihui opened a pull request: https://github.com/apache/spark/pull/1525 Fix race condition at SchedulerBackend.isReady in standalone mode In SPARK-1946(PR #900), configuration codespark.scheduler.minRegisteredExecutorsRatio/code was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set. Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(code--total-executor-cores/code) as expected resources to judge whether SchedulerBackend is ready. You can merge this pull request into a Git repository by running: $ git pull https://github.com/li-zhihui/spark fixre4s Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1525.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1525 commit 8b54316c77d086ea3454419ebba92003707bbd76 Author: li-zhihui zhihui...@intel.com Date: 2014-07-22T08:15:40Z Fix race condition at SchedulerBackend.isReady in standalone mode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1521#issuecomment-49714576 QA results for PR 1521:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16947/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix race condition at SchedulerBackend.isReady...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1525#issuecomment-49714817 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix race condition at SchedulerBackend.isReady...
Github user li-zhihui commented on the pull request: https://github.com/apache/spark/pull/1525#issuecomment-49714878 @kayousterhout @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1399#issuecomment-49715371 QA tests have started for PR 1399. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16959/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2615] [SQL] Add Equal Sign == Support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1522#issuecomment-49715697 QA results for PR 1522:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16949/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1499#issuecomment-49715947 QA results for PR 1499:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16948/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/455#issuecomment-49716773 @ericgarcia hey that's cool. Would you mind creating a pull request to add that to the Spark examples? It would be very useful to others to see how to make a fully generic Avro converter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2617] Correct doc and usages of preserv...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/1526 [SPARK-2617] Correct doc and usages of preservesPartitioning The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages. This PR 1. adds notes in `maPartitions*`, 2. makes `RDD.sample` preserve partitioner, 3. changes `preservesPartitioning` to false in `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD, 4. fixes some wrong usages in MLlib. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark preserve-partitioner Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1526.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1526 commit d1caa65c7440cb92aed40dcab4b95c03389350c4 Author: Xiangrui Meng m...@databricks.com Date: 2014-07-22T09:15:44Z add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1195#issuecomment-49717415 QA results for PR 1195:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16954/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2617] Correct doc and usages of preserv...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1526#issuecomment-49717533 QA tests have started for PR 1526. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16960/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1521#issuecomment-49717664 Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1521 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1502#issuecomment-49717790 QA results for PR 1502:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):br* This trait extends Any to ensure it is universal (and thus compiled to a Java interface).brclass KVArraySortDataFormat[K, T : AnyRef : ClassTag] extends SortDataFormat[K, Array[T]] {brclass SorterK, Buffer {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16953/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1460#issuecomment-49718092 QA results for PR 1460:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass AutoSerializer(FramedSerializer):brclass Aggregator(object):brclass Merger(object):brclass InMemoryMerger(Merger):brclass ExternalMerger(Merger):brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16957/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1022] Kafka unit test that actually sen...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/557#issuecomment-49718360 Hi TD, I've fixed this flaky issue in my local test, how should I submit my patch, should I push to your branch or create a new PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Graphx example
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1523#issuecomment-49718923 QA results for PR 1523:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16956/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1521#issuecomment-49719044 QA results for PR 1521:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16955/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1399#issuecomment-49721444 QA results for PR 1399:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass SparkSQLOperationManager(hiveContext: HiveContext) extends OperationManager with Logging {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16959/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2387: remove stage barrier
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/1328#discussion_r15220164 --- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala --- @@ -340,6 +459,7 @@ private[spark] class MapOutputTrackerMaster(conf: SparkConf) */ private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends MapOutputTracker(conf) { protected val mapStatuses = new HashMap[Int, Array[MapStatus]] +with mutable.SynchronizedMap[Int, Array[MapStatus]] --- End diff -- I think `ConcurrentHashMap` is better in most cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [CORE] Fixed stage description in stage info p...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1524#issuecomment-49722653 QA results for PR 1524:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16958/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2617] Correct doc and usages of preserv...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1526#issuecomment-49724616 QA results for PR 1526:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16960/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49726353 @pwendell Sure, and I'll do this in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49726591 @marmbrus I've made the changes you suggested and moved the rule to the Resolution batch but now the newly-inserted attribute references remain unresolved after analysis. I'm going to try and figure out where things are going wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1399#discussion_r15222777 --- Diff: bin/beeline --- @@ -0,0 +1,45 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Figure out where Spark is installed +FWDIR=$(cd `dirname $0`/..; pwd) + +# Find the java binary +if [ -n ${JAVA_HOME} ]; then --- End diff -- Hit a bug while fixing this: application options with names that `SparkSubmitArguments` recognizes are stolen by `SparkSubmit` instead of passed to the application. For example, when running `BeeLine` with `spark-submit`, passing `--help` option shows the usage message of `SparkSubmit` rather than `BeeLine`. Since the `spark-internal` issue also touches this part of code, I tried to fix this bug together within this PR to avoid further conflict. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Build] SPARK-2614: Add the spark-examples-xxx...
GitHub user tzolov opened a pull request: https://github.com/apache/spark/pull/1527 [Build] SPARK-2614: Add the spark-examples-xxx-.jar to the generated debian package. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tzolov/spark SPARK-2614 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1527.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1527 commit 32ef6ddc8043b000f90e802b9d8db0a244887d96 Author: tzolov christian.tzo...@gmail.com Date: 2014-07-22T12:18:02Z [SPARK-2614] Add the spark-examples-xxx-.jar to the Debian package --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Build] SPARK-2614: Add the spark-examples-xxx...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1527#issuecomment-49731850 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2150: Provide direct link to finished ap...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/1094#issuecomment-49733427 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2150: Provide direct link to finished ap...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1094#issuecomment-49733823 QA tests have started for PR 1094. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16962/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-49733822 QA tests have started for PR 1269. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16961/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2150: Provide direct link to finished ap...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1094#issuecomment-49733906 QA results for PR 1094:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16962/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---