[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-49701430
  
QA tests have started for PR 1056. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16945/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2612/Fix data skew in ALS

2014-07-22 Thread renozhang
GitHub user renozhang opened a pull request:

https://github.com/apache/spark/pull/1521

SPARK-2612/Fix data skew in ALS



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/renozhang/spark fix-als

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1521.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1521


commit 1a4f7a02ea9dc077837c19cc64d2135275d5a258
Author: peng.zhang peng.zh...@xiaomi.com
Date:   2014-07-22T05:59:23Z

Fix data skew in ALS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2612/Fix data skew in ALS

2014-07-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1521#issuecomment-49701893
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1729. Make Flume pull data from source, ...

2014-07-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/807#issuecomment-49703872
  
Hey folks - I just took a light pass on this.

One thing I'm a bit confused about - the sink here is not really specific 
to Spark at all (there is no dependency on Spark). The classes used are 
entirely general avro classes that other applications could use easily. So why 
is this being contributed to Spark instead of being put inside of Flume as a 
general purpose sink that supports polling from external sources? I could 
imagine that other systems might also want to integrate with flume via a 
pull/polling based model than via a push model.

Was there any previous consideration to put this directly in the Flume 
project?

Adding this in Spark adds some nontrivial steps to the Spark build (such as 
an avro compiler). So naturally a question is whether it belongs in Spark or 
Flume. I don't think we can easily change this later since we expose the 
`SparkFlumePollingEvent` type directly to users, and that type is in the spark 
package. Actually one related thing, can we not convert these to 
`SparkFlumeEvents`'s so that the types do not differ depending on which type of 
flume integration they are using? This could be confusing for users to have to 
deal with a different type.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213062
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala 
---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.random
+
+import cern.jet.random.Poisson
+import cern.jet.random.engine.DRand
+
+import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom}
+
+/**
+ * Trait for random number generators that generate i.i.d values from a 
distribution.
+ */
+trait DistributionGenerator extends Pseudorandom with Serializable {
+
+  /**
+   * @return An i.i.d sample as a Double from an underlying distribution.
+   */
+  def nextValue(): Double
+
+  /**
+   * @return A copy of the DistributionGenerator with a new instance of 
the rng object used in the
+   * class when applicable. Each partition has a unique seed and 
therefore requires its
+   * own instance of the DistributionGenerator.
+   */
+  def newInstance(): DistributionGenerator
--- End diff --

I saw your argument about using `clone`. Between `copy` and `newInstance`, 
I think `copy` is better. For example, in Poisson, we need to copy the mean, 
which is not reflected in `newInstance`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213061
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala 
---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.random
+
+import cern.jet.random.Poisson
+import cern.jet.random.engine.DRand
+
+import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom}
+
+/**
+ * Trait for random number generators that generate i.i.d values from a 
distribution.
+ */
+trait DistributionGenerator extends Pseudorandom with Serializable {
+
+  /**
+   * @return An i.i.d sample as a Double from an underlying distribution.
+   */
+  def nextValue(): Double
+
+  /**
+   * @return A copy of the DistributionGenerator with a new instance of 
the rng object used in the
+   * class when applicable. Each partition has a unique seed and 
therefore requires its
--- End diff --

`partition` has no context here. Maybe simply mention that this is for 
running multiple instances concurrently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213059
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala 
---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.random
+
+import cern.jet.random.Poisson
+import cern.jet.random.engine.DRand
+
+import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom}
+
+/**
+ * Trait for random number generators that generate i.i.d values from a 
distribution.
+ */
+trait DistributionGenerator extends Pseudorandom with Serializable {
+
+  /**
+   * @return An i.i.d sample as a Double from an underlying distribution.
--- End diff --

Change `@return` to `Returns`. Otherwise the summary will be empty in the 
generated docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213064
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala 
---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.random
+
+import cern.jet.random.Poisson
+import cern.jet.random.engine.DRand
+
+import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom}
+
+/**
+ * Trait for random number generators that generate i.i.d values from a 
distribution.
+ */
+trait DistributionGenerator extends Pseudorandom with Serializable {
+
+  /**
+   * @return An i.i.d sample as a Double from an underlying distribution.
+   */
+  def nextValue(): Double
+
+  /**
+   * @return A copy of the DistributionGenerator with a new instance of 
the rng object used in the
+   * class when applicable. Each partition has a unique seed and 
therefore requires its
+   * own instance of the DistributionGenerator.
+   */
+  def newInstance(): DistributionGenerator
+}
+
+/**
+ * Generates i.i.d. samples from U[0.0, 1.0]
+ */
+class UniformGenerator() extends DistributionGenerator {
+
+  // XORShiftRandom for better performance. Thread safety isn't necessary 
here.
+  private val random = new XORShiftRandom()
+
+  /**
+   * @return An i.i.d sample as a Double from U[0.0, 1.0].
+   */
+  override def nextValue(): Double = {
+random.nextDouble()
+  }
+
+  /** Set random seed. */
--- End diff --

Doc is not necessary for the overloaded methods, unless you want to update 
it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213072
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDGenerators.scala ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.random
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.rdd.{RandomVectorRDD, RandomRDD}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+// TODO add Scaladocs once API fully approved
+// Alternatively, we can use the generator pattern to set numPartitions, 
seed, etc instead to bring
+// down the number of methods here.
+object RandomRDDGenerators {
+
+  def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int, seed: 
Long): RDD[Double] = {
+val uniform = new UniformGenerator()
+randomRDD(sc, size, numPartitions, uniform, seed)
+  }
+
+  def uniformRDD(sc: SparkContext, size: Long, seed: Long): RDD[Double] = {
+uniformRDD(sc, size, sc.defaultParallelism, seed)
+  }
+
+  def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int): 
RDD[Double] = {
+uniformRDD(sc, size, numPartitions, Utils.random.nextLong)
+  }
+
+  def uniformRDD(sc: SparkContext, size: Long): RDD[Double] = {
+uniformRDD(sc, size, sc.defaultParallelism, Utils.random.nextLong)
+  }
+
+  def normalRDD(sc: SparkContext, size: Long, numPartitions: Int, seed: 
Long): RDD[Double] = {
+val normal = new StandardNormalGenerator()
+randomRDD(sc, size, numPartitions, normal, seed)
+  }
+
+  def normalRDD(sc: SparkContext, size: Long, seed: Long): RDD[Double] = {
+normalRDD(sc, size, sc.defaultParallelism, seed)
+  }
+
+  def normalRDD(sc: SparkContext, size: Long, numPartitions: Int): 
RDD[Double] = {
+normalRDD(sc, size, numPartitions, Utils.random.nextLong)
+  }
+
+  def normalRDD(sc: SparkContext, size: Long): RDD[Double] = {
+normalRDD(sc, size, sc.defaultParallelism, Utils.random.nextLong)
+  }
+
+  def poissonRDD(sc: SparkContext,
+  size: Long,
+  numPartitions: Int,
+  mean: Double,
+  seed: Long): RDD[Double] = {
+val poisson = new PoissonGenerator(mean)
+randomRDD(sc, size, numPartitions, poisson, seed)
+  }
+
+  def poissonRDD(sc: SparkContext, size: Long, mean: Double, seed: Long): 
RDD[Double] = {
+poissonRDD(sc, size, sc.defaultParallelism, mean, seed)
+  }
+
+  def poissonRDD(sc: SparkContext, size: Long, numPartitions: Int, mean: 
Double): RDD[Double] = {
+poissonRDD(sc, size, numPartitions, mean, Utils.random.nextLong)
+  }
+
+  def poissonRDD(sc: SparkContext, size: Long, mean: Double): RDD[Double] 
= {
+poissonRDD(sc, size, sc.defaultParallelism, mean, 
Utils.random.nextLong)
+  }
+
+  def randomRDD(sc: SparkContext,
+  size: Long,
+  numPartitions: Int,
+  distribution: DistributionGenerator,
--- End diff --

We need to be consistent on the argument name. `distribution` is used here 
but `rng` is used in `randomVectorRDD`. `generator` sounds better to me than 
`distribution` because `DistributionGenerator` is a generator but not a 
distribution and in the context we don't need to use `distributionGenerator`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213069
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/random/RandomRDDGenerators.scala ---
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.random
+
+import org.apache.spark.SparkContext
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.rdd.{RandomVectorRDD, RandomRDD}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+// TODO add Scaladocs once API fully approved
+// Alternatively, we can use the generator pattern to set numPartitions, 
seed, etc instead to bring
+// down the number of methods here.
+object RandomRDDGenerators {
+
+  def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int, seed: 
Long): RDD[Double] = {
+val uniform = new UniformGenerator()
+randomRDD(sc, size, numPartitions, uniform, seed)
+  }
+
+  def uniformRDD(sc: SparkContext, size: Long, seed: Long): RDD[Double] = {
+uniformRDD(sc, size, sc.defaultParallelism, seed)
+  }
+
+  def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int): 
RDD[Double] = {
--- End diff --

It is very confusing to have both `(SparkContext, Long, Int)` and 
`(SparkContext, Long, Long)`. If a user doesn't see `(SparkContext, Long, Int)` 
and treat the third argument as the seed and set an integer, it actually sets 
the number of partitions. Maybe we should only allow default values at the end. 
That is,

~~~
def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int, seed: Long)

def uniformRDD(sc: SparkContext, size: Long, numPartitions: Int)
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213063
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala 
---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.random
+
+import cern.jet.random.Poisson
+import cern.jet.random.engine.DRand
+
+import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom}
+
+/**
+ * Trait for random number generators that generate i.i.d values from a 
distribution.
+ */
+trait DistributionGenerator extends Pseudorandom with Serializable {
+
+  /**
+   * @return An i.i.d sample as a Double from an underlying distribution.
+   */
+  def nextValue(): Double
+
+  /**
+   * @return A copy of the DistributionGenerator with a new instance of 
the rng object used in the
+   * class when applicable. Each partition has a unique seed and 
therefore requires its
+   * own instance of the DistributionGenerator.
+   */
+  def newInstance(): DistributionGenerator
+}
+
+/**
+ * Generates i.i.d. samples from U[0.0, 1.0]
+ */
+class UniformGenerator() extends DistributionGenerator {
--- End diff --

Is `()` necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213096
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala 
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.rdd
+
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.mllib.linalg.{DenseVector, Vector}
+import org.apache.spark.mllib.random.DistributionGenerator
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private[mllib] class RandomRDDPartition(val idx: Int,
+val size: Long,
+val rng: DistributionGenerator,
+val seed: Long) extends Partition {
+
+  override val index: Int = idx
+
+}
+
+// These two classes are necessary since Range objects in Scala cannot 
have size  Int.MaxValue
+private[mllib] class RandomRDD(@transient private var sc: SparkContext,
+size: Long,
+numSlices: Int,
+@transient rng: DistributionGenerator,
+@transient seed: Long = Utils.random.nextLong) extends RDD[Double](sc, 
Nil) {
+
+  require(size  0, Positive RDD size required.)
+  require(numSlices  0, Positive number of partitions required)
+
+  override def compute(splitIn: Partition, context: TaskContext): 
Iterator[Double] = {
+val split = splitIn.asInstanceOf[RandomRDDPartition]
+RandomRDD.getPointIterator(split)
+  }
+
+  override def getPartitions: Array[Partition] = {
+RandomRDD.getPartitions(size, numSlices, rng, seed)
+  }
+}
+
+private[mllib] class RandomVectorRDD(@transient private var sc: 
SparkContext,
+size: Long,
+vectorSize: Int,
+numSlices: Int,
+@transient rng: DistributionGenerator,
+@transient seed: Long = Utils.random.nextLong) extends RDD[Vector](sc, 
Nil) {
+
+  require(size  0, Positive RDD size required.)
+  require(numSlices  0, Positive number of partitions required)
+  require(vectorSize  0, Positive vector size required.)
+
+  override def compute(splitIn: Partition, context: TaskContext): 
Iterator[Vector] = {
+val split = splitIn.asInstanceOf[RandomRDDPartition]
+RandomRDD.getVectorIterator(split, vectorSize)
+  }
+
+  override protected def getPartitions: Array[Partition] = {
+RandomRDD.getPartitions(size, numSlices, rng, seed)
+  }
+}
+
+private[mllib] object RandomRDD {
+
+  private[mllib] class FixedSizePointIterator(val numElem: Long, val rng: 
DistributionGenerator)
+extends Iterator[Double] {
+
+private var currentSize = 0
+
+override def hasNext: Boolean = currentSize  numElem
+
+override def next(): Double = {
+  currentSize += 1
+  rng.nextValue()
+}
+  }
+
+  private[mllib] class FixedSizeVectorIterator(val numElem: Long,
+  val vectorSize: Int,
+  val rng: DistributionGenerator)
+extends Iterator[Vector] {
+
+private var currentSize = 0
+
+override def hasNext: Boolean = currentSize  numElem
+
+override def next(): Vector = {
+  currentSize += 1
+  new DenseVector((0 until vectorSize).map { _ = rng.nextValue() 
}.toArray)
+}
+  }
+
+  def getPartitions(size: Long,
+  numSlices: Int,
+  rng: DistributionGenerator,
+  seed: Long): Array[Partition] = {
+
+val firstPartitionSize = size / numSlices + size % numSlices
--- End diff --

This is not evenly distributed, for example, when `size = 1000` and 
`numSlices = 101`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213082
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala 
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.rdd
+
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.mllib.linalg.{DenseVector, Vector}
+import org.apache.spark.mllib.random.DistributionGenerator
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private[mllib] class RandomRDDPartition(val idx: Int,
+val size: Long,
+val rng: DistributionGenerator,
+val seed: Long) extends Partition {
+
+  override val index: Int = idx
--- End diff --

You can put `override` directly in the constructor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213078
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala 
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.rdd
+
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.mllib.linalg.{DenseVector, Vector}
+import org.apache.spark.mllib.random.DistributionGenerator
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private[mllib] class RandomRDDPartition(val idx: Int,
+val size: Long,
--- End diff --

Could you double check whether we can have more than `Int.MaxValue` items 
in a single partition? It may break storage and couple RDD functions like 
`glob`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213091
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala 
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.rdd
+
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.mllib.linalg.{DenseVector, Vector}
+import org.apache.spark.mllib.random.DistributionGenerator
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private[mllib] class RandomRDDPartition(val idx: Int,
+val size: Long,
+val rng: DistributionGenerator,
+val seed: Long) extends Partition {
+
+  override val index: Int = idx
+
+}
+
+// These two classes are necessary since Range objects in Scala cannot 
have size  Int.MaxValue
+private[mllib] class RandomRDD(@transient private var sc: SparkContext,
+size: Long,
+numSlices: Int,
+@transient rng: DistributionGenerator,
+@transient seed: Long = Utils.random.nextLong) extends RDD[Double](sc, 
Nil) {
+
+  require(size  0, Positive RDD size required.)
+  require(numSlices  0, Positive number of partitions required)
+
+  override def compute(splitIn: Partition, context: TaskContext): 
Iterator[Double] = {
+val split = splitIn.asInstanceOf[RandomRDDPartition]
+RandomRDD.getPointIterator(split)
+  }
+
+  override def getPartitions: Array[Partition] = {
+RandomRDD.getPartitions(size, numSlices, rng, seed)
+  }
+}
+
+private[mllib] class RandomVectorRDD(@transient private var sc: 
SparkContext,
+size: Long,
+vectorSize: Int,
+numSlices: Int,
+@transient rng: DistributionGenerator,
+@transient seed: Long = Utils.random.nextLong) extends RDD[Vector](sc, 
Nil) {
+
+  require(size  0, Positive RDD size required.)
+  require(numSlices  0, Positive number of partitions required)
+  require(vectorSize  0, Positive vector size required.)
+
+  override def compute(splitIn: Partition, context: TaskContext): 
Iterator[Vector] = {
+val split = splitIn.asInstanceOf[RandomRDDPartition]
+RandomRDD.getVectorIterator(split, vectorSize)
+  }
+
+  override protected def getPartitions: Array[Partition] = {
+RandomRDD.getPartitions(size, numSlices, rng, seed)
+  }
+}
+
+private[mllib] object RandomRDD {
+
+  private[mllib] class FixedSizePointIterator(val numElem: Long, val rng: 
DistributionGenerator)
--- End diff --

If we couldn't have more than `Int.MaxValue` items per iteration, this is 
`Iterator.fill(numElem)(rng.nextValue())`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15213097
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala 
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.rdd
+
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.mllib.linalg.{DenseVector, Vector}
+import org.apache.spark.mllib.random.DistributionGenerator
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private[mllib] class RandomRDDPartition(val idx: Int,
+val size: Long,
+val rng: DistributionGenerator,
+val seed: Long) extends Partition {
+
+  override val index: Int = idx
+
+}
+
+// These two classes are necessary since Range objects in Scala cannot 
have size  Int.MaxValue
+private[mllib] class RandomRDD(@transient private var sc: SparkContext,
+size: Long,
+numSlices: Int,
+@transient rng: DistributionGenerator,
+@transient seed: Long = Utils.random.nextLong) extends RDD[Double](sc, 
Nil) {
+
+  require(size  0, Positive RDD size required.)
+  require(numSlices  0, Positive number of partitions required)
+
+  override def compute(splitIn: Partition, context: TaskContext): 
Iterator[Double] = {
+val split = splitIn.asInstanceOf[RandomRDDPartition]
+RandomRDD.getPointIterator(split)
+  }
+
+  override def getPartitions: Array[Partition] = {
+RandomRDD.getPartitions(size, numSlices, rng, seed)
+  }
+}
+
+private[mllib] class RandomVectorRDD(@transient private var sc: 
SparkContext,
+size: Long,
+vectorSize: Int,
+numSlices: Int,
+@transient rng: DistributionGenerator,
+@transient seed: Long = Utils.random.nextLong) extends RDD[Vector](sc, 
Nil) {
+
+  require(size  0, Positive RDD size required.)
+  require(numSlices  0, Positive number of partitions required)
+  require(vectorSize  0, Positive vector size required.)
+
+  override def compute(splitIn: Partition, context: TaskContext): 
Iterator[Vector] = {
+val split = splitIn.asInstanceOf[RandomRDDPartition]
+RandomRDD.getVectorIterator(split, vectorSize)
+  }
+
+  override protected def getPartitions: Array[Partition] = {
+RandomRDD.getPartitions(size, numSlices, rng, seed)
+  }
+}
+
+private[mllib] object RandomRDD {
+
+  private[mllib] class FixedSizePointIterator(val numElem: Long, val rng: 
DistributionGenerator)
+extends Iterator[Double] {
+
+private var currentSize = 0
+
+override def hasNext: Boolean = currentSize  numElem
+
+override def next(): Double = {
+  currentSize += 1
+  rng.nextValue()
+}
+  }
+
+  private[mllib] class FixedSizeVectorIterator(val numElem: Long,
+  val vectorSize: Int,
+  val rng: DistributionGenerator)
+extends Iterator[Vector] {
+
+private var currentSize = 0
+
+override def hasNext: Boolean = currentSize  numElem
+
+override def next(): Vector = {
+  currentSize += 1
+  new DenseVector((0 until vectorSize).map { _ = rng.nextValue() 
}.toArray)
+}
+  }
+
+  def getPartitions(size: Long,
+  numSlices: Int,
+  rng: DistributionGenerator,
+  seed: Long): Array[Partition] = {
+
+val firstPartitionSize = size / numSlices + size % numSlices
+val otherPartitionSize = size / numSlices
+
+val partitions = new Array[RandomRDDPartition](numSlices)
+var i = 0
+while (i  numSlices) {
+  partitions(i) =  if (i == 0) {
+new RandomRDDPartition(i, firstPartitionSize, rng, seed)
--- End diff --

It is safer to make a copy in `compute` than here. In local mode, this may 
cause problems if a user uses the same generator to create two random RDDs.


---
If your 

[GitHub] spark pull request: SPARK-1729. Make Flume pull data from source, ...

2014-07-22 Thread harishreedharan
Github user harishreedharan commented on the pull request:

https://github.com/apache/spark/pull/807#issuecomment-49704453
  
Thanks for taking a look Patrick. The reason this really is not going into 
Flume itself is that Flume sinks are push model (this sink is hacking around 
it). For now, I don't expect a polling model sink to get committed to Flume, 
plus this is a very special case compared to the normal sinks - and it is more 
specific than general purpose.

I will look into that SparkFlumeEvent - for now, it seems like it should be 
possible. Will update this later if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2269 Refactor mesos scheduler resourceOf...

2014-07-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1487#issuecomment-49704715
  
Jenkins, retest this please. Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2269 Refactor mesos scheduler resourceOf...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1487#issuecomment-49705027
  
QA tests have started for PR 1487. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16946/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2612] Fix data skew in ALS

2014-07-22 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1521#issuecomment-49706031
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...

2014-07-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1399#discussion_r15213826
  
--- Diff: docs/sql-programming-guide.md ---
@@ -573,4 +572,170 @@ prefixed with a tick (`'`).  Implicit conversions 
turn these symbols into expres
 evaluated by the SQL execution engine.  A full list of the functions 
supported can be found in the
 [ScalaDoc](api/scala/index.html#org.apache.spark.sql.SchemaRDD).
 
-!-- TODO: Include the table of operations here. --
\ No newline at end of file
+!-- TODO: Include the table of operations here. --
+
+## Running the Thrift JDBC server
+
+The Thrift JDBC server implemented here corresponds to the [`HiveServer2`]
+(https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) 
in Hive 0.12. You can test
+the JDBC server with the beeline script comes with either Spark or Hive 
0.12.
+
+To start the JDBC server, run the following in the Spark directory:
+
+./sbin/start-thriftserver.sh
+
+The default port the server listens on is 1.  Now you can use beeline 
to test the Thrift JDBC
+server:
+
+./bin/beeline
+
+Connect to the JDBC server in beeline with:
+
+beeline !connect jdbc:hive2://localhost:1
+
+Beeline will ask you for a username and password. In non-secure mode, 
simply enter the username on
+your machine and a blank password. For secure mode, please follow the 
instructions given in the
+[beeline 
documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients)
+
+Configuration of Hive is done by placing your `hive-site.xml` file in 
`conf/`.
+
+You may also use the beeline script comes with Hive.
+
+### Migration Guide for Shark Users
+
+ Reducer number
+
+In Shark, default reducer number is 1, and can be tuned by property 
`mapred.reduce.tasks`. In Spark SQL, reducer number is default to 200, and can 
be customized by the `spark.sql.shuffle.partitions` property:
--- End diff --

Yep, exactly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2612] Fix data skew in ALS

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1521#issuecomment-49706388
  
QA tests have started for PR 1521. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16947/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1499#issuecomment-49706936
  
@aarondav pushed an update to the grow code as well, which will now use 
estimateSize. I implemented the same thing in EAOM.

@colorant fixed those, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1499#issuecomment-49707121
  
QA tests have started for PR 1499. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16948/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...

2014-07-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1510#discussion_r15214158
  
--- Diff: 
tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala ---
@@ -99,9 +99,25 @@ object GenerateMIMAIgnore {
 (ignoredClasses.flatMap(c = Seq(c, c.replace($, #))).toSet, 
ignoredMembers.toSet)
   }
 
+  /** Scala reflection does not let us see inner function even if they are 
upgraded
+* to public for some reason. So had to resort to java reflection to 
get all inner
+* functions with $$ in there name.
+*/
+  def getInnerFunc(classSymbol: unv.ClassSymbol): Seq[String] = {
--- End diff --

Can this be called `getInnerFunctions` since it enumerates multiple inner 
functions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2549] Functions defined inside of other...

2014-07-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1510#issuecomment-49707182
  
Small comment but looks good. Have you tested this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2615] [SQL] Add Equal Sign == Support...

2014-07-22 Thread chenghao-intel
GitHub user chenghao-intel opened a pull request:

https://github.com/apache/spark/pull/1522

[SPARK-2615] [SQL] Add Equal Sign == Support for HiveQl

Currently, the == in HiveQL expression will cause exception thrown, this 
patch will fix it.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chenghao-intel/spark equal

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1522.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1522


commit f62a0ff31627d7e11c9c598869d3439890185518
Author: Cheng Hao hao.ch...@intel.com
Date:   2014-07-22T07:24:43Z

Add == Support for HiveQl




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-49707381
  
QA results for PR 1056:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Heartbeat(brclass HeartbeatReceiver(scheduler: 
TaskScheduler) extends Actor {brcase class 
SparkListenerExecutorMetricsUpdate(execId: String,brcase class 
BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends 
ToBlockManagerMasterbrbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16944/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2612] Fix data skew in ALS

2014-07-22 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1521#issuecomment-49707382
  
@renozhang The change looks good to me and thanks for finding it! Do you 
mind adding `[MLLIB]` to the title of this PR and also remove 
`productPartitioner` from `updateFeatures` since it is no longer used there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2612] Fix data skew in ALS

2014-07-22 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1521#issuecomment-49707420
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2452] Create a new valid for each inste...

2014-07-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1441#issuecomment-49707481
  
Okay merging this into master. For the 1.0 branch we just reverted the fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2615] [SQL] Add Equal Sign == Support...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1522#issuecomment-49707497
  
QA tests have started for PR 1522. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16949/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2452] Create a new valid for each inste...

2014-07-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1441


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...

2014-07-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1253#issuecomment-49707690
  
LGTM - @mateiz @rxin any final comments here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1502#issuecomment-49707787
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2099. Report progress while task is runn...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1056#issuecomment-49707887
  
QA results for PR 1056:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Heartbeat(brclass HeartbeatReceiver(scheduler: 
TaskScheduler) extends Actor {brcase class 
SparkListenerExecutorMetricsUpdate(execId: String,brcase class 
BlockManagerHeartbeat(blockManagerId: BlockManagerId) extends 
ToBlockManagerMasterbrbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16945/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...

2014-07-22 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1253#discussion_r15214419
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -55,6 +55,7 @@ private[spark] class SparkSubmitArguments(args: 
Seq[String]) {
   var verbose: Boolean = false
   var isPython: Boolean = false
   var pyFiles: String = null
+  var sparkProperties: HashMap[String, String] = new HashMap[String, 
String]()
--- End diff --

I think you want a val here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...

2014-07-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49707966
  
@willb mind throwing [SQL] in the title here? It helps this filter nicely 
on some internal tools we have to triage PR's.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...

2014-07-22 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1253#discussion_r15214475
  
--- Diff: docs/configuration.md ---
@@ -42,13 +42,14 @@ val sc = new SparkContext(new SparkConf())
 
 Then, you can supply configuration values at runtime:
 {% highlight bash %}
-./bin/spark-submit --name My fancy app --master local[4] myApp.jar
+./bin/spark-submit --name My app --master local[4] myApp.jar --conf 
spark.shuffle.spill=false
--- End diff --

It'd be great to have an example with values that contain spaces so quotes 
are needed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1502#discussion_r15214488
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
 ---
@@ -252,7 +251,7 @@ class ExternalAppendOnlyMap[K, V, C](
   if (it.hasNext) {
 var kc = it.next()
 kcPairs += kc
-val minHash = getKeyHashCode(kc)
+val minHash = hashKey(kc)
--- End diff --

Just curious, is this more efficient than calling 
ExternalAppendOnlyMap.hash directly as we did before? It was kind of weird that 
we were doing it in an inner class, maybe it created another pointer 
derefernece.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1498#issuecomment-49708099
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...

2014-07-22 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1253#discussion_r15214491
  
--- Diff: docs/configuration.md ---
@@ -42,13 +42,14 @@ val sc = new SparkContext(new SparkConf())
 
 Then, you can supply configuration values at runtime:
 {% highlight bash %}
-./bin/spark-submit --name My fancy app --master local[4] myApp.jar
+./bin/spark-submit --name My app --master local[4] myApp.jar --conf 
spark.shuffle.spill=false
 {% endhighlight %}
 
 The Spark shell and 
[`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit)
 tool support two ways to load configurations dynamically. The first are 
command line options,
-such as `--master`, as shown above. Running `./bin/spark-submit --help` 
will show the entire list
-of options.
+such as `--master`, as shown above. `spark-submit` can accept any Spark 
property using the `--conf`
+flag, but uses special flags for properties that play a part in launching 
the Spark application.
+Running `./bin/spark-submit --help` will show the entire list of these 
options.
--- End diff --

can we also make spark-submit without any options print the help? (maybe 
you are already doing that)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-22 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1498#issuecomment-49708172
  
Somehow this makes the unit test taking very long to finish. I also suspect 
there is some racing condition in the cleaning code. This PR makes them 
manifest more often. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1498#issuecomment-49708325
  
QA tests have started for PR 1498. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16952/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1502#issuecomment-49708347
  
QA tests have started for PR 1502. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16953/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2521] Broadcast RDD object (instead of ...

2014-07-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1498#issuecomment-49708333
  
I was doing some random other thing locally and noticed that there was some 
weird issue with synchronization around the TorrentBroadcast lock (there is a 
single shared lock used a bunch inside of that) a bunch of tasks were waiting 
for the lock for a long time. Maybe somehow that is slowing things down the 
tests as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1195#issuecomment-49708363
  
QA tests have started for PR 1195. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16954/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2024] Add saveAsSequenceFile to PySpark

2014-07-22 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/1338#issuecomment-49708463
  
@kanzhang @mateiz  Yeah this is one issue with Pyrolite vs MsgPack. MsgPack 
supported case classes out the box, which would likely be a bit more common 
that beans.

I'd say that custom serde via `Converter` will be far more common (as we've 
already seen with various Avro commentary etc).

Thinking about it some more, I would be ok to remove from the docs. This 
would still be available as undocumented functionality so if relevant use cases 
did come up on the mailing list, we could point to it and in the unlikely case 
that there was demand we could simply document it as read-only functionality.

Bearing in mind this is also still marked experimental and we'll need to 
see how users use it in the wild a bit and make any amendments as required.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15214670
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -124,15 +124,20 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
   private def putInBlockManager[T](
   key: BlockId,
   values: Iterator[T],
-  storageLevel: StorageLevel,
-  updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)]): Iterator[T] = {
-
-if (!storageLevel.useMemory) {
-  /* This RDD is not to be cached in memory, so we can just pass the 
computed values
+  level: StorageLevel,
+  updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)],
+  effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = {
--- End diff --

This should be documented in the javadoc here - it's not at all obvious 
what it means.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15214812
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -124,15 +124,20 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
   private def putInBlockManager[T](
   key: BlockId,
   values: Iterator[T],
-  storageLevel: StorageLevel,
-  updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)]): Iterator[T] = {
-
-if (!storageLevel.useMemory) {
-  /* This RDD is not to be cached in memory, so we can just pass the 
computed values
+  level: StorageLevel,
+  updatedBlocks: ArrayBuffer[(BlockId, BlockStatus)],
+  effectiveStorageLevel: Option[StorageLevel] = None): Iterator[T] = {
+
+val putLevel = effectiveStorageLevel.getOrElse(level)
+if (!putLevel.useMemory) {
+  /*
+   * This RDD is not to be cached in memory, so we can just pass the 
computed values
* as an iterator directly to the BlockManager, rather than first 
fully unrolling
* it in memory. The latter option potentially uses much more memory 
and risks OOM
--- End diff --

This doc is a bit outdated now (we no longer risk OOM ideally). I think 
it's a bit overly complicated anyways. I'd just make it something like:

```
// This RDD is not being cached in memory, so pass an iterator directly to 
the block manager.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Graphx example

2014-07-22 Thread CrazyJvm
GitHub user CrazyJvm opened a pull request:

https://github.com/apache/spark/pull/1523

Graphx example

fix examples

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/CrazyJvm/spark graphx-example

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1523.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1523


commit 7cfff1d029ace9bdb2cd39e726d144a1cb8d868f
Author: CrazyJvm crazy...@gmail.com
Date:   2014-07-22T06:56:28Z

fix example for joinVertices

commit 663457a9f63c6e7bb1087e1ca4ed2a483ad3aa7a
Author: CrazyJvm crazy...@gmail.com
Date:   2014-07-22T07:04:03Z

outDegrees does not take parameters




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15214933
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -140,14 +145,36 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
   throw new BlockException(key, sBlock manager failed to return 
cached value for $key!)
   }
 } else {
-  /* This RDD is to be cached in memory. In this case we cannot pass 
the computed values
+  /*
+   * This RDD is to be cached in memory. In this case we cannot pass 
the computed values
* to the BlockManager as an iterator and expect to read it back 
later. This is because
-   * we may end up dropping a partition from memory store before 
getting it back, e.g.
-   * when the entirety of the RDD does not fit in memory. */
-  val elements = new ArrayBuffer[Any]
-  elements ++= values
-  updatedBlocks ++= blockManager.put(key, elements, storageLevel, 
tellMaster = true)
-  elements.iterator.asInstanceOf[Iterator[T]]
+   * we may end up dropping a partition from memory store before 
getting it back.
+   *
+   * In addition, we must be careful to not unroll the entire 
partition in memory at once.
+   * Otherwise, we may cause an OOM exception if the JVM does not have 
enough space for this
+   * single partition. Instead, we unroll the values cautiously, 
potentially aborting and
+   * dropping the partition to disk if applicable.
+   */
+  blockManager.memoryStore.unrollSafely(key, values, updatedBlocks) 
match {
+case Left(arr) =
+  // We have successfully unrolled the entire partition, so cache 
it in memory
+  updatedBlocks ++=
+blockManager.putArray(key, arr, level, tellMaster = true, 
effectiveStorageLevel)
+  arr.iterator.asInstanceOf[Iterator[T]]
+case Right(it) =
+  // There is not enough space to cache this partition in memory
+  logWarning(sNot enough space to cache $key in memory!  +
+sFree memory is ${blockManager.memoryStore.freeMemory} 
bytes.)
+  var returnValues = it.asInstanceOf[Iterator[T]]
--- End diff --

can you restructure this to not use a val?:

```
if (putLevel.useDisk) {
  logWarning(sPersisting $key to disk instead.)
  diskOnlyLevel = StorageLevel(...)
  putInBlockManager[T](key, returnValues, level, updatedBlocks, 
Some(diskOnlyLevel))
} else {
  it.asInstanceOf[Iterator[T]]
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1777] Prevent OOMs from single partitio...

2014-07-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1165#discussion_r15214991
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/SizeTracker.scala ---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util.collection
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.util.SizeEstimator
+
+/**
+ * A general interface for collections to keep track of their estimated 
sizes in bytes.
+ * We sample with a slow exponential back-off using the SizeEstimator to 
amortize the time,
+ * as each call to SizeEstimator is somewhat expensive (order of a few 
milliseconds).
+ */
+private[spark] trait SizeTracker {
+
+  import SizeTracker._
+
+  /**
+   * Controls the base of the exponential which governs the rate of 
sampling.
+   * E.g., a value of 2 would mean we sample at 1, 2, 4, 8, ... elements.
+   */
+  private val SAMPLE_GROWTH_RATE = 1.1
+
+  /** Samples taken since last resetSamples(). Only the last two are kept 
for extrapolation. */
+  private val samples = new ArrayBuffer[Sample]
+
+  /** The average number of bytes per update between our last two samples. 
*/
+  private var bytesPerUpdate: Double = _
+
+  /** Total number of insertions and updates into the map since the last 
resetSamples(). */
+  private var numUpdates: Long = _
+
+  /** The value of 'numUpdates' at which we will take our next sample. */
+  private var nextSampleNum: Long = _
+
+  resetSamples()
+
+  /**
+   * Reset samples collected so far.
+   * This should be called after the collection undergoes a dramatic 
change in size.
+   */
+  protected def resetSamples(): Unit = {
+numUpdates = 1
+nextSampleNum = 1
+samples.clear()
+takeSample()
+  }
+
+  /**
+   * Callback to be invoked after every update.
+   */
+  protected def afterUpdate(): Unit = {
+numUpdates += 1
+if (nextSampleNum == numUpdates) {
+  takeSample()
+}
+  }
+
+  /**
+   * Take a new sample of the current collection's size.
+   */
+  private def takeSample(): Unit = {
+samples += Sample(SizeEstimator.estimate(this), numUpdates)
+// Only use the last two samples to extrapolate
+if (samples.size  2) {
+  samples.remove(0)
+}
+val bytesDelta = samples.toSeq.reverse match {
+  case latest :: previous :: tail =
+(latest.size - previous.size).toDouble / (latest.numUpdates - 
previous.numUpdates)
+  // If fewer than 2 samples, assume no change
+  case _ = 0
+}
+bytesPerUpdate = math.max(0, bytesDelta)
+nextSampleNum = math.ceil(numUpdates * SAMPLE_GROWTH_RATE).toLong
+  }
+
+  /**
+   * Estimate the current size of the collection in bytes. O(1) time.
+   */
+  def estimateSize(): Long = {
+assert(samples.nonEmpty)
--- End diff --

Okay - got it



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...

2014-07-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1253#discussion_r15215030
  
--- Diff: docs/configuration.md ---
@@ -42,13 +42,14 @@ val sc = new SparkContext(new SparkConf())
 
 Then, you can supply configuration values at runtime:
 {% highlight bash %}
-./bin/spark-submit --name My fancy app --master local[4] myApp.jar
+./bin/spark-submit --name My app --master local[4] myApp.jar --conf 
spark.shuffle.spill=false
--- End diff --

ah good call - asked about this earlier but forgot to follow up


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...

2014-07-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1253#discussion_r15215054
  
--- Diff: docs/configuration.md ---
@@ -42,13 +42,14 @@ val sc = new SparkContext(new SparkConf())
 
 Then, you can supply configuration values at runtime:
 {% highlight bash %}
-./bin/spark-submit --name My fancy app --master local[4] myApp.jar
+./bin/spark-submit --name My app --master local[4] myApp.jar --conf 
spark.shuffle.spill=false
 {% endhighlight %}
 
 The Spark shell and 
[`spark-submit`](cluster-overview.html#launching-applications-with-spark-submit)
 tool support two ways to load configurations dynamically. The first are 
command line options,
-such as `--master`, as shown above. Running `./bin/spark-submit --help` 
will show the entire list
-of options.
+such as `--master`, as shown above. `spark-submit` can accept any Spark 
property using the `--conf`
+flag, but uses special flags for properties that play a part in launching 
the Spark application.
+Running `./bin/spark-submit --help` will show the entire list of these 
options.
--- End diff --

That is already the case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1521#issuecomment-49709515
  
QA tests have started for PR 1521. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16955/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1460#issuecomment-49709494
  
QA tests have started for PR 1460. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16957/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Graphx example

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1523#issuecomment-49709531
  
QA tests have started for PR 1523. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16956/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-22 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1502#issuecomment-49709588
  
This looks good to me other than the question above. This will be pretty 
cool for sorting multiple columnar data structures.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2310. Support arbitrary Spark properties...

2014-07-22 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1253#issuecomment-49709826
  
Can you support `-c` in addition to `--conf`?

Also, the spark-submit doc 
(http://spark.apache.org/docs/latest/submitting-applications.html) should be 
updated to list this option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15215432
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/RandomRDD.scala 
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.rdd
+
+import org.apache.spark.{Partition, SparkContext, TaskContext}
+import org.apache.spark.mllib.linalg.{DenseVector, Vector}
+import org.apache.spark.mllib.random.DistributionGenerator
+import org.apache.spark.rdd.RDD
+import org.apache.spark.util.Utils
+
+private[mllib] class RandomRDDPartition(val idx: Int,
+val size: Long,
+val rng: DistributionGenerator,
+val seed: Long) extends Partition {
+
+  override val index: Int = idx
+
+}
+
+// These two classes are necessary since Range objects in Scala cannot 
have size  Int.MaxValue
+private[mllib] class RandomRDD(@transient private var sc: SparkContext,
+size: Long,
+numSlices: Int,
+@transient rng: DistributionGenerator,
+@transient seed: Long = Utils.random.nextLong) extends RDD[Double](sc, 
Nil) {
+
+  require(size  0, Positive RDD size required.)
+  require(numSlices  0, Positive number of partitions required)
+
+  override def compute(splitIn: Partition, context: TaskContext): 
Iterator[Double] = {
+val split = splitIn.asInstanceOf[RandomRDDPartition]
+RandomRDD.getPointIterator(split)
+  }
+
+  override def getPartitions: Array[Partition] = {
+RandomRDD.getPartitions(size, numSlices, rng, seed)
+  }
+}
+
+private[mllib] class RandomVectorRDD(@transient private var sc: 
SparkContext,
+size: Long,
+vectorSize: Int,
+numSlices: Int,
+@transient rng: DistributionGenerator,
+@transient seed: Long = Utils.random.nextLong) extends RDD[Vector](sc, 
Nil) {
+
+  require(size  0, Positive RDD size required.)
+  require(numSlices  0, Positive number of partitions required)
+  require(vectorSize  0, Positive vector size required.)
+
+  override def compute(splitIn: Partition, context: TaskContext): 
Iterator[Vector] = {
+val split = splitIn.asInstanceOf[RandomRDDPartition]
+RandomRDD.getVectorIterator(split, vectorSize)
+  }
+
+  override protected def getPartitions: Array[Partition] = {
+RandomRDD.getPartitions(size, numSlices, rng, seed)
+  }
+}
+
+private[mllib] object RandomRDD {
+
+  private[mllib] class FixedSizePointIterator(val numElem: Long, val rng: 
DistributionGenerator)
+extends Iterator[Double] {
+
+private var currentSize = 0
+
+override def hasNext: Boolean = currentSize  numElem
+
+override def next(): Double = {
+  currentSize += 1
+  rng.nextValue()
+}
+  }
+
+  private[mllib] class FixedSizeVectorIterator(val numElem: Long,
+  val vectorSize: Int,
+  val rng: DistributionGenerator)
+extends Iterator[Vector] {
+
+private var currentSize = 0
+
+override def hasNext: Boolean = currentSize  numElem
+
+override def next(): Vector = {
+  currentSize += 1
+  new DenseVector((0 until vectorSize).map { _ = rng.nextValue() 
}.toArray)
+}
+  }
+
+  def getPartitions(size: Long,
+  numSlices: Int,
+  rng: DistributionGenerator,
+  seed: Long): Array[Partition] = {
+
+val firstPartitionSize = size / numSlices + size % numSlices
+val otherPartitionSize = size / numSlices
+
+val partitions = new Array[RandomRDDPartition](numSlices)
+var i = 0
+while (i  numSlices) {
+  partitions(i) =  if (i == 0) {
+new RandomRDDPartition(i, firstPartitionSize, rng, seed)
+  } else {
+new RandomRDDPartition(i, otherPartitionSize, rng.newInstance(), 
seed)
+  }
+  i += 1
+}
+partitions.asInstanceOf[Array[Partition]]
  

[GitHub] spark pull request: [SPARK-2514] [mllib] Random RDD generator

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1520#discussion_r15215522
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/random/DistributionGenerator.scala 
---
@@ -0,0 +1,105 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.random
+
+import cern.jet.random.Poisson
+import cern.jet.random.engine.DRand
+
+import org.apache.spark.util.random.{XORShiftRandom, Pseudorandom}
+
+/**
+ * Trait for random number generators that generate i.i.d values from a 
distribution.
+ */
+trait DistributionGenerator extends Pseudorandom with Serializable {
+
+  /**
+   * @return An i.i.d sample as a Double from an underlying distribution.
+   */
+  def nextValue(): Double
+
+  /**
+   * @return A copy of the DistributionGenerator with a new instance of 
the rng object used in the
+   * class when applicable. Each partition has a unique seed and 
therefore requires its
+   * own instance of the DistributionGenerator.
+   */
+  def newInstance(): DistributionGenerator
+}
+
+/**
+ * Generates i.i.d. samples from U[0.0, 1.0]
+ */
+class UniformGenerator() extends DistributionGenerator {
+
+  // XORShiftRandom for better performance. Thread safety isn't necessary 
here.
+  private val random = new XORShiftRandom()
+
+  /**
+   * @return An i.i.d sample as a Double from U[0.0, 1.0].
+   */
+  override def nextValue(): Double = {
+random.nextDouble()
+  }
+
+  /** Set random seed. */
+  override def setSeed(seed: Long) = random.setSeed(seed)
+
+  override def newInstance(): UniformGenerator = new UniformGenerator()
+}
+
+/**
+ * Generates i.i.d. samples from the Standard Normal Distribution.
+ */
+class StandardNormalGenerator() extends DistributionGenerator {
+
+  // XORShiftRandom for better performance. Thread safety isn't necessary 
here.
+  private val random = new XORShiftRandom()
--- End diff --

Is it allowed to use a DistributionGenerator before calling setSeed? It 
would seem simpler to disallow that, but it seems to be something it got from 
trait Pseudorandom.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1502#discussion_r15215866
  
--- Diff: LICENSE ---
@@ -483,6 +483,24 @@ SUCH DAMAGE.
 
 
 
+For Timsort:
--- End diff --

Say which source file this is in (core/src/main/java/...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-22 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1502#discussion_r15215883
  
--- Diff: core/src/main/scala/org/apache/spark/util/collection/Sorter.java 
---
@@ -0,0 +1,915 @@
+/*
--- End diff --

This is a Java file, but you put it in the src/main/scala folder! That 
works with SBT but it's pretty confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1425#discussion_r15216588
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtils.scala ---
@@ -18,28 +18,90 @@
 package org.apache.spark.mllib.util
 
 import org.apache.spark.mllib.linalg.Vector
+import org.scalatest.exceptions.TestFailedException
 
 object TestingUtils {
 
+  val defaultEpsilon = 1E-10
+
   implicit class DoubleWithAlmostEquals(val x: Double) {
-// An improved version of AlmostEquals would always divide by the 
larger number.
-// This will avoid the problem of diving by zero.
-def almostEquals(y: Double, epsilon: Double = 1E-10): Boolean = {
-  if(x == y) {
+
+def almostEquals(y: Double, eps: Double = defaultEpsilon): Boolean = {
+  val absX = math.abs(x)
+  val absY = math.abs(y)
+  val diff = math.abs(x - y)
+  if (x == y) {
 true
-  } else if(math.abs(x)  math.abs(y)) {
-math.abs(x - y) / math.abs(x)  epsilon
+  } else if (absX  1E-15 || absY  1E-15) {
--- End diff --

See commentary at 
https://issues.apache.org/jira/browse/SPARK-2599?focusedCommentId=14068293page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14068293
 I can see the idea here, but all of these kinds of efforts seem to lead to 
errors or unintuitive behavior. For example this line means that:

Fails:
expected = 1e-15
actual = 2e-15
eps = 0.1

Passes:
expected = 1e-16
actual = 2e-16
eps = 0.1

Why is 1e-15 special anyway?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2269 Refactor mesos scheduler resourceOf...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1487#issuecomment-49713371
  
QA results for PR 1487:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16946/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [CORE] Fixed stage description in stage info p...

2014-07-22 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/1524

[CORE] Fixed stage description in stage info page

Stage description should be a `String`, but was changed to an 
`Option[String]` by mistake:


![stage-desc-small](https://cloud.githubusercontent.com/assets/230655/3655611/f6d0b0f6-117b-11e4-83ed-71000dcd5009.png)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark fix-stage-desc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1524.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1524


commit a57f80d67c48243051629630b8ed859fafe027c5
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-07-22T08:09:58Z

Fixed stage description in stage info page




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [CORE] Fixed stage description in stage info p...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1524#issuecomment-49713650
  
QA tests have started for PR 1524. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16958/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-22 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49713754
  
Is it possible to support syntax like `0.3 +- 0.1` for absolute error, and 
`0.3 +- 10%` for relative error? Seems like the kind of crazy thing that Scala 
just might support. Maybe it's a nice way to support both semantics; I think 
relative error semantics have to be a separate method anwayy.

Also, there is the method `Math.ulp` 
(http://docs.oracle.com/javase/7/docs/api/java/lang/Math.html#ulp(double)) 
whichs tell you how big the gap is between floating-point values around a given 
value. How about using this to pick a reasonable absolute error around the 
test's expected value? At least it scales automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix race condition at SchedulerBackend.isReady...

2014-07-22 Thread li-zhihui
GitHub user li-zhihui opened a pull request:

https://github.com/apache/spark/pull/1525

Fix race condition at SchedulerBackend.isReady in standalone mode

In SPARK-1946(PR #900), configuration 
codespark.scheduler.minRegisteredExecutorsRatio/code was introduced. 
However, in standalone mode, there is a race condition where isReady() can 
return true because totalExpectedExecutors has not been correctly set.

Because expected executors is uncertain in standalone mode, the PR try to 
use CPU cores(code--total-executor-cores/code) as expected resources to 
judge whether SchedulerBackend is ready.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/li-zhihui/spark fixre4s

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1525.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1525


commit 8b54316c77d086ea3454419ebba92003707bbd76
Author: li-zhihui zhihui...@intel.com
Date:   2014-07-22T08:15:40Z

Fix race condition at SchedulerBackend.isReady in standalone mode




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1521#issuecomment-49714576
  
QA results for PR 1521:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16947/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix race condition at SchedulerBackend.isReady...

2014-07-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1525#issuecomment-49714817
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix race condition at SchedulerBackend.isReady...

2014-07-22 Thread li-zhihui
Github user li-zhihui commented on the pull request:

https://github.com/apache/spark/pull/1525#issuecomment-49714878
  
@kayousterhout @tgravescs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-49715371
  
QA tests have started for PR 1399. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16959/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2615] [SQL] Add Equal Sign == Support...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1522#issuecomment-49715697
  
QA results for PR 1522:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16949/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: (WIP) SPARK-2045 Sort-based shuffle

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1499#issuecomment-49715947
  
QA results for PR 1499:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16948/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1416: PySpark support for SequenceFile a...

2014-07-22 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/455#issuecomment-49716773
  
@ericgarcia hey that's cool. Would you mind creating a pull request to add 
that to the Spark examples? It would be very useful to others to see how to 
make a fully generic Avro converter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2617] Correct doc and usages of preserv...

2014-07-22 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/1526

[SPARK-2617] Correct doc and usages of preservesPartitioning

The name `preservesPartitioning` is ambiguous: 1) preserves the indices of 
partitions, 2) preserves the partitioner. The latter is correct and 
`preservesPartitioning` should really be called `preservesPartitioner` to avoid 
confusion. Unfortunately, this is already part of the API and we cannot change. 
We should be clear in the doc and fix wrong usages.

This PR

1. adds notes in `maPartitions*`,
2. makes `RDD.sample` preserve partitioner,
3. changes `preservesPartitioning` to false in  `RDD.zip` because the keys 
of the first RDD are no longer the keys of the zipped RDD,
4. fixes some wrong usages in MLlib.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark preserve-partitioner

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1526.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1526


commit d1caa65c7440cb92aed40dcab4b95c03389350c4
Author: Xiangrui Meng m...@databricks.com
Date:   2014-07-22T09:15:44Z

add doc to explain preservesPartitioning
fix wrong usage of preservesPartitioning
make sample preserse partitioning




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1730. Make receiver store data reliably ...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1195#issuecomment-49717415
  
QA results for PR 1195:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16954/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2617] Correct doc and usages of preserv...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1526#issuecomment-49717533
  
QA tests have started for PR 1526. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16960/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS

2014-07-22 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1521#issuecomment-49717664
  
Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS

2014-07-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1521


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2047: Introduce an in-mem Sorter, and us...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1502#issuecomment-49717790
  
QA results for PR 1502:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):br* This trait extends Any to ensure it is universal (and thus 
compiled to a Java interface).brclass KVArraySortDataFormat[K, T : AnyRef : 
ClassTag] extends SortDataFormat[K, Array[T]] {brclass SorterK, Buffer 
{brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16953/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2538] [PySpark] Hash based disk spillin...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1460#issuecomment-49718092
  
QA results for PR 1460:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass AutoSerializer(FramedSerializer):brclass 
Aggregator(object):brclass Merger(object):brclass 
InMemoryMerger(Merger):brclass ExternalMerger(Merger):brbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16957/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1022] Kafka unit test that actually sen...

2014-07-22 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/557#issuecomment-49718360
  
Hi TD, I've fixed this flaky issue in my local test, how should I submit my 
patch, should I push to your branch or create a new PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Graphx example

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1523#issuecomment-49718923
  
QA results for PR 1523:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16956/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2612] [mllib] Fix data skew in ALS

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1521#issuecomment-49719044
  
QA results for PR 1521:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16955/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1399#issuecomment-49721444
  
QA results for PR 1399:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass SparkSQLOperationManager(hiveContext: HiveContext) 
extends OperationManager with Logging {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16959/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2387: remove stage barrier

2014-07-22 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/1328#discussion_r15220164
  
--- Diff: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ---
@@ -340,6 +459,7 @@ private[spark] class MapOutputTrackerMaster(conf: 
SparkConf)
  */
 private[spark] class MapOutputTrackerWorker(conf: SparkConf) extends 
MapOutputTracker(conf) {
   protected val mapStatuses = new HashMap[Int, Array[MapStatus]]
+with mutable.SynchronizedMap[Int, Array[MapStatus]]
--- End diff --

I think `ConcurrentHashMap` is better in most cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [CORE] Fixed stage description in stage info p...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1524#issuecomment-49722653
  
QA results for PR 1524:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16958/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2617] Correct doc and usages of preserv...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1526#issuecomment-49724616
  
QA results for PR 1526:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16960/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-22 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49726353
  
@pwendell Sure, and I'll do this in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-22 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49726591
  
@marmbrus I've made the changes you suggested and moved the rule to the 
Resolution batch but now the newly-inserted attribute references remain 
unresolved after analysis.  I'm going to try and figure out where things are 
going wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...

2014-07-22 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1399#discussion_r15222777
  
--- Diff: bin/beeline ---
@@ -0,0 +1,45 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# Figure out where Spark is installed
+FWDIR=$(cd `dirname $0`/..; pwd)
+
+# Find the java binary
+if [ -n ${JAVA_HOME} ]; then
--- End diff --

Hit a bug while fixing this: application options with names that 
`SparkSubmitArguments` recognizes are stolen by `SparkSubmit` instead of passed 
to the application. For example, when running `BeeLine` with `spark-submit`, 
passing `--help` option shows the usage message of `SparkSubmit` rather than 
`BeeLine`.

Since the `spark-internal` issue also touches this part of code, I tried to 
fix this bug together within this PR to avoid further conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Build] SPARK-2614: Add the spark-examples-xxx...

2014-07-22 Thread tzolov
GitHub user tzolov opened a pull request:

https://github.com/apache/spark/pull/1527

[Build] SPARK-2614: Add the spark-examples-xxx-.jar to the generated debian 
package.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tzolov/spark SPARK-2614

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1527.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1527


commit 32ef6ddc8043b000f90e802b9d8db0a244887d96
Author: tzolov christian.tzo...@gmail.com
Date:   2014-07-22T12:18:02Z

[SPARK-2614] Add the spark-examples-xxx-.jar to the Debian package




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [Build] SPARK-2614: Add the spark-examples-xxx...

2014-07-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1527#issuecomment-49731850
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2150: Provide direct link to finished ap...

2014-07-22 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/1094#issuecomment-49733427
  
Jenkins, test this please 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2150: Provide direct link to finished ap...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1094#issuecomment-49733823
  
QA tests have started for PR 1094. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16962/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-49733822
  
QA tests have started for PR 1269. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16961/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2150: Provide direct link to finished ap...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1094#issuecomment-49733906
  
QA results for PR 1094:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16962/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   4   5   >