[GitHub] spark pull request: [SPARK-2677] BasicBlockFetchIterator#next can ...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1632#issuecomment-52279573
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18600/consoleFull)
 for   PR 1632 at commit 
[`66cfff7`](https://github.com/apache/spark/commit/66cfff765f44aea7ff2bf5afe1a29403e79b7951).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2677] BasicBlockFetchIterator#next can ...

2014-08-14 Thread sarutak
Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/1632#discussion_r16282025
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
@@ -836,9 +845,14 @@ private[spark] class ConnectionManager(
   def sendMessageReliably(connectionManagerId: ConnectionManagerId, 
message: Message)
   : Future[Message] = {
 val promise = Promise[Message]()
+
+val ackTimeoutMonitor =  new Timer(s"Ack Timeout Monitor-" +
--- End diff --

I've modified for ConnectionManager-wide timer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3063][SQL] ExistingRdd should convert M...

2014-08-14 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/1963

[SPARK-3063][SQL] ExistingRdd should convert Map to catalyst Map.

Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-3063

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1963.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1963


commit d8a900ad1408ea3662d2e43b2df24db485cd28e5
Author: Takuya UESHIN 
Date:   2014-08-15T06:48:52Z

Make ExistingRdd.convertToCatalyst be able to convert Map value.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2970] [SQL] spark-sql script ends with ...

2014-08-14 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1891#issuecomment-52278697
  
Sorry, didn't realize that `ShutdownHookManager` is only available in 
Hadoop 2. Compilation fails when building Spark with Hadoop 1. Filed 
[SPARK-3062](https://issues.apache.org/jira/browse/SPARK-3062) to track this 
issue.

@marmbrus Maybe we should revert this PR first for safe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3045] [SPARK-3046] Make Serializer inte...

2014-08-14 Thread GrahamDennis
Github user GrahamDennis commented on the pull request:

https://github.com/apache/spark/pull/1948#issuecomment-52278145
  
I've convinced myself that this PR fixes SPARK-2878, and think it should be 
merged.  Thanks @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

2014-08-14 Thread chouqin
Github user chouqin commented on the pull request:

https://github.com/apache/spark/pull/1941#issuecomment-52277965
  
@mengxr @jkbradley never mind, I will help you review @1950 :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...

2014-08-14 Thread harishreedharan
Github user harishreedharan commented on a diff in the pull request:

https://github.com/apache/spark/pull/1958#discussion_r16281516
  
--- Diff: 
external/flume-sink/src/test/scala/org/apache/spark/streaming/flume/sink/SparkSinkSuite.scala
 ---
@@ -0,0 +1,208 @@
+package org.apache.spark.streaming.flume.sink
+
+import java.net.InetSocketAddress
+import java.util.concurrent.atomic.AtomicInteger
+import java.util.concurrent.{CountDownLatch, Executors}
+
+import scala.collection.JavaConversions._
+import scala.concurrent.{Promise, Future}
+import scala.util.{Failure, Success, Try}
+
+import com.google.common.util.concurrent.ThreadFactoryBuilder
+import org.apache.avro.ipc.NettyTransceiver
+import org.apache.avro.ipc.specific.SpecificRequestor
+import org.apache.flume.Context
+import org.apache.flume.channel.MemoryChannel
+import org.apache.flume.event.EventBuilder
+import org.apache.spark.streaming.TestSuiteBase
+import org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory
+
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
--- End diff --

Thanks! Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1958#issuecomment-5222
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18599/consoleFull)
 for   PR 1958 at commit 
[`f2c56c9`](https://github.com/apache/spark/commit/f2c56c976bc6faa83b8357c80caad1f4839eb06d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread chouqin
Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16281414
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -728,8 +659,10 @@ object DecisionTree extends Serializable with Logging {
   arr
 }
 
- // Find feature bins for all nodes at a level.
+// Find feature bins for all nodes at a level.
+timer.start("findBinsForLevel")
 val binMappedRDD = input.map(x => findBinsForLevel(x))
+timer.stop("findBinsForLevel")
--- End diff --

this timer may also be useless


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread chouqin
Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16281396
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -53,16 +55,28 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
*/
   def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
 
+val timer = new TimeTracker()
+
+timer.start("total")
+
 // Cache input RDD for speedup during multiple passes.
-val retaggedInput = input.retag(classOf[LabeledPoint]).cache()
+timer.start("init")
+val retaggedInput = input.retag(classOf[LabeledPoint])
 logDebug("algo = " + strategy.algo)
+timer.stop("init")
 
 // Find the splits and the corresponding bins (interval between the 
splits) using a sample
 // of the input data.
+timer.start("findSplitsBins")
 val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, 
strategy)
 val numBins = bins(0).length
+timer.stop("findSplitsBins")
 logDebug("numBins = " + numBins)
 
+timer.start("init")
+val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, 
bins).cache()
+timer.stop("init")
--- End diff --

I think the timer for `convertToTreeRDD` may not be useful, since the `map` 
from `retaggedInput` to `treeInput` is evaluated lazily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread chouqin
Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16281333
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -689,37 +631,26 @@ object DecisionTree extends Serializable with Logging 
{
  *bin index for this labeledPoint
  *(or InvalidBinIndex if labeledPoint is not handled by 
this node)
  */
-def findBinsForLevel(labeledPoint: LabeledPoint): Array[Double] = {
+def findBinsForLevel(treePoint: TreePoint): Array[Double] = {
   // Calculate bin index and label per feature per node.
   val arr = new Array[Double](1 + (numFeatures * numNodes))
   // First element of the array is the label of the instance.
-  arr(0) = labeledPoint.label
+  arr(0) = treePoint.label
   // Iterate over nodes.
   var nodeIndex = 0
   while (nodeIndex < numNodes) {
 val parentFilters = findParentFilters(nodeIndex)
 // Find out whether the sample qualifies for the particular node.
-val sampleValid = isSampleValid(parentFilters, labeledPoint)
+val sampleValid = isSampleValid(parentFilters, treePoint)
 val shift = 1 + numFeatures * nodeIndex
 if (!sampleValid) {
   // Mark one bin as -1 is sufficient.
   arr(shift) = InvalidBinIndex
 } else {
   var featureIndex = 0
+  // TODO: Vectorize this
   while (featureIndex < numFeatures) {
-val featureInfo = 
strategy.categoricalFeaturesInfo.get(featureIndex)
-val isFeatureContinuous = featureInfo.isEmpty
-if (isFeatureContinuous) {
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous, 
false)
-} else {
-  val featureCategories = featureInfo.get
-  val isSpaceSufficientForAllCategoricalSplits
-= numBins > math.pow(2, featureCategories.toInt - 1) - 1
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous,
-isSpaceSufficientForAllCategoricalSplits)
-}
+arr(shift + featureIndex) = treePoint.features(featureIndex)
--- End diff --

Since the features array is the same of all nodes on which this 
labeledpoint is valid, is it really necessary for every node to have a copy of 
it?

In #1941 , I have changed the `arr` structure, does this memory saving 
help? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-14 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1912#issuecomment-52277037
  
I had add Broadcast.unpersist(blocking=False).

Because we have an copy in disks, so read it from there when user want to 
access it driver, then we can keep the SparkContext.broadcast() unchanged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2849 / 2914] Handle certain Spark confi...

2014-08-14 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1845#issuecomment-52276624
  
Hey @andrewor14 it's great to see this being fixed. I think things can be 
simplified a lot by deferring to the `eval` from bash to deal with the argumnet 
parsing. It's worth seeing if that handles all the cases here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1912#issuecomment-52276573
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18598/consoleFull)
 for   PR 1912 at commit 
[`e06df4a`](https://github.com/apache/spark/commit/e06df4a8c211f53a5b7d176c6ec655033f1419ee).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2849 / 2914] Handle certain Spark confi...

2014-08-14 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1845#discussion_r16281080
  
--- Diff: bin/utils.sh ---
@@ -0,0 +1,108 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# *  *
+# |  Utility functions for launching Spark applications  |
+# *  *
+
+# Parse the value of a config from a java properties file according to the 
specifications in
+# 
http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html#load(java.io.Reader).
+# This accepts the name of the config as an argument, and expects the path 
of the property
+# file to be found in PROPERTIES_FILE. The value is returned through 
JAVA_PROPERTY_VALUE.
+parse_java_property() {
+  JAVA_PROPERTY_VALUE=""  # return value
+  config_buffer=""# buffer for collecting parts of a config value
+  multi_line=0# whether this config is spanning multiple lines
+  while read -r line; do
+# Strip leading and trailing whitespace
+line=$(echo "$line" | sed "s/^[[:space:]]\(.*\)[[:space:]]*$/\1/")
+contains_config=$(echo "$line" | grep -e "^$1")
+if [[ -n "$contains_config" || "$multi_line" == 1 ]]; then
+  has_more_lines=$(echo "$line" | grep -e "$")
+  if [[ -n "$has_more_lines" ]]; then
+# Strip trailing backslash
+line=$(echo "$line" | sed "s/$//")
+config_buffer="$config_buffer $line"
+multi_line=1
+  else
+JAVA_PROPERTY_VALUE="$config_buffer $line"
+break
+  fi
+fi
+  done < "$PROPERTIES_FILE"
+
+  # Actually extract the value of the config
+  JAVA_PROPERTY_VALUE=$( \
+echo "$JAVA_PROPERTY_VALUE" | \
+sed "s/$1//" | \
+sed "s/^[[:space:]]*[:=]\{0,1\}//" | \
+sed "s/^[[:space:]]*\(.*\)[[:space:]]*$/\1/g" \
+  )
+  export JAVA_PROPERTY_VALUE
+}
+
+# Properly split java options, dealing with whitespace, double quotes and 
backslashes.
+# This accepts a string and returns the resulting list through 
SPLIT_JAVA_OPTS.
+split_java_options() {
+  SPLIT_JAVA_OPTS=()  # return value
--- End diff --

I think this could be solved using an `eval` statement much more easily. At 
first I though there are some security concerns, but I don't think this allows 
people to do anything they couldn't already do:
```
test="One \"This \\\"is two\" Three"
eval "set -- $test"
for i in "$@"
do
  echo $i
done
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2849 / 2914] Handle certain Spark confi...

2014-08-14 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1845#discussion_r16280819
  
--- Diff: bin/spark-class ---
@@ -101,11 +106,16 @@ fi
 # Set JAVA_OPTS to be able to load native libraries and to set heap size
 JAVA_OPTS="-XX:MaxPermSize=128m $OUR_JAVA_OPTS"
 JAVA_OPTS="$JAVA_OPTS -Xms$OUR_JAVA_MEM -Xmx$OUR_JAVA_MEM"
+
 # Load extra JAVA_OPTS from conf/java-opts, if it exists
 if [ -e "$FWDIR/conf/java-opts" ] ; then
   JAVA_OPTS="$JAVA_OPTS `cat $FWDIR/conf/java-opts`"
 fi
-export JAVA_OPTS
+
+# Split JAVA_OPTS properly to handle whitespace, double quotes and 
backslashes
+# This exports the split java options into SPLIT_JAVA_OPTS
+split_java_options "$JAVA_OPTS"
--- End diff --

This actually solves a more general problem than those reported in 
SPARK-2849 and SPARK-2914... the problem/feature is that in general we don't 
support quotes in any of the java option strings we have. I tested this in 
master and confirmed it doesn't work:

```
$ export SPARK_JAVA_OPTS="-Dfoo=\"bar baz\""
$ ./bin/spark-shell
Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
Error: Could not find or load main class baz"
```
So it might be good to create another JIRA as well for this PR. One that 
just explains that none of the JAVA_OPTS variants we have correctly support 
quoted strings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1960#issuecomment-52275457
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18597/consoleFull)
 for   PR 1960 at commit 
[`3debe7c`](https://github.com/apache/spark/commit/3debe7c246b58345d0495b52f70bdd0be1b4f5e3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1960#issuecomment-52275324
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3040] pick up a more proper local ip ad...

2014-08-14 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1946#issuecomment-52275310
  
I have seen cases where a bad interface was chose, so this does seem like a 
good idea. But for windows, does this mean that the wrong interface is chosen? 
Since we support windows, should we add a check here to confirm we are on a 
unix-like system?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2367: update io.netty from 4.0.17 to 4.0...

2014-08-14 Thread ngbinh
Github user ngbinh closed the pull request at:

https://github.com/apache/spark/pull/1302


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2367: update io.netty from 4.0.17 to 4.0...

2014-08-14 Thread ngbinh
Github user ngbinh commented on the pull request:

https://github.com/apache/spark/pull/1302#issuecomment-52275260
  
thanks. closing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-08-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1535


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2367: update io.netty from 4.0.17 to 4.0...

2014-08-14 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1302#issuecomment-52275218
  
This was fixed in 3a8b68b7 so I think we can close this issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-08-14 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-52275155
  
Hey this looks good. Merging it now into mater. Sorry about the delay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2849 / 2914] Handle certain Spark confi...

2014-08-14 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1845#discussion_r16280454
  
--- Diff: bin/utils.sh ---
@@ -0,0 +1,108 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# *  *
+# |  Utility functions for launching Spark applications  |
+# *  *
+
+# Parse the value of a config from a java properties file according to the 
specifications in
+# 
http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html#load(java.io.Reader).
+# This accepts the name of the config as an argument, and expects the path 
of the property
+# file to be found in PROPERTIES_FILE. The value is returned through 
JAVA_PROPERTY_VALUE.
+parse_java_property() {
+  JAVA_PROPERTY_VALUE=""  # return value
+  config_buffer=""# buffer for collecting parts of a config value
+  multi_line=0# whether this config is spanning multiple lines
+  while read -r line; do
+# Strip leading and trailing whitespace
+line=$(echo "$line" | sed "s/^[[:space:]]\(.*\)[[:space:]]*$/\1/")
+contains_config=$(echo "$line" | grep -e "^$1")
+if [[ -n "$contains_config" || "$multi_line" == 1 ]]; then
+  has_more_lines=$(echo "$line" | grep -e "$")
+  if [[ -n "$has_more_lines" ]]; then
--- End diff --

We discussed this already today, but I'd prefer not to deal with multiline 
statements at all here to keep this simpler.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2955 [BUILD] Test code fails to compile ...

2014-08-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1879


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2917] [SQL] Avoid table creation in log...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1846#issuecomment-52274987
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18596/consoleFull)
 for   PR 1846 at commit 
[`5be9554`](https://github.com/apache/spark/commit/5be9554254312a78a94dd023494478a7db5d3243).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2955 [BUILD] Test code fails to compile ...

2014-08-14 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1879#issuecomment-52274970
  
Ah, seems like a good change. Thanks Sean, I merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2912] [Spark QA] Include commit hash in...

2014-08-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1816


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...

2014-08-14 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/1958#discussion_r16280387
  
--- Diff: 
external/flume-sink/src/test/scala/org/apache/spark/streaming/flume/sink/SparkSinkSuite.scala
 ---
@@ -0,0 +1,208 @@
+package org.apache.spark.streaming.flume.sink
+
+import java.net.InetSocketAddress
+import java.util.concurrent.atomic.AtomicInteger
+import java.util.concurrent.{CountDownLatch, Executors}
+
+import scala.collection.JavaConversions._
+import scala.concurrent.{Promise, Future}
+import scala.util.{Failure, Success, Try}
+
+import com.google.common.util.concurrent.ThreadFactoryBuilder
+import org.apache.avro.ipc.NettyTransceiver
+import org.apache.avro.ipc.specific.SpecificRequestor
+import org.apache.flume.Context
+import org.apache.flume.channel.MemoryChannel
+import org.apache.flume.event.EventBuilder
+import org.apache.spark.streaming.TestSuiteBase
+import org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory
+
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
--- End diff --

Hey, Hari, ASF header should be at the top of file :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2917] [SQL] Avoid table creation in log...

2014-08-14 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1846#issuecomment-52274734
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3013] [SQL] [PySpark] convert array int...

2014-08-14 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/1928#issuecomment-52274671
  
@davies that's true. One thing I like about Python list is that it gets 
mapped to Java List (symmetry). However, it's not array any more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3013] [SQL] [PySpark] convert array int...

2014-08-14 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1928#issuecomment-52274398
  
@kanzhang I think either way is fine, tuple is like read-only list. RDD is 
read only, so there is no difference between having list or tuple in RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2918] [SQL] [WIP] Support the extended ...

2014-08-14 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1847#issuecomment-52274317
  
@marmbrus I've create another PR #1962 which only provide the `extended` 
support (but doesn't support the `explain CTAS`), hope we can merge that first. 
I am not sure if we still able to merge the `explain CTAS` in spark 1.1 
release, as there is another dependency failed in unit test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3013] [SQL] [PySpark] convert array int...

2014-08-14 Thread kanzhang
Github user kanzhang commented on the pull request:

https://github.com/apache/spark/pull/1928#issuecomment-52274255
  
Now that I think about it, our ```WritableToJavaConverter``` should 
probably convert array to java collection (which gets pickled to Python list), 
instead of ```Object[]``` (which gets picked to Python tuple), as it is now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52274276
  
QA tests have started for PR 1961. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18595/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3058] [SQL] Support EXTENDED for EXPLAI...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1962#issuecomment-52274263
  
QA tests have started for PR 1962. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18594/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-14 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52274194
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3058] [SQL] Support EXTENDED for EXPLAI...

2014-08-14 Thread chenghao-intel
GitHub user chenghao-intel opened a pull request:

https://github.com/apache/spark/pull/1962

[SPARK-3058] [SQL] Support EXTENDED for EXPLAIN

Provide `extended` keyword support for `explain` command in SQL. e.g.
```
explain extended select key as a1, value as a2 from src where key=1;
== Parsed Logical Plan ==
Project ['key AS a1#3,'value AS a2#4]
 Filter ('key = 1)
  UnresolvedRelation None, src, None

== Analyzed Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
  MetastoreRelation default, src, None

== Optimized Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  MetastoreRelation default, src, None

== Physical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
 Filter (CAST(key#8, DoubleType) = 1.0)
  HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), 
None

Code Generation: false
== RDD ==
(2) MappedRDD[14] at map at HiveContext.scala:350
  MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
  MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
  MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
  MappedRDD[10] at map at TableReader.scala:240
  HadoopRDD[9] at HadoopRDD at TableReader.scala:230
```

It's the sub task of #1847. But can go without any dependency.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chenghao-intel/spark explain_extended

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1962.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1962


commit 48bc989e3740c5700f29aef153b0c4bb279c52fc
Author: Cheng Hao 
Date:   2014-08-08T02:45:02Z

Support EXTENDED for EXPLAIN




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-14 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52274079
  
Ooops, fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3048][MLLIB] add LabeledPoint.parse and...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1952#issuecomment-52273898
  
QA tests have started for PR 1952. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18593/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-14 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1912#issuecomment-52273858
  
It occurs to me: what if we had .value retrieve and depickle the value from 
the JVM?  Also, won't we still experience memory leaks in the JVM if we 
iteratively create broadcast variables, since we will never clean up those 
pickled values?

One approach is to have .value() depickle the JVM value (so we're not 
changing the user-facing API) and add a Python equivalent of 
Broadcast.destroy() for performing permanent cleanup of a broadcast's 
resources.  What do you think of this approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52273690
  
Oops compilation failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-14 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1912#discussion_r16279926
  
--- Diff: python/pyspark/broadcast.py ---
@@ -52,17 +47,31 @@ class Broadcast(object):
 Access its value through C{.value}.
 """
 
-def __init__(self, bid, value, java_broadcast=None, 
pickle_registry=None):
+def __init__(self, bid, value, java_broadcast=None, 
pickle_registry=None, keep=True):
 """
 Should not be called directly by users -- use
 L{SparkContext.broadcast()}
 instead.
 """
-self.value = value
 self.bid = bid
+if keep:
+self.value = value
 self._jbroadcast = java_broadcast
 self._pickle_registry = pickle_registry
+self.keep = keep
 
 def __reduce__(self):
 self._pickle_registry.add(self)
 return (_from_id, (self.bid, ))
+
+def __getattr__(self, item):
+if item == 'value' and not self.keep:
+raise Exception("please create broadcast with keep=True to 
make"
+" it accessable in driver")
--- End diff --

Typo: should be spelled "accessible."  Also, maybe the error message could 
be a little clearer about how broadcast variables are created and why call 
failed.  I'm thinking of something like "please call sc.broadcast() with 
keep=True to make values accessible in the driver".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52273493
  
QA results for PR 1961:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18591/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1960#issuecomment-52273513
  
QA tests have started for PR 1960. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18592/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1960#issuecomment-52273441
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52273425
  
LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52273376
  
QA tests have started for PR 1961. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18591/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279842
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -41,6 +43,9 @@ class DecisionTreeSuite extends FunSuite with 
LocalSparkContext {
   prediction != expected.label
 }
 val accuracy = (input.length - numOffPredictions).toDouble / 
input.length
+if (accuracy < requiredAccuracy) {
+  println(s"validateClassifier calculated accuracy $accuracy but 
required $requiredAccuracy.")
--- End diff --

put the error message as the second argument of `assert` below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279841
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -17,6 +17,8 @@
 
 package org.apache.spark.mllib.tree
 
+import org.apache.spark.mllib.tree.impl.TreePoint
--- End diff --

organize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279835
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  "Unordered categorical features" are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  "Ordered categorical features" are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
+}
+
+
+private[tree] object TreePoint {
+
+  /**
+   * Convert an input dataset into its TreePoint representation,
+   * binning feature values in preparation for DecisionTree training.
+   * @param input Input dataset.
+   * @param strategy  DecisionTree training info, used for dataset 
metadata.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @return  TreePoint dataset representation
+   */
+  def convertToTreeRDD(
+  input: RDD[LabeledPoint],
+  strategy: Strategy,
+  bins: Array[Array[Bin]]): RDD[TreePoint] = {
+input.map { x =>
+  TreePoint.labeledPointToTreePoint(x, 
strategy.isMulticlassClassification, bins,
+strategy.categoricalFeaturesInfo)
+}
+  }
+
+  /**
+   * Convert one LabeledPoint into its TreePoint representation.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @param categoricalFeaturesInfo  Map over categorical features: 
feature index --> feature arity
+   */
+  private def labeledPointToTreePoint(
+  labeledPoint: LabeledPoint,
+  isMulticlassClassification: Boolean,
+  bins: Array[Array[Bin]],
+  categoricalFeaturesInfo: Map[Int, Int]): TreePoint = {
+
+val numFeatures = labeledPoint.features.size
+val numBins = bins(0).size
+val arr = new Array[Int](numFeatures)
+var featureIndex = 0 // offset by 1 for label
+while (featureIndex < numFeatures) {
+  val featureInfo = categoricalFeaturesInfo.get(featureIndex)
+  val isFeatureContinuous = featureInfo.isEmpty
+  if (isFeatureContinuous) {
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous, false,
+  bins, categoricalFeaturesInfo)
+  } else {
+val featureCategories = featureInfo.get
+val isSpaceSufficientForAllCategoricalSplits
+  = numBins > math.pow(2, featureCategories.toInt - 1) - 1
+val isUnorderedFeature =
+  isMulticlassClassification && 
isSpaceSufficientForAllCategoricalSplits
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous,
+  isUnorderedFeature, bins, categoricalFeaturesInfo)
+  }
+  featureIndex += 1
+}
+
+new TreePoint(labeledPoint.label, arr)
+  }
+
+
+  /**
+   * Find bin for one (labeledPoint, feature).
+   *
+   * @param isUnordere

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279832
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  "Unordered categorical features" are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  "Ordered categorical features" are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
+}
+
+
+private[tree] object TreePoint {
+
+  /**
+   * Convert an input dataset into its TreePoint representation,
+   * binning feature values in preparation for DecisionTree training.
+   * @param input Input dataset.
+   * @param strategy  DecisionTree training info, used for dataset 
metadata.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @return  TreePoint dataset representation
+   */
+  def convertToTreeRDD(
+  input: RDD[LabeledPoint],
+  strategy: Strategy,
+  bins: Array[Array[Bin]]): RDD[TreePoint] = {
+input.map { x =>
+  TreePoint.labeledPointToTreePoint(x, 
strategy.isMulticlassClassification, bins,
+strategy.categoricalFeaturesInfo)
+}
+  }
+
+  /**
+   * Convert one LabeledPoint into its TreePoint representation.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @param categoricalFeaturesInfo  Map over categorical features: 
feature index --> feature arity
+   */
+  private def labeledPointToTreePoint(
+  labeledPoint: LabeledPoint,
+  isMulticlassClassification: Boolean,
+  bins: Array[Array[Bin]],
+  categoricalFeaturesInfo: Map[Int, Int]): TreePoint = {
+
+val numFeatures = labeledPoint.features.size
+val numBins = bins(0).size
+val arr = new Array[Int](numFeatures)
+var featureIndex = 0 // offset by 1 for label
--- End diff --

Where does `offset by 1` happen?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279839
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  "Unordered categorical features" are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  "Ordered categorical features" are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
+}
+
+
+private[tree] object TreePoint {
+
+  /**
+   * Convert an input dataset into its TreePoint representation,
+   * binning feature values in preparation for DecisionTree training.
+   * @param input Input dataset.
+   * @param strategy  DecisionTree training info, used for dataset 
metadata.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @return  TreePoint dataset representation
+   */
+  def convertToTreeRDD(
+  input: RDD[LabeledPoint],
+  strategy: Strategy,
+  bins: Array[Array[Bin]]): RDD[TreePoint] = {
+input.map { x =>
+  TreePoint.labeledPointToTreePoint(x, 
strategy.isMulticlassClassification, bins,
+strategy.categoricalFeaturesInfo)
+}
+  }
+
+  /**
+   * Convert one LabeledPoint into its TreePoint representation.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @param categoricalFeaturesInfo  Map over categorical features: 
feature index --> feature arity
+   */
+  private def labeledPointToTreePoint(
+  labeledPoint: LabeledPoint,
+  isMulticlassClassification: Boolean,
+  bins: Array[Array[Bin]],
+  categoricalFeaturesInfo: Map[Int, Int]): TreePoint = {
+
+val numFeatures = labeledPoint.features.size
+val numBins = bins(0).size
+val arr = new Array[Int](numFeatures)
+var featureIndex = 0 // offset by 1 for label
+while (featureIndex < numFeatures) {
+  val featureInfo = categoricalFeaturesInfo.get(featureIndex)
+  val isFeatureContinuous = featureInfo.isEmpty
+  if (isFeatureContinuous) {
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous, false,
+  bins, categoricalFeaturesInfo)
+  } else {
+val featureCategories = featureInfo.get
+val isSpaceSufficientForAllCategoricalSplits
+  = numBins > math.pow(2, featureCategories.toInt - 1) - 1
+val isUnorderedFeature =
+  isMulticlassClassification && 
isSpaceSufficientForAllCategoricalSplits
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous,
+  isUnorderedFeature, bins, categoricalFeaturesInfo)
+  }
+  featureIndex += 1
+}
+
+new TreePoint(labeledPoint.label, arr)
+  }
+
+
+  /**
+   * Find bin for one (labeledPoint, feature).
+   *
+   * @param isUnordere

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279828
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  "Unordered categorical features" are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  "Ordered categorical features" are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
--- End diff --

another candidate is `binIndices`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279831
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  "Unordered categorical features" are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  "Ordered categorical features" are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
+}
+
+
--- End diff --

remove extra empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279840
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  "Unordered categorical features" are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  "Ordered categorical features" are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
+}
+
+
+private[tree] object TreePoint {
+
+  /**
+   * Convert an input dataset into its TreePoint representation,
+   * binning feature values in preparation for DecisionTree training.
+   * @param input Input dataset.
+   * @param strategy  DecisionTree training info, used for dataset 
metadata.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @return  TreePoint dataset representation
+   */
+  def convertToTreeRDD(
+  input: RDD[LabeledPoint],
+  strategy: Strategy,
+  bins: Array[Array[Bin]]): RDD[TreePoint] = {
+input.map { x =>
+  TreePoint.labeledPointToTreePoint(x, 
strategy.isMulticlassClassification, bins,
+strategy.categoricalFeaturesInfo)
+}
+  }
+
+  /**
+   * Convert one LabeledPoint into its TreePoint representation.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @param categoricalFeaturesInfo  Map over categorical features: 
feature index --> feature arity
+   */
+  private def labeledPointToTreePoint(
+  labeledPoint: LabeledPoint,
+  isMulticlassClassification: Boolean,
+  bins: Array[Array[Bin]],
+  categoricalFeaturesInfo: Map[Int, Int]): TreePoint = {
+
+val numFeatures = labeledPoint.features.size
+val numBins = bins(0).size
+val arr = new Array[Int](numFeatures)
+var featureIndex = 0 // offset by 1 for label
+while (featureIndex < numFeatures) {
+  val featureInfo = categoricalFeaturesInfo.get(featureIndex)
+  val isFeatureContinuous = featureInfo.isEmpty
+  if (isFeatureContinuous) {
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous, false,
+  bins, categoricalFeaturesInfo)
+  } else {
+val featureCategories = featureInfo.get
+val isSpaceSufficientForAllCategoricalSplits
+  = numBins > math.pow(2, featureCategories.toInt - 1) - 1
+val isUnorderedFeature =
+  isMulticlassClassification && 
isSpaceSufficientForAllCategoricalSplits
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous,
+  isUnorderedFeature, bins, categoricalFeaturesInfo)
+  }
+  featureIndex += 1
+}
+
+new TreePoint(labeledPoint.label, arr)
+  }
+
+
+  /**
+   * Find bin for one (labeledPoint, feature).
+   *
+   * @param isUnordere

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279837
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  "Unordered categorical features" are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  "Ordered categorical features" are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
+}
+
+
+private[tree] object TreePoint {
+
+  /**
+   * Convert an input dataset into its TreePoint representation,
+   * binning feature values in preparation for DecisionTree training.
+   * @param input Input dataset.
+   * @param strategy  DecisionTree training info, used for dataset 
metadata.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @return  TreePoint dataset representation
+   */
+  def convertToTreeRDD(
+  input: RDD[LabeledPoint],
+  strategy: Strategy,
+  bins: Array[Array[Bin]]): RDD[TreePoint] = {
+input.map { x =>
+  TreePoint.labeledPointToTreePoint(x, 
strategy.isMulticlassClassification, bins,
+strategy.categoricalFeaturesInfo)
+}
+  }
+
+  /**
+   * Convert one LabeledPoint into its TreePoint representation.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @param categoricalFeaturesInfo  Map over categorical features: 
feature index --> feature arity
+   */
+  private def labeledPointToTreePoint(
+  labeledPoint: LabeledPoint,
+  isMulticlassClassification: Boolean,
+  bins: Array[Array[Bin]],
+  categoricalFeaturesInfo: Map[Int, Int]): TreePoint = {
+
+val numFeatures = labeledPoint.features.size
+val numBins = bins(0).size
+val arr = new Array[Int](numFeatures)
+var featureIndex = 0 // offset by 1 for label
+while (featureIndex < numFeatures) {
+  val featureInfo = categoricalFeaturesInfo.get(featureIndex)
+  val isFeatureContinuous = featureInfo.isEmpty
+  if (isFeatureContinuous) {
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous, false,
+  bins, categoricalFeaturesInfo)
+  } else {
+val featureCategories = featureInfo.get
+val isSpaceSufficientForAllCategoricalSplits
+  = numBins > math.pow(2, featureCategories.toInt - 1) - 1
+val isUnorderedFeature =
+  isMulticlassClassification && 
isSpaceSufficientForAllCategoricalSplits
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous,
+  isUnorderedFeature, bins, categoricalFeaturesInfo)
+  }
+  featureIndex += 1
+}
+
+new TreePoint(labeledPoint.label, arr)
+  }
+
+
+  /**
+   * Find bin for one (labeledPoint, feature).
+   *
+   * @param isUnordere

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279834
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  "Unordered categorical features" are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  "Ordered categorical features" are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
+}
+
+
+private[tree] object TreePoint {
+
+  /**
+   * Convert an input dataset into its TreePoint representation,
+   * binning feature values in preparation for DecisionTree training.
+   * @param input Input dataset.
+   * @param strategy  DecisionTree training info, used for dataset 
metadata.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @return  TreePoint dataset representation
+   */
+  def convertToTreeRDD(
+  input: RDD[LabeledPoint],
+  strategy: Strategy,
+  bins: Array[Array[Bin]]): RDD[TreePoint] = {
+input.map { x =>
+  TreePoint.labeledPointToTreePoint(x, 
strategy.isMulticlassClassification, bins,
+strategy.categoricalFeaturesInfo)
+}
+  }
+
+  /**
+   * Convert one LabeledPoint into its TreePoint representation.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @param categoricalFeaturesInfo  Map over categorical features: 
feature index --> feature arity
+   */
+  private def labeledPointToTreePoint(
+  labeledPoint: LabeledPoint,
+  isMulticlassClassification: Boolean,
+  bins: Array[Array[Bin]],
+  categoricalFeaturesInfo: Map[Int, Int]): TreePoint = {
+
+val numFeatures = labeledPoint.features.size
+val numBins = bins(0).size
+val arr = new Array[Int](numFeatures)
+var featureIndex = 0 // offset by 1 for label
+while (featureIndex < numFeatures) {
+  val featureInfo = categoricalFeaturesInfo.get(featureIndex)
+  val isFeatureContinuous = featureInfo.isEmpty
+  if (isFeatureContinuous) {
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous, false,
+  bins, categoricalFeaturesInfo)
+  } else {
+val featureCategories = featureInfo.get
+val isSpaceSufficientForAllCategoricalSplits
+  = numBins > math.pow(2, featureCategories.toInt - 1) - 1
+val isUnorderedFeature =
+  isMulticlassClassification && 
isSpaceSufficientForAllCategoricalSplits
+arr(featureIndex) = findBin(featureIndex, labeledPoint, 
isFeatureContinuous,
+  isUnorderedFeature, bins, categoricalFeaturesInfo)
+  }
+  featureIndex += 1
+}
+
+new TreePoint(labeledPoint.label, arr)
+  }
+
+
--- End diff --

remove extra empty line


---
If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279819
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -1281,7 +1226,8 @@ object DecisionTree extends Serializable with Logging 
{
   val gains = Array.ofDim[InformationGainStats](numFeatures, numBins - 
1)
 
   for (featureIndex <- 0 until numFeatures) {
-for (splitIndex <- 0 until numBins - 1) {
+val numSplitsForFeature = getNumSplitsForFeature(featureIndex)
+for (splitIndex <- 0 until numSplitsForFeature) {
--- End diff --

ditto: use `while`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279816
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -1281,7 +1226,8 @@ object DecisionTree extends Serializable with Logging 
{
   val gains = Array.ofDim[InformationGainStats](numFeatures, numBins - 
1)
 
   for (featureIndex <- 0 until numFeatures) {
--- End diff --

use `while` instead of `for`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279806
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -550,25 +579,35 @@ object DecisionTree extends Serializable with Logging 
{
 
   // Apply each filter and check sample validity. Return false when 
invalid condition found.
   for (filter <- parentFilters) {
--- End diff --

Please update this line to `parentFilter.foreach { filter =>` which is 
faster than `for`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279814
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -973,10 +907,13 @@ object DecisionTree extends Serializable with Logging 
{
   combinedAggregate
 }
 
+
--- End diff --

remove extra empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279801
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -144,6 +165,11 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
 // Build the full tree using the node info calculated in the 
level-wise best split calculations.
 topNode.build(nodes)
 
+timer.stop("total")
+
+logDebug("Internal timing for DecisionTree:")
--- End diff --

`logDebug` -> `logInfo`? There are not a lot of messages here and the 
running times are useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279805
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -200,6 +226,7 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
   }
 }
 
+
--- End diff --

remove extra empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279800
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -112,16 +126,23 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
   logDebug("level = " + level)
   logDebug("#")
 
+
--- End diff --

remove extra empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279812
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -689,37 +631,26 @@ object DecisionTree extends Serializable with Logging 
{
  *bin index for this labeledPoint
  *(or InvalidBinIndex if labeledPoint is not handled by 
this node)
  */
-def findBinsForLevel(labeledPoint: LabeledPoint): Array[Double] = {
+def findBinsForLevel(treePoint: TreePoint): Array[Double] = {
   // Calculate bin index and label per feature per node.
   val arr = new Array[Double](1 + (numFeatures * numNodes))
   // First element of the array is the label of the instance.
-  arr(0) = labeledPoint.label
+  arr(0) = treePoint.label
   // Iterate over nodes.
   var nodeIndex = 0
   while (nodeIndex < numNodes) {
 val parentFilters = findParentFilters(nodeIndex)
 // Find out whether the sample qualifies for the particular node.
-val sampleValid = isSampleValid(parentFilters, labeledPoint)
+val sampleValid = isSampleValid(parentFilters, treePoint)
 val shift = 1 + numFeatures * nodeIndex
 if (!sampleValid) {
   // Mark one bin as -1 is sufficient.
   arr(shift) = InvalidBinIndex
 } else {
   var featureIndex = 0
+  // TODO: Vectorize this
   while (featureIndex < numFeatures) {
-val featureInfo = 
strategy.categoricalFeaturesInfo.get(featureIndex)
-val isFeatureContinuous = featureInfo.isEmpty
-if (isFeatureContinuous) {
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous, 
false)
-} else {
-  val featureCategories = featureInfo.get
-  val isSpaceSufficientForAllCategoricalSplits
-= numBins > math.pow(2, featureCategories.toInt - 1) - 1
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous,
-isSpaceSufficientForAllCategoricalSplits)
-}
+arr(shift + featureIndex) = treePoint.features(featureIndex)
--- End diff --

use `Array.copy(treePoint.features, 0, arr, shift, numFeatures)` to replace 
the while loop?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16279798
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -53,16 +55,28 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
*/
   def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
 
+val timer = new TimeTracker()
+
+timer.start("total")
+
 // Cache input RDD for speedup during multiple passes.
-val retaggedInput = input.retag(classOf[LabeledPoint]).cache()
+timer.start("init")
+val retaggedInput = input.retag(classOf[LabeledPoint])
 logDebug("algo = " + strategy.algo)
+timer.stop("init")
 
 // Find the splits and the corresponding bins (interval between the 
splits) using a sample
 // of the input data.
+timer.start("findSplitsBins")
 val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, 
strategy)
 val numBins = bins(0).length
+timer.stop("findSplitsBins")
 logDebug("numBins = " + numBins)
 
+timer.start("init")
+val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, 
bins).cache()
--- End diff --

`.cache()` -> `.persist(StorageLevel.MEMORY_AND_DISK)`? There is a 
computation/storage trade-off here, maybe worth testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-14 Thread sryza
GitHub user sryza opened a pull request:

https://github.com/apache/spark/pull/1961

SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetrics...

...Update

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/spark sandy-spark-3028

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1961.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1961


commit f883dedf2910380796bc2c148d3bc28f562b967a
Author: Sandy Ryza 
Date:   2014-08-15T04:16:07Z

SPARK-3028. sparkEventToJson should support 
SparkListenerExecutorMetricsUpdate




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3035] Wrong example with SparkContext.a...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1942#issuecomment-52272419
  
QA tests have started for PR 1942. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18590/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3035] Wrong example with SparkContext.a...

2014-08-14 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1942#issuecomment-52272341
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3025 [SQL]: Allow JDBC clients to set a ...

2014-08-14 Thread tianyi
Github user tianyi commented on the pull request:

https://github.com/apache/spark/pull/1937#issuecomment-52272355
  
hi @pwendell , I have two questions:
1 Why don't handle this set command in "SetCommand" ?
2 Do we have to save this pool name in LocalProperty? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2845] Add timestamps to block manager e...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/654#issuecomment-52271621
  
QA tests have started for PR 654. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18589/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2845] Add timestamps to block manager e...

2014-08-14 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/654#issuecomment-52271303
  
Yes, that's my main concern with putting that logic in the logger. Also, 
doing that means that another listener that wants to use a timestamp would have 
to do the same hack; so if using that approach, it would be better to change 
the SparkListener interface to take a timestamp parameter aside from the events.

I like having the timestamp with the event better because of that; it's 
ultimately a property of the event, and placing it somewhere else means you 
need to handle two pieces of information everywhere you need them, instead of 
everything being encapsulated in the event itself.

I agree all events should have a timestamp, but I don't have a solution for 
easily doing that (well, aside from investigating a saner approach to 
generating JSON than manually writing the conversion code like it's currently 
done). Perhaps it would be better soon to just switch to using something like 
Jackson and keep the current code just for backwards compatibility with older 
event logs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-52270667
  
QA results for PR 731:- This patch PASSES unit tests.For more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18587/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-14 Thread shivaram
Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/1960#issuecomment-52270393
  
LGTM (pending tests)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

2014-08-14 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1931#issuecomment-52269810
  
Yeah, sounds good. I guess we'll use a `ReferenceQueueWithSize` or 
something instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2845] Add timestamps to block manager e...

2014-08-14 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/654#issuecomment-52269494
  
@vanzin I think it would be useful to time stamp all events, either in this 
patch or in the future. Is the reason why we have to add the timestamp field to 
the event case classes themselves the following: we can't use the time when the 
logger actually logs the event because it's technically inaccurate? E.g. the 
block manager might have been added ages ago, and by the time we actually 
process the event many milliseconds may have already passed. This isn't a huge 
problem if the event queues are small, but we have seen them go up to O(100) 
even after #1679. I guess this means we need to do the ugly thing of adding a 
"timestamp" field to each of these case classes. Do you see a better 
alternative?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2492][Streaming] kafkaReceiver minor ch...

2014-08-14 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/1420#issuecomment-52269406
  
Hi @tdas , would you mind taking a look at this? thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/1950#issuecomment-52269207
  
:+1: 

LGTM! I have some minor comments. I look forward to the performance 
improvements based upon these changes. The multiclass bug fix is a great catch. 
Finally, I see the timer functionality being added soon to other parts of 
Spark. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2845] Add timestamps to block manager e...

2014-08-14 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/654#discussion_r16278258
  
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -448,12 +450,14 @@ private[spark] object JsonProtocol {
   def blockManagerAddedFromJson(json: JValue): 
SparkListenerBlockManagerAdded = {
 val blockManagerId = blockManagerIdFromJson(json \ "Block Manager ID")
 val maxMem = (json \ "Maximum Memory").extract[Long]
-SparkListenerBlockManagerAdded(blockManagerId, maxMem)
+val time = (json \ "Timestamp").extract[Option[Long]].getOrElse(-1L)
--- End diff --

`Utils.jsonOption`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16278221
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  "Unordered categorical features" are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  "Ordered categorical features" are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
--- End diff --

Minor: Could we use ```binnedFeatures``` instead of ```features```?

Discussion: I guess we are making ```Int`` as the default choice for bins. 
It's great since we don't have to do additional book-keeping if we used 
multiple of ```Byte```s.  May be an optimization we should consider for the 
future. I think it's a good tradeoff for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-14 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16278126
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TimeTracker.scala ---
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import scala.collection.mutable.{HashMap => MutableHashMap}
+
+import org.apache.spark.annotation.Experimental
+
+/**
+ * Time tracker implementation which holds labeled timers.
+ */
+@Experimental
+private[tree]
--- End diff --

Very minor: not sure about the style but should private[tree] be on the 
same line? Feel free to ignore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1960#issuecomment-52268330
  
QA tests have started for PR 1960. This patch merges cleanly. View 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18588/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-14 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1960

Remove netty-test-file.txt.

Instead, use randomly generated temporary data. 

cc @shivaram

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark netty

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1960.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1960


commit 3debe7c246b58345d0495b52f70bdd0be1b4f5e3
Author: Reynold Xin 
Date:   2014-08-15T02:19:33Z

Remove netty-test-file.txt.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2759][CORE] Generic Binary File Support...

2014-08-14 Thread loveconan1988
Github user loveconan1988 commented on the pull request:

https://github.com/apache/spark/pull/1658#issuecomment-52267833
  
it should can be solve 。

 

 -- 原始邮件 --
  发件人: "Matei Zaharia";;
 发送时间: 2014年8月15日(星期五) 上午9:55
 收件人: "apache/spark"; 
 
 主题: Re: [spark] [SPARK-2759][CORE] Generic Binary File Support in 
Spark(#1658)

 

 
@kmader to allow streams to be shuffle-able, how about the following? We 
create a class called BinaryData with a method open() that returns an 
InputStream. The BinaryData object has information about an InputSplit inside 
and can thus be shuffled across the network, cached, etc. And whenever users 
want to read it, they can get a stream. Would this solve the problem?
 
—
Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3052. Misleading and spurious FileSystem...

2014-08-14 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/1956#issuecomment-52267702
  
> Ah and the order they should be shut down in is RecordReader then
FileSystem?

Right


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...

2014-08-14 Thread loveconan1988
Github user loveconan1988 commented on the pull request:

https://github.com/apache/spark/pull/1907#issuecomment-52267441
  
I have know it,thanks.
  

 

 -- 原始邮件 --
  发件人: "asfgit";;
 发送时间: 2014年8月15日(星期五) 上午10:03
 收件人: "apache/spark"; 
 
 主题: Re: [spark] [SPARK-2468] Netty based block server / client 
module(#1907)

 

 
Closed #1907 via 3a8b68b.
 
—
Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2736] PySpark converter and example scr...

2014-08-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1916


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...

2014-08-14 Thread loveconan1988
Github user loveconan1988 commented on the pull request:

https://github.com/apache/spark/pull/1907#issuecomment-52267389
  
I have know it,thanks.

 

 -- 原始邮件 --
  发件人: "Reynold Xin";;
 发送时间: 2014年8月15日(星期五) 上午10:01
 收件人: "apache/spark"; 
 
 主题: Re: [spark] [SPARK-2468] Netty based block server / client 
module(#1907)

 

 
Thanks for looking at this. Merging it in master & branch-1.1. This is off 
by default so the danger is very small. We will continue improving it and with 
changes like SPARK-3019 it will make it easier to use.
 
—
Reply to this email directly or view it on GitHub.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...

2014-08-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1907


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...

2014-08-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1907#issuecomment-52267255
  
Thanks for looking at this. Merging it in master & branch-1.1. This is off 
by default so the danger is very small. We will continue improving it and with 
changes like SPARK-3019 it will make it easier to use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...

2014-08-14 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1907#issuecomment-52267195
  
I reviewed the changes that are relevant to spark core and they LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2736] PySpark converter and example scr...

2014-08-14 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1916#issuecomment-52267186
  
Alright, going to merge this. Thanks for putting this together!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3011][SQL] _temporary directory should ...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1959#issuecomment-52267094
  
QA results for PR 1959:- This patch FAILED unit tests.- This patch 
merges cleanly- This patch adds no public classesFor more 
information see test 
ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18586/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2759][CORE] Generic Binary File Support...

2014-08-14 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1658#issuecomment-52266956
  
@kmader to allow streams to be shuffle-able, how about the following? We 
create a class called `BinaryData` with a method open() that returns an 
InputStream. The BinaryData object has information about an InputSplit inside 
and can thus be shuffled across the network, cached, etc. And whenever users 
want to read it, they can get a stream. Would this solve the problem?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3045] [SPARK-3046] Make Serializer inte...

2014-08-14 Thread GrahamDennis
Github user GrahamDennis commented on the pull request:

https://github.com/apache/spark/pull/1948#issuecomment-52266597
  
@rxin: You right about me forgetting to run `sbt assemble`.  Some of the 
results of my testcase don't quite match what I was seeing before (I was seeing 
only one of the executor's having the .jar in their work directory).  Let me 
verify that my testcase fails on the latest master branch without your fix, and 
also test this fix on my cluster.  If that check's out, then I'm happy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2014-08-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-52266419
  
QA tests have started for PR 731. This patch DID NOT merge cleanly! 
View progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18587/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3027] TaskContext: tighten visibility a...

2014-08-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1938


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3027] TaskContext: tighten visibility a...

2014-08-14 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1938#issuecomment-52265995
  
Thanks. Merging this in master & branch-1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >