date:20140815

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-15 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/1912#issuecomment-52277037
  
I had add Broadcast.unpersist(blocking=False).

Because we have an copy in disks, so read it from there when user want to 
access it driver, then we can keep the SparkContext.broadcast() unchanged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread chouqin

Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16281333
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -689,37 +631,26 @@ object DecisionTree extends Serializable with Logging 
{
  *bin index for this labeledPoint
  *(or InvalidBinIndex if labeledPoint is not handled by 
this node)
  */
-def findBinsForLevel(labeledPoint: LabeledPoint): Array[Double] = {
+def findBinsForLevel(treePoint: TreePoint): Array[Double] = {
   // Calculate bin index and label per feature per node.
   val arr = new Array[Double](1 + (numFeatures * numNodes))
   // First element of the array is the label of the instance.
-  arr(0) = labeledPoint.label
+  arr(0) = treePoint.label
   // Iterate over nodes.
   var nodeIndex = 0
   while (nodeIndex  numNodes) {
 val parentFilters = findParentFilters(nodeIndex)
 // Find out whether the sample qualifies for the particular node.
-val sampleValid = isSampleValid(parentFilters, labeledPoint)
+val sampleValid = isSampleValid(parentFilters, treePoint)
 val shift = 1 + numFeatures * nodeIndex
 if (!sampleValid) {
   // Mark one bin as -1 is sufficient.
   arr(shift) = InvalidBinIndex
 } else {
   var featureIndex = 0
+  // TODO: Vectorize this
   while (featureIndex  numFeatures) {
-val featureInfo = 
strategy.categoricalFeaturesInfo.get(featureIndex)
-val isFeatureContinuous = featureInfo.isEmpty
-if (isFeatureContinuous) {
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous, 
false)
-} else {
-  val featureCategories = featureInfo.get
-  val isSpaceSufficientForAllCategoricalSplits
-= numBins  math.pow(2, featureCategories.toInt - 1) - 1
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous,
-isSpaceSufficientForAllCategoricalSplits)
-}
+arr(shift + featureIndex) = treePoint.features(featureIndex)
--- End diff --

Since the features array is the same of all nodes on which this 
labeledpoint is valid, is it really necessary for every node to have a copy of 
it?

In #1941 , I have changed the `arr` structure, does this memory saving 
help? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread chouqin

Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16281396
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -53,16 +55,28 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
*/
   def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
 
+val timer = new TimeTracker()
+
+timer.start(total)
+
 // Cache input RDD for speedup during multiple passes.
-val retaggedInput = input.retag(classOf[LabeledPoint]).cache()
+timer.start(init)
+val retaggedInput = input.retag(classOf[LabeledPoint])
 logDebug(algo =  + strategy.algo)
+timer.stop(init)
 
 // Find the splits and the corresponding bins (interval between the 
splits) using a sample
 // of the input data.
+timer.start(findSplitsBins)
 val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, 
strategy)
 val numBins = bins(0).length
+timer.stop(findSplitsBins)
 logDebug(numBins =  + numBins)
 
+timer.start(init)
+val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, 
bins).cache()
+timer.stop(init)
--- End diff --

I think the timer for `convertToTreeRDD` may not be useful, since the `map` 
from `retaggedInput` to `treeInput` is evaluated lazily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread chouqin

Github user chouqin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16281414
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -728,8 +659,10 @@ object DecisionTree extends Serializable with Logging {
   arr
 }
 
- // Find feature bins for all nodes at a level.
+// Find feature bins for all nodes at a level.
+timer.start(findBinsForLevel)
 val binMappedRDD = input.map(x = findBinsForLevel(x))
+timer.stop(findBinsForLevel)
--- End diff --

this timer may also be useless


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1958#issuecomment-5222
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18599/consoleFull)
 for   PR 1958 at commit 
[`f2c56c9`](https://github.com/apache/spark/commit/f2c56c976bc6faa83b8357c80caad1f4839eb06d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...

2014-08-15 Thread harishreedharan

Github user harishreedharan commented on a diff in the pull request:

https://github.com/apache/spark/pull/1958#discussion_r16281516
  
--- Diff: 
external/flume-sink/src/test/scala/org/apache/spark/streaming/flume/sink/SparkSinkSuite.scala
 ---
@@ -0,0 +1,208 @@
+package org.apache.spark.streaming.flume.sink
+
+import java.net.InetSocketAddress
+import java.util.concurrent.atomic.AtomicInteger
+import java.util.concurrent.{CountDownLatch, Executors}
+
+import scala.collection.JavaConversions._
+import scala.concurrent.{Promise, Future}
+import scala.util.{Failure, Success, Try}
+
+import com.google.common.util.concurrent.ThreadFactoryBuilder
+import org.apache.avro.ipc.NettyTransceiver
+import org.apache.avro.ipc.specific.SpecificRequestor
+import org.apache.flume.Context
+import org.apache.flume.channel.MemoryChannel
+import org.apache.flume.event.EventBuilder
+import org.apache.spark.streaming.TestSuiteBase
+import org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory
+
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
--- End diff --

Thanks! Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

2014-08-15 Thread chouqin

Github user chouqin commented on the pull request:

https://github.com/apache/spark/pull/1941#issuecomment-52277965
  
@mengxr @jkbradley never mind, I will help you review @1950 :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3045] [SPARK-3046] Make Serializer inte...

2014-08-15 Thread GrahamDennis

Github user GrahamDennis commented on the pull request:

https://github.com/apache/spark/pull/1948#issuecomment-52278145
  
I've convinced myself that this PR fixes SPARK-2878, and think it should be 
merged.  Thanks @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2970] [SQL] spark-sql script ends with ...

2014-08-15 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1891#issuecomment-52278697
  
Sorry, didn't realize that `ShutdownHookManager` is only available in 
Hadoop 2. Compilation fails when building Spark with Hadoop 1. Filed 
[SPARK-3062](https://issues.apache.org/jira/browse/SPARK-3062) to track this 
issue.

@marmbrus Maybe we should revert this PR first for safe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3063][SQL] ExistingRdd should convert M...

2014-08-15 Thread ueshin

GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/1963

[SPARK-3063][SQL] ExistingRdd should convert Map to catalyst Map.

Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-3063

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1963.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1963


commit d8a900ad1408ea3662d2e43b2df24db485cd28e5
Author: Takuya UESHIN ues...@happy-camper.st
Date:   2014-08-15T06:48:52Z

Make ExistingRdd.convertToCatalyst be able to convert Map value.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2677] BasicBlockFetchIterator#next can ...

2014-08-15 Thread sarutak

Github user sarutak commented on a diff in the pull request:

https://github.com/apache/spark/pull/1632#discussion_r16282025
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/ConnectionManager.scala ---
@@ -836,9 +845,14 @@ private[spark] class ConnectionManager(
   def sendMessageReliably(connectionManagerId: ConnectionManagerId, 
message: Message)
   : Future[Message] = {
 val promise = Promise[Message]()
+
+val ackTimeoutMonitor =  new Timer(sAck Timeout Monitor- +
--- End diff --

I've modified for ConnectionManager-wide timer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2677] BasicBlockFetchIterator#next can ...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1632#issuecomment-52279573
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18600/consoleFull)
 for   PR 1632 at commit 
[`66cfff7`](https://github.com/apache/spark/commit/66cfff765f44aea7ff2bf5afe1a29403e79b7951).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3063][SQL] ExistingRdd should convert M...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1963#issuecomment-52279818
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18601/consoleFull)
 for   PR 1963 at commit 
[`d8a900a`](https://github.com/apache/spark/commit/d8a900ad1408ea3662d2e43b2df24db485cd28e5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...

2014-08-15 Thread chouqin

Github user chouqin closed the pull request at:

https://github.com/apache/spark/pull/1941


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2929][SQL] Refactored Thrift server and...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1856#issuecomment-52280206
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18602/consoleFull)
 for   PR 1856 at commit 
[`c783024`](https://github.com/apache/spark/commit/c7830247747c797e57e0eb7ab44b95dd7cb18812).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2929][SQL] Refactored Thrift server and...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1856#issuecomment-52280390
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18602/consoleFull)
 for   PR 1856 at commit 
[`c783024`](https://github.com/apache/spark/commit/c7830247747c797e57e0eb7ab44b95dd7cb18812).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-15 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1960#issuecomment-52280431
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2929][SQL] Refactored Thrift server and...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1856#issuecomment-52280478
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18603/consoleFull)
 for   PR 1856 at commit 
[`a175255`](https://github.com/apache/spark/commit/a175255092b7657e3fd8ff77daeee842b96405cd).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3040] pick up a more proper local ip ad...

2014-08-15 Thread advancedxy

Github user advancedxy commented on the pull request:

https://github.com/apache/spark/pull/1946#issuecomment-52280562
  
Sorry, I didn't realize windows is supported. In that case, I believe a 
check is necessary. I will update the pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1960#issuecomment-52280678
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18604/consoleFull)
 for   PR 1960 at commit 
[`3debe7c`](https://github.com/apache/spark/commit/3debe7c246b58345d0495b52f70bdd0be1b4f5e3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3063][SQL] ExistingRdd should convert M...

2014-08-15 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/1963#issuecomment-52280976
  
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2677] BasicBlockFetchIterator#next can ...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1632#issuecomment-52280997
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18605/consoleFull)
 for   PR 1632 at commit 
[`7ed48be`](https://github.com/apache/spark/commit/7ed48be337f469b75a1ba0c85b6817e5beb9f3a6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-08-15 Thread avulanov

Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-52282101
  
@bgreeven I've tried to train `ParallelANNWithSGD` with 3 layers 
1000x500x18, numiterations 1000, stepSize 1. My dataset has ~2000 instances, 
1000 features, 18 classes. After 17 hours it didn't finish and I killed the 
Spark process. I think there are some performance issues. I'll try to look at 
your code but without comments it would be challenging :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

2014-08-15 Thread yu-iskw

GitHub user yu-iskw opened a pull request:

https://github.com/apache/spark/pull/1964

[SPARK-3012] Standardized Distance Functions between two Vectors for MLlib

https://issues.apache.org/jira/browse/SPARK-3012

I implemented some distance measures between two Vector instances, such as 
Manhattan distance and Euclidean distance.Because the standardized distance 
functions help us to implement more machine learning algorithms efficiently

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yu-iskw/spark branch-1.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1964.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1964


commit 0e81b1239fe82d55feb9dca0ca7441d2e61aea14
Author: Yuu ISHIKAWA yuu.ishik...@gmail.com
Date:   2014-08-15T07:35:52Z

[SPARK-3012] Standardized Distance Functions between two Vectors for MLlib

https://issues.apache.org/jira/browse/SPARK-3012




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

2014-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1964#issuecomment-52282833
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3040] pick up a more proper local ip ad...

2014-08-15 Thread advancedxy

Github user advancedxy commented on the pull request:

https://github.com/apache/spark/pull/1946#issuecomment-52283168
  
@pwendell, would you look at this? It's a fairly simple fix. I don't have 
windows for primary use, so it's not confirmed on windows. I hope someone who 
uses windows can confirm this behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3011][SQL] _temporary directory should ...

2014-08-15 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1959#discussion_r16284043
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -373,9 +373,11 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 }
 ParquetRelation.enableLogForwarding()
 
+// NOTE: Explicitly list _temporary because hadoop 0.23 removed the 
variable TEMP_DIR_NAME
+// from FileOutputCommitter. Check MAPREDUCE-5229 for the detail.
 val children = fs.listStatus(path).filterNot { status =
   val name = status.getPath.getName
-  name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME
+  name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME || 
name == _temporary
--- End diff --

How about ignoring any file starting with _ ? Hadoop (also) uses this 
convention, for things like the `_SUCCESS` file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: remove MaxPermSize option for jvm 1.8

2014-08-15 Thread adrian-wang

GitHub user adrian-wang opened a pull request:

https://github.com/apache/spark/pull/1965

remove MaxPermSize option for jvm 1.8

In JVM 1.8.0, MaxPermSize is no longer supported.
In spark `stderr` output, there would be a line of

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option 
MaxPermSize=128m; support was removed in 8.0



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adrian-wang/spark maxpermsize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1965.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1965


commit 18c36b21303d9ca115326dda9b11e3782bfc7390
Author: Daoyuan Wang daoyuan.w...@intel.com
Date:   2014-08-15T08:43:59Z

remove MaxPermSize option for jvm 1.8




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3058] [SQL] Support EXTENDED for EXPLAI...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1962#issuecomment-52286428
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18606/consoleFull)
 for   PR 1962 at commit 
[`295db74`](https://github.com/apache/spark/commit/295db7406beca519f0d169dca8bc9b433b0bc329).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: remove MaxPermSize option for jvm 1.8

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1965#issuecomment-52286793
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18607/consoleFull)
 for   PR 1965 at commit 
[`18c36b2`](https://github.com/apache/spark/commit/18c36b21303d9ca115326dda9b11e3782bfc7390).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3039: Allow spark to be built using avro...

2014-08-15 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1945#issuecomment-52287539
  
I think it works with the invocation you describe. Honestly it's not a big 
priority, this version, but nice to get it right. Want to open a JIRA to track 
updating/deleting the info from README.md? I think it needs to be fixed one way 
or the other. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3068]remove MaxPermSize option for jvm ...

2014-08-15 Thread adrian-wang

Github user adrian-wang closed the pull request at:

https://github.com/apache/spark/pull/1965


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3067] JobProgressPage could not show Fa...

2014-08-15 Thread YanTangZhai

GitHub user YanTangZhai opened a pull request:

https://github.com/apache/spark/pull/1966

[SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section 
sometimes

JobProgressPage could not show Fair Scheduler Pools section sometimes.
SparkContext starts webui and then postEnvironmentUpdate. Sometimes 
JobProgressPage is accessed between webui starting and postEnvironmentUpdate, 
then the lazy val isFairScheduler will be false. The Fair Scheduler Pools 
section will not display any more.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/YanTangZhai/spark SPARK-3067

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1966.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1966


commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai hakeemz...@tencent.com
Date:   2014-08-06T13:07:08Z

Merge pull request #1 from apache/master

update

commit aac7f7b67d83d4175018d58568cfbd1a639e3d7e
Author: yantangzhai tyz0...@163.com
Date:   2014-08-15T09:04:24Z

[SPARK-3067] JobProgressPage could not show Fair Scheduler Pools section 
sometimes




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3068]remove MaxPermSize option for jvm ...

2014-08-15 Thread adrian-wang

GitHub user adrian-wang opened a pull request:

https://github.com/apache/spark/pull/1967

[SPARK-3068]remove MaxPermSize option for jvm 1.8

In JVM 1.8.0, MaxPermSize is no longer supported.
In spark `stderr` output, there would be a line of

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option 
MaxPermSize=128m; support was removed in 8.0


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/adrian-wang/spark maxpermsize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1967.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1967


commit 9c32941f0607500e71786848c589cdff73b7d7ea
Author: Daoyuan Wang daoyuan.w...@intel.com
Date:   2014-08-15T09:09:09Z

remove MaxPermSize option for jvm 1.8




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3068]remove MaxPermSize option for jvm ...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1967#issuecomment-52288134
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18608/consoleFull)
 for   PR 1967 at commit 
[`9c32941`](https://github.com/apache/spark/commit/9c32941f0607500e71786848c589cdff73b7d7ea).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3067] JobProgressPage could not show Fa...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1966#issuecomment-52288130
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18609/consoleFull)
 for   PR 1966 at commit 
[`aac7f7b`](https://github.com/apache/spark/commit/aac7f7b67d83d4175018d58568cfbd1a639e3d7e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3068]remove MaxPermSize option for jvm ...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1965#issuecomment-52288743
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18607/consoleFull)
 for   PR 1965 at commit 
[`18c36b2`](https://github.com/apache/spark/commit/18c36b21303d9ca115326dda9b11e3782bfc7390).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3039: Allow spark to be built using avro...

2014-08-15 Thread bbossy

Github user bbossy commented on the pull request:

https://github.com/apache/spark/pull/1945#issuecomment-52290012
  
Created the issue: https://issues.apache.org/jira/browse/SPARK-3069 (Build 
instructions in README are outdated)

@srowen: Thank you for your input!




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3065][SQL] Add Locale setting to HiveCo...

2014-08-15 Thread luogankun

GitHub user luogankun opened a pull request:

https://github.com/apache/spark/pull/1968

[SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite to fix run 
udf_unix_timestamp with not America/Los_Angeles TimeZone

When run the udf_unix_timestamp of 
org.apache.spark.sql.hive.execution.HiveCompatibilitySuite testcase
with not America/Los_Angeles TimeZone throws error. 
[https://issues.apache.org/jira/browse/SPARK-3065]
add locale setting on beforeAll and afterAll method to fix the bug of 
HiveCompatibilitySuite testcase



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/luogankun/spark SPARK-3065

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1968.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1968


commit 0a25e3a4f1cb80e027ee0fd05fd898651a3f7074
Author: luogankun luogan...@gmail.com
Date:   2014-08-15T07:31:47Z

[SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite

commit c167832d161dd559a67061094d77fa454ec24fa8
Author: luogankun luogan...@gmail.com
Date:   2014-08-15T09:07:32Z

[SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3065][SQL] Add Locale setting to HiveCo...

2014-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1968#issuecomment-52290933
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Use user defined $SPARK_HOME in spark-submit i...

2014-08-15 Thread iven

GitHub user iven opened a pull request:

https://github.com/apache/spark/pull/1969

Use user defined $SPARK_HOME in spark-submit if possible



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/iven/spark spark-home

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1969.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1969


commit 8dc9f7f16d414ce2fd285243afe8fb87c33e9a8d
Author: Xu Lijian xulij...@qiyi.com
Date:   2014-08-07T08:46:08Z

Use user defined $SPARK_HOME in spark-submit if possible




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3068]remove MaxPermSize option for jvm ...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1967#issuecomment-52291120
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18608/consoleFull)
 for   PR 1967 at commit 
[`9c32941`](https://github.com/apache/spark/commit/9c32941f0607500e71786848c589cdff73b7d7ea).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Use user defined $SPARK_HOME in spark-submit i...

2014-08-15 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1969#issuecomment-52291226
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3040] pick up a more proper local ip ad...

2014-08-15 Thread mridulm

Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1946#issuecomment-52292439
  
Is this relying on documented behaviour or observed evidence ?
If latter, it is iffy at best to include this.
 On 15-Aug-2014 1:29 pm, å¶åè¿ notificati...@github.com wrote:

 @pwendell https://github.com/pwendell, would you look at this? It's a
 fairly simple fix. I don't have windows for primary use, so it's not
 confirmed on windows. I hope someone who uses windows can confirm this
 behavior.

 â
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1946#issuecomment-52283168.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Added support for :cp jar that was broken in...

2014-08-15 Thread rcsenkbeil

Github user rcsenkbeil commented on the pull request:

https://github.com/apache/spark/pull/1929#issuecomment-52309557
  
@mateiz @som-snytt The summary I have is that this should work for 2.10.x 
as it doesn't appear like they are trickling the removal of the class 
invalidation from [scala/scala#3884](https://github.com/scala/scala/pull/3884). 
I've given more analysis to the internal classpath editing from 2.10 and I 
don't see any potential side effects from appending to the existing merged 
classpath.

In terms of Spark moving to 2.11(.3 or higher) in the future, one option 
would be to simply include a trait mixed in with global that keeps the 
functionality they removed. From what I've seen, the only reason they removed 
it was because it wasn't used in the compiler or sbt, not that it had any 
issues. The underlying functionality wasn't broken, so I don't see any harm in 
doing that as a backup.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Use user defined $SPARK_HOME in spark-submit i...

2014-08-15 Thread CodingCat

Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1969#issuecomment-52309962
  
I once submitted a similar patch, but the latest solution (merged?) is that 
we will not send local SPARK_HOME to the remote end entirely. @andrewor14?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16294020
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -689,37 +631,26 @@ object DecisionTree extends Serializable with Logging 
{
  *bin index for this labeledPoint
  *(or InvalidBinIndex if labeledPoint is not handled by 
this node)
  */
-def findBinsForLevel(labeledPoint: LabeledPoint): Array[Double] = {
+def findBinsForLevel(treePoint: TreePoint): Array[Double] = {
   // Calculate bin index and label per feature per node.
   val arr = new Array[Double](1 + (numFeatures * numNodes))
   // First element of the array is the label of the instance.
-  arr(0) = labeledPoint.label
+  arr(0) = treePoint.label
   // Iterate over nodes.
   var nodeIndex = 0
   while (nodeIndex  numNodes) {
 val parentFilters = findParentFilters(nodeIndex)
 // Find out whether the sample qualifies for the particular node.
-val sampleValid = isSampleValid(parentFilters, labeledPoint)
+val sampleValid = isSampleValid(parentFilters, treePoint)
 val shift = 1 + numFeatures * nodeIndex
 if (!sampleValid) {
   // Mark one bin as -1 is sufficient.
   arr(shift) = InvalidBinIndex
 } else {
   var featureIndex = 0
+  // TODO: Vectorize this
   while (featureIndex  numFeatures) {
-val featureInfo = 
strategy.categoricalFeaturesInfo.get(featureIndex)
-val isFeatureContinuous = featureInfo.isEmpty
-if (isFeatureContinuous) {
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous, 
false)
-} else {
-  val featureCategories = featureInfo.get
-  val isSpaceSufficientForAllCategoricalSplits
-= numBins  math.pow(2, featureCategories.toInt - 1) - 1
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous,
-isSpaceSufficientForAllCategoricalSplits)
-}
+arr(shift + featureIndex) = treePoint.features(featureIndex)
--- End diff --

@mengxr that should work. @chouqin I thought we need this per node info for 
the aggregate step. Could you point to the relevant code in your commit to 
better understand your suggestion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2927][SQL] Add a conf to configure if w...

2014-08-15 Thread chutium

Github user chutium commented on a diff in the pull request:

https://github.com/apache/spark/pull/1855#discussion_r16294353
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -403,7 +406,10 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
* @param conf The Hadoop configuration to use.
* @return A list of attributes that make up the schema.
*/
-  def readSchemaFromFile(origPath: Path, conf: Option[Configuration]): 
Seq[Attribute] = {
+  def readSchemaFromFile(
+  origPath: Path,
+  conf: Option[Configuration],
+  isBinaryAsString: Boolean): Seq[Attribute] = {
 val keyValueMetadata: java.util.Map[String, String] =
   readMetaData(origPath, conf)
 .getFileMetaData
--- End diff --

this patch will be great for impala users like us :) thanks, moreover, 
there is a ```getCreatedBy``` method in ```readMetaData(origPath, 
conf).getFileMetaData```, and impala creates parquet files always its own 
CreatedBy information (always contains string impala), so, maybe we can do 
some auto-detection like (https://github.com/apache/spark/pull/1599/files)
```
if (fileMetaData.getCreatedBy.contains(impala)) {
  isBinaryAsString = true
}
```
does this auto-detection make sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16298815
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -53,16 +55,28 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
*/
   def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
 
+val timer = new TimeTracker()
+
+timer.start(total)
+
 // Cache input RDD for speedup during multiple passes.
-val retaggedInput = input.retag(classOf[LabeledPoint]).cache()
+timer.start(init)
+val retaggedInput = input.retag(classOf[LabeledPoint])
 logDebug(algo =  + strategy.algo)
+timer.stop(init)
 
 // Find the splits and the corresponding bins (interval between the 
splits) using a sample
 // of the input data.
+timer.start(findSplitsBins)
 val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, 
strategy)
 val numBins = bins(0).length
+timer.stop(findSplitsBins)
 logDebug(numBins =  + numBins)
 
+timer.start(init)
+val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, 
bins).cache()
+timer.stop(init)
--- End diff --

nice catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3011][SQL] _temporary directory should ...

2014-08-15 Thread joesu

Github user joesu commented on a diff in the pull request:

https://github.com/apache/spark/pull/1959#discussion_r16299347
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -373,9 +373,11 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 }
 ParquetRelation.enableLogForwarding()
 
+// NOTE: Explicitly list _temporary because hadoop 0.23 removed the 
variable TEMP_DIR_NAME
+// from FileOutputCommitter. Check MAPREDUCE-5229 for the detail.
 val children = fs.listStatus(path).filterNot { status =
   val name = status.getPath.getName
-  name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME
+  name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME || 
name == _temporary
--- End diff --

Unfortunately, that would ignore the metadata file _metadata as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3068]remove MaxPermSize option for jvm ...

2014-08-15 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1967#discussion_r16299399
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala ---
@@ -73,8 +73,17 @@ object CommandUtils extends Logging {
   extraEnvironment = command.environment)
 val userClassPath = command.classPathEntries ++ Seq(classPath)
 
-Seq(-cp, 
userClassPath.filterNot(_.isEmpty).mkString(File.pathSeparator)) ++
-  permGenOpt ++ libraryOpts ++ workerLocalOpts ++ command.javaOpts ++ 
memoryOpts
+val runner = getEnv(JAVA_HOME, command).map(_ + 
/bin/java).getOrElse(java)
--- End diff --

Is all this code and munging in case JAVA_HOME is not the same Java as the 
Worker was started with? Otherwise it seems we could just do 
`sys.env(java.version)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2924] remove default args to overloaded...

2014-08-15 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1704#issuecomment-52322700
  
Thanks - I've merged this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-15 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52322797
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3068]remove MaxPermSize option for jvm ...

2014-08-15 Thread aarondav

Github user aarondav commented on a diff in the pull request:

https://github.com/apache/spark/pull/1967#discussion_r16299443
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/worker/CommandUtils.scala ---
@@ -73,8 +73,17 @@ object CommandUtils extends Logging {
   extraEnvironment = command.environment)
 val userClassPath = command.classPathEntries ++ Seq(classPath)
 
-Seq(-cp, 
userClassPath.filterNot(_.isEmpty).mkString(File.pathSeparator)) ++
-  permGenOpt ++ libraryOpts ++ workerLocalOpts ++ command.javaOpts ++ 
memoryOpts
+val runner = getEnv(JAVA_HOME, command).map(_ + 
/bin/java).getOrElse(java)
+val jvmversion = Utils.executeAndGetOutput(Seq(runner +  -version ),
+  extraEnvironment = command.environment)
+val version = jvmversion.substring(jvmversion.indexOf(\) + 1, 
jvmversion.indexOf(_))
+if (version.compareTo(1.8.0)  0) {
+  Seq(-cp, 
userClassPath.filterNot(_.isEmpty).mkString(File.pathSeparator)) ++
+permGenOpt ++ libraryOpts ++ workerLocalOpts ++ command.javaOpts 
++ memoryOpts
--- End diff --

Let's not copy-paste this, just make `permGenOpt = if (version == 1.8.0) 
Some(-XX:MaxPerMSize=128m) else None` and unconditionally include it in this 
Seq.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3065][SQL] Add Locale setting to HiveCo...

2014-08-15 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1968#issuecomment-52322858
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2924] remove default args to overloaded...

2014-08-15 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1704


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52323644
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18610/consoleFull)
 for   PR 1961 at commit 
[`dccdff5`](https://github.com/apache/spark/commit/dccdff55612ef798227ed1e8102489d01e6e7d07).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1958#issuecomment-52327168
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18611/consoleFull)
 for   PR 1958 at commit 
[`f2c56c9`](https://github.com/apache/spark/commit/f2c56c976bc6faa83b8357c80caad1f4839eb06d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...

2014-08-15 Thread harishreedharan

Github user harishreedharan commented on the pull request:

https://github.com/apache/spark/pull/1958#issuecomment-52326820
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16301157
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -689,37 +631,26 @@ object DecisionTree extends Serializable with Logging 
{
  *bin index for this labeledPoint
  *(or InvalidBinIndex if labeledPoint is not handled by 
this node)
  */
-def findBinsForLevel(labeledPoint: LabeledPoint): Array[Double] = {
+def findBinsForLevel(treePoint: TreePoint): Array[Double] = {
   // Calculate bin index and label per feature per node.
   val arr = new Array[Double](1 + (numFeatures * numNodes))
   // First element of the array is the label of the instance.
-  arr(0) = labeledPoint.label
+  arr(0) = treePoint.label
   // Iterate over nodes.
   var nodeIndex = 0
   while (nodeIndex  numNodes) {
 val parentFilters = findParentFilters(nodeIndex)
 // Find out whether the sample qualifies for the particular node.
-val sampleValid = isSampleValid(parentFilters, labeledPoint)
+val sampleValid = isSampleValid(parentFilters, treePoint)
 val shift = 1 + numFeatures * nodeIndex
 if (!sampleValid) {
   // Mark one bin as -1 is sufficient.
   arr(shift) = InvalidBinIndex
 } else {
   var featureIndex = 0
+  // TODO: Vectorize this
   while (featureIndex  numFeatures) {
-val featureInfo = 
strategy.categoricalFeaturesInfo.get(featureIndex)
-val isFeatureContinuous = featureInfo.isEmpty
-if (isFeatureContinuous) {
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous, 
false)
-} else {
-  val featureCategories = featureInfo.get
-  val isSpaceSufficientForAllCategoricalSplits
-= numBins  math.pow(2, featureCategories.toInt - 1) - 1
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous,
-isSpaceSufficientForAllCategoricalSplits)
-}
+arr(shift + featureIndex) = treePoint.features(featureIndex)
--- End diff --

I agree with @chouqin that TreePoint simply stores feature values (exactly 
the same data as LabeledPoint) for ordered categorical features.  We could save 
some space by not making a copy of those (up to 2x the storage).  The main 
issue is that indexing would be a little trickier since we do not separate out 
these various features.

@chouqin  About the arr structure, I am doing that in my next updates, 
where I eliminate that entire part of the aggregation step.  I.e., I eliminate 
the call to findBinsForLevel() and keep the binMappedRDD.aggregate().



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/1950#issuecomment-52327651
  
@chouqin  Thank you for the comments!  I'll make those fixes and get the 
other PRs done ASAP.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3065][SQL] Add Locale setting to HiveCo...

2014-08-15 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1968#issuecomment-52329690
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-15 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52330255
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52331037
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18612/consoleFull)
 for   PR 1961 at commit 
[`dccdff5`](https://github.com/apache/spark/commit/dccdff55612ef798227ed1e8102489d01e6e7d07).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1958#issuecomment-52332252
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18611/consoleFull)
 for   PR 1958 at commit 
[`f2c56c9`](https://github.com/apache/spark/commit/f2c56c976bc6faa83b8357c80caad1f4839eb06d).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3062] [SPARK-2970] [SQL] spark-sql scri...

2014-08-15 Thread sarutak

GitHub user sarutak opened a pull request:

https://github.com/apache/spark/pull/1970

[SPARK-3062] [SPARK-2970] [SQL] spark-sql script ends with IOException when 
EventLogging is enabled

#1891 was to avoid IOException when EventLogging is enabled.
The solution used ShutdownHookManager but it was defined only Hadoop 2.x. 
Hadoop 1.x don't have ShutdownHookManager so #1891 doesn't compile on Hadoop 1.x

Now, I had a compromised solution for both Hadoop 1.x and 2.x.
Only for FileLogger, an unique FileSystem object is created.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sarutak/spark SPARK-2970

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1970.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1970


commit e1262ecd4d1a21df1e881ed3881f1cee4128dfb4
Author: Kousuke Saruta saru...@oss.nttdata.co.jp
Date:   2014-08-15T17:14:01Z

Modified Filelogger to use unique FileSystem instance




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2970] [SQL] spark-sql script ends with ...

2014-08-15 Thread sarutak

Github user sarutak commented on the pull request:

https://github.com/apache/spark/pull/1891#issuecomment-5255
  
@liancheng @marmbrus So sorry for my fault.
Now I had a compromised solution on #1970.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3062] [SPARK-2970] [SQL] spark-sql scri...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1970#issuecomment-52333871
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18613/consoleFull)
 for   PR 1970 at commit 
[`e1262ec`](https://github.com/apache/spark/commit/e1262ecd4d1a21df1e881ed3881f1cee4128dfb4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2333 - Allow option to specify/reuse a s...

2014-08-15 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1899#issuecomment-52335044
  
The committers use the 
[merge_spark_pr.py](https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py)
 script for merging pull requests.  This script will squash together all of 
your commits into a single commit that uses the title of the PR plus the PR's 
description as the commit message, so you no longer need to worry about 
rebasing your patch into a single commit.

I'd just copy-paste the description from your last commit into the 
description above so it becomes the commit message


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-15 Thread rxin

Github user rxin closed the pull request at:

https://github.com/apache/spark/pull/1960


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Remove netty-test-file.txt.

2014-08-15 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1960#issuecomment-52335199
  
Closing this one since it will be combined with another one pull request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...

2014-08-15 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1971

[SPARK-2468] Netty based block server / client module

Previous pull request (#1907) was reverted. This brings it back. Still 
looking into the hang.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark netty1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1971.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1971


commit 754f2f732337ec65e2bb5ff7308cff73967cb3c3
Author: Reynold Xin r...@apache.org
Date:   2014-08-15T17:34:25Z

Revert Revert [SPARK-2468] Netty based block server / client module

This reverts commit fd9fcd25e93c727b327909cde0027426204ca6c3, which was a 
revert. That is to say, this added Netty module back.

commit 9629a1eeb7c67b5e369540b29785de801ea9d508
Author: Reynold Xin r...@apache.org
Date:   2014-08-15T02:19:33Z

Remove netty-test-file.txt.

(cherry picked from commit 3debe7c246b58345d0495b52f70bdd0be1b4f5e3)
Signed-off-by: Reynold Xin r...@apache.org




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2333 - Allow option to specify/reuse a s...

2014-08-15 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/1899#discussion_r16304946
  
--- Diff: ec2/spark_ec2.py ---
@@ -440,14 +449,22 @@ def launch_cluster(conn, opts, cluster_name):
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
 # Give the instances descriptive names
+# TODO: Add retry logic for tagging with name since it's used to 
identify a cluster.
 for master in master_nodes:
-master.add_tag(
-key='Name',
-value='{cn}-master-{iid}'.format(cn=cluster_name, 
iid=master.id))
+name = '{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)
+for i in range(0, 5):
+master.add_tag(key='Name', value=name)
--- End diff --

What happens if an `add_tag` call fails?  My bet is that it throws an 
exception rather than silently failing, in which case this re-try logic won't 
run.  Rather than using this set-and-test logic, maybe we can just wrap the 
call in a try-except block?

@shivaram Did the eventual-consistency issue that you saw result in 
exceptions from `add_tag`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52336221
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18615/consoleFull)
 for   PR 1689 at commit 
[`09f0637`](https://github.com/apache/spark/commit/09f0637ac5ff986701d76c874b6567313022a0ab).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread erikerlandson

Github user erikerlandson commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52336202
  
Latest push updates RangePartition sampling job to be async, and updates 
the async action functions so that they will properly enclose the sampling job 
induced by calling 'partitions'.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3046] use executor's class loader as th...

2014-08-15 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1972

[SPARK-3046] use executor's class loader as the default serializer 
classloader



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark kryoBug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1972.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1972


commit d879e67de639ab441ba97147a95642df5703de64
Author: Reynold Xin r...@apache.org
Date:   2014-08-15T17:46:22Z

[SPARK-3046] use executor's class loader as the default serializer class 
loader.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3045] Make Serializer interface Java fr...

2014-08-15 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1948#issuecomment-52336674
  
I actually separated the pull request into two: this one and #1972.

This one is only about the API update for serializer, and #1972 is the bug 
fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16305258
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -689,37 +631,26 @@ object DecisionTree extends Serializable with Logging 
{
  *bin index for this labeledPoint
  *(or InvalidBinIndex if labeledPoint is not handled by 
this node)
  */
-def findBinsForLevel(labeledPoint: LabeledPoint): Array[Double] = {
+def findBinsForLevel(treePoint: TreePoint): Array[Double] = {
   // Calculate bin index and label per feature per node.
   val arr = new Array[Double](1 + (numFeatures * numNodes))
   // First element of the array is the label of the instance.
-  arr(0) = labeledPoint.label
+  arr(0) = treePoint.label
   // Iterate over nodes.
   var nodeIndex = 0
   while (nodeIndex  numNodes) {
 val parentFilters = findParentFilters(nodeIndex)
 // Find out whether the sample qualifies for the particular node.
-val sampleValid = isSampleValid(parentFilters, labeledPoint)
+val sampleValid = isSampleValid(parentFilters, treePoint)
 val shift = 1 + numFeatures * nodeIndex
 if (!sampleValid) {
   // Mark one bin as -1 is sufficient.
   arr(shift) = InvalidBinIndex
 } else {
   var featureIndex = 0
+  // TODO: Vectorize this
   while (featureIndex  numFeatures) {
-val featureInfo = 
strategy.categoricalFeaturesInfo.get(featureIndex)
-val isFeatureContinuous = featureInfo.isEmpty
-if (isFeatureContinuous) {
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous, 
false)
-} else {
-  val featureCategories = featureInfo.get
-  val isSpaceSufficientForAllCategoricalSplits
-= numBins  math.pow(2, featureCategories.toInt - 1) - 1
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous,
-isSpaceSufficientForAllCategoricalSplits)
-}
+arr(shift + featureIndex) = treePoint.features(featureIndex)
--- End diff --

@chouqin @jkbradley Got it. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-2333 - Allow option to specify/reuse a s...

2014-08-15 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/1899#discussion_r16305293
  
--- Diff: ec2/spark_ec2.py ---
@@ -440,14 +449,22 @@ def launch_cluster(conn, opts, cluster_name):
 print Launched master in %s, regid = %s % (zone, master_res.id)
 
 # Give the instances descriptive names
+# TODO: Add retry logic for tagging with name since it's used to 
identify a cluster.
 for master in master_nodes:
-master.add_tag(
-key='Name',
-value='{cn}-master-{iid}'.format(cn=cluster_name, 
iid=master.id))
+name = '{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id)
+for i in range(0, 5):
+master.add_tag(key='Name', value=name)
--- End diff --

Yes - I am pretty sure it throws an exception. I don't remember what the 
type is -- All I see in my notes is that the exception says 'Instance not found'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3045] Make Serializer interface Java fr...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1948#issuecomment-52336859
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18617/consoleFull)
 for   PR 1948 at commit 
[`724f7c8`](https://github.com/apache/spark/commit/724f7c80d5add0596625328b144cb086198ad4aa).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3046] use executor's class loader as th...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1972#issuecomment-52336840
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18616/consoleFull)
 for   PR 1972 at commit 
[`d879e67`](https://github.com/apache/spark/commit/d879e67de639ab441ba97147a95642df5703de64).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52337059
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18612/consoleFull)
 for   PR 1961 at commit 
[`dccdff5`](https://github.com/apache/spark/commit/dccdff55612ef798227ed1e8102489d01e6e7d07).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3046] use executor's class loader as th...

2014-08-15 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1972#issuecomment-52337105
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3045] Make Serializer interface Java fr...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1948#issuecomment-52337271
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18617/consoleFull)
 for   PR 1948 at commit 
[`724f7c8`](https://github.com/apache/spark/commit/724f7c80d5add0596625328b144cb086198ad4aa).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3045] Make Serializer interface Java fr...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1948#issuecomment-52337471
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18619/consoleFull)
 for   PR 1948 at commit 
[`9a4dbcc`](https://github.com/apache/spark/commit/9a4dbccae2eafb24cd5ccbf7c30e94276cce7cc6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3046] use executor's class loader as th...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1972#issuecomment-52337483
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18618/consoleFull)
 for   PR 1972 at commit 
[`7204c33`](https://github.com/apache/spark/commit/7204c331ad2db3fd160f7d979e6e03fea3d57072).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3045] Make Serializer interface Java fr...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1948#issuecomment-52337863
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18619/consoleFull)
 for   PR 1948 at commit 
[`9a4dbcc`](https://github.com/apache/spark/commit/9a4dbccae2eafb24cd5ccbf7c30e94276cce7cc6).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3046] use executor's class loader as th...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1972#issuecomment-52338627
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18620/consoleFull)
 for   PR 1972 at commit 
[`c1c7bf0`](https://github.com/apache/spark/commit/c1c7bf0ac1c4b6c355f70bbb3e39c047cbcfdf60).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3045] Make Serializer interface Java fr...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1948#issuecomment-52338620
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18621/consoleFull)
 for   PR 1948 at commit 
[`f1a88ab`](https://github.com/apache/spark/commit/f1a88ab8eaa25389fed9da0a4f6970dd3ddaaeea).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread markhamstra

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52339006
  
Excellent!  I'll try to find some time to review this soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1958#issuecomment-52339243
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18622/consoleFull)
 for   PR 1958 at commit 
[`7b9b649`](https://github.com/apache/spark/commit/7b9b649612bd61dae44f4b0212160b59fca86b73).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3062] [SPARK-2970] [SQL] spark-sql scri...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1970#issuecomment-52339299
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18613/consoleFull)
 for   PR 1970 at commit 
[`e1262ec`](https://github.com/apache/spark/commit/e1262ecd4d1a21df1e881ed3881f1cee4128dfb4).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Use user defined $SPARK_HOME in spark-submit i...

2014-08-15 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1969#issuecomment-52342115
  
There was a bunch of prior discussion about this in an old pull request for 
[SPARK-1110](http://issues.apache.org/jira/browse/SPARK-1110) (I'd link to it, 
but it's from the now-deleted `incubator-spark` GitHub repo).

I think we decided that it didn't make sense for workers to inherit 
`SPARK_HOME` from the driver; there were some later patches that removed this 
dependency, if I recall.

@iven Was this pull request motivated by an issue that you saw when 
deploying Spark?  Which version were you using, and on what platform?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-15 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1961#issuecomment-52342051
  
Okay looks good - I'm merging this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16308033
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -689,37 +631,26 @@ object DecisionTree extends Serializable with Logging 
{
  *bin index for this labeledPoint
  *(or InvalidBinIndex if labeledPoint is not handled by 
this node)
  */
-def findBinsForLevel(labeledPoint: LabeledPoint): Array[Double] = {
+def findBinsForLevel(treePoint: TreePoint): Array[Double] = {
   // Calculate bin index and label per feature per node.
   val arr = new Array[Double](1 + (numFeatures * numNodes))
   // First element of the array is the label of the instance.
-  arr(0) = labeledPoint.label
+  arr(0) = treePoint.label
   // Iterate over nodes.
   var nodeIndex = 0
   while (nodeIndex  numNodes) {
 val parentFilters = findParentFilters(nodeIndex)
 // Find out whether the sample qualifies for the particular node.
-val sampleValid = isSampleValid(parentFilters, labeledPoint)
+val sampleValid = isSampleValid(parentFilters, treePoint)
 val shift = 1 + numFeatures * nodeIndex
 if (!sampleValid) {
   // Mark one bin as -1 is sufficient.
   arr(shift) = InvalidBinIndex
 } else {
   var featureIndex = 0
+  // TODO: Vectorize this
   while (featureIndex  numFeatures) {
-val featureInfo = 
strategy.categoricalFeaturesInfo.get(featureIndex)
-val isFeatureContinuous = featureInfo.isEmpty
-if (isFeatureContinuous) {
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous, 
false)
-} else {
-  val featureCategories = featureInfo.get
-  val isSpaceSufficientForAllCategoricalSplits
-= numBins  math.pow(2, featureCategories.toInt - 1) - 1
-  arr(shift + featureIndex)
-= findBin(featureIndex, labeledPoint, isFeatureContinuous,
-isSpaceSufficientForAllCategoricalSplits)
-}
+arr(shift + featureIndex) = treePoint.features(featureIndex)
--- End diff --

@mengxr  I tried Array.copy but ran into issues with the Java code being 
unhappy with types.  I'd prefer to skip this change for now, since this part of 
the code is eliminated when I get rid of this part of the aggregation in a 
subsequent optimization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16308047
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala ---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.model.Bin
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Internal representation of LabeledPoint for DecisionTree.
+ * This bins feature values based on a subsampled of data as follows:
+ *  (a) Continuous features are binned into ranges.
+ *  (b) Unordered categorical features are binned based on subsets of 
feature values.
+ *  Unordered categorical features are categorical features with low 
arity used in
+ *  multiclass classification.
+ *  (c) Ordered categorical features are binned based on feature values.
+ *  Ordered categorical features are categorical features with high 
arity,
+ *  or any categorical feature used in regression or binary 
classification.
+ *
+ * @param label  Label from LabeledPoint
+ * @param features  Binned feature values.
+ *  Same length as LabeledPoint.features, but values are 
bin indices.
+ */
+private[tree] class TreePoint(val label: Double, val features: Array[Int]) 
extends Serializable {
+}
+
+
+private[tree] object TreePoint {
+
+  /**
+   * Convert an input dataset into its TreePoint representation,
+   * binning feature values in preparation for DecisionTree training.
+   * @param input Input dataset.
+   * @param strategy  DecisionTree training info, used for dataset 
metadata.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @return  TreePoint dataset representation
+   */
+  def convertToTreeRDD(
+  input: RDD[LabeledPoint],
+  strategy: Strategy,
+  bins: Array[Array[Bin]]): RDD[TreePoint] = {
+input.map { x =
+  TreePoint.labeledPointToTreePoint(x, 
strategy.isMulticlassClassification, bins,
+strategy.categoricalFeaturesInfo)
+}
+  }
+
+  /**
+   * Convert one LabeledPoint into its TreePoint representation.
+   * @param bins  Bins for features, of size (numFeatures, numBins).
+   * @param categoricalFeaturesInfo  Map over categorical features: 
feature index -- feature arity
+   */
+  private def labeledPointToTreePoint(
+  labeledPoint: LabeledPoint,
+  isMulticlassClassification: Boolean,
+  bins: Array[Array[Bin]],
+  categoricalFeaturesInfo: Map[Int, Int]): TreePoint = {
+
+val numFeatures = labeledPoint.features.size
+val numBins = bins(0).size
+val arr = new Array[Int](numFeatures)
+var featureIndex = 0 // offset by 1 for label
--- End diff --

That was an old comment; I removed it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...

2014-08-15 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1961


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1689#issuecomment-52342401
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18615/consoleFull)
 for   PR 1689 at commit 
[`09f0637`](https://github.com/apache/spark/commit/09f0637ac5ff986701d76c874b6567313022a0ab).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2677] BasicBlockFetchIterator#next can ...

2014-08-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1632#issuecomment-52343198
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18623/consoleFull)
 for   PR 1632 at commit 
[`7ed48be`](https://github.com/apache/spark/commit/7ed48be337f469b75a1ba0c85b6817e5beb9f3a6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...

2014-08-15 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/1950#discussion_r16308658
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
@@ -53,16 +55,28 @@ class DecisionTree (private val strategy: Strategy) 
extends Serializable with Lo
*/
   def train(input: RDD[LabeledPoint]): DecisionTreeModel = {
 
+val timer = new TimeTracker()
+
+timer.start(total)
+
 // Cache input RDD for speedup during multiple passes.
-val retaggedInput = input.retag(classOf[LabeledPoint]).cache()
+timer.start(init)
+val retaggedInput = input.retag(classOf[LabeledPoint])
 logDebug(algo =  + strategy.algo)
+timer.stop(init)
 
 // Find the splits and the corresponding bins (interval between the 
splits) using a sample
 // of the input data.
+timer.start(findSplitsBins)
 val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, 
strategy)
 val numBins = bins(0).length
+timer.stop(findSplitsBins)
 logDebug(numBins =  + numBins)
 
+timer.start(init)
+val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, 
bins).cache()
--- End diff --

I'll try testing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 233 matches

Mail list logo