[GitHub] spark pull request: [SPARK-2677] BasicBlockFetchIterator#next can ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1632#issuecomment-52279573 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18600/consoleFull) for PR 1632 at commit [`66cfff7`](https://github.com/apache/spark/commit/66cfff765f44aea7ff2bf5afe1a29403e79b7951). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2677] BasicBlockFetchIterator#next can ...
Github user sarutak commented on a diff in the pull request: https://github.com/apache/spark/pull/1632#discussion_r16282025 --- Diff: core/src/main/scala/org/apache/spark/network/ConnectionManager.scala --- @@ -836,9 +845,14 @@ private[spark] class ConnectionManager( def sendMessageReliably(connectionManagerId: ConnectionManagerId, message: Message) : Future[Message] = { val promise = Promise[Message]() + +val ackTimeoutMonitor = new Timer(s"Ack Timeout Monitor-" + --- End diff -- I've modified for ConnectionManager-wide timer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3063][SQL] ExistingRdd should convert M...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/1963 [SPARK-3063][SQL] ExistingRdd should convert Map to catalyst Map. Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-3063 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1963.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1963 commit d8a900ad1408ea3662d2e43b2df24db485cd28e5 Author: Takuya UESHIN Date: 2014-08-15T06:48:52Z Make ExistingRdd.convertToCatalyst be able to convert Map value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2970] [SQL] spark-sql script ends with ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/1891#issuecomment-52278697 Sorry, didn't realize that `ShutdownHookManager` is only available in Hadoop 2. Compilation fails when building Spark with Hadoop 1. Filed [SPARK-3062](https://issues.apache.org/jira/browse/SPARK-3062) to track this issue. @marmbrus Maybe we should revert this PR first for safe. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3045] [SPARK-3046] Make Serializer inte...
Github user GrahamDennis commented on the pull request: https://github.com/apache/spark/pull/1948#issuecomment-52278145 I've convinced myself that this PR fixes SPARK-2878, and think it should be merged. Thanks @rxin! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [mllib] FindBinsForLevel in decis...
Github user chouqin commented on the pull request: https://github.com/apache/spark/pull/1941#issuecomment-52277965 @mengxr @jkbradley never mind, I will help you review @1950 :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...
Github user harishreedharan commented on a diff in the pull request: https://github.com/apache/spark/pull/1958#discussion_r16281516 --- Diff: external/flume-sink/src/test/scala/org/apache/spark/streaming/flume/sink/SparkSinkSuite.scala --- @@ -0,0 +1,208 @@ +package org.apache.spark.streaming.flume.sink + +import java.net.InetSocketAddress +import java.util.concurrent.atomic.AtomicInteger +import java.util.concurrent.{CountDownLatch, Executors} + +import scala.collection.JavaConversions._ +import scala.concurrent.{Promise, Future} +import scala.util.{Failure, Success, Try} + +import com.google.common.util.concurrent.ThreadFactoryBuilder +import org.apache.avro.ipc.NettyTransceiver +import org.apache.avro.ipc.specific.SpecificRequestor +import org.apache.flume.Context +import org.apache.flume.channel.MemoryChannel +import org.apache.flume.event.EventBuilder +import org.apache.spark.streaming.TestSuiteBase +import org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory + + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more --- End diff -- Thanks! Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1958#issuecomment-5222 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18599/consoleFull) for PR 1958 at commit [`f2c56c9`](https://github.com/apache/spark/commit/f2c56c976bc6faa83b8357c80caad1f4839eb06d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16281414 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -728,8 +659,10 @@ object DecisionTree extends Serializable with Logging { arr } - // Find feature bins for all nodes at a level. +// Find feature bins for all nodes at a level. +timer.start("findBinsForLevel") val binMappedRDD = input.map(x => findBinsForLevel(x)) +timer.stop("findBinsForLevel") --- End diff -- this timer may also be useless --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16281396 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -53,16 +55,28 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo */ def train(input: RDD[LabeledPoint]): DecisionTreeModel = { +val timer = new TimeTracker() + +timer.start("total") + // Cache input RDD for speedup during multiple passes. -val retaggedInput = input.retag(classOf[LabeledPoint]).cache() +timer.start("init") +val retaggedInput = input.retag(classOf[LabeledPoint]) logDebug("algo = " + strategy.algo) +timer.stop("init") // Find the splits and the corresponding bins (interval between the splits) using a sample // of the input data. +timer.start("findSplitsBins") val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, strategy) val numBins = bins(0).length +timer.stop("findSplitsBins") logDebug("numBins = " + numBins) +timer.start("init") +val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, bins).cache() +timer.stop("init") --- End diff -- I think the timer for `convertToTreeRDD` may not be useful, since the `map` from `retaggedInput` to `treeInput` is evaluated lazily. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user chouqin commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16281333 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -689,37 +631,26 @@ object DecisionTree extends Serializable with Logging { *bin index for this labeledPoint *(or InvalidBinIndex if labeledPoint is not handled by this node) */ -def findBinsForLevel(labeledPoint: LabeledPoint): Array[Double] = { +def findBinsForLevel(treePoint: TreePoint): Array[Double] = { // Calculate bin index and label per feature per node. val arr = new Array[Double](1 + (numFeatures * numNodes)) // First element of the array is the label of the instance. - arr(0) = labeledPoint.label + arr(0) = treePoint.label // Iterate over nodes. var nodeIndex = 0 while (nodeIndex < numNodes) { val parentFilters = findParentFilters(nodeIndex) // Find out whether the sample qualifies for the particular node. -val sampleValid = isSampleValid(parentFilters, labeledPoint) +val sampleValid = isSampleValid(parentFilters, treePoint) val shift = 1 + numFeatures * nodeIndex if (!sampleValid) { // Mark one bin as -1 is sufficient. arr(shift) = InvalidBinIndex } else { var featureIndex = 0 + // TODO: Vectorize this while (featureIndex < numFeatures) { -val featureInfo = strategy.categoricalFeaturesInfo.get(featureIndex) -val isFeatureContinuous = featureInfo.isEmpty -if (isFeatureContinuous) { - arr(shift + featureIndex) -= findBin(featureIndex, labeledPoint, isFeatureContinuous, false) -} else { - val featureCategories = featureInfo.get - val isSpaceSufficientForAllCategoricalSplits -= numBins > math.pow(2, featureCategories.toInt - 1) - 1 - arr(shift + featureIndex) -= findBin(featureIndex, labeledPoint, isFeatureContinuous, -isSpaceSufficientForAllCategoricalSplits) -} +arr(shift + featureIndex) = treePoint.features(featureIndex) --- End diff -- Since the features array is the same of all nodes on which this labeledpoint is valid, is it really necessary for every node to have a copy of it? In #1941 , I have changed the `arr` structure, does this memory saving help? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52277037 I had add Broadcast.unpersist(blocking=False). Because we have an copy in disks, so read it from there when user want to access it driver, then we can keep the SparkContext.broadcast() unchanged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2849 / 2914] Handle certain Spark confi...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1845#issuecomment-52276624 Hey @andrewor14 it's great to see this being fixed. I think things can be simplified a lot by deferring to the `eval` from bash to deal with the argumnet parsing. It's worth seeing if that handles all the cases here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52276573 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18598/consoleFull) for PR 1912 at commit [`e06df4a`](https://github.com/apache/spark/commit/e06df4a8c211f53a5b7d176c6ec655033f1419ee). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2849 / 2914] Handle certain Spark confi...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1845#discussion_r16281080 --- Diff: bin/utils.sh --- @@ -0,0 +1,108 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# * * +# | Utility functions for launching Spark applications | +# * * + +# Parse the value of a config from a java properties file according to the specifications in +# http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html#load(java.io.Reader). +# This accepts the name of the config as an argument, and expects the path of the property +# file to be found in PROPERTIES_FILE. The value is returned through JAVA_PROPERTY_VALUE. +parse_java_property() { + JAVA_PROPERTY_VALUE="" # return value + config_buffer=""# buffer for collecting parts of a config value + multi_line=0# whether this config is spanning multiple lines + while read -r line; do +# Strip leading and trailing whitespace +line=$(echo "$line" | sed "s/^[[:space:]]\(.*\)[[:space:]]*$/\1/") +contains_config=$(echo "$line" | grep -e "^$1") +if [[ -n "$contains_config" || "$multi_line" == 1 ]]; then + has_more_lines=$(echo "$line" | grep -e "$") + if [[ -n "$has_more_lines" ]]; then +# Strip trailing backslash +line=$(echo "$line" | sed "s/$//") +config_buffer="$config_buffer $line" +multi_line=1 + else +JAVA_PROPERTY_VALUE="$config_buffer $line" +break + fi +fi + done < "$PROPERTIES_FILE" + + # Actually extract the value of the config + JAVA_PROPERTY_VALUE=$( \ +echo "$JAVA_PROPERTY_VALUE" | \ +sed "s/$1//" | \ +sed "s/^[[:space:]]*[:=]\{0,1\}//" | \ +sed "s/^[[:space:]]*\(.*\)[[:space:]]*$/\1/g" \ + ) + export JAVA_PROPERTY_VALUE +} + +# Properly split java options, dealing with whitespace, double quotes and backslashes. +# This accepts a string and returns the resulting list through SPLIT_JAVA_OPTS. +split_java_options() { + SPLIT_JAVA_OPTS=() # return value --- End diff -- I think this could be solved using an `eval` statement much more easily. At first I though there are some security concerns, but I don't think this allows people to do anything they couldn't already do: ``` test="One \"This \\\"is two\" Three" eval "set -- $test" for i in "$@" do echo $i done ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2849 / 2914] Handle certain Spark confi...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1845#discussion_r16280819 --- Diff: bin/spark-class --- @@ -101,11 +106,16 @@ fi # Set JAVA_OPTS to be able to load native libraries and to set heap size JAVA_OPTS="-XX:MaxPermSize=128m $OUR_JAVA_OPTS" JAVA_OPTS="$JAVA_OPTS -Xms$OUR_JAVA_MEM -Xmx$OUR_JAVA_MEM" + # Load extra JAVA_OPTS from conf/java-opts, if it exists if [ -e "$FWDIR/conf/java-opts" ] ; then JAVA_OPTS="$JAVA_OPTS `cat $FWDIR/conf/java-opts`" fi -export JAVA_OPTS + +# Split JAVA_OPTS properly to handle whitespace, double quotes and backslashes +# This exports the split java options into SPLIT_JAVA_OPTS +split_java_options "$JAVA_OPTS" --- End diff -- This actually solves a more general problem than those reported in SPARK-2849 and SPARK-2914... the problem/feature is that in general we don't support quotes in any of the java option strings we have. I tested this in master and confirmed it doesn't work: ``` $ export SPARK_JAVA_OPTS="-Dfoo=\"bar baz\"" $ ./bin/spark-shell Spark assembly has been built with Hive, including Datanucleus jars on classpath Error: Could not find or load main class baz" ``` So it might be good to create another JIRA as well for this PR. One that just explains that none of the JAVA_OPTS variants we have correctly support quoted strings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove netty-test-file.txt.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1960#issuecomment-52275457 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18597/consoleFull) for PR 1960 at commit [`3debe7c`](https://github.com/apache/spark/commit/3debe7c246b58345d0495b52f70bdd0be1b4f5e3). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove netty-test-file.txt.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1960#issuecomment-52275324 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3040] pick up a more proper local ip ad...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1946#issuecomment-52275310 I have seen cases where a bad interface was chose, so this does seem like a good idea. But for windows, does this mean that the wrong interface is chosen? Since we support windows, should we add a check here to confirm we are on a unix-like system? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2367: update io.netty from 4.0.17 to 4.0...
Github user ngbinh closed the pull request at: https://github.com/apache/spark/pull/1302 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2367: update io.netty from 4.0.17 to 4.0...
Github user ngbinh commented on the pull request: https://github.com/apache/spark/pull/1302#issuecomment-52275260 thanks. closing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1535 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2367: update io.netty from 4.0.17 to 4.0...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1302#issuecomment-52275218 This was fixed in 3a8b68b7 so I think we can close this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add caching information to rdd.toDebugString
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1535#issuecomment-52275155 Hey this looks good. Merging it now into mater. Sorry about the delay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2849 / 2914] Handle certain Spark confi...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1845#discussion_r16280454 --- Diff: bin/utils.sh --- @@ -0,0 +1,108 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# * * +# | Utility functions for launching Spark applications | +# * * + +# Parse the value of a config from a java properties file according to the specifications in +# http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html#load(java.io.Reader). +# This accepts the name of the config as an argument, and expects the path of the property +# file to be found in PROPERTIES_FILE. The value is returned through JAVA_PROPERTY_VALUE. +parse_java_property() { + JAVA_PROPERTY_VALUE="" # return value + config_buffer=""# buffer for collecting parts of a config value + multi_line=0# whether this config is spanning multiple lines + while read -r line; do +# Strip leading and trailing whitespace +line=$(echo "$line" | sed "s/^[[:space:]]\(.*\)[[:space:]]*$/\1/") +contains_config=$(echo "$line" | grep -e "^$1") +if [[ -n "$contains_config" || "$multi_line" == 1 ]]; then + has_more_lines=$(echo "$line" | grep -e "$") + if [[ -n "$has_more_lines" ]]; then --- End diff -- We discussed this already today, but I'd prefer not to deal with multiline statements at all here to keep this simpler. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2955 [BUILD] Test code fails to compile ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1879 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2917] [SQL] Avoid table creation in log...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1846#issuecomment-52274987 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18596/consoleFull) for PR 1846 at commit [`5be9554`](https://github.com/apache/spark/commit/5be9554254312a78a94dd023494478a7db5d3243). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2955 [BUILD] Test code fails to compile ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1879#issuecomment-52274970 Ah, seems like a good change. Thanks Sean, I merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2912] [Spark QA] Include commit hash in...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1816 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3054][STREAMING] Add unit tests for Spa...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/1958#discussion_r16280387 --- Diff: external/flume-sink/src/test/scala/org/apache/spark/streaming/flume/sink/SparkSinkSuite.scala --- @@ -0,0 +1,208 @@ +package org.apache.spark.streaming.flume.sink + +import java.net.InetSocketAddress +import java.util.concurrent.atomic.AtomicInteger +import java.util.concurrent.{CountDownLatch, Executors} + +import scala.collection.JavaConversions._ +import scala.concurrent.{Promise, Future} +import scala.util.{Failure, Success, Try} + +import com.google.common.util.concurrent.ThreadFactoryBuilder +import org.apache.avro.ipc.NettyTransceiver +import org.apache.avro.ipc.specific.SpecificRequestor +import org.apache.flume.Context +import org.apache.flume.channel.MemoryChannel +import org.apache.flume.event.EventBuilder +import org.apache.spark.streaming.TestSuiteBase +import org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory + + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more --- End diff -- Hey, Hari, ASF header should be at the top of file :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2917] [SQL] Avoid table creation in log...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1846#issuecomment-52274734 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3013] [SQL] [PySpark] convert array int...
Github user kanzhang commented on the pull request: https://github.com/apache/spark/pull/1928#issuecomment-52274671 @davies that's true. One thing I like about Python list is that it gets mapped to Java List (symmetry). However, it's not array any more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3013] [SQL] [PySpark] convert array int...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1928#issuecomment-52274398 @kanzhang I think either way is fine, tuple is like read-only list. RDD is read only, so there is no difference between having list or tuple in RDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2918] [SQL] [WIP] Support the extended ...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/1847#issuecomment-52274317 @marmbrus I've create another PR #1962 which only provide the `extended` support (but doesn't support the `explain CTAS`), hope we can merge that first. I am not sure if we still able to merge the `explain CTAS` in spark 1.1 release, as there is another dependency failed in unit test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3013] [SQL] [PySpark] convert array int...
Github user kanzhang commented on the pull request: https://github.com/apache/spark/pull/1928#issuecomment-52274255 Now that I think about it, our ```WritableToJavaConverter``` should probably convert array to java collection (which gets pickled to Python list), instead of ```Object[]``` (which gets picked to Python tuple), as it is now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1961#issuecomment-52274276 QA tests have started for PR 1961. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18595/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3058] [SQL] Support EXTENDED for EXPLAI...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1962#issuecomment-52274263 QA tests have started for PR 1962. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18594/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1961#issuecomment-52274194 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3058] [SQL] Support EXTENDED for EXPLAI...
GitHub user chenghao-intel opened a pull request: https://github.com/apache/spark/pull/1962 [SPARK-3058] [SQL] Support EXTENDED for EXPLAIN Provide `extended` keyword support for `explain` command in SQL. e.g. ``` explain extended select key as a1, value as a2 from src where key=1; == Parsed Logical Plan == Project ['key AS a1#3,'value AS a2#4] Filter ('key = 1) UnresolvedRelation None, src, None == Analyzed Logical Plan == Project [key#8 AS a1#3,value#9 AS a2#4] Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType)) MetastoreRelation default, src, None == Optimized Logical Plan == Project [key#8 AS a1#3,value#9 AS a2#4] Filter (CAST(key#8, DoubleType) = 1.0) MetastoreRelation default, src, None == Physical Plan == Project [key#8 AS a1#3,value#9 AS a2#4] Filter (CAST(key#8, DoubleType) = 1.0) HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None Code Generation: false == RDD == (2) MappedRDD[14] at map at HiveContext.scala:350 MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42 MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57 MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112 MappedRDD[10] at map at TableReader.scala:240 HadoopRDD[9] at HadoopRDD at TableReader.scala:230 ``` It's the sub task of #1847. But can go without any dependency. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chenghao-intel/spark explain_extended Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1962.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1962 commit 48bc989e3740c5700f29aef153b0c4bb279c52fc Author: Cheng Hao Date: 2014-08-08T02:45:02Z Support EXTENDED for EXPLAIN --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/1961#issuecomment-52274079 Ooops, fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3048][MLLIB] add LabeledPoint.parse and...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1952#issuecomment-52273898 QA tests have started for PR 1952. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18593/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52273858 It occurs to me: what if we had .value retrieve and depickle the value from the JVM? Also, won't we still experience memory leaks in the JVM if we iteratively create broadcast variables, since we will never clean up those pickled values? One approach is to have .value() depickle the JVM value (so we're not changing the user-facing API) and add a Python equivalent of Broadcast.destroy() for performing permanent cleanup of a broadcast's resources. What do you think of this approach? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1961#issuecomment-52273690 Oops compilation failed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1912#discussion_r16279926 --- Diff: python/pyspark/broadcast.py --- @@ -52,17 +47,31 @@ class Broadcast(object): Access its value through C{.value}. """ -def __init__(self, bid, value, java_broadcast=None, pickle_registry=None): +def __init__(self, bid, value, java_broadcast=None, pickle_registry=None, keep=True): """ Should not be called directly by users -- use L{SparkContext.broadcast()} instead. """ -self.value = value self.bid = bid +if keep: +self.value = value self._jbroadcast = java_broadcast self._pickle_registry = pickle_registry +self.keep = keep def __reduce__(self): self._pickle_registry.add(self) return (_from_id, (self.bid, )) + +def __getattr__(self, item): +if item == 'value' and not self.keep: +raise Exception("please create broadcast with keep=True to make" +" it accessable in driver") --- End diff -- Typo: should be spelled "accessible." Also, maybe the error message could be a little clearer about how broadcast variables are created and why call failed. I'm thinking of something like "please call sc.broadcast() with keep=True to make values accessible in the driver". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1961#issuecomment-52273493 QA results for PR 1961:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18591/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove netty-test-file.txt.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1960#issuecomment-52273513 QA tests have started for PR 1960. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18592/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove netty-test-file.txt.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1960#issuecomment-52273441 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1961#issuecomment-52273425 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1961#issuecomment-52273376 QA tests have started for PR 1961. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18591/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279842 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala --- @@ -41,6 +43,9 @@ class DecisionTreeSuite extends FunSuite with LocalSparkContext { prediction != expected.label } val accuracy = (input.length - numOffPredictions).toDouble / input.length +if (accuracy < requiredAccuracy) { + println(s"validateClassifier calculated accuracy $accuracy but required $requiredAccuracy.") --- End diff -- put the error message as the second argument of `assert` below --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279841 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala --- @@ -17,6 +17,8 @@ package org.apache.spark.mllib.tree +import org.apache.spark.mllib.tree.impl.TreePoint --- End diff -- organize imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279835 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala --- @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.model.Bin +import org.apache.spark.rdd.RDD + + +/** + * Internal representation of LabeledPoint for DecisionTree. + * This bins feature values based on a subsampled of data as follows: + * (a) Continuous features are binned into ranges. + * (b) Unordered categorical features are binned based on subsets of feature values. + * "Unordered categorical features" are categorical features with low arity used in + * multiclass classification. + * (c) Ordered categorical features are binned based on feature values. + * "Ordered categorical features" are categorical features with high arity, + * or any categorical feature used in regression or binary classification. + * + * @param label Label from LabeledPoint + * @param features Binned feature values. + * Same length as LabeledPoint.features, but values are bin indices. + */ +private[tree] class TreePoint(val label: Double, val features: Array[Int]) extends Serializable { +} + + +private[tree] object TreePoint { + + /** + * Convert an input dataset into its TreePoint representation, + * binning feature values in preparation for DecisionTree training. + * @param input Input dataset. + * @param strategy DecisionTree training info, used for dataset metadata. + * @param bins Bins for features, of size (numFeatures, numBins). + * @return TreePoint dataset representation + */ + def convertToTreeRDD( + input: RDD[LabeledPoint], + strategy: Strategy, + bins: Array[Array[Bin]]): RDD[TreePoint] = { +input.map { x => + TreePoint.labeledPointToTreePoint(x, strategy.isMulticlassClassification, bins, +strategy.categoricalFeaturesInfo) +} + } + + /** + * Convert one LabeledPoint into its TreePoint representation. + * @param bins Bins for features, of size (numFeatures, numBins). + * @param categoricalFeaturesInfo Map over categorical features: feature index --> feature arity + */ + private def labeledPointToTreePoint( + labeledPoint: LabeledPoint, + isMulticlassClassification: Boolean, + bins: Array[Array[Bin]], + categoricalFeaturesInfo: Map[Int, Int]): TreePoint = { + +val numFeatures = labeledPoint.features.size +val numBins = bins(0).size +val arr = new Array[Int](numFeatures) +var featureIndex = 0 // offset by 1 for label +while (featureIndex < numFeatures) { + val featureInfo = categoricalFeaturesInfo.get(featureIndex) + val isFeatureContinuous = featureInfo.isEmpty + if (isFeatureContinuous) { +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, false, + bins, categoricalFeaturesInfo) + } else { +val featureCategories = featureInfo.get +val isSpaceSufficientForAllCategoricalSplits + = numBins > math.pow(2, featureCategories.toInt - 1) - 1 +val isUnorderedFeature = + isMulticlassClassification && isSpaceSufficientForAllCategoricalSplits +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, + isUnorderedFeature, bins, categoricalFeaturesInfo) + } + featureIndex += 1 +} + +new TreePoint(labeledPoint.label, arr) + } + + + /** + * Find bin for one (labeledPoint, feature). + * + * @param isUnordere
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279832 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala --- @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.model.Bin +import org.apache.spark.rdd.RDD + + +/** + * Internal representation of LabeledPoint for DecisionTree. + * This bins feature values based on a subsampled of data as follows: + * (a) Continuous features are binned into ranges. + * (b) Unordered categorical features are binned based on subsets of feature values. + * "Unordered categorical features" are categorical features with low arity used in + * multiclass classification. + * (c) Ordered categorical features are binned based on feature values. + * "Ordered categorical features" are categorical features with high arity, + * or any categorical feature used in regression or binary classification. + * + * @param label Label from LabeledPoint + * @param features Binned feature values. + * Same length as LabeledPoint.features, but values are bin indices. + */ +private[tree] class TreePoint(val label: Double, val features: Array[Int]) extends Serializable { +} + + +private[tree] object TreePoint { + + /** + * Convert an input dataset into its TreePoint representation, + * binning feature values in preparation for DecisionTree training. + * @param input Input dataset. + * @param strategy DecisionTree training info, used for dataset metadata. + * @param bins Bins for features, of size (numFeatures, numBins). + * @return TreePoint dataset representation + */ + def convertToTreeRDD( + input: RDD[LabeledPoint], + strategy: Strategy, + bins: Array[Array[Bin]]): RDD[TreePoint] = { +input.map { x => + TreePoint.labeledPointToTreePoint(x, strategy.isMulticlassClassification, bins, +strategy.categoricalFeaturesInfo) +} + } + + /** + * Convert one LabeledPoint into its TreePoint representation. + * @param bins Bins for features, of size (numFeatures, numBins). + * @param categoricalFeaturesInfo Map over categorical features: feature index --> feature arity + */ + private def labeledPointToTreePoint( + labeledPoint: LabeledPoint, + isMulticlassClassification: Boolean, + bins: Array[Array[Bin]], + categoricalFeaturesInfo: Map[Int, Int]): TreePoint = { + +val numFeatures = labeledPoint.features.size +val numBins = bins(0).size +val arr = new Array[Int](numFeatures) +var featureIndex = 0 // offset by 1 for label --- End diff -- Where does `offset by 1` happen? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279839 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala --- @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.model.Bin +import org.apache.spark.rdd.RDD + + +/** + * Internal representation of LabeledPoint for DecisionTree. + * This bins feature values based on a subsampled of data as follows: + * (a) Continuous features are binned into ranges. + * (b) Unordered categorical features are binned based on subsets of feature values. + * "Unordered categorical features" are categorical features with low arity used in + * multiclass classification. + * (c) Ordered categorical features are binned based on feature values. + * "Ordered categorical features" are categorical features with high arity, + * or any categorical feature used in regression or binary classification. + * + * @param label Label from LabeledPoint + * @param features Binned feature values. + * Same length as LabeledPoint.features, but values are bin indices. + */ +private[tree] class TreePoint(val label: Double, val features: Array[Int]) extends Serializable { +} + + +private[tree] object TreePoint { + + /** + * Convert an input dataset into its TreePoint representation, + * binning feature values in preparation for DecisionTree training. + * @param input Input dataset. + * @param strategy DecisionTree training info, used for dataset metadata. + * @param bins Bins for features, of size (numFeatures, numBins). + * @return TreePoint dataset representation + */ + def convertToTreeRDD( + input: RDD[LabeledPoint], + strategy: Strategy, + bins: Array[Array[Bin]]): RDD[TreePoint] = { +input.map { x => + TreePoint.labeledPointToTreePoint(x, strategy.isMulticlassClassification, bins, +strategy.categoricalFeaturesInfo) +} + } + + /** + * Convert one LabeledPoint into its TreePoint representation. + * @param bins Bins for features, of size (numFeatures, numBins). + * @param categoricalFeaturesInfo Map over categorical features: feature index --> feature arity + */ + private def labeledPointToTreePoint( + labeledPoint: LabeledPoint, + isMulticlassClassification: Boolean, + bins: Array[Array[Bin]], + categoricalFeaturesInfo: Map[Int, Int]): TreePoint = { + +val numFeatures = labeledPoint.features.size +val numBins = bins(0).size +val arr = new Array[Int](numFeatures) +var featureIndex = 0 // offset by 1 for label +while (featureIndex < numFeatures) { + val featureInfo = categoricalFeaturesInfo.get(featureIndex) + val isFeatureContinuous = featureInfo.isEmpty + if (isFeatureContinuous) { +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, false, + bins, categoricalFeaturesInfo) + } else { +val featureCategories = featureInfo.get +val isSpaceSufficientForAllCategoricalSplits + = numBins > math.pow(2, featureCategories.toInt - 1) - 1 +val isUnorderedFeature = + isMulticlassClassification && isSpaceSufficientForAllCategoricalSplits +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, + isUnorderedFeature, bins, categoricalFeaturesInfo) + } + featureIndex += 1 +} + +new TreePoint(labeledPoint.label, arr) + } + + + /** + * Find bin for one (labeledPoint, feature). + * + * @param isUnordere
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279828 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala --- @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.model.Bin +import org.apache.spark.rdd.RDD + + +/** + * Internal representation of LabeledPoint for DecisionTree. + * This bins feature values based on a subsampled of data as follows: + * (a) Continuous features are binned into ranges. + * (b) Unordered categorical features are binned based on subsets of feature values. + * "Unordered categorical features" are categorical features with low arity used in + * multiclass classification. + * (c) Ordered categorical features are binned based on feature values. + * "Ordered categorical features" are categorical features with high arity, + * or any categorical feature used in regression or binary classification. + * + * @param label Label from LabeledPoint + * @param features Binned feature values. + * Same length as LabeledPoint.features, but values are bin indices. + */ +private[tree] class TreePoint(val label: Double, val features: Array[Int]) extends Serializable { --- End diff -- another candidate is `binIndices` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279831 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala --- @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.model.Bin +import org.apache.spark.rdd.RDD + + +/** + * Internal representation of LabeledPoint for DecisionTree. + * This bins feature values based on a subsampled of data as follows: + * (a) Continuous features are binned into ranges. + * (b) Unordered categorical features are binned based on subsets of feature values. + * "Unordered categorical features" are categorical features with low arity used in + * multiclass classification. + * (c) Ordered categorical features are binned based on feature values. + * "Ordered categorical features" are categorical features with high arity, + * or any categorical feature used in regression or binary classification. + * + * @param label Label from LabeledPoint + * @param features Binned feature values. + * Same length as LabeledPoint.features, but values are bin indices. + */ +private[tree] class TreePoint(val label: Double, val features: Array[Int]) extends Serializable { +} + + --- End diff -- remove extra empty line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279840 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala --- @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.model.Bin +import org.apache.spark.rdd.RDD + + +/** + * Internal representation of LabeledPoint for DecisionTree. + * This bins feature values based on a subsampled of data as follows: + * (a) Continuous features are binned into ranges. + * (b) Unordered categorical features are binned based on subsets of feature values. + * "Unordered categorical features" are categorical features with low arity used in + * multiclass classification. + * (c) Ordered categorical features are binned based on feature values. + * "Ordered categorical features" are categorical features with high arity, + * or any categorical feature used in regression or binary classification. + * + * @param label Label from LabeledPoint + * @param features Binned feature values. + * Same length as LabeledPoint.features, but values are bin indices. + */ +private[tree] class TreePoint(val label: Double, val features: Array[Int]) extends Serializable { +} + + +private[tree] object TreePoint { + + /** + * Convert an input dataset into its TreePoint representation, + * binning feature values in preparation for DecisionTree training. + * @param input Input dataset. + * @param strategy DecisionTree training info, used for dataset metadata. + * @param bins Bins for features, of size (numFeatures, numBins). + * @return TreePoint dataset representation + */ + def convertToTreeRDD( + input: RDD[LabeledPoint], + strategy: Strategy, + bins: Array[Array[Bin]]): RDD[TreePoint] = { +input.map { x => + TreePoint.labeledPointToTreePoint(x, strategy.isMulticlassClassification, bins, +strategy.categoricalFeaturesInfo) +} + } + + /** + * Convert one LabeledPoint into its TreePoint representation. + * @param bins Bins for features, of size (numFeatures, numBins). + * @param categoricalFeaturesInfo Map over categorical features: feature index --> feature arity + */ + private def labeledPointToTreePoint( + labeledPoint: LabeledPoint, + isMulticlassClassification: Boolean, + bins: Array[Array[Bin]], + categoricalFeaturesInfo: Map[Int, Int]): TreePoint = { + +val numFeatures = labeledPoint.features.size +val numBins = bins(0).size +val arr = new Array[Int](numFeatures) +var featureIndex = 0 // offset by 1 for label +while (featureIndex < numFeatures) { + val featureInfo = categoricalFeaturesInfo.get(featureIndex) + val isFeatureContinuous = featureInfo.isEmpty + if (isFeatureContinuous) { +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, false, + bins, categoricalFeaturesInfo) + } else { +val featureCategories = featureInfo.get +val isSpaceSufficientForAllCategoricalSplits + = numBins > math.pow(2, featureCategories.toInt - 1) - 1 +val isUnorderedFeature = + isMulticlassClassification && isSpaceSufficientForAllCategoricalSplits +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, + isUnorderedFeature, bins, categoricalFeaturesInfo) + } + featureIndex += 1 +} + +new TreePoint(labeledPoint.label, arr) + } + + + /** + * Find bin for one (labeledPoint, feature). + * + * @param isUnordere
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279837 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala --- @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.model.Bin +import org.apache.spark.rdd.RDD + + +/** + * Internal representation of LabeledPoint for DecisionTree. + * This bins feature values based on a subsampled of data as follows: + * (a) Continuous features are binned into ranges. + * (b) Unordered categorical features are binned based on subsets of feature values. + * "Unordered categorical features" are categorical features with low arity used in + * multiclass classification. + * (c) Ordered categorical features are binned based on feature values. + * "Ordered categorical features" are categorical features with high arity, + * or any categorical feature used in regression or binary classification. + * + * @param label Label from LabeledPoint + * @param features Binned feature values. + * Same length as LabeledPoint.features, but values are bin indices. + */ +private[tree] class TreePoint(val label: Double, val features: Array[Int]) extends Serializable { +} + + +private[tree] object TreePoint { + + /** + * Convert an input dataset into its TreePoint representation, + * binning feature values in preparation for DecisionTree training. + * @param input Input dataset. + * @param strategy DecisionTree training info, used for dataset metadata. + * @param bins Bins for features, of size (numFeatures, numBins). + * @return TreePoint dataset representation + */ + def convertToTreeRDD( + input: RDD[LabeledPoint], + strategy: Strategy, + bins: Array[Array[Bin]]): RDD[TreePoint] = { +input.map { x => + TreePoint.labeledPointToTreePoint(x, strategy.isMulticlassClassification, bins, +strategy.categoricalFeaturesInfo) +} + } + + /** + * Convert one LabeledPoint into its TreePoint representation. + * @param bins Bins for features, of size (numFeatures, numBins). + * @param categoricalFeaturesInfo Map over categorical features: feature index --> feature arity + */ + private def labeledPointToTreePoint( + labeledPoint: LabeledPoint, + isMulticlassClassification: Boolean, + bins: Array[Array[Bin]], + categoricalFeaturesInfo: Map[Int, Int]): TreePoint = { + +val numFeatures = labeledPoint.features.size +val numBins = bins(0).size +val arr = new Array[Int](numFeatures) +var featureIndex = 0 // offset by 1 for label +while (featureIndex < numFeatures) { + val featureInfo = categoricalFeaturesInfo.get(featureIndex) + val isFeatureContinuous = featureInfo.isEmpty + if (isFeatureContinuous) { +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, false, + bins, categoricalFeaturesInfo) + } else { +val featureCategories = featureInfo.get +val isSpaceSufficientForAllCategoricalSplits + = numBins > math.pow(2, featureCategories.toInt - 1) - 1 +val isUnorderedFeature = + isMulticlassClassification && isSpaceSufficientForAllCategoricalSplits +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, + isUnorderedFeature, bins, categoricalFeaturesInfo) + } + featureIndex += 1 +} + +new TreePoint(labeledPoint.label, arr) + } + + + /** + * Find bin for one (labeledPoint, feature). + * + * @param isUnordere
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279834 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala --- @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.model.Bin +import org.apache.spark.rdd.RDD + + +/** + * Internal representation of LabeledPoint for DecisionTree. + * This bins feature values based on a subsampled of data as follows: + * (a) Continuous features are binned into ranges. + * (b) Unordered categorical features are binned based on subsets of feature values. + * "Unordered categorical features" are categorical features with low arity used in + * multiclass classification. + * (c) Ordered categorical features are binned based on feature values. + * "Ordered categorical features" are categorical features with high arity, + * or any categorical feature used in regression or binary classification. + * + * @param label Label from LabeledPoint + * @param features Binned feature values. + * Same length as LabeledPoint.features, but values are bin indices. + */ +private[tree] class TreePoint(val label: Double, val features: Array[Int]) extends Serializable { +} + + +private[tree] object TreePoint { + + /** + * Convert an input dataset into its TreePoint representation, + * binning feature values in preparation for DecisionTree training. + * @param input Input dataset. + * @param strategy DecisionTree training info, used for dataset metadata. + * @param bins Bins for features, of size (numFeatures, numBins). + * @return TreePoint dataset representation + */ + def convertToTreeRDD( + input: RDD[LabeledPoint], + strategy: Strategy, + bins: Array[Array[Bin]]): RDD[TreePoint] = { +input.map { x => + TreePoint.labeledPointToTreePoint(x, strategy.isMulticlassClassification, bins, +strategy.categoricalFeaturesInfo) +} + } + + /** + * Convert one LabeledPoint into its TreePoint representation. + * @param bins Bins for features, of size (numFeatures, numBins). + * @param categoricalFeaturesInfo Map over categorical features: feature index --> feature arity + */ + private def labeledPointToTreePoint( + labeledPoint: LabeledPoint, + isMulticlassClassification: Boolean, + bins: Array[Array[Bin]], + categoricalFeaturesInfo: Map[Int, Int]): TreePoint = { + +val numFeatures = labeledPoint.features.size +val numBins = bins(0).size +val arr = new Array[Int](numFeatures) +var featureIndex = 0 // offset by 1 for label +while (featureIndex < numFeatures) { + val featureInfo = categoricalFeaturesInfo.get(featureIndex) + val isFeatureContinuous = featureInfo.isEmpty + if (isFeatureContinuous) { +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, false, + bins, categoricalFeaturesInfo) + } else { +val featureCategories = featureInfo.get +val isSpaceSufficientForAllCategoricalSplits + = numBins > math.pow(2, featureCategories.toInt - 1) - 1 +val isUnorderedFeature = + isMulticlassClassification && isSpaceSufficientForAllCategoricalSplits +arr(featureIndex) = findBin(featureIndex, labeledPoint, isFeatureContinuous, + isUnorderedFeature, bins, categoricalFeaturesInfo) + } + featureIndex += 1 +} + +new TreePoint(labeledPoint.label, arr) + } + + --- End diff -- remove extra empty line --- If your project is set up for it, you can
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279819 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1281,7 +1226,8 @@ object DecisionTree extends Serializable with Logging { val gains = Array.ofDim[InformationGainStats](numFeatures, numBins - 1) for (featureIndex <- 0 until numFeatures) { -for (splitIndex <- 0 until numBins - 1) { +val numSplitsForFeature = getNumSplitsForFeature(featureIndex) +for (splitIndex <- 0 until numSplitsForFeature) { --- End diff -- ditto: use `while` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279816 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -1281,7 +1226,8 @@ object DecisionTree extends Serializable with Logging { val gains = Array.ofDim[InformationGainStats](numFeatures, numBins - 1) for (featureIndex <- 0 until numFeatures) { --- End diff -- use `while` instead of `for` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279806 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -550,25 +579,35 @@ object DecisionTree extends Serializable with Logging { // Apply each filter and check sample validity. Return false when invalid condition found. for (filter <- parentFilters) { --- End diff -- Please update this line to `parentFilter.foreach { filter =>` which is faster than `for`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279814 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -973,10 +907,13 @@ object DecisionTree extends Serializable with Logging { combinedAggregate } + --- End diff -- remove extra empty line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279801 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -144,6 +165,11 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo // Build the full tree using the node info calculated in the level-wise best split calculations. topNode.build(nodes) +timer.stop("total") + +logDebug("Internal timing for DecisionTree:") --- End diff -- `logDebug` -> `logInfo`? There are not a lot of messages here and the running times are useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279805 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -200,6 +226,7 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo } } + --- End diff -- remove extra empty line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279800 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -112,16 +126,23 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo logDebug("level = " + level) logDebug("#") + --- End diff -- remove extra empty line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279812 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -689,37 +631,26 @@ object DecisionTree extends Serializable with Logging { *bin index for this labeledPoint *(or InvalidBinIndex if labeledPoint is not handled by this node) */ -def findBinsForLevel(labeledPoint: LabeledPoint): Array[Double] = { +def findBinsForLevel(treePoint: TreePoint): Array[Double] = { // Calculate bin index and label per feature per node. val arr = new Array[Double](1 + (numFeatures * numNodes)) // First element of the array is the label of the instance. - arr(0) = labeledPoint.label + arr(0) = treePoint.label // Iterate over nodes. var nodeIndex = 0 while (nodeIndex < numNodes) { val parentFilters = findParentFilters(nodeIndex) // Find out whether the sample qualifies for the particular node. -val sampleValid = isSampleValid(parentFilters, labeledPoint) +val sampleValid = isSampleValid(parentFilters, treePoint) val shift = 1 + numFeatures * nodeIndex if (!sampleValid) { // Mark one bin as -1 is sufficient. arr(shift) = InvalidBinIndex } else { var featureIndex = 0 + // TODO: Vectorize this while (featureIndex < numFeatures) { -val featureInfo = strategy.categoricalFeaturesInfo.get(featureIndex) -val isFeatureContinuous = featureInfo.isEmpty -if (isFeatureContinuous) { - arr(shift + featureIndex) -= findBin(featureIndex, labeledPoint, isFeatureContinuous, false) -} else { - val featureCategories = featureInfo.get - val isSpaceSufficientForAllCategoricalSplits -= numBins > math.pow(2, featureCategories.toInt - 1) - 1 - arr(shift + featureIndex) -= findBin(featureIndex, labeledPoint, isFeatureContinuous, -isSpaceSufficientForAllCategoricalSplits) -} +arr(shift + featureIndex) = treePoint.features(featureIndex) --- End diff -- use `Array.copy(treePoint.features, 0, arr, shift, numFeatures)` to replace the while loop? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16279798 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala --- @@ -53,16 +55,28 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo */ def train(input: RDD[LabeledPoint]): DecisionTreeModel = { +val timer = new TimeTracker() + +timer.start("total") + // Cache input RDD for speedup during multiple passes. -val retaggedInput = input.retag(classOf[LabeledPoint]).cache() +timer.start("init") +val retaggedInput = input.retag(classOf[LabeledPoint]) logDebug("algo = " + strategy.algo) +timer.stop("init") // Find the splits and the corresponding bins (interval between the splits) using a sample // of the input data. +timer.start("findSplitsBins") val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, strategy) val numBins = bins(0).length +timer.stop("findSplitsBins") logDebug("numBins = " + numBins) +timer.start("init") +val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, bins).cache() --- End diff -- `.cache()` -> `.persist(StorageLevel.MEMORY_AND_DISK)`? There is a computation/storage trade-off here, maybe worth testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3028. sparkEventToJson should support Sp...
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/1961 SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetrics... ...Update You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-3028 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1961.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1961 commit f883dedf2910380796bc2c148d3bc28f562b967a Author: Sandy Ryza Date: 2014-08-15T04:16:07Z SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetricsUpdate --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3035] Wrong example with SparkContext.a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1942#issuecomment-52272419 QA tests have started for PR 1942. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18590/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3035] Wrong example with SparkContext.a...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1942#issuecomment-52272341 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3025 [SQL]: Allow JDBC clients to set a ...
Github user tianyi commented on the pull request: https://github.com/apache/spark/pull/1937#issuecomment-52272355 hi @pwendell , I have two questions: 1 Why don't handle this set command in "SetCommand" ? 2 Do we have to save this pool name in LocalProperty? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2845] Add timestamps to block manager e...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/654#issuecomment-52271621 QA tests have started for PR 654. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18589/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2845] Add timestamps to block manager e...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/654#issuecomment-52271303 Yes, that's my main concern with putting that logic in the logger. Also, doing that means that another listener that wants to use a timestamp would have to do the same hack; so if using that approach, it would be better to change the SparkListener interface to take a timestamp parameter aside from the events. I like having the timestamp with the event better because of that; it's ultimately a property of the event, and placing it somewhere else means you need to handle two pieces of information everywhere you need them, instead of everything being encapsulated in the event itself. I agree all events should have a timestamp, but I don't have a solution for easily doing that (well, aside from investigating a saner approach to generating JSON than manually writing the conversion code like it's currently done). Perhaps it would be better soon to just switch to using something like Jackson and keep the current code just for backwards compatibility with older event logs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/731#issuecomment-52270667 QA results for PR 731:- This patch PASSES unit tests.For more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18587/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove netty-test-file.txt.
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/1960#issuecomment-52270393 LGTM (pending tests) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1931#issuecomment-52269810 Yeah, sounds good. I guess we'll use a `ReferenceQueueWithSize` or something instead --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2845] Add timestamps to block manager e...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/654#issuecomment-52269494 @vanzin I think it would be useful to time stamp all events, either in this patch or in the future. Is the reason why we have to add the timestamp field to the event case classes themselves the following: we can't use the time when the logger actually logs the event because it's technically inaccurate? E.g. the block manager might have been added ages ago, and by the time we actually process the event many milliseconds may have already passed. This isn't a huge problem if the event queues are small, but we have seen them go up to O(100) even after #1679. I guess this means we need to do the ugly thing of adding a "timestamp" field to each of these case classes. Do you see a better alternative? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2492][Streaming] kafkaReceiver minor ch...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/1420#issuecomment-52269406 Hi @tdas , would you mind taking a look at this? thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/1950#issuecomment-52269207 :+1: LGTM! I have some minor comments. I look forward to the performance improvements based upon these changes. The multiclass bug fix is a great catch. Finally, I see the timer functionality being added soon to other parts of Spark. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2845] Add timestamps to block manager e...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/654#discussion_r16278258 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -448,12 +450,14 @@ private[spark] object JsonProtocol { def blockManagerAddedFromJson(json: JValue): SparkListenerBlockManagerAdded = { val blockManagerId = blockManagerIdFromJson(json \ "Block Manager ID") val maxMem = (json \ "Maximum Memory").extract[Long] -SparkListenerBlockManagerAdded(blockManagerId, maxMem) +val time = (json \ "Timestamp").extract[Option[Long]].getOrElse(-1L) --- End diff -- `Utils.jsonOption` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16278221 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala --- @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.tree.configuration.Strategy +import org.apache.spark.mllib.tree.model.Bin +import org.apache.spark.rdd.RDD + + +/** + * Internal representation of LabeledPoint for DecisionTree. + * This bins feature values based on a subsampled of data as follows: + * (a) Continuous features are binned into ranges. + * (b) Unordered categorical features are binned based on subsets of feature values. + * "Unordered categorical features" are categorical features with low arity used in + * multiclass classification. + * (c) Ordered categorical features are binned based on feature values. + * "Ordered categorical features" are categorical features with high arity, + * or any categorical feature used in regression or binary classification. + * + * @param label Label from LabeledPoint + * @param features Binned feature values. + * Same length as LabeledPoint.features, but values are bin indices. + */ +private[tree] class TreePoint(val label: Double, val features: Array[Int]) extends Serializable { --- End diff -- Minor: Could we use ```binnedFeatures``` instead of ```features```? Discussion: I guess we are making ```Int`` as the default choice for bins. It's great since we don't have to do additional book-keeping if we used multiple of ```Byte```s. May be an optimization we should consider for the future. I think it's a good tradeoff for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3022] [SPARK-3041] [mllib] Call findBin...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/1950#discussion_r16278126 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TimeTracker.scala --- @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impl + +import scala.collection.mutable.{HashMap => MutableHashMap} + +import org.apache.spark.annotation.Experimental + +/** + * Time tracker implementation which holds labeled timers. + */ +@Experimental +private[tree] --- End diff -- Very minor: not sure about the style but should private[tree] be on the same line? Feel free to ignore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove netty-test-file.txt.
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1960#issuecomment-52268330 QA tests have started for PR 1960. This patch merges cleanly. View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18588/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove netty-test-file.txt.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1960 Remove netty-test-file.txt. Instead, use randomly generated temporary data. cc @shivaram You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark netty Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1960.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1960 commit 3debe7c246b58345d0495b52f70bdd0be1b4f5e3 Author: Reynold Xin Date: 2014-08-15T02:19:33Z Remove netty-test-file.txt. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2759][CORE] Generic Binary File Support...
Github user loveconan1988 commented on the pull request: https://github.com/apache/spark/pull/1658#issuecomment-52267833 it should can be solve ã -- åå§é®ä»¶ -- å件人: "Matei Zaharia";; åéæ¶é´: 2014å¹´8æ15æ¥(ææäº) ä¸å9:55 æ¶ä»¶äºº: "apache/spark"; 主é¢: Re: [spark] [SPARK-2759][CORE] Generic Binary File Support in Spark(#1658) @kmader to allow streams to be shuffle-able, how about the following? We create a class called BinaryData with a method open() that returns an InputStream. The BinaryData object has information about an InputSplit inside and can thus be shuffled across the network, cached, etc. And whenever users want to read it, they can get a stream. Would this solve the problem? â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3052. Misleading and spurious FileSystem...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/1956#issuecomment-52267702 > Ah and the order they should be shut down in is RecordReader then FileSystem? Right --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...
Github user loveconan1988 commented on the pull request: https://github.com/apache/spark/pull/1907#issuecomment-52267441 I have know it,thanks. -- åå§é®ä»¶ -- å件人: "asfgit";; åéæ¶é´: 2014å¹´8æ15æ¥(ææäº) ä¸å10:03 æ¶ä»¶äºº: "apache/spark"; 主é¢: Re: [spark] [SPARK-2468] Netty based block server / client module(#1907) Closed #1907 via 3a8b68b. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2736] PySpark converter and example scr...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1916 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...
Github user loveconan1988 commented on the pull request: https://github.com/apache/spark/pull/1907#issuecomment-52267389 I have know it,thanks. -- åå§é®ä»¶ -- å件人: "Reynold Xin";; åéæ¶é´: 2014å¹´8æ15æ¥(ææäº) ä¸å10:01 æ¶ä»¶äºº: "apache/spark"; 主é¢: Re: [spark] [SPARK-2468] Netty based block server / client module(#1907) Thanks for looking at this. Merging it in master & branch-1.1. This is off by default so the danger is very small. We will continue improving it and with changes like SPARK-3019 it will make it easier to use. â Reply to this email directly or view it on GitHub. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1907 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1907#issuecomment-52267255 Thanks for looking at this. Merging it in master & branch-1.1. This is off by default so the danger is very small. We will continue improving it and with changes like SPARK-3019 it will make it easier to use. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2468] Netty based block server / client...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1907#issuecomment-52267195 I reviewed the changes that are relevant to spark core and they LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2736] PySpark converter and example scr...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1916#issuecomment-52267186 Alright, going to merge this. Thanks for putting this together! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3011][SQL] _temporary directory should ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1959#issuecomment-52267094 QA results for PR 1959:- This patch FAILED unit tests.- This patch merges cleanly- This patch adds no public classesFor more information see test ouptut:https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18586/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2759][CORE] Generic Binary File Support...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1658#issuecomment-52266956 @kmader to allow streams to be shuffle-able, how about the following? We create a class called `BinaryData` with a method open() that returns an InputStream. The BinaryData object has information about an InputSplit inside and can thus be shuffled across the network, cached, etc. And whenever users want to read it, they can get a stream. Would this solve the problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3045] [SPARK-3046] Make Serializer inte...
Github user GrahamDennis commented on the pull request: https://github.com/apache/spark/pull/1948#issuecomment-52266597 @rxin: You right about me forgetting to run `sbt assemble`. Some of the results of my testcase don't quite match what I was seeing before (I was seeing only one of the executor's having the .jar in their work directory). Let me verify that my testcase fails on the latest master branch without your fix, and also test this fix on my cluster. If that check's out, then I'm happy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/731#issuecomment-52266419 QA tests have started for PR 731. This patch DID NOT merge cleanly! View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18587/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3027] TaskContext: tighten visibility a...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1938 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3027] TaskContext: tighten visibility a...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1938#issuecomment-52265995 Thanks. Merging this in master & branch-1.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org