[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/3564#issuecomment-66416442 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3564#issuecomment-66416417 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24298/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user yu-iskw commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21588804 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala --- @@ -174,37 +174,18 @@ class IDFModel private[mllib] (val idf: Vector) extends Serializable { */ def transform(dataset: RDD[Vector]): RDD[Vector] = { val bcIdf = dataset.context.broadcast(idf) -dataset.mapPartitions { iter = - val thisIdf = bcIdf.value - iter.map { v = -val n = v.size -v match { - case sv: SparseVector = -val nnz = sv.indices.size -val newValues = new Array[Double](nnz) -var k = 0 -while (k nnz) { - newValues(k) = sv.values(k) * thisIdf(sv.indices(k)) - k += 1 -} -Vectors.sparse(n, sv.indices, newValues) - case dv: DenseVector = -val newValues = new Array[Double](n) -var j = 0 -while (j n) { - newValues(j) = dv.values(j) * thisIdf(j) - j += 1 -} -Vectors.dense(newValues) - case other = -throw new UnsupportedOperationException( - sOnly sparse and dense vectors are supported but got ${other.getClass}.) -} - } -} +dataset.mapPartitions(iter = iter.map(v = IDFModel.transform(bcIdf.value, v))) } /** + * Transforms tern frequency (TF) vectors to a TF-IDF vector --- End diff -- https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-7c5eb57aa2d7d6da7afb24b85429ac14L181 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3564#issuecomment-66417200 [Test build #24300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24300/consoleFull) for PR 3564 at commit [`b583f81`](https://github.com/apache/spark/commit/b583f8199229f176c462e4095c8d196c0fc21bba). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user yu-iskw commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21588828 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala --- @@ -53,6 +53,19 @@ class IDFSuite extends FunSuite with MLlibTestSparkContext { val tfidf2 = tfidf(2L).asInstanceOf[SparseVector] assert(tfidf2.indices === Array(1)) assert(tfidf2.values(0) ~== (1.0 * expected(1)) absTol 1e-12) + +// Transforms local vectors --- End diff -- https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-7440885aeb7f73a84564ec244399fc5cR44 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user yu-iskw commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21588814 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala --- @@ -17,12 +17,10 @@ package org.apache.spark.mllib.feature -import org.scalatest.FunSuite - -import org.apache.spark.SparkContext._ import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors} import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.mllib.util.TestingUtils._ +import org.scalatest.FunSuite --- End diff -- https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-7440885aeb7f73a84564ec244399fc5cL20 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user yu-iskw commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21588839 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala --- @@ -86,6 +101,19 @@ class IDFSuite extends FunSuite with MLlibTestSparkContext { val tfidf2 = tfidf(2L).asInstanceOf[SparseVector] assert(tfidf2.indices === Array(1)) assert(tfidf2.values(0) ~== (1.0 * expected(1)) absTol 1e-12) + +// Transforms local vectors --- End diff -- https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-7440885aeb7f73a84564ec244399fc5cR85 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user yu-iskw commented on a diff in the pull request: https://github.com/apache/spark/pull/3603#discussion_r21588845 --- Diff: python/pyspark/mllib/feature.py --- @@ -220,12 +220,15 @@ def transform(self, dataset): the terms which occur in fewer than `minDocFreq` documents will have an entry of 0. -:param dataset: an RDD of term frequency vectors -:return: an RDD of TF-IDF vectors +:param data: an RDD of term frequency vectors or a term frequency vector +:return: an RDD of TF-IDF vectors or a TF-IDF vector -if not isinstance(dataset, RDD): +if isinstance(data, RDD): +return JavaVectorTransformer.transform(self, data) +elif isinstance(data, Vector): --- End diff -- https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-722e3d483892191debee07edd1a85fc8R226 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/3603#issuecomment-66417392 @jkbradley Thank you for your comments. I add `[mllib]` tag to the PR title. And I modified the source code following your advice. Could you please review the difference? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3603#issuecomment-66417572 [Test build #24301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24301/consoleFull) for PR 3603 at commit [`a3bf566`](https://github.com/apache/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66419204 Thanks. However, I can not see why this is a broken change. Please let me know where it causes problems as it seems to pass tests now. In fact, this PR does not make a lot of change. Original codes close and reopen `FileInputStream` for every batch reading. This PR keeps the stream open across these batches. Other parts are untouched. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3611] Show number of cores for each exe...
Github user devldevelopment closed the pull request at: https://github.com/apache/spark/pull/2980 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/3661 [SPARK-4813][Streaming] Fix the issue that ContextWaiter didn't handle 'spurious wakeup' Used `Condition` to rewrite `ContextWaiter` because it provides a convenient API `awaitNanos` for timeout. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-4813 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3661.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3661 commit e06bd4fdc7d052ef55e2d98e68441586fe9d2026 Author: zsxwing zsxw...@gmail.com Date: 2014-12-10T08:25:39Z Fix the issue that ContextWaiter didn't handle 'spurious wakeup' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3661#issuecomment-66421083 [Test build #24302 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24302/consoleFull) for PR 3661 at commit [`e06bd4f`](https://github.com/apache/spark/commit/e06bd4fdc7d052ef55e2d98e68441586fe9d2026). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4159 [CORE] [WIP] Maven build doesn't ru...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3651#issuecomment-66423087 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24299/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user pkallos commented on the pull request: https://github.com/apache/spark/pull/3603#issuecomment-66423122 :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4159 [CORE] [WIP] Maven build doesn't ru...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3651#issuecomment-66423082 [Test build #24299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24299/consoleFull) for PR 3651 at commit [`125b0b6`](https://github.com/apache/spark/commit/125b0b64efc22c5a573aea00bf9bfdb53393cdbe). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/3541#issuecomment-66424149 On Thu, Dec 4, 2014 at 2:57 AM, Davies Liu notificati...@github.com wrote: @davies https://github.com/davies I am not sure I completely understood your comment. Sorry for that, maybe I didnot explain it clearly. As detailed above, there are multiple reasons why a task can fail - and quite a lot of them are non-fatal from 'rescheduling the task on same host' point of view : in particular race in spark between reporting executor going down, shutdown hooks running and task schedules due to locality preference. So we need per-executor blacklist - note that this is just a temporary - to either allow the executor to recover (in case task failures are due to transient reasons), or allow task to get scheduled elsewhere in meantime (if schedule locality constraints can be satisfied). Agreed that the executor based blacklist worked for you, and I think the host based blacklist will also work for you (there is a little regression about locality). It is not a small regression - if you have 4 - 8 executors on a host (as is common here) : this change will blacklist all of them instead of blacklisting a single executor. This is fairly severe regression : which is why I said I am -1 on modifying existing behavior unless new functionality allows for existing feature to continue to work as currently expected to. The thing to understand is executor blacklist is not subsumed by host blacklist other than in a very crude model. A different set of criterion would apply when we want to do host level blacklist - when we have determined that the node is unusable, and so task fails on all executors in the node : due to NODE_LOCAL locality level, we would keep trying other executors on the same node in case executor blacklist kicks in; so in case the node is temporarily unusable, executor black list might not help. So we need host based blacklist. Yes, the reasons why we need host blacklist are valid and separate from why we need executor blacklist. They might overlap in some degenerate cases (since obviously host level issues do impact executors too) : executor blacklist is more fine grained - while host level issues are more coarser in comparison. While executor blacklist might alleviate lack of host blacklist to some extent (as exists currently), it is suboptimal to do so : so need for host blacklist is justified. The timeout based temporary executor blacklist we currently have is still a stop gap solution which solves immediate problems observed at that time : without which spark was becoming unusable in large enough multi-tennet clusters. Agreed. If we want to it to a host level and do a principled solution - then we need a lot of other pieces to be put into place (since currently we only take task scheduling into account; which is insufficient). Top of my head - remove it from rdd replication, de-allocate executors already on the node, moving existing rdd blocks away from the executors on the node, blacklisting the node from further allocation requests (yarn, mesos), and so on. I am sure @kayousterhout https://github.com/kayousterhout might have other thoughts on this. Agreed. Figure out the failure domain is a hard thing in distributed environment, I'm doubt that who can contribute a principled solution to retry the failed tasks in the best position in near term (such as reschedule it in same executor, different executor on same host, different host, different rack). I think the host based blacklist is the simplest solution and work well in most failure cases. Unfortunately, I do not have the bandwidth to engage on this; so I am hoping the right thing gets done. Whatever it is, I am -1 on removing executor level blacklist - that is something we heavily depend on to get our jobs to work. A better solution while not regressing on this functionality is most welcome ! Really appreciate your comments here, to have a better solution. Could you raise a detailed cases that the host based blacklist will break you job? Maybe there are some cases I did not figure out in your situation, please correct me. The primary reason for executor blacklist, as @kayousterhout https://github.com/kayousterhout also referred to, were initially quite simple : Task gets submitted to same executor repeatedly due to locality constraint - but keeps failing on the executor since the executor might be in inconsistent state (like in middle of shutdown, etc).This very quickly
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3661#discussion_r21591585 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala --- @@ -17,30 +17,74 @@ package org.apache.spark.streaming +import java.util.concurrent.{TimeoutException, TimeUnit} +import java.util.concurrent.locks.ReentrantLock +import javax.annotation.concurrent.GuardedBy + private[streaming] class ContextWaiter { + + private val lock = new ReentrantLock() + private val condition = lock.newCondition() + + @GuardedBy(lock) --- End diff -- Minor point - these are not in the JDK but in a Findbugs library for JSR-305. It's not used in Spark, and happens to be a dependency now. Maybe not worth using it just 1 place? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3661#discussion_r21591750 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala --- @@ -17,30 +17,74 @@ package org.apache.spark.streaming +import java.util.concurrent.{TimeoutException, TimeUnit} +import java.util.concurrent.locks.ReentrantLock +import javax.annotation.concurrent.GuardedBy + private[streaming] class ContextWaiter { + + private val lock = new ReentrantLock() + private val condition = lock.newCondition() + + @GuardedBy(lock) private var error: Throwable = null + + @GuardedBy(lock) private var stopped: Boolean = false - def notifyError(e: Throwable) = synchronized { -error = e -notifyAll() + def notifyError(e: Throwable) = { +lock.lock() +try { + error = e + condition.signalAll() +} finally { + lock.unlock() +} } - def notifyStop() = synchronized { -stopped = true -notifyAll() + def notifyStop() = { +lock.lock() +try { + stopped = true + condition.signalAll() +} finally { + lock.unlock() +} } - def waitForStopOrError(timeout: Long = -1) = synchronized { -// If already had error, then throw it -if (error != null) { - throw error -} + /** + * Return `true` if it's stopped; or throw the reported error if `notifyError` has been called; or + * `false` if the waiting time detectably elapsed before return from the method. + */ + def waitForStopOrError(timeout: Long = -1): Boolean = { +lock.lock() +try { + if (timeout 0) { +while (true) { --- End diff -- Maybe it's just me but it feels like these loops would be simpler just testing `while (!stopped error == null)`? `nanos` would be tested in the other one too. This avoids duplication, and also avoids the unreachable return value, because you check these conditions in one place at the end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3603#issuecomment-66425154 [Test build #24301 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24301/consoleFull) for PR 3603 at commit [`a3bf566`](https://github.com/apache/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3603#issuecomment-66425157 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24301/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4159 [CORE] [WIP] Maven build doesn't ru...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3651#issuecomment-66425799 [Test build #24303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24303/consoleFull) for PR 3651 at commit [`11bd041`](https://github.com/apache/spark/commit/11bd041909a20b6d7c1b5074d6b78133aa1ff547). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66426107 Hmm, might be tricky to explain if you do not have sufficient context, let me give it a shot. a) Streams in java are not usually multiplexed - unless explicitly stated otherwise. With this PR, the same underlying stream (fileStream) is being reused across deserializeStream (and its users). One way it manifests is (b) b) Most streams in java override finalize to close their underlying stream in case they are going out of scope (to prevent resource leak, etc) : ofcourse this is an implementation detail, but is the general expectation. In this case, deserializeStream gets re-assigned somewhere in the method below - causing the previous 'deserializeStream' to go out of scope. When gc kicks in, and then when finalizers are run, deserializeStream's finalize can call its close, resulting in fileStream to get closed - which might now be used by some other deserializeStream : since it was re-used. This will cause hard to debug crashes/bugs. I am sure I am missing other spectacular ways in which this can fail :-) - in general, these things happen when the basic api expectation (probably implicit here maybe) is broken. Now, we can go down this path in case the operation we are saving is very expensive -which is not the case here (it is a cheap file open/close which is saved). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66426037 [Test build #24304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24304/consoleFull) for PR 3644 at commit [`3bb8731`](https://github.com/apache/spark/commit/3bb8731a33ecf2bde076df92aa8619340fe3e84a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/3661#discussion_r21592200 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala --- @@ -17,30 +17,74 @@ package org.apache.spark.streaming +import java.util.concurrent.{TimeoutException, TimeUnit} +import java.util.concurrent.locks.ReentrantLock +import javax.annotation.concurrent.GuardedBy + private[streaming] class ContextWaiter { + + private val lock = new ReentrantLock() + private val condition = lock.newCondition() + + @GuardedBy(lock) --- End diff -- Maybe not worth using it just 1 place? So which one do you prefer? 1. Use comments to describe such information. 2. Use `GuardedBy` from now on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/3661#discussion_r21592261 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala --- @@ -17,30 +17,74 @@ package org.apache.spark.streaming +import java.util.concurrent.{TimeoutException, TimeUnit} +import java.util.concurrent.locks.ReentrantLock +import javax.annotation.concurrent.GuardedBy + private[streaming] class ContextWaiter { + + private val lock = new ReentrantLock() + private val condition = lock.newCondition() + + @GuardedBy(lock) --- End diff -- In addition, now Findbugs does not recognize `GuardedBy` in Scala codes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/3661#discussion_r21592650 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala --- @@ -17,30 +17,74 @@ package org.apache.spark.streaming +import java.util.concurrent.{TimeoutException, TimeUnit} +import java.util.concurrent.locks.ReentrantLock +import javax.annotation.concurrent.GuardedBy + private[streaming] class ContextWaiter { + + private val lock = new ReentrantLock() + private val condition = lock.newCondition() + + @GuardedBy(lock) --- End diff -- BTW, I turned to `GuardedBy` because @aarondav asked me to do it in #3634 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66427302 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24304/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66427294 [Test build #24304 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24304/consoleFull) for PR 3644 at commit [`3bb8731`](https://github.com/apache/spark/commit/3bb8731a33ecf2bde076df92aa8619340fe3e84a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait ParquetTest ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3661#discussion_r21592824 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala --- @@ -17,30 +17,74 @@ package org.apache.spark.streaming +import java.util.concurrent.{TimeoutException, TimeUnit} +import java.util.concurrent.locks.ReentrantLock +import javax.annotation.concurrent.GuardedBy + private[streaming] class ContextWaiter { + + private val lock = new ReentrantLock() + private val condition = lock.newCondition() + + @GuardedBy(lock) --- End diff -- Yes, that's why I brought it up. It's not actually a standard Java annotation (unless someone tells me it just turned up in 8 or something) but part of JSR-305. This is a dependency of Spark core at the moment, but none of the annotations are used. I think we should just not use them instead of using this lib in 1 place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4159 [CORE] Maven build doesn't run JUni...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3651#issuecomment-66428269 I'm pretty convinced this works now. I'm diffing the test run output between master and this branch, and the scala tests are the same. The only visible differences are that `scalatest` turns up in every module, and of course, output from `surefire` now. Note that I did _not_ enable assertions in SBT now, which I mentioned in a related conversation. There's another issue with it tracked in http://issues.apache.org/jira/browse/SPARK-4814 I also think this is a predecessor to https://issues.apache.org/jira/browse/SPARK-3431 Let's see what Jenkins says. I'm calling this no longer a WIP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/3661#discussion_r21593635 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala --- @@ -17,30 +17,74 @@ package org.apache.spark.streaming +import java.util.concurrent.{TimeoutException, TimeUnit} +import java.util.concurrent.locks.ReentrantLock +import javax.annotation.concurrent.GuardedBy + private[streaming] class ContextWaiter { + + private val lock = new ReentrantLock() + private val condition = lock.newCondition() + + @GuardedBy(lock) private var error: Throwable = null + + @GuardedBy(lock) private var stopped: Boolean = false - def notifyError(e: Throwable) = synchronized { -error = e -notifyAll() + def notifyError(e: Throwable) = { +lock.lock() +try { + error = e + condition.signalAll() +} finally { + lock.unlock() +} } - def notifyStop() = synchronized { -stopped = true -notifyAll() + def notifyStop() = { +lock.lock() +try { + stopped = true + condition.signalAll() +} finally { + lock.unlock() +} } - def waitForStopOrError(timeout: Long = -1) = synchronized { -// If already had error, then throw it -if (error != null) { - throw error -} + /** + * Return `true` if it's stopped; or throw the reported error if `notifyError` has been called; or + * `false` if the waiting time detectably elapsed before return from the method. + */ + def waitForStopOrError(timeout: Long = -1): Boolean = { +lock.lock() +try { + if (timeout 0) { +while (true) { --- End diff -- It's cleaner now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4033][Examples]Input of the SparkPi too...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2874#issuecomment-66428460 @SaintBacchus why did you close this? seems like it still needs a fix and you had an improvement going here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3661#issuecomment-66429086 [Test build #24305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24305/consoleFull) for PR 3661 at commit [`be42bcf`](https://github.com/apache/spark/commit/be42bcfaa38a3f3fbe4fc759656a61c72f0fb556). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3564#issuecomment-66429754 **[Test build #24300 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24300/consoleFull)** for PR 3564 at commit [`b583f81`](https://github.com/apache/spark/commit/b583f8199229f176c462e4095c8d196c0fc21bba) after a configured wait of `120m`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3564#issuecomment-66429761 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24300/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3661#issuecomment-66430212 [Test build #24302 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24302/consoleFull) for PR 3661 at commit [`e06bd4f`](https://github.com/apache/spark/commit/e06bd4fdc7d052ef55e2d98e68441586fe9d2026). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3661#issuecomment-66430222 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24302/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-66430592 So, I may not be 100% up to speed with the new API and these changes, so my comments may be a bit off, but: An Estimator makes a Model. To make a model, you need raw data and its interpretation, if you will. a LabeledPoint is raw data. That alone is not sufficient to train a Classifier (Estimator). Yes, this extra info has to come from somewhere. I agree that SchemaRDD contains, or could contain, or could be made to deduce, this extra interpretation, so the SchemaRDD API makes sense to me. If LabeledPoint is to remain the raw data, given the conversation here, then it has to be parameters or something. I think you still need these for testing, right? you still need to know what the raw data means. Or is it assumed that the built Classifier / Model stores this info? This is sort of a rehash of the same exchange we just had, in that the question is caused by the input data abstraction not really containing all the input -- the metadata comes along separately. Which could be OK but yes it means this question pops up somewhere else in the API. Yes, a Model may be able to remember the metadata and accept raw LabeledPoints in the future. You just have to make sure you are feeding raw LabeledPoints that use the same metadata, but that's a given no matter how you design this. To answer the question: given the question, I'd hide the typed API, I suppose. I think the typed API has to take some other values to contain metadata like the type of features, etc. These could be more parameters, then? it kind of overloads the meaning, since the parameters look like they are intended to be hyper parameters. But it's not crazy. Transformations: these feel like these could meaningfully operate on raw data, so, typed API makes sense to me and could be public now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Print the specified number of data and handle ...
GitHub user surq opened a pull request: https://github.com/apache/spark/pull/3662 Print the specified number of data and handle all of the elements in RDD Dstream.print function:Print 10 elements and handle 11 elements. A new function based on Dstream.print function is presented: the new function: Print the specified number of data and handle all of the elements in RDD. there is a work scene: val dstream = stream.map-filter-mapPartitions-print the data after filter need update database in mapPartitions,but don't need print each data,only need to print the top 20 for view the data processing. You can merge this pull request into a Git repository by running: $ git pull https://github.com/surq/spark SPARK-4817 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3662.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3662 commit 4e3f715941f94cb2467ca68b205a5fa3630130a3 Author: surq s...@asiainfo.com Date: 2014-12-10T10:49:54Z Print the specified number of data and handle all of the elements in RDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4817][streaming]Print the specified num...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3662#issuecomment-66434968 You should put `SPARK- [STREAMING]` in the title. But your original JIRA was a duplicate of https://issues.apache.org/jira/browse/SPARK-3325 so perhaps you can connect this to that JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4815][SQL] Fix: ThriftServer use only o...
GitHub user guowei2 opened a pull request: https://github.com/apache/spark/pull/3663 [SPARK-4815][SQL] Fix: ThriftServer use only one SessionState to run sql using hive Use a `SessionState` map in `HiveContext` to store all of the session states to the thread id. The session state will be updated when open a new hive session and close the session You can merge this pull request into a Git repository by running: $ git pull https://github.com/guowei2/spark SPARK-4815 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3663.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3663 commit 0e9f2239836ce132466070f85090f282a3ff4fbe Author: guowei2 guow...@asiainfo.com Date: 2014-12-10T10:25:34Z Fix: ThriftServer use only one SessionState to run sql using hive --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4815][SQL] Fix: ThriftServer use only o...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3663#issuecomment-66435369 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4159 [CORE] Maven build doesn't run JUni...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3651#issuecomment-66435431 [Test build #24303 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24303/consoleFull) for PR 3651 at commit [`11bd041`](https://github.com/apache/spark/commit/11bd041909a20b6d7c1b5074d6b78133aa1ff547). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4159 [CORE] Maven build doesn't run JUni...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3651#issuecomment-66435437 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24303/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66435647 I agree with you that the saved operation here is a cheap one. :-) However the problem you mentioned would not happen with current version of `DeserializationStream`. Not all InputStream close their underlying stream when they are collected by GC. There are detailed discussions [here](http://www.coderanch.com/t/278165/java-io/java/InputStream-close-garbage-collection) and [there](http://stackoverflow.com/questions/1522370/does-input-outputstreams-close-on-destruction). I am sure that `FileInputStream` implements `finalize` to close underlying file. But other streams used here are not as the tests show. `DeserializationStream` is implemented in Spark and it has no such behavior. During modifying the codes, I checked it and found that you must explicitly call its `close` to close its underlying stream. That is why it passes the tests. I am ok to close this PR if it causes problem. But if it would not really cause the mentioned problem, I can not see why a slightly improved performance is bad. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4159 [CORE] Maven build doesn't run JUni...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3651#issuecomment-66435761 Jenkins, retest this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4817][streaming]Print the specified num...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3662#issuecomment-66435798 [Test build #24306 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24306/consoleFull) for PR 3662 at commit [`4e3f715`](https://github.com/apache/spark/commit/4e3f715941f94cb2467ca68b205a5fa3630130a3). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3661#issuecomment-66438151 [Test build #24305 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24305/consoleFull) for PR 3661 at commit [`be42bcf`](https://github.com/apache/spark/commit/be42bcfaa38a3f3fbe4fc759656a61c72f0fb556). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3661#issuecomment-66438157 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24305/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4812][SQL] Fix the initialization issue...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/3660#issuecomment-66439177 ~~ we should mark codegenEnabled as lazy. ~~ `lazy` doesn't work because `codegenEnabled` has not been used before serialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4812][SQL] Fix the initialization issue...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3660#issuecomment-66439622 [Test build #24307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24307/consoleFull) for PR 3660 at commit [`a3eea56`](https://github.com/apache/spark/commit/a3eea5692b7bf2fd88b27032e899b776651ef321). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66442416 [Test build #24308 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24308/consoleFull) for PR 1269 at commit [`7f9b7c3`](https://github.com/apache/spark/commit/7f9b7c35c28e3399a8c34d494064a3bbd238d9c2). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66443297 I think you are missing the point - we should not rely on specific implementation details on whether it is currently done or not - that leads to brittle codebase. finalize() *can* close wrapped stream because that is the implicit contract. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4817][streaming]Print the specified num...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3662#issuecomment-66444342 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24306/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4817][streaming]Print the specified num...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3662#issuecomment-66444334 [Test build #24306 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24306/consoleFull) for PR 3662 at commit [`4e3f715`](https://github.com/apache/spark/commit/4e3f715941f94cb2467ca68b205a5fa3630130a3). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4812][SQL] Fix the initialization issue...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3660#issuecomment-66445954 [Test build #24307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24307/consoleFull) for PR 3660 at commit [`a3eea56`](https://github.com/apache/spark/commit/a3eea5692b7bf2fd88b27032e899b776651ef321). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4812][SQL] Fix the initialization issue...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3660#issuecomment-66445964 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24307/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...
Github user tsliwowicz commented on the pull request: https://github.com/apache/spark/pull/2914#issuecomment-66446290 No problem. Glad to help :-) On Wed, Dec 10, 2014 at 4:44 AM, andrewor14 notificati...@github.com wrote: Hey sorry @tsliwowicz https://github.com/tsliwowicz for using your PRs as the battleground in fixing our builds against older branches. There aren't a lot of PRs opened against older branches so these tests aren't run in this context very often. So far I think all of these test failures have nothing to do with your patch so there is no action needed on your side. On our side, we'll keep investigating why the tests are failing all the time. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2914#issuecomment-66396333. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66446898 I do know that `finalize` can close wrapped stream. I did not say it would not. But It only can if you implement it as that. There is no such implicit contract as I know. As the discussions I included in previous comment show, some InputStream implement `finalize` and some not. You can not reply on a specified implementation found in few InputStream types to generalize the behavior to all InputStream types. And there is an obvious example, `DeserializationStream`, which does not implement the implicit contract. If this PR would cause problem, I just want to know why and where it is. You said `DeserializationStream` would cause problem if it goes out of scope, I just show you that the problem you mentioned is not the case, as the codes show. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66448444 Except for some streams associated with files and network connections, not all streams should always be closed when you're done with them. That is what I know. Maybe that is why `DeserializationStream` does not implement `finalize` to close its input stream. I think that it is unnecessary to have such long discussion for a small modification. I will close this PR later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user viirya closed the pull request at: https://github.com/apache/spark/pull/3600 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66448735 [Test build #24309 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24309/consoleFull) for PR 1269 at commit [`af9bcc8`](https://github.com/apache/spark/commit/af9bcc87df561f920226342d25ca4203639bacf9). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66450681 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24308/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66450671 [Test build #24308 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24308/consoleFull) for PR 1269 at commit [`7f9b7c3`](https://github.com/apache/spark/commit/7f9b7c35c28e3399a8c34d494064a3bbd238d9c2). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66451641 I think I did say this will not go into spark at the very begining of my review :-) In the assumption that you would want to continue to improve spark IO, I wanted to clarify why it wont go in. This part of spark core is critical to correctness of IO - hence the additional scrutiny (when I get time) to ensure no bugs are introduced. We have fixed quite a lot of issues here, and relying on (existing) implementation detail is asking for trouble. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66453822 Thanks. But in the end, you still can not provide a rational explanation for the reason why it fails. At least, it is not convincing for me. :-) Anyway, still thanks for your comments and time to replying. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user bgreeven commented on a diff in the pull request: https://github.com/apache/spark/pull/1290#discussion_r21603916 --- Diff: docs/mllib-ann.md --- @@ -0,0 +1,239 @@ +--- +layout: global +title: Artificial Neural Networks - MLlib +displayTitle: a href=mllib-guide.htmlMLlib/a - Artificial Neural Networks +--- + +# Introduction + +This document describes the MLlib's Artificial Neural Network (ANN) implementation. + +The implementation currently consist of the following files: + +* 'ArtificialNeuralNetwork.scala': implements the ANN +* 'ANNSuite': implements automated tests for the ANN and its gradient +* 'ANNDemo': a demo that approximates three functions and shows a graphical representation of +the result + +# Summary of usage + +The ArtificialNeuralNetwork object is used as an interface to the neural network. It is +called as follows: + +``` +val annModel = ArtificialNeuralNetwork.train(rdd, hiddenLayersTopology, maxNumIterations) +``` + +where + +* `rdd` is an RDD of type (Vector,Vector), the first element containing the input vector and +the second the associated output vector. +* `hiddenLayersTopology` is an array of integers (Array[Int]), which contains the number of +nodes per hidden layer, starting with the layer that takes inputs from the input layer, and +finishing with the layer that outputs to the output layer. The bias nodes are not counted. +* `maxNumIterations` is an upper bound to the number of iterations to be performed. +* `ANNmodel` contains the trained ANN parameters, and can be used to calculated the ANNs +approximation to arbitrary input values. + +The approximations can be calculated as follows: + +``` +val v_out = annModel.predict(v_in) +``` + +where v_in is either a Vector or an RDD of Vectors, and v_out respectively a Vector or RDD of +(Vector,Vector) pairs, corresponding to input and output values. + +Further details and other calling options will be elaborated upon below. + +# Architecture and Notation + +The file ArtificialNeuralNetwork.scala implements the ANN. The following picture shows the +architecture of a 3-layer ANN: + +``` + +---+ + | | + | N_0,0 | + | | + +---++---+ + | | + +---+| N_0,1 | +---+ + | || | | | + | N_1,0 |- +---+ -| N_0,2 | + | | \ Wij1 / | | + +---+ --+---+ -- +---+ + \ | | / Wjk2 + : -| N_1,1 |- +---+ + :| | | | + :+---+ | N_1,2 | + :| | + :: +---+ + :: + ::: + :: + :: +---+ + :: | | + :: |N_K-1,2| + :| | + :+---+ +---+ + :| | + :|N_J-1,1| + | | + +---++---+ + | | + |N_I-1,0| + | | + +---+ + + +---+++ + | ||| + | -1 || -1 | + | ||| + +---+++ + +INPUT LAYER HIDDEN LAYEROUTPUT LAYER +``` + +The i-th node in layer l is denoted by N_{i,l}, both i and l starting with 0. The weight +between node i in layer l-1 and node j in layer l is denoted by Wijl. Layer 0 is the input +layer, whereas layer L is the output layer. + +The ANN also implements bias units. These are nodes that always output the value -1. The bias +units are in all layers except the output layer. They act similar to other nodes, but do not +have input. + +The value of node N_{j,l} is calculated as follows: + +`$N_{j,l} = g( \sum_{i=0}^{topology_l} W_{i,j,l)*N_{i,l-1} )$` + +Where g is the sigmoid function + +`$g(t) = \frac{e^{\beta t} }{1+e^{\beta t}}$` + +# LBFGS + +MLlib's ANN implementation uses the LBFGS optimisation algorithm for training. It minimises the +following error function: + +`$E = \sum_{k=0}^{K-1} (N_{k,L} - Y_k)^2$` + +where Y_k is the target output given inputs N_{0,0} ... N_{I-1,0}. + +# Implementation Details + +## The ArtificialNeuralNetwork class + +The ArtificialNeuralNetwork class has the following constructor: + +``` +class
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-66454787 [Test build #24310 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24310/consoleFull) for PR 1290 at commit [`5e86c5e`](https://github.com/apache/spark/commit/5e86c5edab4c58fee55ddae841f29105f62ceec4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/3600#issuecomment-66455203 Anyway, still thanks for your comments and time to replying this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66456752 [Test build #24311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24311/consoleFull) for PR 1269 at commit [`b3f7a0d`](https://github.com/apache/spark/commit/b3f7a0de47497ca88a0815656451a4379fe180dc). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/3653#issuecomment-66458371 @JoshRosen @pwendell @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66458584 [Test build #24309 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24309/consoleFull) for PR 1269 at commit [`af9bcc8`](https://github.com/apache/spark/commit/af9bcc87df561f920226342d25ca4203639bacf9). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66458603 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24309/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3653#issuecomment-6645 [Test build #24312 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24312/consoleFull) for PR 3653 at commit [`195852c`](https://github.com/apache/spark/commit/195852c8bf3a36bfcebff54b3188eac152b010b7). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3653#issuecomment-66459924 [Test build #24313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24313/consoleFull) for PR 3653 at commit [`aa8bb87`](https://github.com/apache/spark/commit/aa8bb8771d08968d5564be51732c5062b2a7883a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-66461883 [Test build #24310 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24310/consoleFull) for PR 1290 at commit [`5e86c5e`](https://github.com/apache/spark/commit/5e86c5edab4c58fee55ddae841f29105f62ceec4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OutputCanvas2D(wd: Int, ht: Int) extends Canvas ` * `class OutputFrame2D( title: String ) extends Frame( title ) ` * `class OutputCanvas3D(wd: Int, ht: Int, shadowFrac: Double) extends Canvas ` * `class OutputFrame3D(title: String, shadowFrac: Double) extends Frame(title) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-66461896 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24310/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/3409#issuecomment-66464201 I'm in favor of spark.yarn.am.* and then documenting if it only applies to client mode also. @andrewor14 @sryza votes? Lets try to resolve this today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...
Github user koeninger commented on a diff in the pull request: https://github.com/apache/spark/pull/3543#discussion_r21610025 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext) def createParquetFile[A : Product : TypeTag]( path: String, allowExisting: Boolean = true, - conf: Configuration = new Configuration()): SchemaRDD = { --- End diff -- I seem to recall there being potential thread safety issues related to hadoop configuration objects, resulting in the need to create / clone them. Quick search turned up e.g. https://issues.apache.org/jira/browse/SPARK-2546 I'm not sure how relevant that is to all of these existing situations where new Configuration() is being called. On Tue, Dec 9, 2014 at 5:07 PM, Tathagata Das notificati...@github.com wrote: In sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala https://github.com/apache/spark/pull/3543#discussion-diff-21571141: @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext) def createParquetFile[A : Product : TypeTag]( path: String, allowExisting: Boolean = true, - conf: Configuration = new Configuration()): SchemaRDD = { I think this should be using the hadoopConfiguration object in the SparkContext. That has all the hadoop related configuration already setup and should be what is automatically used. @marmbrus https://github.com/marmbrus should have a better idea. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/3543/files#r21571141. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3653#issuecomment-66473044 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24312/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3653#issuecomment-66473033 [Test build #24312 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24312/consoleFull) for PR 3653 at commit [`195852c`](https://github.com/apache/spark/commit/195852c8bf3a36bfcebff54b3188eac152b010b7). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3653#issuecomment-66473682 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24313/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3653#issuecomment-66473670 [Test build #24313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24313/consoleFull) for PR 3653 at commit [`aa8bb87`](https://github.com/apache/spark/commit/aa8bb8771d08968d5564be51732c5062b2a7883a). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66475387 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24311/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66475370 [Test build #24311 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24311/consoleFull) for PR 1269 at commit [`b3f7a0d`](https://github.com/apache/spark/commit/b3f7a0de47497ca88a0815656451a4379fe180dc). * This patch **passes all tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66478011 Succeeded at the third attempt. (5) Enumerator @jkbradley, as you can see, I moved `Enumerator` to `mllib/features` folder and renamed it to `TokenIndexer`. You said, I should write a setter method `setRareTokenThreshold` -- I see no need in this due to the fact, that it's the only one field. (If setter method is a code-style and/or API requirement, I'm ready add it). (6) move Dirichlet to stats I like the idea to move Dirichlet pdf to stats for everyone to be able to use it. But I see no classes computing pdf in mllib/stats folder, so I have no idea what API should be implemented. Any other remarks on code structure and/or API? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66488488 While collecting data from a Parquet based SchemaRDD, the underlying Parquet split may be out of order, thus caused occasional test failures. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4453][SPARK-4213][SQL] Additional test ...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/#issuecomment-66488779 Hi @sarutak, I added a new set of Parquet test suites in #3644, which aim to replace the old `ParquetQuerySuite`. I believe Parquet filters have been tested thoroughly there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3644#issuecomment-66488841 [Test build #24314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24314/consoleFull) for PR 3644 at commit [`800e745`](https://github.com/apache/spark/commit/800e7459a9261281c35e48c837dbb7de5643e4b2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/1379#issuecomment-66490270 @dbtsai Thank you, I look forward for your code to perform benchmarks. Thanks again for the video! I've enjoy ed it, especially QA after the talk. At 51:23 Prof CJ Lin mentiones that we released dataset of about 600 Gigabytes. Do you know where I can download it? It should be quite a challenging workload for classification in Spark! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Updated documentation and refactored code to e...
GitHub user ilganeli opened a pull request: https://github.com/apache/spark/pull/3664 Updated documentation and refactored code to extract shared variables Hi all - cleaned up the code to get rid of the unused parameter and added some discussion of the ThreadPoolExecutor parameters to explain why we can use a single threadCount instead of providing a min/max. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ilganeli/spark SPARK-3607C Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3664.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3664 commit 3c056904570fdd97d429c10895590850bb81e759 Author: Ilya Ganelin ilya.gane...@capitalone.com Date: 2014-12-10T17:35:02Z Updated documentation and refactored code to extract shared variables --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3607] ConnectionManager threads.max con...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3664#issuecomment-66491450 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1037] The name of findTaskFromList fi...
GitHub user ilganeli opened a pull request: https://github.com/apache/spark/pull/3665 [SPARK-1037] The name of findTaskFromList findTask in TaskSetManager.scala is confusing Hi all - I've renamed the methods referenced in this JIRA to clarify that they modify the provided arrays (find vs. deque). You can merge this pull request into a Git repository by running: $ git pull https://github.com/ilganeli/spark SPARK-1037B Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3665.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3665 commit 683482afddd2ab45626fa57ccac6711314669dd1 Author: Ilya Ganelin ilya.gane...@capitalone.com Date: 2014-12-10T17:43:08Z Renamed private methods to clarify that they modify the provided parameters commit f27d85ebdbe1355039c80f236c9075a446e3018c Author: Ilya Ganelin ilya.gane...@capitalone.com Date: 2014-12-10T17:46:12Z Renamed private methods to clarify that they modify the provided parameters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1037] The name of findTaskFromList fi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3665#issuecomment-66493048 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4569] Rename 'externalSorting' in Aggre...
GitHub user ilganeli opened a pull request: https://github.com/apache/spark/pull/3666 [SPARK-4569] Rename 'externalSorting' in Aggregator Hi all - I've renamed the unhelpfully named variable and added a comment clarifying what's actually happening. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ilganeli/spark SPARK-4569B Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3666.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3666 commit 5b3f39cf4f1475a4b656eb24d563af80e4a953c9 Author: Ilya Ganelin ilya.gane...@capitalone.com Date: 2014-12-10T17:51:42Z [SPARK-4569] Rename in Aggregator commit d7cefec06e0e3b235ee67bcdf8bf115c92a1cbed Author: Ilya Ganelin ilya.gane...@capitalone.com Date: 2014-12-10T17:52:40Z [SPARK-4569] Rename 'externalSorting' in Aggregator commit e2d20929b043ed4dbe1001bb38e3e441c8450992 Author: Ilya Ganelin ilya.gane...@capitalone.com Date: 2014-12-10T17:53:53Z [SPARK-4569] Rename 'externalSorting' in Aggregator commit 18103943e4b2584ce3079f466cdd7e3253675fac Author: Ilya Ganelin ilya.gane...@capitalone.com Date: 2014-12-10T17:54:39Z [SPARK-4569] Rename 'externalSorting' in Aggregator --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4569] Rename 'externalSorting' in Aggre...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3666#issuecomment-66493838 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/3543#discussion_r21622115 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: SparkContext) def createParquetFile[A : Product : TypeTag]( path: String, allowExisting: Boolean = true, - conf: Configuration = new Configuration()): SchemaRDD = { --- End diff -- @koeninger The issue that you linked is concerned with thread-safety issues when multiple threads concurrently modify the same `Configuration` instance. It turns out that there's another, older thread-safety issue related to `Configuration`'s constructor not being thread-safe due to non-thread-safe static state: https://issues.apache.org/jira/browse/HADOOP-10456. This has been fixed in some newer Hadoop releases, but since it was only reported in April I don't think we can ignore it. As a result, https://issues.apache.org/jira/browse/SPARK-1097 implements a workaround which synchronizes on an object before calling `new Configuration`. Currently, I think the extra synchronization logic is only implemented in `HadoopRDD`, but it should probably be used everywhere just to be safe. I think that `HadoopRDD` was the highest-risk place where we might have many threads creating Configurations at the same time, which is probably why that patch's author didn't add the synchronization everywhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling
Github user akopich commented on the pull request: https://github.com/apache/spark/pull/1269#issuecomment-66498633 (5) Enumerator BTW, names `TokenIndexer` and `TokenIndex` look confusive (though, these classes rely on `breeze.util.Index`). So I renamed it to `TokenEnumerator`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org