[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution

2014-12-10 Thread nchammas
Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/3564#issuecomment-66416442
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3564#issuecomment-66416417
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24298/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/3603#discussion_r21588804
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala ---
@@ -174,37 +174,18 @@ class IDFModel private[mllib] (val idf: Vector) 
extends Serializable {
*/
   def transform(dataset: RDD[Vector]): RDD[Vector] = {
 val bcIdf = dataset.context.broadcast(idf)
-dataset.mapPartitions { iter =
-  val thisIdf = bcIdf.value
-  iter.map { v =
-val n = v.size
-v match {
-  case sv: SparseVector =
-val nnz = sv.indices.size
-val newValues = new Array[Double](nnz)
-var k = 0
-while (k  nnz) {
-  newValues(k) = sv.values(k) * thisIdf(sv.indices(k))
-  k += 1
-}
-Vectors.sparse(n, sv.indices, newValues)
-  case dv: DenseVector =
-val newValues = new Array[Double](n)
-var j = 0
-while (j  n) {
-  newValues(j) = dv.values(j) * thisIdf(j)
-  j += 1
-}
-Vectors.dense(newValues)
-  case other =
-throw new UnsupportedOperationException(
-  sOnly sparse and dense vectors are supported but got 
${other.getClass}.)
-}
-  }
-}
+dataset.mapPartitions(iter = iter.map(v = 
IDFModel.transform(bcIdf.value, v)))
   }
 
   /**
+   * Transforms tern frequency (TF) vectors to a TF-IDF vector
--- End diff --


https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-7c5eb57aa2d7d6da7afb24b85429ac14L181


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3564#issuecomment-66417200
  
  [Test build #24300 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24300/consoleFull)
 for   PR 3564 at commit 
[`b583f81`](https://github.com/apache/spark/commit/b583f8199229f176c462e4095c8d196c0fc21bba).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/3603#discussion_r21588828
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala ---
@@ -53,6 +53,19 @@ class IDFSuite extends FunSuite with 
MLlibTestSparkContext {
 val tfidf2 = tfidf(2L).asInstanceOf[SparseVector]
 assert(tfidf2.indices === Array(1))
 assert(tfidf2.values(0) ~== (1.0 * expected(1)) absTol 1e-12)
+
+// Transforms local vectors
--- End diff --


https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-7440885aeb7f73a84564ec244399fc5cR44


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/3603#discussion_r21588814
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala ---
@@ -17,12 +17,10 @@
 
 package org.apache.spark.mllib.feature
 
-import org.scalatest.FunSuite
-
-import org.apache.spark.SparkContext._
 import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors}
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.mllib.util.TestingUtils._
+import org.scalatest.FunSuite
--- End diff --


https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-7440885aeb7f73a84564ec244399fc5cL20


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/3603#discussion_r21588839
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/IDFSuite.scala ---
@@ -86,6 +101,19 @@ class IDFSuite extends FunSuite with 
MLlibTestSparkContext {
 val tfidf2 = tfidf(2L).asInstanceOf[SparseVector]
 assert(tfidf2.indices === Array(1))
 assert(tfidf2.values(0) ~== (1.0 * expected(1)) absTol 1e-12)
+
+// Transforms local vectors
--- End diff --


https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-7440885aeb7f73a84564ec244399fc5cR85


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/3603#discussion_r21588845
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -220,12 +220,15 @@ def transform(self, dataset):
 the terms which occur in fewer than `minDocFreq`
 documents will have an entry of 0.
 
-:param dataset: an RDD of term frequency vectors
-:return: an RDD of TF-IDF vectors
+:param data: an RDD of term frequency vectors or a term frequency 
vector
+:return: an RDD of TF-IDF vectors or a TF-IDF vector
 
-if not isinstance(dataset, RDD):
+if isinstance(data, RDD):
+return JavaVectorTransformer.transform(self, data)
+elif isinstance(data, Vector):
--- End diff --


https://github.com/yu-iskw/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360#diff-722e3d483892191debee07edd1a85fc8R226


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/3603#issuecomment-66417392
  
@jkbradley Thank you for your comments.
I add `[mllib]` tag to the PR title. And I modified the source code 
following your advice.
Could you please review the difference?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3603#issuecomment-66417572
  
  [Test build #24301 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24301/consoleFull)
 for   PR 3603 at commit 
[`a3bf566`](https://github.com/apache/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/3600#issuecomment-66419204
  
Thanks. However, I can not see why this is a broken change. Please let me 
know where it causes problems as it seems to pass tests now.

In fact, this PR does not make a lot of change. Original codes close and 
reopen `FileInputStream` for every batch reading. This PR keeps the stream open 
across these batches. Other parts are untouched.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3611] Show number of cores for each exe...

2014-12-10 Thread devldevelopment
Github user devldevelopment closed the pull request at:

https://github.com/apache/spark/pull/2980


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/3661

[SPARK-4813][Streaming] Fix the issue that ContextWaiter didn't handle 
'spurious wakeup'

Used `Condition` to rewrite `ContextWaiter` because it provides a 
convenient API `awaitNanos` for timeout.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-4813

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3661.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3661


commit e06bd4fdc7d052ef55e2d98e68441586fe9d2026
Author: zsxwing zsxw...@gmail.com
Date:   2014-12-10T08:25:39Z

Fix the issue that ContextWaiter didn't handle 'spurious wakeup'




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3661#issuecomment-66421083
  
  [Test build #24302 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24302/consoleFull)
 for   PR 3661 at commit 
[`e06bd4f`](https://github.com/apache/spark/commit/e06bd4fdc7d052ef55e2d98e68441586fe9d2026).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4159 [CORE] [WIP] Maven build doesn't ru...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3651#issuecomment-66423087
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24299/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread pkallos
Github user pkallos commented on the pull request:

https://github.com/apache/spark/pull/3603#issuecomment-66423122
  
:+1:


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4159 [CORE] [WIP] Maven build doesn't ru...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3651#issuecomment-66423082
  
  [Test build #24299 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24299/consoleFull)
 for   PR 3651 at commit 
[`125b0b6`](https://github.com/apache/spark/commit/125b0b64efc22c5a573aea00bf9bfdb53393cdbe).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4609] Blacklist hosts rather than execu...

2014-12-10 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/3541#issuecomment-66424149
  
On Thu, Dec 4, 2014 at 2:57 AM, Davies Liu notificati...@github.com wrote:

 @davies https://github.com/davies I am not sure I completely understood
 your comment.
 Sorry for that, maybe I didnot explain it clearly.

 As detailed above, there are multiple reasons why a task can fail - and
 quite a lot of them are non-fatal from 'rescheduling the task on same 
host'
 point of view : in particular race in spark between reporting executor
 going down, shutdown hooks running and task schedules due to locality
 preference.
 So we need per-executor blacklist - note that this is just a temporary -
 to either allow the executor to recover (in case task failures are due to
 transient reasons), or allow task to get scheduled elsewhere in meantime
 (if schedule locality constraints can be satisfied).

 Agreed that the executor based blacklist worked for you, and I think the
 host based blacklist will also work for you (there is a little regression
 about locality).


It is not a small regression - if you have 4 - 8 executors on a host (as is
common here) : this change will blacklist all of them instead of
blacklisting a single executor.
This is fairly severe regression : which is why I said I am -1 on modifying
existing behavior unless new functionality allows for existing feature to
continue to work as currently expected to.
The thing to understand is executor blacklist is not subsumed by host
blacklist other than in a very crude model.



  A different set of criterion would apply when we want to do host level
 blacklist - when we have determined that the node is unusable, and so task
 fails on all executors in the node : due to NODE_LOCAL locality level, we
 would keep trying other executors on the same node in case executor
 blacklist kicks in; so in case the node is temporarily unusable, executor
 black list might not help.

 So we need host based blacklist.


Yes, the reasons why we need host blacklist are valid and separate from why
we need executor blacklist.
They might overlap in some degenerate cases (since obviously host level
issues do impact executors too) : executor blacklist is more fine grained -
while host level issues are more coarser in comparison.
While executor blacklist might alleviate lack of host blacklist to some
extent (as exists currently), it is suboptimal to do so : so need for host
blacklist is justified.




  The timeout based temporary executor blacklist we currently have is
 still a stop gap solution which solves immediate problems observed at that
 time : without which spark was becoming unusable in large enough
 multi-tennet clusters.

 Agreed.

 If we want to it to a host level and do a principled solution - then we
 need a lot of other pieces to be put into place (since currently we only
 take task scheduling into account; which is insufficient).
 Top of my head - remove it from rdd replication, de-allocate executors
 already on the node, moving existing rdd blocks away from the executors on
 the node, blacklisting the node from further allocation requests (yarn,
 mesos), and so on. I am sure @kayousterhout
 https://github.com/kayousterhout might have other thoughts on this.

 Agreed. Figure out the failure domain is a hard thing in distributed
 environment, I'm doubt that who can contribute a principled solution to
 retry the failed tasks in the best position in near term (such as
 reschedule it in same executor, different executor on same host, different
 host, different rack).

 I think the host based blacklist is the simplest solution and work well in
 most failure cases.

 Unfortunately, I do not have the bandwidth to engage on this; so I am
 hoping the right thing gets done. Whatever it is, I am -1 on removing
 executor level blacklist - that is something we heavily depend on to get
 our jobs to work. A better solution while not regressing on this
 functionality is most welcome !

 Really appreciate your comments here, to have a better solution. Could you
 raise a detailed cases that the host based blacklist will break you job?
 Maybe there are some cases I did not figure out in your situation, please
 correct me.



The primary reason for executor blacklist, as @kayousterhout
https://github.com/kayousterhout also referred to, were initially quite
simple :
Task gets submitted to same executor repeatedly due to locality constraint
- but keeps failing on the executor since the executor might be in
inconsistent state (like in middle of shutdown, etc).This very quickly
   

[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3661#discussion_r21591585
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala ---
@@ -17,30 +17,74 @@
 
 package org.apache.spark.streaming
 
+import java.util.concurrent.{TimeoutException, TimeUnit}
+import java.util.concurrent.locks.ReentrantLock
+import javax.annotation.concurrent.GuardedBy
+
 private[streaming] class ContextWaiter {
+
+  private val lock = new ReentrantLock()
+  private val condition = lock.newCondition()
+
+  @GuardedBy(lock)
--- End diff --

Minor point - these are not in the JDK but in a Findbugs library for 
JSR-305. It's not used in Spark, and happens to be a dependency now. Maybe not 
worth using it just 1 place?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3661#discussion_r21591750
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala ---
@@ -17,30 +17,74 @@
 
 package org.apache.spark.streaming
 
+import java.util.concurrent.{TimeoutException, TimeUnit}
+import java.util.concurrent.locks.ReentrantLock
+import javax.annotation.concurrent.GuardedBy
+
 private[streaming] class ContextWaiter {
+
+  private val lock = new ReentrantLock()
+  private val condition = lock.newCondition()
+
+  @GuardedBy(lock)
   private var error: Throwable = null
+
+  @GuardedBy(lock)
   private var stopped: Boolean = false
 
-  def notifyError(e: Throwable) = synchronized {
-error = e
-notifyAll()
+  def notifyError(e: Throwable) = {
+lock.lock()
+try {
+  error = e
+  condition.signalAll()
+} finally {
+  lock.unlock()
+}
   }
 
-  def notifyStop() = synchronized {
-stopped = true
-notifyAll()
+  def notifyStop() = {
+lock.lock()
+try {
+  stopped = true
+  condition.signalAll()
+} finally {
+  lock.unlock()
+}
   }
 
-  def waitForStopOrError(timeout: Long = -1) = synchronized {
-// If already had error, then throw it
-if (error != null) {
-  throw error
-}
+  /**
+   * Return `true` if it's stopped; or throw the reported error if 
`notifyError` has been called; or
+   * `false` if the waiting time detectably elapsed before return from the 
method.
+   */
+  def waitForStopOrError(timeout: Long = -1): Boolean = {
+lock.lock()
+try {
+  if (timeout  0) {
+while (true) {
--- End diff --

Maybe it's just me but it feels like these loops would be simpler just 
testing `while (!stopped  error == null)`? `nanos` would be tested in the 
other one too. This avoids duplication, and also avoids the unreachable return 
value, because you check these conditions in one place at the end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3603#issuecomment-66425154
  
  [Test build #24301 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24301/consoleFull)
 for   PR 3603 at commit 
[`a3bf566`](https://github.com/apache/spark/commit/a3bf566e923be8c8d5787d8c8ffb777a5886f360).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4494][mllib] IDFModel.transform() add s...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3603#issuecomment-66425157
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24301/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4159 [CORE] [WIP] Maven build doesn't ru...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3651#issuecomment-66425799
  
  [Test build #24303 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24303/consoleFull)
 for   PR 3651 at commit 
[`11bd041`](https://github.com/apache/spark/commit/11bd041909a20b6d7c1b5074d6b78133aa1ff547).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/3600#issuecomment-66426107
  
Hmm, might be tricky to explain if you do not have sufficient context, let 
me give it a shot.
a) Streams in java are not usually multiplexed - unless explicitly stated 
otherwise.
With this PR, the same underlying stream (fileStream) is being reused 
across deserializeStream (and its users).
One way it manifests is (b)

b) Most streams in java override finalize to close their underlying stream 
in case they are going out of scope (to prevent resource leak, etc) : ofcourse 
this is an implementation detail, but is the general expectation.
In this case, deserializeStream gets re-assigned somewhere in the method 
below - causing the previous 'deserializeStream' to go out of scope.
When gc kicks in, and then when finalizers are run, deserializeStream's 
finalize can call its close, resulting in fileStream to get closed - which 
might now be used by some other deserializeStream : since it was re-used.
This will cause hard to debug crashes/bugs.

I am sure I am missing other spectacular ways in which this can fail :-) - 
in general, these things happen when the basic api expectation (probably 
implicit here maybe) is broken.
Now, we can go down this path in case the operation we are saving is very 
expensive -which is not the case here (it is a cheap file open/close which is 
saved).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66426037
  
  [Test build #24304 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24304/consoleFull)
 for   PR 3644 at commit 
[`3bb8731`](https://github.com/apache/spark/commit/3bb8731a33ecf2bde076df92aa8619340fe3e84a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/3661#discussion_r21592200
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala ---
@@ -17,30 +17,74 @@
 
 package org.apache.spark.streaming
 
+import java.util.concurrent.{TimeoutException, TimeUnit}
+import java.util.concurrent.locks.ReentrantLock
+import javax.annotation.concurrent.GuardedBy
+
 private[streaming] class ContextWaiter {
+
+  private val lock = new ReentrantLock()
+  private val condition = lock.newCondition()
+
+  @GuardedBy(lock)
--- End diff --

 Maybe not worth using it just 1 place?

So which one do you prefer?
1. Use comments to describe such information.
2. Use `GuardedBy` from now on.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/3661#discussion_r21592261
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala ---
@@ -17,30 +17,74 @@
 
 package org.apache.spark.streaming
 
+import java.util.concurrent.{TimeoutException, TimeUnit}
+import java.util.concurrent.locks.ReentrantLock
+import javax.annotation.concurrent.GuardedBy
+
 private[streaming] class ContextWaiter {
+
+  private val lock = new ReentrantLock()
+  private val condition = lock.newCondition()
+
+  @GuardedBy(lock)
--- End diff --

In addition, now Findbugs does not recognize `GuardedBy` in Scala codes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/3661#discussion_r21592650
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala ---
@@ -17,30 +17,74 @@
 
 package org.apache.spark.streaming
 
+import java.util.concurrent.{TimeoutException, TimeUnit}
+import java.util.concurrent.locks.ReentrantLock
+import javax.annotation.concurrent.GuardedBy
+
 private[streaming] class ContextWaiter {
+
+  private val lock = new ReentrantLock()
+  private val condition = lock.newCondition()
+
+  @GuardedBy(lock)
--- End diff --

BTW, I turned to `GuardedBy` because @aarondav asked me to do it in #3634


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66427302
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24304/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66427294
  
  [Test build #24304 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24304/consoleFull)
 for   PR 3644 at commit 
[`3bb8731`](https://github.com/apache/spark/commit/3bb8731a33ecf2bde076df92aa8619340fe3e84a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait ParquetTest `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3661#discussion_r21592824
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala ---
@@ -17,30 +17,74 @@
 
 package org.apache.spark.streaming
 
+import java.util.concurrent.{TimeoutException, TimeUnit}
+import java.util.concurrent.locks.ReentrantLock
+import javax.annotation.concurrent.GuardedBy
+
 private[streaming] class ContextWaiter {
+
+  private val lock = new ReentrantLock()
+  private val condition = lock.newCondition()
+
+  @GuardedBy(lock)
--- End diff --

Yes, that's why I brought it up. It's not actually a standard Java 
annotation (unless someone tells me it just turned up in 8 or something) but 
part of JSR-305. This is a dependency of Spark core at the moment, but none of 
the annotations are used. I think we should just not use them instead of using 
this lib in 1 place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4159 [CORE] Maven build doesn't run JUni...

2014-12-10 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3651#issuecomment-66428269
  
I'm pretty convinced this works now. I'm diffing the test run output 
between master and this branch, and the scala tests are the same. The only 
visible differences are that `scalatest` turns up in every module, and of 
course, output from `surefire` now.

Note that I did _not_ enable assertions in SBT now, which I mentioned in a 
related conversation. There's another issue with it tracked in 
http://issues.apache.org/jira/browse/SPARK-4814

I also think this is a predecessor to 
https://issues.apache.org/jira/browse/SPARK-3431

Let's see what Jenkins says. I'm calling this no longer a WIP.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/3661#discussion_r21593635
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala ---
@@ -17,30 +17,74 @@
 
 package org.apache.spark.streaming
 
+import java.util.concurrent.{TimeoutException, TimeUnit}
+import java.util.concurrent.locks.ReentrantLock
+import javax.annotation.concurrent.GuardedBy
+
 private[streaming] class ContextWaiter {
+
+  private val lock = new ReentrantLock()
+  private val condition = lock.newCondition()
+
+  @GuardedBy(lock)
   private var error: Throwable = null
+
+  @GuardedBy(lock)
   private var stopped: Boolean = false
 
-  def notifyError(e: Throwable) = synchronized {
-error = e
-notifyAll()
+  def notifyError(e: Throwable) = {
+lock.lock()
+try {
+  error = e
+  condition.signalAll()
+} finally {
+  lock.unlock()
+}
   }
 
-  def notifyStop() = synchronized {
-stopped = true
-notifyAll()
+  def notifyStop() = {
+lock.lock()
+try {
+  stopped = true
+  condition.signalAll()
+} finally {
+  lock.unlock()
+}
   }
 
-  def waitForStopOrError(timeout: Long = -1) = synchronized {
-// If already had error, then throw it
-if (error != null) {
-  throw error
-}
+  /**
+   * Return `true` if it's stopped; or throw the reported error if 
`notifyError` has been called; or
+   * `false` if the waiting time detectably elapsed before return from the 
method.
+   */
+  def waitForStopOrError(timeout: Long = -1): Boolean = {
+lock.lock()
+try {
+  if (timeout  0) {
+while (true) {
--- End diff --

It's cleaner now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4033][Examples]Input of the SparkPi too...

2014-12-10 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/2874#issuecomment-66428460
  
@SaintBacchus  why did you close this? seems like it still needs a fix and 
you had an improvement going here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3661#issuecomment-66429086
  
  [Test build #24305 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24305/consoleFull)
 for   PR 3661 at commit 
[`be42bcf`](https://github.com/apache/spark/commit/be42bcfaa38a3f3fbe4fc759656a61c72f0fb556).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3564#issuecomment-66429754
  
**[Test build #24300 timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24300/consoleFull)**
 for PR 3564 at commit 
[`b583f81`](https://github.com/apache/spark/commit/b583f8199229f176c462e4095c8d196c0fc21bba)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3431] [WIP] Parallelize test execution

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3564#issuecomment-66429761
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24300/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3661#issuecomment-66430212
  
  [Test build #24302 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24302/consoleFull)
 for   PR 3661 at commit 
[`e06bd4f`](https://github.com/apache/spark/commit/e06bd4fdc7d052ef55e2d98e68441586fe9d2026).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3661#issuecomment-66430222
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24302/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4789] [mllib] Standardize ML Prediction...

2014-12-10 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-66430592
  
So, I may not be 100% up to speed with the new API and these changes, so my 
comments may be a bit off, but:

An Estimator makes a Model. To make a model, you need raw data and its 
interpretation, if you will. a LabeledPoint is raw data. That alone is not 
sufficient to train a Classifier (Estimator). Yes, this extra info has to come 
from somewhere.

I agree that SchemaRDD contains, or could contain, or could be made to 
deduce, this extra interpretation, so the SchemaRDD API makes sense to me.

If LabeledPoint is to remain the raw data, given the conversation here, 
then it has to be parameters or something. I think you still need these for 
testing, right? you still need to know what the raw data means. Or is it 
assumed that the built Classifier / Model stores this info?

This is sort of a rehash of the same exchange we just had, in that the 
question is caused by the input data abstraction not really containing all the 
input -- the metadata comes along separately. Which could be OK but yes it 
means this question pops up somewhere else in the API.

Yes, a Model may be able to remember the metadata and accept raw 
LabeledPoints in the future. You just have to make sure you are feeding raw 
LabeledPoints that use the same metadata, but that's a given no matter how you 
design this.

To answer the question: given the question, I'd hide the typed API, I 
suppose. I think the typed API has to take some other values to contain 
metadata like the type of features, etc. These could be more parameters, then? 
it kind of overloads the meaning, since the parameters look like they are 
intended to be hyper parameters. But it's not crazy.

Transformations: these feel like these could meaningfully operate on raw 
data, so, typed API makes sense to me and could be public now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Print the specified number of data and handle ...

2014-12-10 Thread surq
GitHub user surq opened a pull request:

https://github.com/apache/spark/pull/3662

Print the specified number of data and handle all of the elements in RDD

Dstream.print function:Print 10 elements and handle 11 elements.
A new function based on Dstream.print function is presented:
the new function:
Print the specified number of data and handle all of the elements in RDD.
there is a work scene:
val dstream = stream.map-filter-mapPartitions-print
the data after filter need update database in mapPartitions,but don't need 
print each data,only need to print the top 20 for view the data processing.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/surq/spark SPARK-4817

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3662.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3662


commit 4e3f715941f94cb2467ca68b205a5fa3630130a3
Author: surq s...@asiainfo.com
Date:   2014-12-10T10:49:54Z

Print the specified number of data and handle all of the elements in RDD




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4817][streaming]Print the specified num...

2014-12-10 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3662#issuecomment-66434968
  
You should put `SPARK- [STREAMING]` in the title. But your original 
JIRA was a duplicate of https://issues.apache.org/jira/browse/SPARK-3325 so 
perhaps you can connect this to that JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4815][SQL] Fix: ThriftServer use only o...

2014-12-10 Thread guowei2
GitHub user guowei2 opened a pull request:

https://github.com/apache/spark/pull/3663

[SPARK-4815][SQL] Fix: ThriftServer use only one SessionState to run sql 
using hive

Use a `SessionState` map in `HiveContext` to store all of the session 
states to the thread id.
The session state  will be updated when open a new hive session and close 
the session 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/guowei2/spark SPARK-4815

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3663.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3663


commit 0e9f2239836ce132466070f85090f282a3ff4fbe
Author: guowei2 guow...@asiainfo.com
Date:   2014-12-10T10:25:34Z

Fix: ThriftServer use only one SessionState to run sql using hive




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4815][SQL] Fix: ThriftServer use only o...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3663#issuecomment-66435369
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4159 [CORE] Maven build doesn't run JUni...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3651#issuecomment-66435431
  
  [Test build #24303 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24303/consoleFull)
 for   PR 3651 at commit 
[`11bd041`](https://github.com/apache/spark/commit/11bd041909a20b6d7c1b5074d6b78133aa1ff547).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4159 [CORE] Maven build doesn't run JUni...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3651#issuecomment-66435437
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24303/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/3600#issuecomment-66435647
  
I agree with you that the saved operation here is a cheap one. :-) However 
the problem you mentioned would not happen with current version of 
`DeserializationStream`.

Not all InputStream close their underlying stream when they are collected 
by GC. There are detailed discussions 
[here](http://www.coderanch.com/t/278165/java-io/java/InputStream-close-garbage-collection)
 and 
[there](http://stackoverflow.com/questions/1522370/does-input-outputstreams-close-on-destruction).
 I am sure that `FileInputStream` implements `finalize` to close underlying 
file. But other streams used here are not as the tests show.

`DeserializationStream` is implemented in Spark and it has no such 
behavior. During modifying the codes, I checked it and found that you must 
explicitly call its `close` to close its underlying stream. That is why it 
passes the tests.

I am ok to close this PR if it causes problem. But if it would not really 
cause the mentioned problem, I can not see why a slightly improved performance 
is bad.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4159 [CORE] Maven build doesn't run JUni...

2014-12-10 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3651#issuecomment-66435761
  
Jenkins, retest this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4817][streaming]Print the specified num...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3662#issuecomment-66435798
  
  [Test build #24306 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24306/consoleFull)
 for   PR 3662 at commit 
[`4e3f715`](https://github.com/apache/spark/commit/4e3f715941f94cb2467ca68b205a5fa3630130a3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3661#issuecomment-66438151
  
  [Test build #24305 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24305/consoleFull)
 for   PR 3661 at commit 
[`be42bcf`](https://github.com/apache/spark/commit/be42bcfaa38a3f3fbe4fc759656a61c72f0fb556).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3661#issuecomment-66438157
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24305/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4812][SQL] Fix the initialization issue...

2014-12-10 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3660#issuecomment-66439177
  
~~ we should mark codegenEnabled as lazy. ~~

`lazy` doesn't work because `codegenEnabled` has not been used before 
serialization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4812][SQL] Fix the initialization issue...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3660#issuecomment-66439622
  
  [Test build #24307 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24307/consoleFull)
 for   PR 3660 at commit 
[`a3eea56`](https://github.com/apache/spark/commit/a3eea5692b7bf2fd88b27032e899b776651ef321).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66442416
  
  [Test build #24308 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24308/consoleFull)
 for   PR 1269 at commit 
[`7f9b7c3`](https://github.com/apache/spark/commit/7f9b7c35c28e3399a8c34d494064a3bbd238d9c2).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/3600#issuecomment-66443297
  
I think you are missing the point - we should not rely on specific 
implementation details on whether it is currently done or not - that leads to 
brittle codebase. finalize() *can* close wrapped stream because that is the 
implicit contract.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4817][streaming]Print the specified num...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3662#issuecomment-66444342
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24306/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4817][streaming]Print the specified num...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3662#issuecomment-66444334
  
  [Test build #24306 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24306/consoleFull)
 for   PR 3662 at commit 
[`4e3f715`](https://github.com/apache/spark/commit/4e3f715941f94cb2467ca68b205a5fa3630130a3).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4812][SQL] Fix the initialization issue...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3660#issuecomment-66445954
  
  [Test build #24307 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24307/consoleFull)
 for   PR 3660 at commit 
[`a3eea56`](https://github.com/apache/spark/commit/a3eea5692b7bf2fd88b27032e899b776651ef321).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4812][SQL] Fix the initialization issue...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3660#issuecomment-66445964
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24307/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4006] In long running contexts, we enco...

2014-12-10 Thread tsliwowicz
Github user tsliwowicz commented on the pull request:

https://github.com/apache/spark/pull/2914#issuecomment-66446290
  
No problem. Glad to help :-)

On Wed, Dec 10, 2014 at 4:44 AM, andrewor14 notificati...@github.com
wrote:

 Hey sorry @tsliwowicz https://github.com/tsliwowicz for using your PRs
 as the battleground in fixing our builds against older branches. There
 aren't a lot of PRs opened against older branches so these tests aren't 
run
 in this context very often. So far I think all of these test failures have
 nothing to do with your patch so there is no action needed on your side. 
On
 our side, we'll keep investigating why the tests are failing all the time.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2914#issuecomment-66396333.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/3600#issuecomment-66446898
  
I do know that `finalize` can close wrapped stream. I did not say it would 
not. But It only can if you implement it as that.

There is no such implicit contract as I know. As the discussions I included 
in previous comment show, some InputStream implement `finalize` and some not. 
You can not reply on a specified implementation found in few InputStream types 
to generalize the behavior to all InputStream types. And there is an obvious 
example, `DeserializationStream`, which does not implement the implicit 
contract.

If this PR would cause problem, I just want to know why and where it is. 
You said `DeserializationStream` would cause problem if it goes out of scope, I 
just show you that the problem you mentioned is not the case, as the codes show.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/3600#issuecomment-66448444
  
Except for some streams associated with files and network connections, not 
all streams should always be closed when you're done with them. That is what I 
know. Maybe that is why `DeserializationStream` does not implement `finalize` 
to close its input stream.

I think that it is unnecessary to have such long discussion for a small 
modification. I will close this PR later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread viirya
Github user viirya closed the pull request at:

https://github.com/apache/spark/pull/3600


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66448735
  
  [Test build #24309 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24309/consoleFull)
 for   PR 1269 at commit 
[`af9bcc8`](https://github.com/apache/spark/commit/af9bcc87df561f920226342d25ca4203639bacf9).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66450681
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24308/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66450671
  
  [Test build #24308 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24308/consoleFull)
 for   PR 1269 at commit 
[`7f9b7c3`](https://github.com/apache/spark/commit/7f9b7c35c28e3399a8c34d494064a3bbd238d9c2).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/3600#issuecomment-66451641
  
I think I did say this will not go into spark at the very begining of my 
review :-)
In the assumption that you would want to continue to improve spark IO, I 
wanted to clarify why it wont go in. This part of spark core is critical to 
correctness of IO - hence the additional scrutiny (when I get time) to ensure 
no bugs are introduced. We have fixed quite a lot of issues here, and relying 
on (existing) implementation detail is asking for trouble.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/3600#issuecomment-66453822
  
Thanks. But in the end, you still can not provide a rational explanation 
for the reason why it fails. At least, it is not convincing for me. :-) Anyway, 
still thanks for your comments and time to replying.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-12-10 Thread bgreeven
Github user bgreeven commented on a diff in the pull request:

https://github.com/apache/spark/pull/1290#discussion_r21603916
  
--- Diff: docs/mllib-ann.md ---
@@ -0,0 +1,239 @@
+---
+layout: global
+title: Artificial Neural Networks - MLlib
+displayTitle: a href=mllib-guide.htmlMLlib/a - Artificial Neural 
Networks
+---
+
+# Introduction
+
+This document describes the MLlib's Artificial Neural Network (ANN) 
implementation.
+
+The implementation currently consist of the following files:
+
+* 'ArtificialNeuralNetwork.scala': implements the ANN
+* 'ANNSuite': implements automated tests for the ANN and its gradient
+* 'ANNDemo': a demo that approximates three functions and shows a 
graphical representation of
+the result
+
+# Summary of usage
+
+The ArtificialNeuralNetwork object is used as an interface to the neural 
network. It is
+called as follows:
+
+```
+val annModel = ArtificialNeuralNetwork.train(rdd, hiddenLayersTopology, 
maxNumIterations)
+```
+
+where
+
+* `rdd` is an RDD of type (Vector,Vector), the first element containing 
the input vector and
+the second the associated output vector.
+* `hiddenLayersTopology` is an array of integers (Array[Int]), which 
contains the number of
+nodes per hidden layer, starting with the layer that takes inputs from the 
input layer, and
+finishing with the layer that outputs to the output layer. The bias nodes 
are not counted.
+* `maxNumIterations` is an upper bound to the number of iterations to be 
performed.
+* `ANNmodel` contains the trained ANN parameters, and can be used to 
calculated the ANNs
+approximation to arbitrary input values.
+
+The approximations can be calculated as follows:
+
+```
+val v_out = annModel.predict(v_in)
+```
+
+where v_in is either a Vector or an RDD of Vectors, and v_out respectively 
a Vector or RDD of
+(Vector,Vector) pairs, corresponding to input and output values.
+
+Further details and other calling options will be elaborated upon below.
+
+# Architecture and Notation
+
+The file ArtificialNeuralNetwork.scala implements the ANN. The following 
picture shows the
+architecture of a 3-layer ANN:
+
+```
+ +---+
+ |   |
+ | N_0,0 |
+ |   | 
+ +---++---+
+  |   |
+ +---+| N_0,1 |   +---+
+ |   ||   |   |   |
+ | N_1,0 |-   +---+ -| N_0,2 |
+ |   | \ Wij1  /  |   |
+ +---+  --+---+  --   +---+
+   \  |   | / Wjk2
+ :  -| N_1,1 |-  +---+
+ :|   |   |   |
+ :+---+   | N_1,2 |
+ :|   |
+ ::   +---+
+ ::
+ :::
+ :: 
+ ::   +---+
+ ::   |   |
+ ::   |N_K-1,2|
+ :|   |
+ :+---+   +---+
+ :|   |
+ :|N_J-1,1|
+  |   |
+ +---++---+
+ |   | 
+ |N_I-1,0|  
+ |   |
+ +---+
+
+ +---+++
+ |   |||
+ |   -1  ||   -1   |
+ |   |||
+ +---+++
+
+INPUT LAYER  HIDDEN LAYEROUTPUT LAYER
+```
+
+The i-th node in layer l is denoted by N_{i,l}, both i and l starting with 
0. The weight
+between node i in layer l-1 and node j in layer l is denoted by Wijl. 
Layer 0 is the input
+layer, whereas layer L is the output layer.
+
+The ANN also implements bias units. These are nodes that always output the 
value -1. The bias
+units are in all layers except the output layer. They act similar to other 
nodes, but do not
+have input.
+
+The value of node N_{j,l} is calculated  as follows:
+
+`$N_{j,l} = g( \sum_{i=0}^{topology_l} W_{i,j,l)*N_{i,l-1} )$`
+
+Where g is the sigmoid function
+
+`$g(t) = \frac{e^{\beta t} }{1+e^{\beta t}}$`
+
+# LBFGS
+
+MLlib's ANN implementation uses the LBFGS optimisation algorithm for 
training. It minimises the
+following error function:
+
+`$E = \sum_{k=0}^{K-1} (N_{k,L} - Y_k)^2$`
+
+where Y_k is the target output given inputs N_{0,0} ... N_{I-1,0}.
+
+# Implementation Details
+
+## The ArtificialNeuralNetwork class
+
+The ArtificialNeuralNetwork class has the following constructor:
+
+```
+class 

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-66454787
  
  [Test build #24310 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24310/consoleFull)
 for   PR 1290 at commit 
[`5e86c5e`](https://github.com/apache/spark/commit/5e86c5edab4c58fee55ddae841f29105f62ceec4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4741] Do not destroy FileInputStream an...

2014-12-10 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/3600#issuecomment-66455203
  
Anyway, still thanks for your comments and time to replying this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66456752
  
  [Test build #24311 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24311/consoleFull)
 for   PR 1269 at commit 
[`b3f7a0d`](https://github.com/apache/spark/commit/b3f7a0de47497ca88a0815656451a4379fe180dc).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

2014-12-10 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/3653#issuecomment-66458371
  
@JoshRosen @pwendell @andrewor14


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66458584
  
  [Test build #24309 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24309/consoleFull)
 for   PR 1269 at commit 
[`af9bcc8`](https://github.com/apache/spark/commit/af9bcc87df561f920226342d25ca4203639bacf9).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66458603
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24309/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3653#issuecomment-6645
  
  [Test build #24312 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24312/consoleFull)
 for   PR 3653 at commit 
[`195852c`](https://github.com/apache/spark/commit/195852c8bf3a36bfcebff54b3188eac152b010b7).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3653#issuecomment-66459924
  
  [Test build #24313 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24313/consoleFull)
 for   PR 3653 at commit 
[`aa8bb87`](https://github.com/apache/spark/commit/aa8bb8771d08968d5564be51732c5062b2a7883a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-66461883
  
  [Test build #24310 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24310/consoleFull)
 for   PR 1290 at commit 
[`5e86c5e`](https://github.com/apache/spark/commit/5e86c5edab4c58fee55ddae841f29105f62ceec4).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class OutputCanvas2D(wd: Int, ht: Int) extends Canvas `
  * `class OutputFrame2D( title: String ) extends Frame( title ) `
  * `class OutputCanvas3D(wd: Int, ht: Int, shadowFrac: Double) extends 
Canvas `
  * `class OutputFrame3D(title: String, shadowFrac: Double) extends 
Frame(title) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an 1-hi...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-66461896
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24310/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4461][YARN] pass extra java options to ...

2014-12-10 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/3409#issuecomment-66464201
  
I'm in favor of spark.yarn.am.* and then documenting if it only applies to 
client mode also.  @andrewor14 @sryza  votes?  Lets try to resolve this today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

2014-12-10 Thread koeninger
Github user koeninger commented on a diff in the pull request:

https://github.com/apache/spark/pull/3543#discussion_r21610025
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
   def createParquetFile[A : Product : TypeTag](
   path: String,
   allowExisting: Boolean = true,
-  conf: Configuration = new Configuration()): SchemaRDD = {
--- End diff --

I seem to recall there being potential thread safety issues related to
hadoop configuration objects, resulting in the need to create / clone them.

Quick search turned up e.g.

https://issues.apache.org/jira/browse/SPARK-2546

I'm not sure how relevant that is to all of these existing situations where
new Configuration() is being called.

On Tue, Dec 9, 2014 at 5:07 PM, Tathagata Das notificati...@github.com
wrote:

 In sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
 https://github.com/apache/spark/pull/3543#discussion-diff-21571141:

  @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 def createParquetFile[A : Product : TypeTag](
 path: String,
 allowExisting: Boolean = true,
  -  conf: Configuration = new Configuration()): SchemaRDD = {

 I think this should be using the hadoopConfiguration object in the
 SparkContext. That has all the hadoop related configuration already setup
 and should be what is automatically used. @marmbrus
 https://github.com/marmbrus should have a better idea.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/3543/files#r21571141.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3653#issuecomment-66473044
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24312/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3653#issuecomment-66473033
  
  [Test build #24312 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24312/consoleFull)
 for   PR 3653 at commit 
[`195852c`](https://github.com/apache/spark/commit/195852c8bf3a36bfcebff54b3188eac152b010b7).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3653#issuecomment-66473682
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24313/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4806] Streaming doc update for 1.2

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3653#issuecomment-66473670
  
  [Test build #24313 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24313/consoleFull)
 for   PR 3653 at commit 
[`aa8bb87`](https://github.com/apache/spark/commit/aa8bb8771d08968d5564be51732c5062b2a7883a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66475387
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24311/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66475370
  
  [Test build #24311 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24311/consoleFull)
 for   PR 1269 at commit 
[`b3f7a0d`](https://github.com/apache/spark/commit/b3f7a0de47497ca88a0815656451a4379fe180dc).
 * This patch **passes all tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread akopich
Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66478011
  
Succeeded at the third attempt. 

(5) Enumerator
@jkbradley, as you can see, I moved `Enumerator` to `mllib/features` folder 
and renamed it to `TokenIndexer`. You said, I should write a setter method 
`setRareTokenThreshold` -- I see no need in this due to the fact, that it's the 
only one field. (If setter method is a code-style and/or API requirement, I'm 
ready add it).

(6) move Dirichlet to stats

I like the idea to move Dirichlet pdf to stats for everyone to be able to 
use it. But I see no classes computing pdf in mllib/stats folder, so I have no 
idea what API should be implemented. 

Any other remarks on code structure and/or API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-10 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66488488
  
While collecting data from a Parquet based SchemaRDD, the underlying 
Parquet split may be out of order, thus caused occasional test failures.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4453][SPARK-4213][SQL] Additional test ...

2014-12-10 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/#issuecomment-66488779
  
Hi @sarutak, I added a new set of Parquet test suites in #3644, which aim 
to replace the old `ParquetQuerySuite`. I believe Parquet filters have been 
tested thoroughly there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4798][SQL] A new set of Parquet testing...

2014-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3644#issuecomment-66488841
  
  [Test build #24314 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24314/consoleFull)
 for   PR 3644 at commit 
[`800e745`](https://github.com/apache/spark/commit/800e7459a9261281c35e48c837dbb7de5643e4b2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2309][MLlib] Generalize the binary logi...

2014-12-10 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/1379#issuecomment-66490270
  
@dbtsai Thank you, I look forward for your code to perform benchmarks. 
Thanks again for the video! I've enjoy ed it, especially QA after the talk. At 
51:23 Prof CJ Lin mentiones that we released dataset of about 600 Gigabytes. 
Do you know where I can download it? It should be quite a challenging workload 
for classification in Spark!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Updated documentation and refactored code to e...

2014-12-10 Thread ilganeli
GitHub user ilganeli opened a pull request:

https://github.com/apache/spark/pull/3664

Updated documentation and refactored code to extract shared variables

Hi all - cleaned up the code to get rid of the unused parameter and added 
some discussion of the ThreadPoolExecutor parameters to explain why we can use 
a single threadCount instead of providing a min/max. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ilganeli/spark SPARK-3607C

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3664.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3664


commit 3c056904570fdd97d429c10895590850bb81e759
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2014-12-10T17:35:02Z

Updated documentation and refactored code to extract shared variables




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3607] ConnectionManager threads.max con...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3664#issuecomment-66491450
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1037] The name of findTaskFromList fi...

2014-12-10 Thread ilganeli
GitHub user ilganeli opened a pull request:

https://github.com/apache/spark/pull/3665

[SPARK-1037] The name of findTaskFromList  findTask in 
TaskSetManager.scala is confusing

Hi all - I've renamed the methods referenced in this JIRA to clarify that 
they modify the provided arrays (find vs. deque).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ilganeli/spark SPARK-1037B

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3665.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3665


commit 683482afddd2ab45626fa57ccac6711314669dd1
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2014-12-10T17:43:08Z

Renamed private methods to clarify that they modify the provided parameters

commit f27d85ebdbe1355039c80f236c9075a446e3018c
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2014-12-10T17:46:12Z

Renamed private methods to clarify that they modify the provided parameters




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1037] The name of findTaskFromList fi...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3665#issuecomment-66493048
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4569] Rename 'externalSorting' in Aggre...

2014-12-10 Thread ilganeli
GitHub user ilganeli opened a pull request:

https://github.com/apache/spark/pull/3666

[SPARK-4569] Rename 'externalSorting' in Aggregator

Hi all - I've renamed the unhelpfully named variable and added a comment 
clarifying what's actually happening. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ilganeli/spark SPARK-4569B

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3666.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3666


commit 5b3f39cf4f1475a4b656eb24d563af80e4a953c9
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2014-12-10T17:51:42Z

[SPARK-4569] Rename  in Aggregator

commit d7cefec06e0e3b235ee67bcdf8bf115c92a1cbed
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2014-12-10T17:52:40Z

[SPARK-4569] Rename 'externalSorting' in Aggregator

commit e2d20929b043ed4dbe1001bb38e3e441c8450992
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2014-12-10T17:53:53Z

[SPARK-4569] Rename 'externalSorting' in Aggregator

commit 18103943e4b2584ce3079f466cdd7e3253675fac
Author: Ilya Ganelin ilya.gane...@capitalone.com
Date:   2014-12-10T17:54:39Z

[SPARK-4569] Rename 'externalSorting' in Aggregator




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4569] Rename 'externalSorting' in Aggre...

2014-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3666#issuecomment-66493838
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

2014-12-10 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3543#discussion_r21622115
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
   def createParquetFile[A : Product : TypeTag](
   path: String,
   allowExisting: Boolean = true,
-  conf: Configuration = new Configuration()): SchemaRDD = {
--- End diff --

@koeninger The issue that you linked is concerned with thread-safety issues 
when multiple threads concurrently modify the same `Configuration` instance.

It turns out that there's another, older thread-safety issue related to 
`Configuration`'s constructor not being thread-safe due to non-thread-safe 
static state: https://issues.apache.org/jira/browse/HADOOP-10456.  This has 
been fixed in some newer Hadoop releases, but since it was only reported in 
April I don't think we can ignore it.  As a result, 
https://issues.apache.org/jira/browse/SPARK-1097 implements a workaround which 
synchronizes on an object before calling `new Configuration`.  Currently, I 
think the extra synchronization logic is only implemented in `HadoopRDD`, but 
it should probably be used everywhere just to be safe.  I think that 
`HadoopRDD` was the highest-risk place where we might have many threads 
creating Configurations at the same time, which is probably why that patch's 
author didn't add the synchronization everywhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2199] [mllib] topic modeling

2014-12-10 Thread akopich
Github user akopich commented on the pull request:

https://github.com/apache/spark/pull/1269#issuecomment-66498633
  
(5) Enumerator

BTW, names `TokenIndexer` and `TokenIndex` look confusive (though, these 
classes rely on `breeze.util.Index`). 

So I renamed it to `TokenEnumerator`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >