[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-10-18 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/13440
  
I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at 
this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-08-23 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/13440
  
@HyukjinKwon @thunterdb can you all take a look at this?  It's been under 
review for quite a long time!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15105
  
Thanks for the feedback @srowen!  I think 18e6bfe addresses everything from 
a code perspective, but it missed removing the comment about assuming that 
distinct words have distinct vector representations, so I've just pushed 
another commit that just removes that comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15150: Use a bounded priority queue to find synonyms in ...

2016-09-19 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/15150

Use a bounded priority queue to find synonyms in Word2VecModel

## What changes were proposed in this pull request?

The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements 
with the highest similarity to the query vector currently sorts the collection 
of similarities for every vocabulary element. This involves making multiple 
copies of the collection of similarities while doing a (relatively) expensive 
sort. It would be more efficient to find the best matches by maintaining a 
bounded priority queue and populating it with a single pass over the 
vocabulary, and that is exactly what this patch does.

## How was this patch tested?

This patch adds no user-visible functionality and its correctness should be 
exercised by existing tests.  To ensure that this approach is actually faster, 
I made a microbenchmark for `findSynonyms`:

```
object W2VTiming {
  import org.apache.spark.{SparkContext, SparkConf}
  import org.apache.spark.mllib.feature.Word2VecModel
  def run(modelPath: String, scOpt: Option[SparkContext] = None) {
val sc = scOpt.getOrElse(new SparkContext(new 
SparkConf(true).setMaster("local[*]").setAppName("test")))
val model = Word2VecModel.load(sc, modelPath)
val keys = model.getVectors.keys
val start = System.currentTimeMillis
for(key <- keys) {
  model.findSynonyms(key, 5)
  model.findSynonyms(key, 10)
  model.findSynonyms(key, 25)
  model.findSynonyms(key, 50)
}
val finish = System.currentTimeMillis
println("run completed in " + (finish - start) + "ms")
  }
}
```

I ran this test on a model generated from the complete works of Jane Austen 
and found that the new approach was over 3x faster than the old approach.  (As 
the `num` argument to `findSynonyms` approaches the vocabulary size, the new 
approach will have less of an advantage over the old one.)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark SPARK-17595

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15150


commit ddba6577a6ece67f77b3757da859c88fa9065c04
Author: William Benton 
Date:   2016-09-19T14:35:57Z

Use a bounded priority queue to find synonyms in Word2VecModel




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...

2016-09-19 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15150
  
Thanks for the feedback, @srowen!  I've made the changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...

2016-09-20 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15150#discussion_r79600882
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -580,7 +581,11 @@ class Word2VecModel private[spark] (
   ind += 1
 }
 
-val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
+val pq = new BoundedPriorityQueue[(String, Double)](num + 
1)(Ordering.by(_._2))
+
+pq ++= wordList.zip(cosVec)
+
+val scored = pq.toSeq.sortBy(-_._2)
 
 val filtered = wordOpt match {
   case Some(w) => scored.take(num + 1).filter(tup => w != tup._1)
--- End diff --

Thanks, @hhbyyh!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...

2016-09-20 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15150#discussion_r79637489
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -580,10 +581,14 @@ class Word2VecModel private[spark] (
   ind += 1
 }
 
-val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
+val pq = new BoundedPriorityQueue[(String, Double)](num + 
1)(Ordering.by(_._2))
+
+pq ++= wordList.zip(cosVec)
--- End diff --

Yeah, I guess I figured that since we were allocating the tuples anyway a 
single copy of the array wasn't a lot of extra overhead vs. having slightly 
cleaner code.  But I'm happy to make the change if you think it's a good idea.  
I agree that allocating an array just to iterate through it isn't ideal.

(I'm ambivalent, partially because I don't have a great sense for the 
vocabulary sizes people typically use this code for in the wild.  For my 
example corpus, my patch as-is, zipping collection iterators, and explicit 
iteration over indices are all more or less equivalent in time performance.  My 
intuition is that allocating even the single array from `zip` is a bad deal if 
we're dealing with a very large vocabulary but probably not if the typical case 
is on the order of 10^5 words or less.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...

2016-09-20 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15150#discussion_r79641840
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -580,10 +581,14 @@ class Word2VecModel private[spark] (
   ind += 1
 }
 
-val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
+val pq = new BoundedPriorityQueue[(String, Double)](num + 
1)(Ordering.by(_._2))
+
+pq ++= wordList.zip(cosVec)
--- End diff --

Agreed.  I'll push as soon as I finish running tests locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-07-21 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/13440
  
@thunterdb can you take a look at this now that 2.2 is out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2018-09-17 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/13440
  
@srowen @thunterdb this PR passes all tests and merges cleanly -- can you 
take another look?  It's been open for quite a while now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-14 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/15105

[SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously 
rejects the best match when invoked with a vector

## What changes were proposed in this pull request?

This pull request changes the behavior of `Word2VecModel.findSynonyms` so 
that it will not spuriously reject the best match when invoked with a vector 
that does not correspond to a word in the model's vocabulary.  Instead of 
blindly discarding the best match, the changed implementation discards a match 
that corresponds to the query word (in cases where `findSynonyms` is invoked 
with a word) or that has an identical angle to the query vector.

## How was this patch tested?

I added a test to `Word2VecSuite` to ensure that the word with the most 
similar vector from a supplied vector would not be spuriously rejected.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark fix/findSynonyms

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15105


commit 5a7cf31e1c76c2fe4e172c2889fa23deb2152af9
Author: William Benton 
Date:   2016-09-14T17:07:14Z

test for spurious rejection of similar vectors in findSynonyms

commit 757ce7c73e76f751395d55de3b5b31fa4a05cde8
Author: William Benton 
Date:   2016-09-14T21:35:29Z

`Word2VecModel.findSynonyms` no longer spuriously rejects the best match

Previously, the `findSynonyms` method in `Word2VecModel` rejected the
closest-matching vector.  This was typically correct in cases where we
were searching for synonyms of a word, but was incorrect in cases
where we were searching for words most similar to a given vector,
since the given vector might not correspond to a word in the model's
vocabulary.

With this commit, `findSynonyms` will not discard the best matching
term unless the best matching word is also the query word (or is
maximally similar to the query vector).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-14 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15105
  
Thanks, @hhbyyh.  This is what I get for running `sbt "testOnly ..."`  I'll 
push a fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78973934
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

@hhbyyh So [one of the python 
doctests](https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/feature.py#L575)
 depends on a word not being a synonym of its vector representation.  I think 
since this is the documented behavior now, that's the direction the fix should 
go as well, but I'll use `Vector.equals` instead of checking similarity in any 
case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78981066
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

@srowen This code in general kind of bothers me (I'd rather see a single 
pass through the tuples with a bounded priority queue keeping track of the `num 
+ 1` candidates than converting to a sequence and then allocating an array to 
sort in place).  But I'm inclined to get some numbers showing that that is a 
good idea and make it a separate PR unless this is a good time to fold it in 
(so to speak).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78981374
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -68,6 +69,21 @@ class Word2VecSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(syms(1)._1 == "japan")
   }
 
+  test("findSynonyms doesn't reject similar word vectors when called with 
a vector") {
+val num = 2
+val word2VecMap = Map(
+  ("china", Array(0.50f, 0.50f, 0.50f, 0.50f)),
+  ("japan", Array(0.40f, 0.50f, 0.50f, 0.50f)),
+  ("taiwan", Array(0.60f, 0.50f, 0.50f, 0.50f)),
+  ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f))
+)
+val model = new Word2VecModel(word2VecMap)
+val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 
0.50d)), num)
--- End diff --

Yes to both (the two significant digits was simply following the style 
earlier in the test).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78983451
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

I wasn't sure if the todo was referring to the sorting or to earlier 
optimizations.  I'll get it started as a separate issue; thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78995496
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

@srowen So actually reverting to the old code but filtering only if 
`wordOpt` is defined doesn't handle the original case I was considering here, 
where you pass in a vector that is very similar to the representation of a word 
in the vocabulary but that is not itself the representation of a word in the 
vocabulary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-15 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15105
  
Can we assume in general that distinct words will not have identical (viz., 
by `.equals`) vector representations?  I ask because the behavior in the 
current documentation (in that pyspark test I linked above) assumes that you 
don't want a word to turn up as a synonym for its own vector representation.  
But this is impossible to enforce if we aren't guaranteed that distinct words 
won't have identical vector representations.

This seems like a mostly reasonable assumption but it is definitely a 
corner case that we might want to be robust to (or at least take note of).  If 
we can't accept this assumption, then we should figure out whether or not 
changing the expected behavior in the documentation is acceptable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79005872
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

Consider the case where you're passing in a vector that is extremely 
similar to the representation of a word in the vocabulary but that is not 
itself the representation of a word in the vocabulary.  (A concrete example is 
(in this test I 
added)[https://github.com/willb/spark/blob/fix/findSynonyms/mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala#L72].)
  In this case, `wordOpt` is not defined (because you are querying for words 
whose vector representations are similar to a given vector) but you nonetheless 
are not interested in discarding the best match because it is not a trivial 
match (that is, it is not going to be your query term).

Related to your other comment (and discussion elsewhere on this PR), I 
think we could make a case for changing the documented behavior (especially 
since it is only documented as such in pyspark) in the case where 
`findSynonyms` is invoked with the vector representation of a word that is _in_ 
the vocabulary.  Instead of rejecting the best match in that case, we could 
return it. The argument there is that such a result is telling you something 
you didn't necessarily know (otherwise, you'd probably be querying for a word 
and not a vector) and that there is an easy way to identify that such a match 
is trivially the best match.  I recognize that changing documented behavior is 
a big deal, but in this case it seems like it could be the best way to address 
a minor bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-15 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15105
  
@srowen Yes.  But if we're querying for words similar to a given vector, 
then there's no word to filter out (and, indeed, no way to know which word we 
might want to filter out if multiple words might map to the same vector 
representation).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-16 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79157601
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -227,7 +227,7 @@ class Word2VecModel private[ml] (
*/
   @Since("1.5.0")
   def findSynonyms(word: String, num: Int): DataFrame = {
-findSynonyms(wordVectors.transform(word), num)
+findSynonyms(wordVectors.transform(word), num, Some(word))
--- End diff --

In this case (and similarly in 
[`Word2VecModelWrapper`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala))
 I opted to call the three-argument version because the wrappers both 
explicitly convert their argument to a vector before calling `findSynonyms` on 
the underlying model (and so `wordOpt` would not be defined if the wrapper were 
invoked with a word).  If we were to make the three-argument `findSynonyms` 
private we wouldn't be able to share a code path in the wrapper classes and 
would need to duplicate the code to tidy and reformat results in both methods 
(data frame creation in this case, unzipping and `asJava` in the Python model 
wrapper) or factor it out to a separate method.  Let me know how you want me to 
proceed here.

I agree that updating the docs makes sense and will make it clearer to 
future maintainers as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46677366
  
Thanks for the quick review and patch, @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46724443
  
@rxin, re: the former, seems like most implementations signal this as an 
error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46725581
  
OK, I wasn't sure if strict Hive compatibility was the goal.  I'm happy to 
take these tickets.  Thanks again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2225] Turn HAVING without GROUP BY into...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1161#issuecomment-46727408
  
LGTM; this is basically exactly what I did 
(https://github.com/willb/spark/commit/b272f6be925ba50741e0a5093244926ea4a7a9a8)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-06-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-46860741
  
Can someone take another look at this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-06-29 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/143#discussion_r14329726
  
--- Diff: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala ---
@@ -153,6 +153,18 @@ private[spark] object ClosureCleaner extends Logging {
   field.setAccessible(true)
   field.set(func, outer)
 }
+
+if (checkSerializable) {
+  ensureSerializable(func)
+}
+  }
+
+  private def ensureSerializable(func: AnyRef) {
+try {
+  SparkEnv.get.closureSerializer.newInstance().serialize(func)
+} catch {
+  case ex: Exception => throw new SparkException("Task not 
serializable: " + ex.toString)
--- End diff --

I agree that it is better to wrap the underlying exception but was 
following the style of this error in DAGScheduler.  I'll make the change and 
update that as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-06-29 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-47459912
  
Sorry, I missed FailureSuite.  I have a fix but ran out of battery before I 
could push.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-729: predictable closure capture

2014-07-07 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1322

spark-729:  predictable closure capture

SPARK-729 concerns when free variables in closure arguments to 
transformations are captured. Currently, it is possible for closures to get the 
environment in which they are serialized (not the environment in which they are 
created).  This PR causes free variables in closure arguments to RDD 
transformations to be captured at closure creation time by modifying 
`ClosureCleaner` to serialize and deserialize its argument.

This PR is based on #189 (which is closed) but has fixes to work with some 
changes in 1.0.  In particular, it ensures that the cloned `Broadcast` objects 
produced by closure capture are registered with `ContextCleaner` so that 
broadcast variables won't become invalid simply because variable capture 
(implemented this way) causes strong references to the original broadcast 
variables to go away.

(See #189 for additional discussion and background.)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-729

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1322.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1322


commit c24b7c8e92c035cd5b48acf11fb29dc060f68fca
Author: William Benton 
Date:   2014-07-04T16:26:59Z

Added reference counting for Broadcasts.

commit b10f613b6debe78ec1a1d53dd7f28168f8adb359
Author: William Benton 
Date:   2014-07-07T17:52:26Z

Added ContextCleaner.withCurrentCleaner

This method allows code that needs access to the currently-active
ContextCleaner to access it via a DynamicVariable.

commit 14e074a59616732d422454c1ce881bc9b29060cd
Author: William Benton 
Date:   2014-03-18T14:55:57Z

Added tests for variable capture in closures

The two tests added to ClosureCleanerSuite ensure that variable values
are captured at RDD definition time, not at job-execution time.

commit 4e026a9d130c6fc92ce4cd73a68de013ed7aee5d
Author: William Benton 
Date:   2014-03-20T15:48:17Z

Predictable closure environment capture

The environments of serializable closures are now captured as
part of closure cleaning. Since we already proactively check most
closures for serializability, ClosureCleaner.clean now returns
the result of deserializing the serialized version of the cleaned
closure.

Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala

commit 39960620eb9c37f92ade2c35e2d5402cad6dd686
Author: William Benton 
Date:   2014-03-26T04:45:45Z

Skip proactive closure capture for runJob

There are two possible cases for runJob calls: either they are called
by RDD action methods from inside Spark or they are called from client
code. There's no need to proactively check the closure argument to
runJob for serializability or force variable capture in either case:

1. if they are called by RDD actions, their closure arguments consist
of mapping an already-serializable closure (with an already-frozen
environment) to each element in the RDD;

2. in both cases, the closure is about to execute and thus the benefit
of proactively checking for serializability (or ensuring immediate
variable capture) is nonexistent.

(Note that ensuring capture via serializability on closure arguments to
runJob also causes pyspark accumulators to fail to update.)

Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala

commit f4ed7535a1a1af0a518d4b7c9585073026a745c1
Author: William Benton 
Date:   2014-03-26T16:31:56Z

Split closure-serializability failure tests

This splits the test identifying expected failures due to
closure serializability into three cases.

commit b507dd85b0ea1595ed216ffd8226964ba671676c
Author: William Benton 
Date:   2014-04-04T21:39:55Z

Fixed style issues in tests

Conflicts:

core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala

commit 5284569120cff51b3d2253e03b435047258798a0
Author: William Benton 
Date:   2014-04-04T22:15:50Z

Stylistic changes and cleanups

Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala

commit d6d49304f543b5435b3062558f80ea61d2e1a757
Author: William Benton 
Date:   2014-05-02T14:23:49Z

Removed proactive closure serialization from DStream

Conflicts:

streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala

commit c052a63dba60b62325abb7ad541fe3d3f1a6e7d0
Author: William Benton 
Date:   2014-07-07T20:47:41Z

Support tracking clones of broadcast variables.




---
If your project is set up for it, you can reply to this email and have 

[GitHub] spark pull request: Fix (some of the) warnings in the test suite

2014-07-07 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1323

Fix (some of the) warnings in the test suite

This PR fixes three classes of compiler and deprecation warnings in the 
test suite:

* `expectResult` is currently deprecated and has been replaced with 
`assertResult`;
* assertions of the form `should be ===` are deprecated in favor of the 
form `shouldEqual`; and
* `scala.language.postfixOps` was not in scope within the test classes in 
which postfix operations were actually used.

The fixes for these issues were almost entirely mechanical, but they 
eliminate many lines of warning output.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark testCleanups

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1323.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1323


commit b05304666eb61d7d1d1849772c27cf9711699dae
Author: William Benton 
Date:   2014-07-07T20:31:27Z

Replace expectResult with assertResult

Assertions.expectResult is deprecated.  This commit replaces
expectResult invocations with the assertResult invocations.

commit 5377534fdfe741aab7d6d539573ab78636bfd554
Author: William Benton 
Date:   2014-07-07T20:37:53Z

Replaced deprecated 'be ===' assertions

"should be ===" assertions have been deprecated in favor of
"shouldEqual" assertions.  This commit replaces the deprecated
form with the supported form.

commit a672a9b5d579c4946a20fe4c6f88c4582949ef3f
Author: William Benton 
Date:   2014-07-07T21:00:58Z

Ensure language.postfixOps is in scope where used

Previously, language.postfixOps was imported at toplevel, which meant
compiler warnings since it wasn't visible inside the classes that used
postfix operations.  This commit moves the import to suppress these
warnings.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix (some of the) warnings in the test suite

2014-07-07 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-48243538
  
@srowen sorry, I hadn't noticed the other PR!  I think the `postfixOps` 
change in mine is disjoint, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix (some of the) warnings in the test suite

2014-07-08 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-48326403
  
@rxin Done; I also updated the comment to reflect the narrower focus after 
eliminating overlap with #1153.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1359

SPARK-2407: Added internal implementation of SQL SUBSTR()

This replaces the Hive UDF for SUBSTR(ING) with an implementation in 
Catalyst
and adds tests to verify correct operation.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark internalSqlSubstring

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1359.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1359


commit 7d7b2631394d2b46f9753c0cf53e674b640eb82c
Author: William Benton 
Date:   2014-07-08T18:06:15Z

Added internal implementation of SQL SUBSTR()

This replaces the Hive UDF for SUBSTR(ING) with an implementation in 
Catalyst
and adds tests to verify correct operation.

Squashes 8969d038 and e6419b4e.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782197
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos > 0 => pos - 1
+  case neg if neg < 0 => len + neg
+  case _ => 0
+}
+
+val end = sliceLen match {
+  case max if max == Integer.MAX_VALUE => max
+  case x => start + x
+}
+  
+str.slice(start, end)
+  }
+  
+  override def eval(input: Row): Any = {
+val string = str.eval(input)
+
+val po = pos.eval(input)
+val ln = len.eval(input)
+
+if ((string == null) || (po == null) || (ln == null)) {
+  null
+} else {
+  val start = po.asInstanceOf[Int]
+  val length = ln.asInstanceOf[Int] 
+  
+  string match {
+case ba: Array[Byte] => slice(ba, start, length)
+case other => slice(other.toString, start, length)
+  }
+}
+  }
+  
+  override def toString = s"SUBSTR($str, $pos, $len)"
--- End diff --

Makes sense!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782431
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -860,6 +860,7 @@ private[hive] object HiveQl {
   val BETWEEN = "(?i)BETWEEN".r
   val WHEN = "(?i)WHEN".r
   val CASE = "(?i)CASE".r
+  val SUBSTR = "(?i)I_SUBSTR(?:ING)?".r
--- End diff --

Good catch -- the `I_` was spuriously committed.  (I had that in my working 
tree so that I could easily compare Catalyst and Hive `substr` from the 
console.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48643276
  
Thanks for the comments, @concretevitamin.  I've made the changes from your 
comments and will think about reducing branch overhead before pushing an update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14783468
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
--- End diff --

Hive supports 0-based indexing in the same way as this patch.  I agree that 
supporting both in this way is ugly (both from an interface and from an 
implementation perspective), but it seems likely that people are depending on 
this behavior in the wild, doesn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14783626
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos > 0 => pos - 1
+  case neg if neg < 0 => len + neg
+  case _ => 0
+}
+
+val end = sliceLen match {
--- End diff --

This is the behavior of `IndexedSeqOptimized[A,B].slice` (and thus this 
patch) as well as of Hive, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48648670
  
@concretevitamin Inlining `Substring.slice` seems to make a nontrivial 
difference (~11.5% speedup) on a very simple `Substring.eval()` microbenchmark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14787712
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
--- End diff --

(Another point is that we need to branch on the index value even if we 
eliminate 0-based indexing in order to handle substrings taken from the end of 
the string.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14802394
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +207,71 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (!resolved) {
+  throw new UnresolvedException(this, s"Cannot resolve since $children 
are not resolved")
+}
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  @inline
+  def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
--- End diff --

Yes, I think I can do it with implicit parameters.  I'll push an update 
once I've run the Catalyst suite against my change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-14 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48971559
  
Thanks, Michael.  I've fixed the conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...

2014-07-14 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1413

SPARK-2486: Utils.getCallSite is now resilient to bogus frames

When running Spark under certain instrumenting profilers,
Utils.getCallSite could crash with an NPE.  This commit
makes it more resilient to failures occurring while inspecting
stack frames.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2486

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1413.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1413


commit 0f0c1ae825d53d773ef12a02d74072b88f65d91a
Author: William Benton 
Date:   2014-07-15T03:40:51Z

Utils.getCallSite is now resilient to bogus frames

When running Spark under certain instrumenting profilers,
Utils.getCallSite could crash with an NPE.  This commit
makes it more resilient to failures occurring while inspecting
stack frames.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...

2014-07-14 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1413#discussion_r14916362
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -809,7 +809,11 @@ private[spark] object Utils extends Logging {
*/
   def getCallSite: CallSite = {
 val trace = Thread.currentThread.getStackTrace()
-  .filterNot(_.getMethodName.contains("getStackTrace"))
+  .filterNot((ste:StackTraceElement) => 
+// When running under some profilers, the current stack trace 
might contain some bogus 
+// frames. This Try is intended to ensure that we don't crash in 
these situations by
+// ignoring any frames that we can't examine.
+Try(ste.getMethodName.contains("getStackTrace")).getOrElse(true))
--- End diff --

So I was thinking that either `ste` or the result of `ste.getMethodName` 
could be `null` when running under instrumentation.  I'll rewrite to use checks 
instead of `Try`, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...

2014-07-15 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1413#issuecomment-49023679
  
@aarondav @pwendell Yes, with this patch I'm able to enable the YourKit 
features that were causing crashes before.  I'll submit an update to fix the 
bracket style and cc you both.  Thanks for the quick review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reformat multi-line closure argument.

2014-07-15 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1419

Reformat multi-line closure argument.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark reformat-2486

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1419


commit 26762310ddf0ea88a418c506ed0e86892fe6e4d5
Author: William Benton 
Date:   2014-07-15T12:35:13Z

Reformat multi-line closure argument.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reformat multi-line closure argument.

2014-07-15 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1419#issuecomment-49024982
  
(See discussion on #1413; cc @aarondav and @pwendell.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-16 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49159849
  
That was my thought as well, @egraldlo.  Thanks for submitting this, 
@chutium!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-16 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49160982
  
@egraldlo, couldn't it be `(SUBSTR | SUBSTRING) ~> // ... ` in that case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with unre...

2014-07-19 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1497

SPARK-2226:  transform HAVING clauses with unresolvable attributes

This commit adds an analyzer rule to
  1. find expressions in `HAVING` clause filters that depend on unresolved 
attributes, 
  2. push these expressions down to the underlying aggregates, and then
  3. project them away above the filter.

It also enables the `HAVING` queries in the Hive compatibility suite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2226

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1497.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1497


commit 29a26e3ab6a21e6619f003d905bc7aa7d1cb2976
Author: William Benton 
Date:   2014-07-17T15:36:37Z

Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)

commit c7f2b2c8a19b09ec095a316cb965f18d474d7144
Author: William Benton 
Date:   2014-07-17T17:16:18Z

Whitelist HAVING queries.

Also adds golden outputs for HAVING tests.

commit 5a12647c169ee06bba5355c3956a158699247e43
Author: William Benton 
Date:   2014-07-19T17:08:17Z

Explanatory comments and stylistic cleanups.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...

2014-07-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49542704
  
Thanks, @rxin!  I've made these changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...

2014-07-21 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1497#discussion_r15172123
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -152,6 +155,34 @@ class Analyzer(catalog: Catalog, registry: 
FunctionRegistry, caseSensitive: Bool
   }
 
   /**
+   * This rule finds expressions in HAVING clause filters that depend on
+   * unresolved attributes.  It pushes these expressions down to the 
underlying
+   * aggregates and then projects them away above the filter.
+   */
+  object UnresolvedHavingClauseAttributes extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case pl @ Filter(fexp, agg @ Aggregate(_, ae, _)) if 
!fexp.childrenResolved => {
+val alias = Alias(fexp, makeTmp())()
+val aggExprs = Seq(alias) ++ ae
+
+val newCond = EqualTo(Cast(alias.toAttribute, BooleanType), 
Literal(true, BooleanType))
+
+val newFilter = ResolveReferences(pl.copy(condition = newCond,
+  child = agg.copy(aggregateExpressions = aggExprs)))
+
+Project(pl.output, newFilter)
+  }
+}
+
+private val curId = new java.util.concurrent.atomic.AtomicLong()
+
+private def makeTmp() = {
+  val id = curId.getAndIncrement()
+  s"tmp_cond_$id"
--- End diff --

@marmbrus Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-22 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49726353
  
@pwendell Sure, and I'll do this in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-22 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49726591
  
@marmbrus I've made the changes you suggested and moved the rule to the 
Resolution batch but now the newly-inserted attribute references remain 
unresolved after analysis.  I'm going to try and figure out where things are 
going wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-22 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49783600
  
@marmbrus I've pushed it to 
https://github.com/willb/spark/tree/spark-2226-broken -- if I split out the 
`UnresolvedHavingClauseAttributes` to a separate batch and add a 
`ResolveReferences` (as in willb@16b3dfc), things seem to work, but if I put 
`UnresolvedHavingClauseAttributes` in Resolution (as in willb@76ae5c1), things 
fail with

 No function to evaluate expression. type: AttributeReference

I'm sure I'm missing something simple here.  Thanks for taking a look!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49876099
  
Thanks, @marmbrus; this works now and I understand where things were going 
wrong (and also, hopefully, how to use Catalyst more idiomatically).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49912686
  
@marmbrus, thanks again; sorry to have missed those earlier!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-729: predictable closure capture

2014-07-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1322#issuecomment-49913436
  
@mateiz, I think there's a memory blowup somewhere in this patch as it is 
and am trying to track it down.  Coincidentally, it's what I was switching 
context back to when I saw this comment.

@rxin, can you point me to the broadcast change so I can track it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-729: predictable closure capture

2014-07-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1322#issuecomment-49926246
  
Thanks, @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...

2014-10-11 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/2768

SPARK-2863: [SQL] Add facilities for function-argument coercion

This commit adds the `SignedFunction` trait and modifies the `Sqrt` 
expression
class to use it for coercing its argument to `DoubleType`.

`SignedFunction` represents a fixed-arity function whose arguments should be
casted to particular types. Expression classes extending SignedFunction
must provide `formalTypes`, a List of expected types for formal parameters,
`actualParams`, a list of Expressions corresponding to actual parameters,
and create, which creates an instance of that expression class from a list
of expressions corresponding to actuals. The type parameter for
SignedFunction should be the expression class extending it.

See the Sqrt class for a concrete example.

This trait (or one or several abstract classes extending this trait) could
be exposed to code outside `sql` in the future.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2863

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2768.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2768


commit 4f9517a2c11d13f439f3ed7ea447a4559f9e9088
Author: William Benton 
Date:   2014-10-11T12:40:10Z

Adds SignedFunction trait and type coercion rules

SignedFunction represents a fixed-arity function whose arguments should be
casted to particular types. Expression classes extending SignedFunction
must provide `formalTypes`, a List of expected types for formal parameters,
`actualParams`, a list of Expressions corresponding to actual parameters,
and create, which creates an instance of that expression class from a list
of expressions corresponding to actuals. The type parameter for
SignedFunction should be the expression class extending it.

See the Sqrt class for a concrete example.

This trait (or one or several abstract classes extending this trait) could
be exposed to code outside `sql` in the future.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...

2014-10-13 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2768#issuecomment-58893902
  
Two open questions, and the latter is more relevant:  is requiring that 
actuals are casted to the types of formals too restrictive?  Is it likely to 
lead to type-coercion rules oscillating?  (Obviously, it should be possible to 
pass, e.g., a value of a narrow numeric type where a wider one is expected.  
But if all the type-coercion rules we can anticipate ultimately widen types or 
convert from strings into other values, then the rules will still make 
progress.)

(cc @marmbrus)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-27 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1750#discussion_r16815625
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -935,6 +936,7 @@ private[hive] object HiveQl {
 case Token(DIV(), left :: right:: Nil) =>
   Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType)
 case Token("%", left :: right:: Nil) => Remainder(nodeToExpr(left), 
nodeToExpr(right))
+case Token("TOK_FUNCTION", Token(SQRT(), Nil) :: arg :: Nil) => 
Sqrt(nodeToExpr(arg))
--- End diff --

Hey @liancheng, I agree that it is possible to special-case this function 
(and can certainly do so for this PR).  I recently [wrote up what Hive 
does](http://chapeau.freevariable.com/2014/08/existing-system-coercion.html) 
and have some work in progress on a more general solution that should capture 
what Hive does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-28 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1750#issuecomment-53753423
  
@marmbrus I've rebased this atop the current master and added support for 
casting string-valued SQRT arguments to double and a test case in SQLQuerySuite 
for the same.  (These will be unnecessary when I finish general support for 
string coercion in numeric functions, but they make this patch mergeable now.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: spark-729: predictable closure capture

2014-08-29 Thread willb
Github user willb closed the pull request at:

https://github.com/apache/spark/pull/1322


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: spark-729: predictable closure capture

2014-08-29 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1322#issuecomment-53903366
  
@mateiz sure; I've tracked down the problem but am a bit stumped by how to 
fix it.  I'll reopen when I have a solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-08-31 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/2220

SPARK-3329:  [SQL] Don't depend on Hive SET pair ordering.

This fixes some possible spurious test failures in `HiveQuerySuite` by 
comparing sets of key-value pairs as sets, rather than as lists.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-3329

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2220.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2220


commit 36ff52aee8ac2d8da336a56df195e6093e8f7807
Author: William Benton 
Date:   2014-08-31T17:08:00Z

Don't depend on Hive SET pair ordering.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-08-31 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2220#issuecomment-53996991
  
Thanks, @concretevitamin!  I'll close this one then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-09-02 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2220#issuecomment-54165469
  
@concretevitamin I cherry-picked @aarondav's fix (and added a very simple 
fix to handle cases that it didn't).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-09-03 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2220#issuecomment-54303443
  
This failure (in `SparkSubmitSuite`) appears unrelated to my patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3423: [SQL] Implement BETWEEN for SQLPar...

2014-09-05 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/2295

SPARK-3423:  [SQL] Implement BETWEEN for SQLParser



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark sql-between

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2295.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2295


commit 0016d300a77e8ef80899a679f54e7451544361dc
Author: William Benton 
Date:   2014-09-05T23:00:29Z

Implement BETWEEN for SQLParser




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-09-11 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-55345856
  
Hi @jkbradley; thanks for taking a look.  Here are the warnings as I see 
them when compiling tests on the immediate ancestor of my branch, which is 
56e009d (I'm running on OS X 10.9 in this case):

```
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:38:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn] This can be achieved by adding the import clause 'import 
scala.language.postfixOps'
[warn] or by setting the compiler option -language:postfixOps.
[warn] See the Scala docs for value scala.language.postfixOps for a 
discussion
[warn] why the feature should be explicitly enabled.
[warn]   implicit val defaultTimeout = timeout(1 millis)
[warn]   ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:99:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:117:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:134:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:156:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:187:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:290:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   eventually(waitTimeout, interval(100 millis)) {
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/rdd/AsyncRDDActionsSuite.scala:138:
 postfix operator seconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn] failAfter(10 seconds) {
[warn]  ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/rdd/AsyncRDDActionsSuite.scala:174:
 postfix operator seconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn] failAfter(10 seconds) {
[warn]  ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:42:
 postfix operator seconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   eventually(timeout(10 seconds), interval(50 milliseconds)) {
[warn] ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:42:
 postfix operator milliseconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   eventually(timeout(10 seconds), interval(50 milliseconds)) {
[warn]   ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:56:
 postfix operator seconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   eventually(timeout(10 seconds), interval(50 millisecond

[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-09-11 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-55351655
  
Hey @andrewor14, thanks for the reply.  First off, I absolutely agree with 
@srowen's comment on #1330 that `import`s (not compiler flags) are the right 
way to handle enabling these language features.  It looks to me like 
`SpanSugar` pulls in `postfixOps` -- and that it's the only thing in those 
files that uses `postfixOps`.  (I guess having both `SpanSugar` and 
`postfixOps` imported at toplevel was causing some implicit resolution 
confusion?)

In any case, the approach in #1330 is probably the way to go since 
explicitly importing `postfixOps` seems unnecessary and removing the compiler 
flag is a good idea.  Thanks again for taking a look!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-09-11 Thread willb
Github user willb closed the pull request at:

https://github.com/apache/spark/pull/1323


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-17 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-55902476
  
Here's a somewhat-related concern:  it seems like JVM overhead is unlikely 
to scale linearly with requested heap size, so maybe a straight-up 15% isn't a 
great default?  (If you have hard data on how heap requirements grow with job 
size, I'd be interested in seeing it.)  Perhaps it would make more sense to 
reserve whichever is smaller of 15% or some fixed but reasonable amount.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56121260
  
@andrewor14 This bothers me too, but in a slightly different way:  calling 
the parameter “overhead” when it really refers to how to scale requested 
memory to accommodate anticipated overhead seems wrong.  (115% overhead would 
be appalling indeed!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...

2014-09-26 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/2547

SPARK-3699:  SQL and Hive console tasks now clean up appropriately

The sbt tasks sql/console and hive/console will now `stop()`
the `SparkContext` upon exit.  Previously, they left an ugly stack
trace when quitting.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark consoleCleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2547.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2547


commit d5e431f0a1b9047a5afc27cb371dbfb7014fb6e0
Author: William Benton 
Date:   2014-09-26T18:45:20Z

SQL and Hive console tasks now clean up.

The sbt tasks sql/console and hive/console will now clean up
the SparkContext upon exit.  Previously, they left an ugly stack
trace when quitting.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...

2014-09-26 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2547#issuecomment-57006526
  
Looks like Jenkins and GitHub are having a minor disagreement:

FATAL: Failed to fetch from https://github.com/apache/spark.git


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...

2014-09-26 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2547#issuecomment-57032615
  
This patch only changes the sbt build file; it adds no classes and should 
not have caused any tests to fail. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-02 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1750

SPARK-2813:  [SQL] Implement SQRT() directly in Catalyst

This PR adds a native implementation for SQL SQRT() and thus avoids 
delegating this function to Hive.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2813

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1750


commit 18d63f93316e56b9f0e137e272981b5a2eb84074
Author: William Benton 
Date:   2014-08-02T15:30:26Z

Added native SQRT implementation

commit bb8022612c468ae99531fbcc9ddff8a5f45bcf36
Author: William Benton 
Date:   2014-08-02T16:22:40Z

added SQRT test to SqlQuerySuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-03 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1750#issuecomment-51012136
  
@marmbrus I've added a test in ExpressionEvaluationSuite and can push that 
whenever. As far as type goes, the code only produces double-precision results, 
but I could certainly approximate the square root of an arbitrary-precision 
decimal if that makes sense to do.  (Hive 0.12 only handles double-precision 
square roots AFAICT.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-04 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1750#discussion_r15791823
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -935,6 +936,7 @@ private[hive] object HiveQl {
 case Token(DIV(), left :: right:: Nil) =>
   Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType)
 case Token("%", left :: right:: Nil) => Remainder(nodeToExpr(left), 
nodeToExpr(right))
+case Token("TOK_FUNCTION", Token(SQRT(), Nil) :: arg :: Nil) => 
Sqrt(nodeToExpr(arg))
--- End diff --

It does.  What's the best way to handle this in general?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-04 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1750#issuecomment-51143752
  
@marmbrus I'll file a JIRA for that and am happy to put it at the front of 
my plate; sounds like a fun problem!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-06 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1750#discussion_r15894307
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -935,6 +936,7 @@ private[hive] object HiveQl {
 case Token(DIV(), left :: right:: Nil) =>
   Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType)
 case Token("%", left :: right:: Nil) => Remainder(nodeToExpr(left), 
nodeToExpr(right))
+case Token("TOK_FUNCTION", Token(SQRT(), Nil) :: arg :: Nil) => 
Sqrt(nodeToExpr(arg))
--- End diff --

@marmbrus to clarify, should I have a solution for the new type-coercion 
JIRA before doing any other reimplementations, or can I keep submitting 
reimplementations as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-08-06 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-51378794
  
Can someone take a look at this again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...

2014-10-29 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2768#issuecomment-60934954
  
I've rebased this to fix merge conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-03-17 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-37854483
  
Right now, the cases where preemptive checking isn't happening are in 
`DStream.transform` and `DStream.transformWith` invocations, since a DStream 
object is accessible from the closure argument to these in but one invocation 
in the test suite.  (I've been trying to figure out why preemptive checking 
fails here but why these aren't failing on actions.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-03-17 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-37890968
  
The code in this branch now doesn't check for closure serializability in 
DAGScheduler.  (If checking at transformation- and action-invocation time 
becomes optional via a config knob, then that check will have to go back in, of 
course.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-03-17 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-37896667
  
I cleaned up the branch a little by squashing all of the test-case commits 
together and moving the fixes to `GraphSuite` before the preemptive 
serialization commits.  The latter are now in 0561f539.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-03-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-38189119
  
Thanks for the quick review, @kayousterhout!  I think that the current 
state of the branch addresses all of the concerns you raised; sorry that I had 
to dig in a little more than I expected to verify how things were working in 
`DStream`.

If anyone has further thoughts about making this a configurable option, 
please let me know.  The approach I'm currently taking to 
[SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) depends on 
serializability checking happening in `ClosureCleaner.clean`, so if we did 
decide to make this checking optional it would affect when variable capture 
occurred if the option were disabled (as well as requiring that the 
serializability check go back in to `DAGScheduler`, at least conditionally, as 
mentioned above).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-03-20 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/189

SPARK-729:  Closures not always serialized at capture time

[SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) concerns 
when free variables in closure arguments to transformations are captured.  
Currently, it is possible for closures to get the environment in which they are 
serialized (not the environment in which they are created).  There are a few 
possible approaches to solving this problem and this PR will discuss some of 
them.  The approach I took has the advantage of being simple, obviously 
correct, and minimally-invasive, but it preserves something that has been 
bothering me about Spark's closure handling, so I'd like to discuss an 
alternative and get some feedback on whether or not it is worth pursuing.

## What I did

The basic approach I took depends on the work I did for #143, and so this 
PR is based atop that.  Specifically: #143 modifies `ClosureCleaner.clean` to 
preemptively determine whether or not closures are serializable immediately 
upon closure cleaning (rather than waiting for an job involving that closure to 
be scheduled).  Thus non-serializable closure exceptions will be triggered by 
the line defining the closure rather than triggered where the closure is used.

Since the easiest way to determine whether or not a closure is serializable 
is to attempt to serialize it, the code in #143 is creating a serialized 
closure as part of `ClosureCleaner.clean`.  `clean` currently modifies its 
argument, but the method in `SparkContext` that wraps it to return a value (a 
reference to the modified-in-place argument).  This branch modifies 
`ClosureCleaner.clean` so that it returns a value:  if it is cleaning a 
serializable closure, it returns the result of deserializing its serialized 
argument; therefore it is returning a closure with an environment captured at 
cleaning time.  `SparkContext.clean` then returns the result of 
`ClosureCleaner.clean`, rather than a reference to its modified-in-place 
argument.

I've added tests for this behavior (777a1bc).  The pull request as it 
stands, given the changes in #143, is nearly trivial.  There is some overhead 
from deserializing the closure, but it is minimal and the benefit of obvious 
operational correctness (vs. a more sophisticated but harder-to-validate 
transformation in `ClosureCleaner`) seems pretty important.  I think this is a 
fine way to solve this problem, but it's not perfect.

## What we might want to do

The thing that has been bothering me about Spark's handling of closures is 
that it seems like we should be able to statically ensure that cleaning and 
serialization happen exactly once for a given closure.  If we serialize a 
closure in order to determine whether or not it is serializable, we should be 
able to hang on to the generated byte buffer and use it instead of 
re-serializing the closure later.  By replacing closures with instances of a 
sum type that encodes whether or not a closure has been cleaned or serialized, 
we could handle clean, to-be-cleaned, and serialized closures separately with 
case matches.  Here's a somewhat-concrete sketch (taken from my git stash) of 
what this might look like:

```scala
package org.apache.spark.util
 
import java.nio.ByteBuffer
import scala.reflect.ClassManifest
 
sealed abstract class ClosureBox[T] { def func: T }
final case class RawClosure[T](func: T) extends ClosureBox[T] {}
final case class CleanedClosure[T](func: T) extends ClosureBox[T] {}
final case class SerializedClosure[T](func: T, bytebuf: ByteBuffer) extends 
ClosureBox[T] {}
 
object ClosureBoxImplicits {
  implicit def closureBoxFromFunc[T <: AnyRef](fun: T) = new 
RawClosure[T](fun)
}
```

With these types declared, we'd be able to change `ClosureCleaner.clean` to 
take a `ClosureBox[T=>U]` (possibly generated by implicit conversion) and 
return a `ClosureBox[T=>U]` (either a `CleanedClosure[T=>U]` or a 
`SerializedClosure[T=>U]`, depending on whether or not serializability-checking 
was enabled) instead of a `T=>U`.  A case match could thus short-circuit 
cleaning or serializing closures that had already been cleaned or serialized 
(both in `ClosureCleaner` and in the closure serializer).  
Cleaned-and-serialized closures would be represented by a boxed tuple of the 
original closure and a serialized copy (complete with an environment quiesced 
at transformation time).  Additional implicit conversions could convert from 
`ClosureBox` instances to the underlying function type where appropriate.  
Tracking this sort of state in the type system seems like the right thing to do 
to me.

### Why we might not want to do that

_It's pretty invasive._  Every function type used by every `RDD` subclass 
would ha

[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-03-21 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-38330588
  
@mateiz, I was pretty confused about this but it looks like the python 
accumulator tests are what is failing.  I'm not super-familiar with pyspark yet 
but am trying to figure out where things are going wrong.  (If you have any 
advice for where I should start looking, I'd appreciate it!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-03-21 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-38340085
  
The Python accumulators still work on master and with the changes from #143 
(which serializes the closure at closure cleaning but doesn't use the 
serialized value for anything).

As far as I can tell, accumulators in general still work after this change; 
at least all of the non-pyspark tests that use accumulators all still work 
after this change (I have been re-running `JavaAPISuite`, `AccumulatorSuite`, 
`DistributedSuite`, `ReplSuite`, and `AsyncRDDActionsSuite` individually while 
experimenting with this).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-03-26 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-38738160
  
In the discussion on #143, we were talking about whether or not it made 
sense to omit the closure serializability check in `DAGScheduler`.  Since this 
branch, which combines serializability checking and variable capture, removes 
serializability checking for the closure argument to `SparkContext.runJob`, I 
think it makes sense to either (1) re-introduce the check in `DAGScheduler`, or 
(2) add another option to the internal `clean` methods (in `SparkContext` and 
`ClosureCleaner`) allowing checking for serializability without forcing 
variable capture.

I am inclined to prefer the first option but am interested in other 
opinions on the matter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-03-27 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-38839409
  
I have rebased this branch to remove the commit that took out the 
serializability check in `DAGScheduler`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-03-31 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-39144319
  
It looks like this Travis error is the same one others have seen on the dev 
list -- that is, the hive test is timing out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-04-03 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-39490073
  
Thanks for taking another look, Matei!  I know there's a lot of stuff to 
get in before the merge window closes and appreciate the update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-04-04 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-39610852
  
So I just added this to `ClosureCleanerSuite` and I'm not seeing the same 
behavior.   (I already had added a captured-field test to 
`ClosureCleanerSuite`.)  I don't have a test that depends on deep copying of 
referenced objects with mutable state, though; all of my tests have objects 
with mutable state that is of JVM primitive types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-04-04 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-39610969
  
(e.g. here:  
https://github.com/willb/spark/commit/12c63a7e03bce359fd7eb7faf0a054bd32f85824#diff-f949ef08cc8a2b36861af3beb4309a88R161)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-04-04 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-39613329
  
@mateiz, these naming and stylistic suggestions make sense; thanks!

Cloning the closure in `runJob` is what caused Python accumulators to stop 
working.  I will have to check over my notes to see if I can remember why.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-04-04 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-39613399
  
(BTW, as a matter of style, is it better to rebase my branch or add another 
commit that makes the change?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >