[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user willb commented on the issue: https://github.com/apache/spark/pull/13440 I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user willb commented on the issue: https://github.com/apache/spark/pull/13440 @HyukjinKwon @thunterdb can you all take a look at this? It's been under review for quite a long time! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15105 Thanks for the feedback @srowen! I think 18e6bfe addresses everything from a code perspective, but it missed removing the comment about assuming that distinct words have distinct vector representations, so I've just pushed another commit that just removes that comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15150: Use a bounded priority queue to find synonyms in ...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/15150 Use a bounded priority queue to find synonyms in Word2VecModel ## What changes were proposed in this pull request? The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does. ## How was this patch tested? This patch adds no user-visible functionality and its correctness should be exercised by existing tests. To ensure that this approach is actually faster, I made a microbenchmark for `findSynonyms`: ``` object W2VTiming { import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.mllib.feature.Word2VecModel def run(modelPath: String, scOpt: Option[SparkContext] = None) { val sc = scOpt.getOrElse(new SparkContext(new SparkConf(true).setMaster("local[*]").setAppName("test"))) val model = Word2VecModel.load(sc, modelPath) val keys = model.getVectors.keys val start = System.currentTimeMillis for(key <- keys) { model.findSynonyms(key, 5) model.findSynonyms(key, 10) model.findSynonyms(key, 25) model.findSynonyms(key, 50) } val finish = System.currentTimeMillis println("run completed in " + (finish - start) + "ms") } } ``` I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach. (As the `num` argument to `findSynonyms` approaches the vocabulary size, the new approach will have less of an advantage over the old one.) You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark SPARK-17595 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15150.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15150 commit ddba6577a6ece67f77b3757da859c88fa9065c04 Author: William Benton Date: 2016-09-19T14:35:57Z Use a bounded priority queue to find synonyms in Word2VecModel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15150 Thanks for the feedback, @srowen! I've made the changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15150#discussion_r79600882 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -580,7 +581,11 @@ class Word2VecModel private[spark] ( ind += 1 } -val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) +val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2)) + +pq ++= wordList.zip(cosVec) + +val scored = pq.toSeq.sortBy(-_._2) val filtered = wordOpt match { case Some(w) => scored.take(num + 1).filter(tup => w != tup._1) --- End diff -- Thanks, @hhbyyh! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15150#discussion_r79637489 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -580,10 +581,14 @@ class Word2VecModel private[spark] ( ind += 1 } -val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) +val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2)) + +pq ++= wordList.zip(cosVec) --- End diff -- Yeah, I guess I figured that since we were allocating the tuples anyway a single copy of the array wasn't a lot of extra overhead vs. having slightly cleaner code. But I'm happy to make the change if you think it's a good idea. I agree that allocating an array just to iterate through it isn't ideal. (I'm ambivalent, partially because I don't have a great sense for the vocabulary sizes people typically use this code for in the wild. For my example corpus, my patch as-is, zipping collection iterators, and explicit iteration over indices are all more or less equivalent in time performance. My intuition is that allocating even the single array from `zip` is a bad deal if we're dealing with a very large vocabulary but probably not if the typical case is on the order of 10^5 words or less.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15150#discussion_r79641840 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -580,10 +581,14 @@ class Word2VecModel private[spark] ( ind += 1 } -val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) +val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2)) + +pq ++= wordList.zip(cosVec) --- End diff -- Agreed. I'll push as soon as I finish running tests locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user willb commented on the issue: https://github.com/apache/spark/pull/13440 @thunterdb can you take a look at this now that 2.2 is out? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user willb commented on the issue: https://github.com/apache/spark/pull/13440 @srowen @thunterdb this PR passes all tests and merges cleanly -- can you take another look? It's been open for quite a while now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/15105 [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark fix/findSynonyms Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15105 commit 5a7cf31e1c76c2fe4e172c2889fa23deb2152af9 Author: William Benton Date: 2016-09-14T17:07:14Z test for spurious rejection of similar vectors in findSynonyms commit 757ce7c73e76f751395d55de3b5b31fa4a05cde8 Author: William Benton Date: 2016-09-14T21:35:29Z `Word2VecModel.findSynonyms` no longer spuriously rejects the best match Previously, the `findSynonyms` method in `Word2VecModel` rejected the closest-matching vector. This was typically correct in cases where we were searching for synonyms of a word, but was incorrect in cases where we were searching for words most similar to a given vector, since the given vector might not correspond to a word in the model's vocabulary. With this commit, `findSynonyms` will not discard the best matching term unless the best matching word is also the query word (or is maximally similar to the query vector). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15105 Thanks, @hhbyyh. This is what I get for running `sbt "testOnly ..."` I'll push a fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78973934 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- @hhbyyh So [one of the python doctests](https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/feature.py#L575) depends on a word not being a synonym of its vector representation. I think since this is the documented behavior now, that's the direction the fix should go as well, but I'll use `Vector.equals` instead of checking similarity in any case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78981066 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- @srowen This code in general kind of bothers me (I'd rather see a single pass through the tuples with a bounded priority queue keeping track of the `num + 1` candidates than converting to a sequence and then allocating an array to sort in place). But I'm inclined to get some numbers showing that that is a good idea and make it a separate PR unless this is a good time to fold it in (so to speak). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78981374 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala --- @@ -68,6 +69,21 @@ class Word2VecSuite extends SparkFunSuite with MLlibTestSparkContext { assert(syms(1)._1 == "japan") } + test("findSynonyms doesn't reject similar word vectors when called with a vector") { +val num = 2 +val word2VecMap = Map( + ("china", Array(0.50f, 0.50f, 0.50f, 0.50f)), + ("japan", Array(0.40f, 0.50f, 0.50f, 0.50f)), + ("taiwan", Array(0.60f, 0.50f, 0.50f, 0.50f)), + ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f)) +) +val model = new Word2VecModel(word2VecMap) +val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 0.50d)), num) --- End diff -- Yes to both (the two significant digits was simply following the style earlier in the test). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78983451 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- I wasn't sure if the todo was referring to the sorting or to earlier optimizations. I'll get it started as a separate issue; thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78995496 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- @srowen So actually reverting to the old code but filtering only if `wordOpt` is defined doesn't handle the original case I was considering here, where you pass in a vector that is very similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15105 Can we assume in general that distinct words will not have identical (viz., by `.equals`) vector representations? I ask because the behavior in the current documentation (in that pyspark test I linked above) assumes that you don't want a word to turn up as a synonym for its own vector representation. But this is impossible to enforce if we aren't guaranteed that distinct words won't have identical vector representations. This seems like a mostly reasonable assumption but it is definitely a corner case that we might want to be robust to (or at least take note of). If we can't accept this assumption, then we should figure out whether or not changing the expected behavior in the documentation is acceptable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79005872 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- Consider the case where you're passing in a vector that is extremely similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. (A concrete example is (in this test I added)[https://github.com/willb/spark/blob/fix/findSynonyms/mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala#L72].) In this case, `wordOpt` is not defined (because you are querying for words whose vector representations are similar to a given vector) but you nonetheless are not interested in discarding the best match because it is not a trivial match (that is, it is not going to be your query term). Related to your other comment (and discussion elsewhere on this PR), I think we could make a case for changing the documented behavior (especially since it is only documented as such in pyspark) in the case where `findSynonyms` is invoked with the vector representation of a word that is _in_ the vocabulary. Instead of rejecting the best match in that case, we could return it. The argument there is that such a result is telling you something you didn't necessarily know (otherwise, you'd probably be querying for a word and not a vector) and that there is an easy way to identify that such a match is trivially the best match. I recognize that changing documented behavior is a big deal, but in this case it seems like it could be the best way to address a minor bug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15105 @srowen Yes. But if we're querying for words similar to a given vector, then there's no word to filter out (and, indeed, no way to know which word we might want to filter out if multiple words might map to the same vector representation). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79157601 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -227,7 +227,7 @@ class Word2VecModel private[ml] ( */ @Since("1.5.0") def findSynonyms(word: String, num: Int): DataFrame = { -findSynonyms(wordVectors.transform(word), num) +findSynonyms(wordVectors.transform(word), num, Some(word)) --- End diff -- In this case (and similarly in [`Word2VecModelWrapper`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala)) I opted to call the three-argument version because the wrappers both explicitly convert their argument to a vector before calling `findSynonyms` on the underlying model (and so `wordOpt` would not be defined if the wrapper were invoked with a word). If we were to make the three-argument `findSynonyms` private we wouldn't be able to share a code path in the wrapper classes and would need to duplicate the code to tidy and reformat results in both methods (data frame creation in this case, unzipping and `asJava` in the Python model wrapper) or factor it out to a separate method. Let me know how you want me to proceed here. I agree that updating the docs makes sense and will make it clearer to future maintainers as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46677366 Thanks for the quick review and patch, @rxin! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46724443 @rxin, re: the former, seems like most implementations signal this as an error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46725581 OK, I wasn't sure if strict Hive compatibility was the goal. I'm happy to take these tickets. Thanks again! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2225] Turn HAVING without GROUP BY into...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1161#issuecomment-46727408 LGTM; this is basically exactly what I did (https://github.com/willb/spark/commit/b272f6be925ba50741e0a5093244926ea4a7a9a8) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-46860741 Can someone take another look at this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/143#discussion_r14329726 --- Diff: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala --- @@ -153,6 +153,18 @@ private[spark] object ClosureCleaner extends Logging { field.setAccessible(true) field.set(func, outer) } + +if (checkSerializable) { + ensureSerializable(func) +} + } + + private def ensureSerializable(func: AnyRef) { +try { + SparkEnv.get.closureSerializer.newInstance().serialize(func) +} catch { + case ex: Exception => throw new SparkException("Task not serializable: " + ex.toString) --- End diff -- I agree that it is better to wrap the underlying exception but was following the style of this error in DAGScheduler. I'll make the change and update that as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-47459912 Sorry, I missed FailureSuite. I have a fix but ran out of battery before I could push. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-729: predictable closure capture
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1322 spark-729: predictable closure capture SPARK-729 concerns when free variables in closure arguments to transformations are captured. Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created). This PR causes free variables in closure arguments to RDD transformations to be captured at closure creation time by modifying `ClosureCleaner` to serialize and deserialize its argument. This PR is based on #189 (which is closed) but has fixes to work with some changes in 1.0. In particular, it ensures that the cloned `Broadcast` objects produced by closure capture are registered with `ContextCleaner` so that broadcast variables won't become invalid simply because variable capture (implemented this way) causes strong references to the original broadcast variables to go away. (See #189 for additional discussion and background.) You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-729 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1322.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1322 commit c24b7c8e92c035cd5b48acf11fb29dc060f68fca Author: William Benton Date: 2014-07-04T16:26:59Z Added reference counting for Broadcasts. commit b10f613b6debe78ec1a1d53dd7f28168f8adb359 Author: William Benton Date: 2014-07-07T17:52:26Z Added ContextCleaner.withCurrentCleaner This method allows code that needs access to the currently-active ContextCleaner to access it via a DynamicVariable. commit 14e074a59616732d422454c1ce881bc9b29060cd Author: William Benton Date: 2014-03-18T14:55:57Z Added tests for variable capture in closures The two tests added to ClosureCleanerSuite ensure that variable values are captured at RDD definition time, not at job-execution time. commit 4e026a9d130c6fc92ce4cd73a68de013ed7aee5d Author: William Benton Date: 2014-03-20T15:48:17Z Predictable closure environment capture The environments of serializable closures are now captured as part of closure cleaning. Since we already proactively check most closures for serializability, ClosureCleaner.clean now returns the result of deserializing the serialized version of the cleaned closure. Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala commit 39960620eb9c37f92ade2c35e2d5402cad6dd686 Author: William Benton Date: 2014-03-26T04:45:45Z Skip proactive closure capture for runJob There are two possible cases for runJob calls: either they are called by RDD action methods from inside Spark or they are called from client code. There's no need to proactively check the closure argument to runJob for serializability or force variable capture in either case: 1. if they are called by RDD actions, their closure arguments consist of mapping an already-serializable closure (with an already-frozen environment) to each element in the RDD; 2. in both cases, the closure is about to execute and thus the benefit of proactively checking for serializability (or ensuring immediate variable capture) is nonexistent. (Note that ensuring capture via serializability on closure arguments to runJob also causes pyspark accumulators to fail to update.) Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala commit f4ed7535a1a1af0a518d4b7c9585073026a745c1 Author: William Benton Date: 2014-03-26T16:31:56Z Split closure-serializability failure tests This splits the test identifying expected failures due to closure serializability into three cases. commit b507dd85b0ea1595ed216ffd8226964ba671676c Author: William Benton Date: 2014-04-04T21:39:55Z Fixed style issues in tests Conflicts: core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala commit 5284569120cff51b3d2253e03b435047258798a0 Author: William Benton Date: 2014-04-04T22:15:50Z Stylistic changes and cleanups Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala commit d6d49304f543b5435b3062558f80ea61d2e1a757 Author: William Benton Date: 2014-05-02T14:23:49Z Removed proactive closure serialization from DStream Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala commit c052a63dba60b62325abb7ad541fe3d3f1a6e7d0 Author: William Benton Date: 2014-07-07T20:47:41Z Support tracking clones of broadcast variables. --- If your project is set up for it, you can reply to this email and have
[GitHub] spark pull request: Fix (some of the) warnings in the test suite
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1323 Fix (some of the) warnings in the test suite This PR fixes three classes of compiler and deprecation warnings in the test suite: * `expectResult` is currently deprecated and has been replaced with `assertResult`; * assertions of the form `should be ===` are deprecated in favor of the form `shouldEqual`; and * `scala.language.postfixOps` was not in scope within the test classes in which postfix operations were actually used. The fixes for these issues were almost entirely mechanical, but they eliminate many lines of warning output. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark testCleanups Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1323.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1323 commit b05304666eb61d7d1d1849772c27cf9711699dae Author: William Benton Date: 2014-07-07T20:31:27Z Replace expectResult with assertResult Assertions.expectResult is deprecated. This commit replaces expectResult invocations with the assertResult invocations. commit 5377534fdfe741aab7d6d539573ab78636bfd554 Author: William Benton Date: 2014-07-07T20:37:53Z Replaced deprecated 'be ===' assertions "should be ===" assertions have been deprecated in favor of "shouldEqual" assertions. This commit replaces the deprecated form with the supported form. commit a672a9b5d579c4946a20fe4c6f88c4582949ef3f Author: William Benton Date: 2014-07-07T21:00:58Z Ensure language.postfixOps is in scope where used Previously, language.postfixOps was imported at toplevel, which meant compiler warnings since it wasn't visible inside the classes that used postfix operations. This commit moves the import to suppress these warnings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix (some of the) warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-48243538 @srowen sorry, I hadn't noticed the other PR! I think the `postfixOps` change in mine is disjoint, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix (some of the) warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-48326403 @rxin Done; I also updated the comment to reflect the narrower focus after eliminating overlap with #1153. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1359 SPARK-2407: Added internal implementation of SQL SUBSTR() This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst and adds tests to verify correct operation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark internalSqlSubstring Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1359.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1359 commit 7d7b2631394d2b46f9753c0cf53e674b640eb82c Author: William Benton Date: 2014-07-08T18:06:15Z Added internal implementation of SQL SUBSTR() This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst and adds tests to verify correct operation. Squashes 8969d038 and e6419b4e. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14782197 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and +// negative indices for start positions. If a start index i is greater than 0, it +// refers to element i-1 in the sequence. If a start index i is less than 0, it refers +// to the -ith element before the end of the sequence. If a start index i is 0, it +// refers to the first element. + +val start = startPos match { + case pos if pos > 0 => pos - 1 + case neg if neg < 0 => len + neg + case _ => 0 +} + +val end = sliceLen match { + case max if max == Integer.MAX_VALUE => max + case x => start + x +} + +str.slice(start, end) + } + + override def eval(input: Row): Any = { +val string = str.eval(input) + +val po = pos.eval(input) +val ln = len.eval(input) + +if ((string == null) || (po == null) || (ln == null)) { + null +} else { + val start = po.asInstanceOf[Int] + val length = ln.asInstanceOf[Int] + + string match { +case ba: Array[Byte] => slice(ba, start, length) +case other => slice(other.toString, start, length) + } +} + } + + override def toString = s"SUBSTR($str, $pos, $len)" --- End diff -- Makes sense! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14782431 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -860,6 +860,7 @@ private[hive] object HiveQl { val BETWEEN = "(?i)BETWEEN".r val WHEN = "(?i)WHEN".r val CASE = "(?i)CASE".r + val SUBSTR = "(?i)I_SUBSTR(?:ING)?".r --- End diff -- Good catch -- the `I_` was spuriously committed. (I had that in my working tree so that I could easily compare Catalyst and Hive `substr` from the console.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1359#issuecomment-48643276 Thanks for the comments, @concretevitamin. I've made the changes from your comments and will think about reducing branch overhead before pushing an update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14783468 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and --- End diff -- Hive supports 0-based indexing in the same way as this patch. I agree that supporting both in this way is ugly (both from an interface and from an implementation perspective), but it seems likely that people are depending on this behavior in the wild, doesn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14783626 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and +// negative indices for start positions. If a start index i is greater than 0, it +// refers to element i-1 in the sequence. If a start index i is less than 0, it refers +// to the -ith element before the end of the sequence. If a start index i is 0, it +// refers to the first element. + +val start = startPos match { + case pos if pos > 0 => pos - 1 + case neg if neg < 0 => len + neg + case _ => 0 +} + +val end = sliceLen match { --- End diff -- This is the behavior of `IndexedSeqOptimized[A,B].slice` (and thus this patch) as well as of Hive, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1359#issuecomment-48648670 @concretevitamin Inlining `Substring.slice` seems to make a nontrivial difference (~11.5% speedup) on a very simple `Substring.eval()` microbenchmark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14787712 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and --- End diff -- (Another point is that we need to branch on the index value even if we eliminate 0-based indexing in order to handle substrings taken from the end of the string.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14802394 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +207,71 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (!resolved) { + throw new UnresolvedException(this, s"Cannot resolve since $children are not resolved") +} +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + @inline + def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { --- End diff -- Yes, I think I can do it with implicit parameters. I'll push an update once I've run the Catalyst suite against my change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1359#issuecomment-48971559 Thanks, Michael. I've fixed the conflict. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1413 SPARK-2486: Utils.getCallSite is now resilient to bogus frames When running Spark under certain instrumenting profilers, Utils.getCallSite could crash with an NPE. This commit makes it more resilient to failures occurring while inspecting stack frames. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2486 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1413.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1413 commit 0f0c1ae825d53d773ef12a02d74072b88f65d91a Author: William Benton Date: 2014-07-15T03:40:51Z Utils.getCallSite is now resilient to bogus frames When running Spark under certain instrumenting profilers, Utils.getCallSite could crash with an NPE. This commit makes it more resilient to failures occurring while inspecting stack frames. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1413#discussion_r14916362 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -809,7 +809,11 @@ private[spark] object Utils extends Logging { */ def getCallSite: CallSite = { val trace = Thread.currentThread.getStackTrace() - .filterNot(_.getMethodName.contains("getStackTrace")) + .filterNot((ste:StackTraceElement) => +// When running under some profilers, the current stack trace might contain some bogus +// frames. This Try is intended to ensure that we don't crash in these situations by +// ignoring any frames that we can't examine. +Try(ste.getMethodName.contains("getStackTrace")).getOrElse(true)) --- End diff -- So I was thinking that either `ste` or the result of `ste.getMethodName` could be `null` when running under instrumentation. I'll rewrite to use checks instead of `Try`, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1413#issuecomment-49023679 @aarondav @pwendell Yes, with this patch I'm able to enable the YourKit features that were causing crashes before. I'll submit an update to fix the bracket style and cc you both. Thanks for the quick review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reformat multi-line closure argument.
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1419 Reformat multi-line closure argument. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark reformat-2486 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1419.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1419 commit 26762310ddf0ea88a418c506ed0e86892fe6e4d5 Author: William Benton Date: 2014-07-15T12:35:13Z Reformat multi-line closure argument. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reformat multi-line closure argument.
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1419#issuecomment-49024982 (See discussion on #1413; cc @aarondav and @pwendell.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1442#issuecomment-49159849 That was my thought as well, @egraldlo. Thanks for submitting this, @chutium! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1442#issuecomment-49160982 @egraldlo, couldn't it be `(SUBSTR | SUBSTRING) ~> // ... ` in that case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with unre...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1497 SPARK-2226: transform HAVING clauses with unresolvable attributes This commit adds an analyzer rule to 1. find expressions in `HAVING` clause filters that depend on unresolved attributes, 2. push these expressions down to the underlying aggregates, and then 3. project them away above the filter. It also enables the `HAVING` queries in the Hive compatibility suite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2226 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1497.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1497 commit 29a26e3ab6a21e6619f003d905bc7aa7d1cb2976 Author: William Benton Date: 2014-07-17T15:36:37Z Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226) commit c7f2b2c8a19b09ec095a316cb965f18d474d7144 Author: William Benton Date: 2014-07-17T17:16:18Z Whitelist HAVING queries. Also adds golden outputs for HAVING tests. commit 5a12647c169ee06bba5355c3956a158699247e43 Author: William Benton Date: 2014-07-19T17:08:17Z Explanatory comments and stylistic cleanups. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49542704 Thanks, @rxin! I've made these changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1497#discussion_r15172123 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -152,6 +155,34 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool } /** + * This rule finds expressions in HAVING clause filters that depend on + * unresolved attributes. It pushes these expressions down to the underlying + * aggregates and then projects them away above the filter. + */ + object UnresolvedHavingClauseAttributes extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case pl @ Filter(fexp, agg @ Aggregate(_, ae, _)) if !fexp.childrenResolved => { +val alias = Alias(fexp, makeTmp())() +val aggExprs = Seq(alias) ++ ae + +val newCond = EqualTo(Cast(alias.toAttribute, BooleanType), Literal(true, BooleanType)) + +val newFilter = ResolveReferences(pl.copy(condition = newCond, + child = agg.copy(aggregateExpressions = aggExprs))) + +Project(pl.output, newFilter) + } +} + +private val curId = new java.util.concurrent.atomic.AtomicLong() + +private def makeTmp() = { + val id = curId.getAndIncrement() + s"tmp_cond_$id" --- End diff -- @marmbrus Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49726353 @pwendell Sure, and I'll do this in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49726591 @marmbrus I've made the changes you suggested and moved the rule to the Resolution batch but now the newly-inserted attribute references remain unresolved after analysis. I'm going to try and figure out where things are going wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49783600 @marmbrus I've pushed it to https://github.com/willb/spark/tree/spark-2226-broken -- if I split out the `UnresolvedHavingClauseAttributes` to a separate batch and add a `ResolveReferences` (as in willb@16b3dfc), things seem to work, but if I put `UnresolvedHavingClauseAttributes` in Resolution (as in willb@76ae5c1), things fail with No function to evaluate expression. type: AttributeReference I'm sure I'm missing something simple here. Thanks for taking a look! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49876099 Thanks, @marmbrus; this works now and I understand where things were going wrong (and also, hopefully, how to use Catalyst more idiomatically). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49912686 @marmbrus, thanks again; sorry to have missed those earlier! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-729: predictable closure capture
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1322#issuecomment-49913436 @mateiz, I think there's a memory blowup somewhere in this patch as it is and am trying to track it down. Coincidentally, it's what I was switching context back to when I saw this comment. @rxin, can you point me to the broadcast change so I can track it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-729: predictable closure capture
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1322#issuecomment-49926246 Thanks, @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/2768 SPARK-2863: [SQL] Add facilities for function-argument coercion This commit adds the `SignedFunction` trait and modifies the `Sqrt` expression class to use it for coercing its argument to `DoubleType`. `SignedFunction` represents a fixed-arity function whose arguments should be casted to particular types. Expression classes extending SignedFunction must provide `formalTypes`, a List of expected types for formal parameters, `actualParams`, a list of Expressions corresponding to actual parameters, and create, which creates an instance of that expression class from a list of expressions corresponding to actuals. The type parameter for SignedFunction should be the expression class extending it. See the Sqrt class for a concrete example. This trait (or one or several abstract classes extending this trait) could be exposed to code outside `sql` in the future. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2863 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2768.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2768 commit 4f9517a2c11d13f439f3ed7ea447a4559f9e9088 Author: William Benton Date: 2014-10-11T12:40:10Z Adds SignedFunction trait and type coercion rules SignedFunction represents a fixed-arity function whose arguments should be casted to particular types. Expression classes extending SignedFunction must provide `formalTypes`, a List of expected types for formal parameters, `actualParams`, a list of Expressions corresponding to actual parameters, and create, which creates an instance of that expression class from a list of expressions corresponding to actuals. The type parameter for SignedFunction should be the expression class extending it. See the Sqrt class for a concrete example. This trait (or one or several abstract classes extending this trait) could be exposed to code outside `sql` in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2768#issuecomment-58893902 Two open questions, and the latter is more relevant: is requiring that actuals are casted to the types of formals too restrictive? Is it likely to lead to type-coercion rules oscillating? (Obviously, it should be possible to pass, e.g., a value of a narrow numeric type where a wider one is expected. But if all the type-coercion rules we can anticipate ultimately widen types or convert from strings into other values, then the rules will still make progress.) (cc @marmbrus) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1750#discussion_r16815625 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -935,6 +936,7 @@ private[hive] object HiveQl { case Token(DIV(), left :: right:: Nil) => Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType) case Token("%", left :: right:: Nil) => Remainder(nodeToExpr(left), nodeToExpr(right)) +case Token("TOK_FUNCTION", Token(SQRT(), Nil) :: arg :: Nil) => Sqrt(nodeToExpr(arg)) --- End diff -- Hey @liancheng, I agree that it is possible to special-case this function (and can certainly do so for this PR). I recently [wrote up what Hive does](http://chapeau.freevariable.com/2014/08/existing-system-coercion.html) and have some work in progress on a more general solution that should capture what Hive does. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1750#issuecomment-53753423 @marmbrus I've rebased this atop the current master and added support for casting string-valued SQRT arguments to double and a test case in SQLQuerySuite for the same. (These will be unnecessary when I finish general support for string coercion in numeric functions, but they make this patch mergeable now.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: spark-729: predictable closure capture
Github user willb closed the pull request at: https://github.com/apache/spark/pull/1322 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: spark-729: predictable closure capture
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1322#issuecomment-53903366 @mateiz sure; I've tracked down the problem but am a bit stumped by how to fix it. I'll reopen when I have a solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/2220 SPARK-3329: [SQL] Don't depend on Hive SET pair ordering. This fixes some possible spurious test failures in `HiveQuerySuite` by comparing sets of key-value pairs as sets, rather than as lists. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-3329 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2220.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2220 commit 36ff52aee8ac2d8da336a56df195e6093e8f7807 Author: William Benton Date: 2014-08-31T17:08:00Z Don't depend on Hive SET pair ordering. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2220#issuecomment-53996991 Thanks, @concretevitamin! I'll close this one then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2220#issuecomment-54165469 @concretevitamin I cherry-picked @aarondav's fix (and added a very simple fix to handle cases that it didn't). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2220#issuecomment-54303443 This failure (in `SparkSubmitSuite`) appears unrelated to my patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3423: [SQL] Implement BETWEEN for SQLPar...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/2295 SPARK-3423: [SQL] Implement BETWEEN for SQLParser You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark sql-between Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2295.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2295 commit 0016d300a77e8ef80899a679f54e7451544361dc Author: William Benton Date: 2014-09-05T23:00:29Z Implement BETWEEN for SQLParser --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix postfixOps warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-55345856 Hi @jkbradley; thanks for taking a look. Here are the warnings as I see them when compiling tests on the immediate ancestor of my branch, which is 56e009d (I'm running on OS X 10.9 in this case): ``` [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:38: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] This can be achieved by adding the import clause 'import scala.language.postfixOps' [warn] or by setting the compiler option -language:postfixOps. [warn] See the Scala docs for value scala.language.postfixOps for a discussion [warn] why the feature should be explicitly enabled. [warn] implicit val defaultTimeout = timeout(1 millis) [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:99: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:117: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:134: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:156: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:187: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:290: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] eventually(waitTimeout, interval(100 millis)) { [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/rdd/AsyncRDDActionsSuite.scala:138: postfix operator seconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] failAfter(10 seconds) { [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/rdd/AsyncRDDActionsSuite.scala:174: postfix operator seconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] failAfter(10 seconds) { [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:42: postfix operator seconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] eventually(timeout(10 seconds), interval(50 milliseconds)) { [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:42: postfix operator milliseconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] eventually(timeout(10 seconds), interval(50 milliseconds)) { [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:56: postfix operator seconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] eventually(timeout(10 seconds), interval(50 millisecond
[GitHub] spark pull request: Fix postfixOps warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-55351655 Hey @andrewor14, thanks for the reply. First off, I absolutely agree with @srowen's comment on #1330 that `import`s (not compiler flags) are the right way to handle enabling these language features. It looks to me like `SpanSugar` pulls in `postfixOps` -- and that it's the only thing in those files that uses `postfixOps`. (I guess having both `SpanSugar` and `postfixOps` imported at toplevel was causing some implicit resolution confusion?) In any case, the approach in #1330 is probably the way to go since explicitly importing `postfixOps` seems unnecessary and removing the compiler flag is a good idea. Thanks again for taking a look! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix postfixOps warnings in the test suite
Github user willb closed the pull request at: https://github.com/apache/spark/pull/1323 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-55902476 Here's a somewhat-related concern: it seems like JVM overhead is unlikely to scale linearly with requested heap size, so maybe a straight-up 15% isn't a great default? (If you have hard data on how heap requirements grow with job size, I'd be interested in seeing it.) Perhaps it would make more sense to reserve whichever is smaller of 15% or some fixed but reasonable amount. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56121260 @andrewor14 This bothers me too, but in a slightly different way: calling the parameter âoverheadâ when it really refers to how to scale requested memory to accommodate anticipated overhead seems wrong. (115% overhead would be appalling indeed!) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/2547 SPARK-3699: SQL and Hive console tasks now clean up appropriately The sbt tasks sql/console and hive/console will now `stop()` the `SparkContext` upon exit. Previously, they left an ugly stack trace when quitting. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark consoleCleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2547.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2547 commit d5e431f0a1b9047a5afc27cb371dbfb7014fb6e0 Author: William Benton Date: 2014-09-26T18:45:20Z SQL and Hive console tasks now clean up. The sbt tasks sql/console and hive/console will now clean up the SparkContext upon exit. Previously, they left an ugly stack trace when quitting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2547#issuecomment-57006526 Looks like Jenkins and GitHub are having a minor disagreement: FATAL: Failed to fetch from https://github.com/apache/spark.git --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2547#issuecomment-57032615 This patch only changes the sbt build file; it adds no classes and should not have caused any tests to fail. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1750 SPARK-2813: [SQL] Implement SQRT() directly in Catalyst This PR adds a native implementation for SQL SQRT() and thus avoids delegating this function to Hive. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2813 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1750.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1750 commit 18d63f93316e56b9f0e137e272981b5a2eb84074 Author: William Benton Date: 2014-08-02T15:30:26Z Added native SQRT implementation commit bb8022612c468ae99531fbcc9ddff8a5f45bcf36 Author: William Benton Date: 2014-08-02T16:22:40Z added SQRT test to SqlQuerySuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1750#issuecomment-51012136 @marmbrus I've added a test in ExpressionEvaluationSuite and can push that whenever. As far as type goes, the code only produces double-precision results, but I could certainly approximate the square root of an arbitrary-precision decimal if that makes sense to do. (Hive 0.12 only handles double-precision square roots AFAICT.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1750#discussion_r15791823 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -935,6 +936,7 @@ private[hive] object HiveQl { case Token(DIV(), left :: right:: Nil) => Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType) case Token("%", left :: right:: Nil) => Remainder(nodeToExpr(left), nodeToExpr(right)) +case Token("TOK_FUNCTION", Token(SQRT(), Nil) :: arg :: Nil) => Sqrt(nodeToExpr(arg)) --- End diff -- It does. What's the best way to handle this in general? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1750#issuecomment-51143752 @marmbrus I'll file a JIRA for that and am happy to put it at the front of my plate; sounds like a fun problem! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1750#discussion_r15894307 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -935,6 +936,7 @@ private[hive] object HiveQl { case Token(DIV(), left :: right:: Nil) => Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType) case Token("%", left :: right:: Nil) => Remainder(nodeToExpr(left), nodeToExpr(right)) +case Token("TOK_FUNCTION", Token(SQRT(), Nil) :: arg :: Nil) => Sqrt(nodeToExpr(arg)) --- End diff -- @marmbrus to clarify, should I have a solution for the new type-coercion JIRA before doing any other reimplementations, or can I keep submitting reimplementations as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix postfixOps warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-51378794 Can someone take a look at this again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2768#issuecomment-60934954 I've rebased this to fix merge conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-37854483 Right now, the cases where preemptive checking isn't happening are in `DStream.transform` and `DStream.transformWith` invocations, since a DStream object is accessible from the closure argument to these in but one invocation in the test suite. (I've been trying to figure out why preemptive checking fails here but why these aren't failing on actions.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-37890968 The code in this branch now doesn't check for closure serializability in DAGScheduler. (If checking at transformation- and action-invocation time becomes optional via a config knob, then that check will have to go back in, of course.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-37896667 I cleaned up the branch a little by squashing all of the test-case commits together and moving the fixes to `GraphSuite` before the preemptive serialization commits. The latter are now in 0561f539. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-38189119 Thanks for the quick review, @kayousterhout! I think that the current state of the branch addresses all of the concerns you raised; sorry that I had to dig in a little more than I expected to verify how things were working in `DStream`. If anyone has further thoughts about making this a configurable option, please let me know. The approach I'm currently taking to [SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) depends on serializability checking happening in `ClosureCleaner.clean`, so if we did decide to make this checking optional it would affect when variable capture occurred if the option were disabled (as well as requiring that the serializability check go back in to `DAGScheduler`, at least conditionally, as mentioned above). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/189 SPARK-729: Closures not always serialized at capture time [SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) concerns when free variables in closure arguments to transformations are captured. Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created). There are a few possible approaches to solving this problem and this PR will discuss some of them. The approach I took has the advantage of being simple, obviously correct, and minimally-invasive, but it preserves something that has been bothering me about Spark's closure handling, so I'd like to discuss an alternative and get some feedback on whether or not it is worth pursuing. ## What I did The basic approach I took depends on the work I did for #143, and so this PR is based atop that. Specifically: #143 modifies `ClosureCleaner.clean` to preemptively determine whether or not closures are serializable immediately upon closure cleaning (rather than waiting for an job involving that closure to be scheduled). Thus non-serializable closure exceptions will be triggered by the line defining the closure rather than triggered where the closure is used. Since the easiest way to determine whether or not a closure is serializable is to attempt to serialize it, the code in #143 is creating a serialized closure as part of `ClosureCleaner.clean`. `clean` currently modifies its argument, but the method in `SparkContext` that wraps it to return a value (a reference to the modified-in-place argument). This branch modifies `ClosureCleaner.clean` so that it returns a value: if it is cleaning a serializable closure, it returns the result of deserializing its serialized argument; therefore it is returning a closure with an environment captured at cleaning time. `SparkContext.clean` then returns the result of `ClosureCleaner.clean`, rather than a reference to its modified-in-place argument. I've added tests for this behavior (777a1bc). The pull request as it stands, given the changes in #143, is nearly trivial. There is some overhead from deserializing the closure, but it is minimal and the benefit of obvious operational correctness (vs. a more sophisticated but harder-to-validate transformation in `ClosureCleaner`) seems pretty important. I think this is a fine way to solve this problem, but it's not perfect. ## What we might want to do The thing that has been bothering me about Spark's handling of closures is that it seems like we should be able to statically ensure that cleaning and serialization happen exactly once for a given closure. If we serialize a closure in order to determine whether or not it is serializable, we should be able to hang on to the generated byte buffer and use it instead of re-serializing the closure later. By replacing closures with instances of a sum type that encodes whether or not a closure has been cleaned or serialized, we could handle clean, to-be-cleaned, and serialized closures separately with case matches. Here's a somewhat-concrete sketch (taken from my git stash) of what this might look like: ```scala package org.apache.spark.util import java.nio.ByteBuffer import scala.reflect.ClassManifest sealed abstract class ClosureBox[T] { def func: T } final case class RawClosure[T](func: T) extends ClosureBox[T] {} final case class CleanedClosure[T](func: T) extends ClosureBox[T] {} final case class SerializedClosure[T](func: T, bytebuf: ByteBuffer) extends ClosureBox[T] {} object ClosureBoxImplicits { implicit def closureBoxFromFunc[T <: AnyRef](fun: T) = new RawClosure[T](fun) } ``` With these types declared, we'd be able to change `ClosureCleaner.clean` to take a `ClosureBox[T=>U]` (possibly generated by implicit conversion) and return a `ClosureBox[T=>U]` (either a `CleanedClosure[T=>U]` or a `SerializedClosure[T=>U]`, depending on whether or not serializability-checking was enabled) instead of a `T=>U`. A case match could thus short-circuit cleaning or serializing closures that had already been cleaned or serialized (both in `ClosureCleaner` and in the closure serializer). Cleaned-and-serialized closures would be represented by a boxed tuple of the original closure and a serialized copy (complete with an environment quiesced at transformation time). Additional implicit conversions could convert from `ClosureBox` instances to the underlying function type where appropriate. Tracking this sort of state in the type system seems like the right thing to do to me. ### Why we might not want to do that _It's pretty invasive._ Every function type used by every `RDD` subclass would ha
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-38330588 @mateiz, I was pretty confused about this but it looks like the python accumulator tests are what is failing. I'm not super-familiar with pyspark yet but am trying to figure out where things are going wrong. (If you have any advice for where I should start looking, I'd appreciate it!) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-38340085 The Python accumulators still work on master and with the changes from #143 (which serializes the closure at closure cleaning but doesn't use the serialized value for anything). As far as I can tell, accumulators in general still work after this change; at least all of the non-pyspark tests that use accumulators all still work after this change (I have been re-running `JavaAPISuite`, `AccumulatorSuite`, `DistributedSuite`, `ReplSuite`, and `AsyncRDDActionsSuite` individually while experimenting with this). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-38738160 In the discussion on #143, we were talking about whether or not it made sense to omit the closure serializability check in `DAGScheduler`. Since this branch, which combines serializability checking and variable capture, removes serializability checking for the closure argument to `SparkContext.runJob`, I think it makes sense to either (1) re-introduce the check in `DAGScheduler`, or (2) add another option to the internal `clean` methods (in `SparkContext` and `ClosureCleaner`) allowing checking for serializability without forcing variable capture. I am inclined to prefer the first option but am interested in other opinions on the matter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-38839409 I have rebased this branch to remove the commit that took out the serializability check in `DAGScheduler`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-39144319 It looks like this Travis error is the same one others have seen on the dev list -- that is, the hive test is timing out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-39490073 Thanks for taking another look, Matei! I know there's a lot of stuff to get in before the merge window closes and appreciate the update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-39610852 So I just added this to `ClosureCleanerSuite` and I'm not seeing the same behavior. (I already had added a captured-field test to `ClosureCleanerSuite`.) I don't have a test that depends on deep copying of referenced objects with mutable state, though; all of my tests have objects with mutable state that is of JVM primitive types. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-39610969 (e.g. here: https://github.com/willb/spark/commit/12c63a7e03bce359fd7eb7faf0a054bd32f85824#diff-f949ef08cc8a2b36861af3beb4309a88R161) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-39613329 @mateiz, these naming and stylistic suggestions make sense; thanks! Cloning the closure in `runJob` is what caused Python accumulators to stop working. I will have to check over my notes to see if I can remember why. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-39613399 (BTW, as a matter of style, is it better to rebase my branch or add another commit that makes the change?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---