[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-08-02 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-127132676 Same compile error unfortunately. Probably at this point best to pass off. The only unpushed change is: ```scala arr.asInstanceOf[ArrayData].toArray

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-08-02 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-127132250 I will give that workaround a shot. Don't want to push the code above without running the test suite. If it passes I will push it. As for being away

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-08-02 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-127131369 @davies, I just pushed updates except for the fix for `ArrayData`. I rebased to master and am getting unrelated compile errors, so will have to wait for that to be

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-08-02 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r36056386 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +37,62 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-08-02 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r36056381 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +37,62 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-07-31 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-126777044 Thanks for update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-07-31 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-126749986 Another ping for a review since feature freeze is today/Monday @rxin and @chenghao-intel. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-29 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-125874527 Alright, now that tests pass, can you review @chenghao-intel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-28 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-125487277 jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-28 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-125486849 jenkins restest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-27 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-125385186 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-27 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-125344394 Is this failing maybe because of jenkins not picking up the rebase and perhaps not updating the repository in some way @rxin? --- If your project is set up for it

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-27 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-125342145 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-27 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-125326533 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-27 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-125323912 I went ahead and modified the unit test to use `1L` so that it will pass the test. Based on the above points, I think it would be best to not deal with that issue on

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-24 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-124722152 Any thoughts on the python error @rxin or @chenghao-intel? I am going to start digging a little bit to see if I can find where this is occurring. Primary questions are

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-124259173 Looks like back to the issue where python dataframes has a mismatch of `bigint` and `int`. I am pretty sure that this is an issue with the python portion, since all

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-124223602 Just pushed code for review. Based on the unit tests I wrote, it matches the hive behavior described above. --- If your project is set up for it, you can reply to

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-124215462 Makes sense, so shouldn't be a problem. Prior question still is open though, due to the not useful error message, should I explicitly write a `checkInputTypes` m

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-124214142 Based on Hive behavior of this: ```sql # Returns false array_contains(array(1, 0), array(1, null)[1]) ``` It would seem that this should

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-124211814 I will include that change as well. Separate question, is there a better way to signal that there are no legal input types? If I leave it how it is, the error

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-124209251 @rxin and @chenghao-intel, I think I may have found a bug in `ExpectsInputTypes.scala`. For reference, this is what scala does when zipping non-equal length

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-124183259 There seems to be a number of different cases, so I am going to refer to https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-124005375 Pushed code with scala tests, seeing if python breaks still --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35295692 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +36,47 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35295682 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +36,47 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35295129 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +36,47 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35294723 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +36,47 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35294649 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +36,47 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-23 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35294529 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +36,47 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35294183 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +36,47 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35293144 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -298,4 +298,28 @@ class DataFrameFunctionsSuite extends

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35281045 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -298,4 +298,28 @@ class DataFrameFunctionsSuite extends

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-123805715 Few things: I added the python API, but I am getting a runtime exception, which is causing it to fallback to something else, then succeeding, stacktrace below

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35239410 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +37,57 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35239110 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -298,4 +298,32 @@ class DataFrameFunctionsSuite extends

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35238891 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2081,6 +2081,22 @@ object functions { */ def size(column

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35238594 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +37,57 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL][WIP] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35187368 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2081,6 +2081,22 @@ object functions { */ def size(column

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35187090 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala --- @@ -298,4 +298,32 @@ class DataFrameFunctionsSuite extends

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-07-22 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7580#discussion_r35187063 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -35,3 +37,57 @@ case class Size(child

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-07-21 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-123521163 Consolidating all my questions here, then deleting prior comments: 1. What is incorrect about my codegen which is returning an error? Secondary to that, does it

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-07-21 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-123513491 For type checking, should this be done in `eval` or in `inputTypes`? If the second, is there a way to constrain the type of the right by the element type of the left

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-07-21 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-123511627 Are the primitives declared somewhere else, or do I need to declare their type? --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-07-21 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7580#issuecomment-123511552 Stack trace for failed codegen: ```scala [info] - Array contains *** FAILED *** (25 milliseconds) [info] Code generation of array_contains(List(1, 2

[GitHub] spark pull request: [SPARK-8231][SQL] Add array_contains

2015-07-21 Thread EntilZha
GitHub user EntilZha opened a pull request: https://github.com/apache/spark/pull/7580 [SPARK-8231][SQL] Add array_contains PR for work on https://issues.apache.org/jira/browse/SPARK-8231 Currently, I have an initial implementation for contains. Based on discussion on JIRA

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-123175469 @chenghao-intel, last commit should fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35071716 --- Diff: python/pyspark/sql/functions.py --- @@ -795,6 +796,22 @@ def weekofyear(col): return Column(sc._jvm.functions.weekofyear(col

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35071705 --- Diff: python/pyspark/sql/functions.py --- @@ -795,6 +796,22 @@ def weekofyear(col): return Column(sc._jvm.functions.weekofyear(col

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35055968 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionFunctionsSuite.scala --- @@ -0,0 +1,43

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-123087882 On a side note, why is Jenkins saying that there are extra classes being exposed which are not part of the PR diff? --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35055709 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35055572 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r35054813 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-123038452 Merge/rebase is done, now ready for review --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-123032023 Working on rebasing correctly, think I understand what I did/how to fix --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-123003936 Actually, I just looked at the diff, I did do something incorrect when I fixed merge conflicts/rebased, what should I do to fix that? --- If your project is set up

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-123003528 @rxin, I fixed those python issues and rebased to master/resolved merge conflicts. Hopefully didn't break anything with git. Tests also should be passing. Any

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-20 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-122784338 Any other comments on code @rxin or @chenghao-intel? If it looks good, should I fix the merge conflict/rebase to current master or would you do that when you merge the

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-19 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-122764068 Fixed tests using feedback from @chenghao-intel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-19 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/7462#discussion_r34968615 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionFunctionsSuite.scala --- @@ -0,0 +1,43

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-19 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-122759346 I pushed updates to the code based on the comments above. I wrote an expressions test, but can't figure out why its failing. I get the following stack

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-17 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/7462#issuecomment-122207728 Thanks @rxin for feedback. Will make those changes tomorrow and repost when PR is ready for review again. --- If your project is set up for it, you can reply to this

[GitHub] spark pull request: [SPARK-8230][SQL] Add array/map size method

2015-07-17 Thread EntilZha
GitHub user EntilZha opened a pull request: https://github.com/apache/spark/pull/7462 [SPARK-8230][SQL] Add array/map size method Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230 Primary issue resolved is to implement array/map size for Spark SQL. Code is

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-04-28 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4807#issuecomment-97214197 Commenting here and then on ticket. If there is interest in using the Gibbs implementation I wrote for next release using the interface/Refactor from that PR I am open

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-04-02 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4807#issuecomment-89189465 Just double checking, your suggestion would be to rebase from master, implement those general changes, then commit/push the modified branch? Primary reason I

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-04-01 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4807#issuecomment-88657523 Sounds good. I think its reasonable that this PR only includes refactoring, not Gibbs. Then evaluate LightLDA vs FastLDA and choose which one makes sense. If changes

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-12 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4807#issuecomment-78090060 As is, it can be merged (as far as work on refactoring goes). I had actually considered having separate PRs for the Refactor and Gibbs anyway. On Mar 10, 2015 9

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4807#discussion_r26189764 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -311,165 +319,319 @@ private[clustering] object LDA { private

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4807#discussion_r26189097 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -311,165 +319,319 @@ private[clustering] object LDA { private

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4807#discussion_r26188892 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -311,165 +319,319 @@ private[clustering] object LDA { private

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4807#discussion_r26187620 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -311,165 +319,319 @@ private[clustering] object LDA { private

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-03-10 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4807#discussion_r26187624 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -311,165 +319,319 @@ private[clustering] object LDA { private

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

2015-02-26 Thread EntilZha
GitHub user EntilZha opened a pull request: https://github.com/apache/spark/pull/4807 [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs) JIRA: https://issues.apache.org/jira/browse/SPARK-5556 As discussed in that issue, it would be

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-02-02 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-72584774 Sounds great, once its merged in I will start working on Gibbs using that as a starting point. Great work! --- If your project is set up for it, you can reply to this

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-28 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-71946931 @mengxr, correct, its not public, but it would be helpful to at least get a minimal Trait which I can base my PR on (something similar to what I proposed earlier). That

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-28 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-71934508 @jkbradley, probably the more important part isn't the commit/tag itself, but agreement on the relatively short LearningState traits. --- If your project is set u

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-28 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-71932163 It would be very helpful if we can stabilize the LearningState API/Trait, and you point to a commit/tag for that. This would help me finish Gibbs in parallel with EM

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23502479 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23502411 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501399 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501392 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501230 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501143 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501109 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala --- @@ -0,0 +1,265 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501107 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala --- @@ -0,0 +1,265 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501104 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala --- @@ -0,0 +1,265 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501095 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501086 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501071 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23501018 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-24 Thread EntilZha
Github user EntilZha commented on a diff in the pull request: https://github.com/apache/spark/pull/4047#discussion_r23500980 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-22 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-71146326 Thoughts on the API @jkbradley? I am working in implementing those last few functions and performance tests. I am aiming to have the Gibbs version ready (if nothing

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-19 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-70584023 Just finished refactoring, here is the combined API/LDAModel code + Gibbs implementing it. It should give a pretty good idea of what I was thinking about using

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-19 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-70572184 Nice, I am not too far from completing my refactoring. How would it be best to share it, open a different PR or link to it here? Eventually, will it be two PRs... not

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-19 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-70542992 What might be the best way to have the EM and Gibbs LDA implementations play well with each other? If the aim is to not have separate LDA classes, on first

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-19 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-70537930 I've had a good chance to look at PR while making changes to my own code. I really liked the Graph initialization code (especially the partition strategy), I was

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-01-14 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/4047#issuecomment-70029618 Mind posting the data set size (vocab, doc, etc) and type of cluster? About to start some performance tests and would be cool to hit both the Chinese dataset size and

[GitHub] spark pull request: [SPARK-4543] Javadoc failure for network-commo...

2014-11-29 Thread EntilZha
Github user EntilZha closed the pull request at: https://github.com/apache/spark/pull/3405 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-4543] Javadoc failure for network-commo...

2014-11-29 Thread EntilZha
Github user EntilZha commented on the pull request: https://github.com/apache/spark/pull/3405#issuecomment-64969668 Works for me for building against master branch. Close PR and close issue with reference to #3058? --- If your project is set up for it, you can reply to this email

  1   2   >