Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-127132676
Same compile error unfortunately. Probably at this point best to pass off.
The only unpushed change is:
```scala
arr.asInstanceOf[ArrayData].toArray
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-127132250
I will give that workaround a shot. Don't want to push the code above
without running the test suite. If it passes I will push it.
As for being away
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-127131369
@davies, I just pushed updates except for the fix for `ArrayData`. I
rebased to master and am getting unrelated compile errors, so will have to wait
for that to be
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r36056386
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +37,62 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r36056381
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +37,62 @@ case class Size(child
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-126777044
Thanks for update.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-126749986
Another ping for a review since feature freeze is today/Monday @rxin and
@chenghao-intel.
---
If your project is set up for it, you can reply to this email and have
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-125874527
Alright, now that tests pass, can you review @chenghao-intel
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-125487277
jenkins retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-125486849
jenkins restest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-125385186
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-125344394
Is this failing maybe because of jenkins not picking up the rebase and
perhaps not updating the repository in some way @rxin?
---
If your project is set up for it
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-125342145
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-125326533
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-125323912
I went ahead and modified the unit test to use `1L` so that it will pass
the test. Based on the above points, I think it would be best to not deal with
that issue on
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-124722152
Any thoughts on the python error @rxin or @chenghao-intel? I am going to
start digging a little bit to see if I can find where this is occurring.
Primary questions are
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-124259173
Looks like back to the issue where python dataframes has a mismatch of
`bigint` and `int`. I am pretty sure that this is an issue with the python
portion, since all
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-124223602
Just pushed code for review. Based on the unit tests I wrote, it matches
the hive behavior described above.
---
If your project is set up for it, you can reply to
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-124215462
Makes sense, so shouldn't be a problem. Prior question still is open
though, due to the not useful error message, should I explicitly write a
`checkInputTypes` m
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-124214142
Based on Hive behavior of this:
```sql
# Returns false
array_contains(array(1, 0), array(1, null)[1])
```
It would seem that this should
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-124211814
I will include that change as well.
Separate question, is there a better way to signal that there are no legal
input types? If I leave it how it is, the error
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-124209251
@rxin and @chenghao-intel, I think I may have found a bug in
`ExpectsInputTypes.scala`. For reference, this is what scala does when zipping
non-equal length
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-124183259
There seems to be a number of different cases, so I am going to refer to
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-124005375
Pushed code with scala tests, seeing if python breaks still
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35295692
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +36,47 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35295682
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +36,47 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35295129
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +36,47 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35294723
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +36,47 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35294649
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +36,47 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35294529
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +36,47 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35294183
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +36,47 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35293144
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -298,4 +298,28 @@ class DataFrameFunctionsSuite extends
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35281045
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -298,4 +298,28 @@ class DataFrameFunctionsSuite extends
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-123805715
Few things:
I added the python API, but I am getting a runtime exception, which is
causing it to fallback to something else, then succeeding, stacktrace below
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35239410
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +37,57 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35239110
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -298,4 +298,32 @@ class DataFrameFunctionsSuite extends
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35238891
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2081,6 +2081,22 @@ object functions {
*/
def size(column
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35238594
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +37,57 @@ case class Size(child
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35187368
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2081,6 +2081,22 @@ object functions {
*/
def size(column
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35187090
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala ---
@@ -298,4 +298,32 @@ class DataFrameFunctionsSuite extends
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7580#discussion_r35187063
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -35,3 +37,57 @@ case class Size(child
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-123521163
Consolidating all my questions here, then deleting prior comments:
1. What is incorrect about my codegen which is returning an error?
Secondary to that, does it
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-123513491
For type checking, should this be done in `eval` or in `inputTypes`? If the
second, is there a way to constrain the type of the right by the element type
of the left
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-123511627
Are the primitives declared somewhere else, or do I need to declare their
type?
---
If your project is set up for it, you can reply to this email and have your
reply
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7580#issuecomment-123511552
Stack trace for failed codegen:
```scala
[info] - Array contains *** FAILED *** (25 milliseconds)
[info] Code generation of array_contains(List(1, 2
GitHub user EntilZha opened a pull request:
https://github.com/apache/spark/pull/7580
[SPARK-8231][SQL] Add array_contains
PR for work on https://issues.apache.org/jira/browse/SPARK-8231
Currently, I have an initial implementation for contains. Based on
discussion on JIRA
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-123175469
@chenghao-intel, last commit should fix it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7462#discussion_r35071716
--- Diff: python/pyspark/sql/functions.py ---
@@ -795,6 +796,22 @@ def weekofyear(col):
return Column(sc._jvm.functions.weekofyear(col
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7462#discussion_r35071705
--- Diff: python/pyspark/sql/functions.py ---
@@ -795,6 +796,22 @@ def weekofyear(col):
return Column(sc._jvm.functions.weekofyear(col
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7462#discussion_r35055968
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionFunctionsSuite.scala
---
@@ -0,0 +1,43
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-123087882
On a side note, why is Jenkins saying that there are extra classes being
exposed which are not part of the PR diff?
---
If your project is set up for it, you can
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7462#discussion_r35055709
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7462#discussion_r35055572
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7462#discussion_r35054813
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-123038452
Merge/rebase is done, now ready for review
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-123032023
Working on rebasing correctly, think I understand what I did/how to fix
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-123003936
Actually, I just looked at the diff, I did do something incorrect when I
fixed merge conflicts/rebased, what should I do to fix that?
---
If your project is set up
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-123003528
@rxin, I fixed those python issues and rebased to master/resolved merge
conflicts. Hopefully didn't break anything with git. Tests also should be
passing. Any
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-122784338
Any other comments on code @rxin or @chenghao-intel? If it looks good,
should I fix the merge conflict/rebase to current master or would you do that
when you merge the
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-122764068
Fixed tests using feedback from @chenghao-intel
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/7462#discussion_r34968615
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionFunctionsSuite.scala
---
@@ -0,0 +1,43
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-122759346
I pushed updates to the code based on the comments above. I wrote an
expressions test, but can't figure out why its failing. I get the following
stack
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/7462#issuecomment-122207728
Thanks @rxin for feedback. Will make those changes tomorrow and repost when
PR is ready for review again.
---
If your project is set up for it, you can reply to this
GitHub user EntilZha opened a pull request:
https://github.com/apache/spark/pull/7462
[SPARK-8230][SQL] Add array/map size method
Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230
Primary issue resolved is to implement array/map size for Spark SQL. Code
is
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4807#issuecomment-97214197
Commenting here and then on ticket. If there is interest in using the Gibbs
implementation I wrote for next release using the interface/Refactor from that
PR I am open
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4807#issuecomment-89189465
Just double checking, your suggestion would be to rebase from master,
implement those general changes, then commit/push the modified branch?
Primary reason I
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4807#issuecomment-88657523
Sounds good. I think its reasonable that this PR only includes refactoring,
not Gibbs. Then evaluate LightLDA vs FastLDA and choose which one makes sense.
If changes
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4807#issuecomment-78090060
As is, it can be merged (as far as work on refactoring goes). I had
actually considered having separate PRs for the Refactor and Gibbs anyway.
On Mar 10, 2015 9
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4807#discussion_r26189764
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
private
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4807#discussion_r26189097
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
private
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4807#discussion_r26188892
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
private
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4807#discussion_r26187620
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
private
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4807#discussion_r26187624
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -311,165 +319,319 @@ private[clustering] object LDA {
private
GitHub user EntilZha opened a pull request:
https://github.com/apache/spark/pull/4807
[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA
algorithms (EM+Gibbs)
JIRA: https://issues.apache.org/jira/browse/SPARK-5556
As discussed in that issue, it would be
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-72584774
Sounds great, once its merged in I will start working on Gibbs using that
as a starting point. Great work!
---
If your project is set up for it, you can reply to this
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-71946931
@mengxr, correct, its not public, but it would be helpful to at least get a
minimal Trait which I can base my PR on (something similar to what I proposed
earlier). That
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-71934508
@jkbradley, probably the more important part isn't the commit/tag itself,
but agreement on the relatively short LearningState traits.
---
If your project is set u
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-71932163
It would be very helpful if we can stabilize the LearningState API/Trait,
and you point to a commit/tag for that. This would help me finish Gibbs in
parallel with EM
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23502479
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23502411
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501399
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501392
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501230
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501143
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501109
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501107
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501104
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
@@ -0,0 +1,265 @@
+/*
+ * Licensed to the Apache Software
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501095
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501086
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501071
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23501018
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on a diff in the pull request:
https://github.com/apache/spark/pull/4047#discussion_r23500980
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-71146326
Thoughts on the API @jkbradley? I am working in implementing those last few
functions and performance tests. I am aiming to have the Gibbs version ready
(if nothing
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-70584023
Just finished refactoring, here is the combined API/LDAModel code + Gibbs
implementing it. It should give a pretty good idea of what I was thinking about
using
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-70572184
Nice, I am not too far from completing my refactoring. How would it be best
to share it, open a different PR or link to it here? Eventually, will it be two
PRs... not
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-70542992
What might be the best way to have the EM and Gibbs LDA implementations
play well with each other?
If the aim is to not have separate LDA classes, on first
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-70537930
I've had a good chance to look at PR while making changes to my own code. I
really liked the Graph initialization code (especially the partition strategy),
I was
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/4047#issuecomment-70029618
Mind posting the data set size (vocab, doc, etc) and type of cluster? About
to start some performance tests and would be cool to hit both the Chinese
dataset size and
Github user EntilZha closed the pull request at:
https://github.com/apache/spark/pull/3405
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user EntilZha commented on the pull request:
https://github.com/apache/spark/pull/3405#issuecomment-64969668
Works for me for building against master branch. Close PR and close issue
with reference to #3058?
---
If your project is set up for it, you can reply to this email
1 - 100 of 106 matches
Mail list logo