[GitHub] spark issue #13735: [SPARK-15328][MLLIB][ML] Word2Vec import for original bi...

2016-09-15 Thread insidedctm
Github user insidedctm commented on the issue:

https://github.com/apache/spark/pull/13735
  
This seems to work fine with small model such as that produced by 
demo_word.sh in the word2vec code repository however I get problems when trying 
a large model such as 
[GoogleNews-vectors-negative300.bin](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing).

I can successfully load the model using this code (albeit I needed to give 
the driver 12GB of memory):
`import org.apache.spark.ml.feature.Word2VecModel`
`val path = "file:///Downloads/GoogleNews-vectors-negative300.bin"`
`val model = Word2VecModel.loadGoogleModel(path)`

However synonyms are not found for a typical lookup e.g.
`model.findSynonyms("spark",20).show`
responds with
`java.lang.IllegalStateException: spark not in vocabulary`

However the distance tool from the word2vec toolkit, loading the same model 
gives:

https://cloud.githubusercontent.com/assets/5909684/18549055/0a60f9da-7b44-11e6-895c-88ee018ed1a9.png";>




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

2016-02-21 Thread insidedctm
Github user insidedctm commented on the pull request:

https://github.com/apache/spark/pull/11290#issuecomment-186838624
  
@srowen good points, I've updated and pushed changes in line with your 
comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3650][GraphX] Triangle Count handles re...

2016-02-21 Thread insidedctm
GitHub user insidedctm opened a pull request:

https://github.com/apache/spark/pull/11290

[SPARK-3650][GraphX] Triangle Count handles reverse edges incorrectly

## What changes were proposed in this pull request?

Reworking of @jegonzal PR #2495 to address the issue identified in 
SPARK-3650. Code amended to use the convertToCanonicalEdges method. 


## How was the this patch tested?

Patch was tested using the unit tests created in PR #2495




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/insidedctm/spark spark-3650

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11290.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11290


commit 428fa26880bb32f04d0799d2c227e52defb99428
Author: Robin East 
Date:   2015-09-14T21:09:26Z

Change bytes to bits in RoutingTablePartition.toMessage

commit cf66402fb77855711ffd17ddb3efa58c7d44296e
Author: Robin East 
Date:   2016-02-18T18:38:37Z

Merge remote-tracking branch 'upstream/master'

commit 96fcc0aae84450d6cc3edf046807048b2d8c2db1
Author: Joseph E. Gonzalez 
Date:   2014-09-22T21:57:28Z

Improving Triangle Count

commit 1edc09df8e32b6717aa300fe62636a9613bcbc27
Author: Joseph E. Gonzalez 
Date:   2014-09-22T22:16:46Z

fixing bug in unit tests where bi-directed edges lead to duplicate 
triangles.

commit 47673cadc957eb35dbab01cdcbbe21382987e691
Author: Joseph E. Gonzalez 
Date:   2014-11-13T07:18:58Z

factored out code for canonicalization

commit c6cd74792d4f82e562d1c792d322f17b1877d4af
Author: Robin East 
Date:   2016-02-20T21:46:49Z

SPARK-3650 updates to PR 2495 to work with current master

commit c8ad0bd4ed998b86a465bc36ec59ddc5dcceef5e
Author: Robin East 
Date:   2016-02-21T11:27:10Z

revert unexpected changes to R/pkg/DESCRIPTION




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3650] Fix TriangleCount handling of rev...

2015-09-15 Thread insidedctm
Github user insidedctm commented on the pull request:

https://github.com/apache/spark/pull/2495#issuecomment-140515615
  
@pwendell can this be opened again? As per my discussion on the the JIRA 
ticket this is an issue that came up on the mailing list recently. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10598][DOCS]

2015-09-14 Thread insidedctm
GitHub user insidedctm opened a pull request:

https://github.com/apache/spark/pull/8756

[SPARK-10598][DOCS]

Comments preceding toMessage method state: "The edge partition is encoded 
in the lower
   * 30 bytes of the Int, and the position is encoded in the upper 2 bytes 
of the Int.". References to bytes should be changed to bits.

This contribution is my original work and I license the work to the Spark 
project under it's open source license.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/insidedctm/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8756.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8756


commit 428fa26880bb32f04d0799d2c227e52defb99428
Author: Robin East 
Date:   2015-09-14T21:09:26Z

Change bytes to bits in RoutingTablePartition.toMessage




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org