I ran across DIMSUM a while ago but never used it.
https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
Annoy is wonderful if you want to make queries.
If you want to do the "self similarity join" you might look at DIMSUM or
preferably if at all
This is great! Pretty sure I have a use for it involving entity resolution of
text records.
How does this compare to the DIMSUM similarity join implementation in MLlib
performance wise, out of curiosity?
Thanks,
Charlie
On Wednesday, Sep 23, 2015 at 09:25, Nick
+1 to all of the above esp. Dimensionality reduction and locality sensitive
hashing / min hashing.
There's also an algorithm implemented in MLlib called DIMSUM which was
developed at Twitter for this purpose. I've been meaning to try it and would be
interested to hear about results you get.
Hi,
I'm new to spark and am trying to create a Spark df from a pandas df with
~5 million rows. Using Spark 1.4.1.
When I type:
df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None))
(the df.where is a hack I found on the Spark JIRA to avoid a problem with
NaN values making
/jdk1.7.0_79.jdk/Contents/Home
On Mon, Aug 17, 2015 at 11:11 PM, Alun Champion a...@achampion.net wrote:
Yes, they both are set. Just recompiled and still no success, silent
failure.
Which versions of java and scala are you using?
On 17 August 2015 at 19:59, Charlie Hack charles.t.h...@gmail.com
I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1
using these instructions
http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html
(using
`$ sbt/sbt clean assembly`, with the additional step of downloading the
proper sbt-launch.jar (0.13.7) from