Re: Computing hamming distance over large data set

2016-02-12 Thread Charlie Hack
I ran across DIMSUM a while ago but never used it. https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html Annoy is wonderful if you want to make queries. If you want to do the "self similarity join" you might look at DIMSUM or preferably if at all

Re: Cosine LSH Join

2015-09-23 Thread Charlie Hack
This is great! Pretty sure I have a use for it involving entity resolution of text records.  ​ ​How does this compare to the DIMSUM similarity join implementation in MLlib performance wise, out of curiosity? ​ ​Thanks, ​ ​Charlie  On Wednesday, Sep 23, 2015 at 09:25, Nick

Re: Build k-NN graph for large dataset

2015-08-26 Thread Charlie Hack
+1 to all of the above esp.  Dimensionality reduction and locality sensitive hashing / min hashing.  There's also an algorithm implemented in MLlib called DIMSUM which was developed at Twitter for this purpose. I've been meaning to try it and would be interested to hear about results you get. 

Creating Spark DataFrame from large pandas DataFrame

2015-08-20 Thread Charlie Hack
Hi, I'm new to spark and am trying to create a Spark df from a pandas df with ~5 million rows. Using Spark 1.4.1. When I type: df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None)) (the df.where is a hack I found on the Spark JIRA to avoid a problem with NaN values making

Re: Spark 1.4.1 - Mac OSX Yosemite

2015-08-18 Thread Charlie Hack
/jdk1.7.0_79.jdk/Contents/Home On Mon, Aug 17, 2015 at 11:11 PM, Alun Champion a...@achampion.net wrote: Yes, they both are set. Just recompiled and still no success, silent failure. Which versions of java and scala are you using? On 17 August 2015 at 19:59, Charlie Hack charles.t.h...@gmail.com

Re: Spark 1.4.1 - Mac OSX Yosemite

2015-08-17 Thread Charlie Hack
I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1 using these instructions http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html (using `$ sbt/sbt clean assembly`, with the additional step of downloading the proper sbt-launch.jar (0.13.7) from