Karl,

This is tremendously useful.  Thanks very much for your insight.

Brian

On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley <kmhig...@gmail.com> wrote:

> Hi,
>
> It sounds like you're trying to solve the approximate nearest neighbor
> (ANN) problem. With a large dataset, parallelizing a brute force O(n^2)
> approach isn't likely to help all that much, because the number of pairwise
> comparisons grows quickly as the size of the dataset increases. I'd look at
> ways to avoid computing the similarity between all pairs, like
> locality-sensitive hashing. (Unfortunately Spark doesn't yet support LSH --
> it's currently slated for the Spark 2.0.0 release, but AFAIK development on
> it hasn't started yet.)
>
> There are a bunch of Python libraries that support various approaches to
> the ANN problem (including LSH), though. It sounds like you need fast
> lookups, so you might check out https://github.com/spotify/annoy. For
> other alternatives, see this performance comparison of Python ANN libraries
> : https://github.com/erikbern/ann-benchmarks.
>
> Hope that helps,
> Karl
>
> On Wed, Feb 10, 2016 at 10:29 PM rokclimb15 <rokclim...@gmail.com> wrote:
>
>> Hi everyone, new to this list and Spark, so I'm hoping someone can point
>> me
>> in the right direction.
>>
>> I'm trying to perform this same sort of task:
>>
>> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
>>
>> and I'm running into the same problem - it doesn't scale.  Even on a very
>> fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to
>> find a match with 30 million+ rows.
>>
>> What I would like to do is to load this data set from MySQL into Spark and
>> compute the Hamming distance using all available cores, then select the
>> rows
>> matching a maximum distance.  I'm most familiar with Python, so would
>> prefer
>> to use that.
>>
>> I found an example of loading data from MySQL
>>
>>
>> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/
>>
>> I found a related DataFrame commit and docs, but I'm not exactly sure how
>> to
>> put this all together.
>>
>>
>> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E
>>
>>
>> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR
>>
>> Could anyone please point me to a similar example I could follow as a
>> Spark
>> newb to try this out?  Is this even worth attempting, or will it similarly
>> fail performance-wise?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to