Hi,

It sounds like you're trying to solve the approximate nearest neighbor
(ANN) problem. With a large dataset, parallelizing a brute force O(n^2)
approach isn't likely to help all that much, because the number of pairwise
comparisons grows quickly as the size of the dataset increases. I'd look at
ways to avoid computing the similarity between all pairs, like
locality-sensitive hashing. (Unfortunately Spark doesn't yet support LSH --
it's currently slated for the Spark 2.0.0 release, but AFAIK development on
it hasn't started yet.)

There are a bunch of Python libraries that support various approaches to
the ANN problem (including LSH), though. It sounds like you need fast
lookups, so you might check out https://github.com/spotify/annoy. For other
alternatives, see this performance comparison of Python ANN libraries:
https://github.com/erikbern/ann-benchmarks.

Hope that helps,
Karl

On Wed, Feb 10, 2016 at 10:29 PM rokclimb15 <rokclim...@gmail.com> wrote:

> Hi everyone, new to this list and Spark, so I'm hoping someone can point me
> in the right direction.
>
> I'm trying to perform this same sort of task:
>
> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
>
> and I'm running into the same problem - it doesn't scale.  Even on a very
> fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to
> find a match with 30 million+ rows.
>
> What I would like to do is to load this data set from MySQL into Spark and
> compute the Hamming distance using all available cores, then select the
> rows
> matching a maximum distance.  I'm most familiar with Python, so would
> prefer
> to use that.
>
> I found an example of loading data from MySQL
>
>
> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/
>
> I found a related DataFrame commit and docs, but I'm not exactly sure how
> to
> put this all together.
>
>
> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E
>
>
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR
>
> Could anyone please point me to a similar example I could follow as a Spark
> newb to try this out?  Is this even worth attempting, or will it similarly
> fail performance-wise?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to