Computing hamming distance over large data set

rokclimb15 Wed, 10 Feb 2016 19:30:07 -0800

Hi everyone, new to this list and Spark, so I'm hoping someone can point me
in the right direction.


I'm trying to perform this same sort of task:
http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql

and I'm running into the same problem - it doesn't scale.  Even on a very
fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to
find a match with 30 million+ rows.

What I would like to do is to load this data set from MySQL into Spark and
compute the Hamming distance using all available cores, then select the rows
matching a maximum distance.  I'm most familiar with Python, so would prefer
to use that.

I found an example of loading data from MySQL

http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/

I found a related DataFrame commit and docs, but I'm not exactly sure how to
put this all together.

https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR

Could anyone please point me to a similar example I could follow as a Spark
newb to try this out?  Is this even worth attempting, or will it similarly
fail performance-wise?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Computing hamming distance over large data set

Reply via email to