Hi everyone, new to this list and Spark, so I'm hoping someone can point me in the right direction.
I'm trying to perform this same sort of task: http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql and I'm running into the same problem - it doesn't scale. Even on a very fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to find a match with 30 million+ rows. What I would like to do is to load this data set from MySQL into Spark and compute the Hamming distance using all available cores, then select the rows matching a maximum distance. I'm most familiar with Python, so would prefer to use that. I found an example of loading data from MySQL http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/ I found a related DataFrame commit and docs, but I'm not exactly sure how to put this all together. https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR Could anyone please point me to a similar example I could follow as a Spark newb to try this out? Is this even worth attempting, or will it similarly fail performance-wise? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org