Re: Computing hamming distance over large data set
I ran across DIMSUM a while ago but never used it. https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html Annoy is wonderful if you want to make queries. If you want to do the "self similarity join" you might look at DIMSUM or preferably if at all possible see if there's some key that you can join possible pairs and then use a similarity metric to filter out non matches. Does that make sense? In general way more efficient then computing n^2 similarities. Hth Charlie On Fri, Feb 12, 2016 at 20:57 Maciej Szymkiewiczwrote: > There is also this: https://github.com/soundcloud/cosine-lsh-join-spark > > > On 02/11/2016 10:12 PM, Brian Morton wrote: > > Karl, > > This is tremendously useful. Thanks very much for your insight. > > Brian > > On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley wrote: > >> Hi, >> >> It sounds like you're trying to solve the approximate nearest neighbor >> (ANN) problem. With a large dataset, parallelizing a brute force O(n^2) >> approach isn't likely to help all that much, because the number of pairwise >> comparisons grows quickly as the size of the dataset increases. I'd look at >> ways to avoid computing the similarity between all pairs, like >> locality-sensitive hashing. (Unfortunately Spark doesn't yet support LSH -- >> it's currently slated for the Spark 2.0.0 release, but AFAIK development on >> it hasn't started yet.) >> >> There are a bunch of Python libraries that support various approaches to >> the ANN problem (including LSH), though. It sounds like you need fast >> lookups, so you might check out https://github.com/spotify/annoy. For >> other alternatives, see this performance comparison of Python ANN libraries >> : https://github.com/erikbern/ann-benchmarks. >> >> Hope that helps, >> Karl >> >> On Wed, Feb 10, 2016 at 10:29 PM rokclimb15 wrote: >> >>> Hi everyone, new to this list and Spark, so I'm hoping someone can point >>> me >>> in the right direction. >>> >>> I'm trying to perform this same sort of task: >>> >>> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql >>> >>> and I'm running into the same problem - it doesn't scale. Even on a very >>> fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to >>> find a match with 30 million+ rows. >>> >>> What I would like to do is to load this data set from MySQL into Spark >>> and >>> compute the Hamming distance using all available cores, then select the >>> rows >>> matching a maximum distance. I'm most familiar with Python, so would >>> prefer >>> to use that. >>> >>> I found an example of loading data from MySQL >>> >>> >>> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/ >>> >>> I found a related DataFrame commit and docs, but I'm not exactly sure >>> how to >>> put this all together. >>> >>> >>> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E >>> >>> >>> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR >>> >>> Could anyone please point me to a similar example I could follow as a >>> Spark >>> newb to try this out? Is this even worth attempting, or will it >>> similarly >>> fail performance-wise? >>> >>> Thanks! >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> > > -- > Maciej Szymkiewicz > >
Re: Cosine LSH Join
This is great! Pretty sure I have a use for it involving entity resolution of text records. How does this compare to the DIMSUM similarity join implementation in MLlib performance wise, out of curiosity? Thanks, Charlie On Wednesday, Sep 23, 2015 at 09:25, Nick Pentreath, wrote: Looks interesting - I've been trying out a few of the ANN / LSH packages on spark-packages.org and elsewhere e.g. http://spark-packages.org/package/tdebatty/spark-knn-graphs and https://github.com/marufaytekin/lsh-spark How does this compare? Perhaps you could put it up on spark-packages to get visibility? On Wed, Sep 23, 2015 at 3:02 PM, Demir wrote: We've just open sourced a LSH implementation on Spark. We're using this internally in order to find topK neighbors after a matrix factorization. We hope that this might be of use for others: https://github.com/soundcloud/cosine-lsh-join-spark For those wondering: lsh is a technique to quickly find most similar neighbors in a high dimensional space. This is a problem faced whenever objects are represented as vectors in a high dimensional space e.g. words, items, users... cheers özgür demir -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cosine-LSH-Join-tp24785.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Build k-NN graph for large dataset
+1 to all of the above esp. Dimensionality reduction and locality sensitive hashing / min hashing. There's also an algorithm implemented in MLlib called DIMSUM which was developed at Twitter for this purpose. I've been meaning to try it and would be interested to hear about results you get. https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum Charlie — Sent from Mailbox On Wednesday, Aug 26, 2015 at 09:57, Michael Malak michaelma...@yahoo.com.invalid, wrote: Yes. And a paper that describes using grids (actually varying grids) is http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf In the Spark GraphX In Action book that Robin East and I are writing, we implement a drastically simplified version of this in chapter 7, which should become available in the MEAP mid-September. http://www.manning.com/books/spark-graphx-in-action If you don't want to compute all N^2 similarities, you need to implement some kind of blocking first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark implementation: http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d 1) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor based on some kind similarity defined on the vertex spaces. My problem is to implement an efficient algorithm to compute the weight matrix of the graph. I need to compute a N*N similarities and the only way I know is to use cartesian operation follow by map operation on RDD. But, this is very slow when the N is large. Is there a more cleaver way to do this for an arbitrary similarity function ? Cheers, Jao
Creating Spark DataFrame from large pandas DataFrame
Hi, I'm new to spark and am trying to create a Spark df from a pandas df with ~5 million rows. Using Spark 1.4.1. When I type: df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None)) (the df.where is a hack I found on the Spark JIRA to avoid a problem with NaN values making mixed column types) I get: TypeError: cannot create an RDD from type: type 'list' Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had this issue? This is already a workaround-- ideally I'd like to read the spark dataframe from a Hive table. But this is currently not an option for my setup. I also tried reading the data into spark from a CSV using spark-csv. Haven't been able to make this work as yet. I launch $ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar and when I attempt to read the csv I get: Py4JJavaError: An error occurred while calling o22.load. : java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ... Other options I can think of: - Convert my CSV to json (use Pig?) and read into Spark - Read in using jdbc connect from postgres But want to make sure I'm not misusing Spark or missing something obvious. Thanks! Charlie
Re: Spark 1.4.1 - Mac OSX Yosemite
Looks like Scala 2.11.6 and Java 1.7.0_79. ✔ ~ 09:17 $ scala Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79). Type in expressions to have them evaluated. Type :help for more information. scala ✔ ~ 09:26 $ echo $JAVA_HOME /Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home On Mon, Aug 17, 2015 at 11:11 PM, Alun Champion a...@achampion.net wrote: Yes, they both are set. Just recompiled and still no success, silent failure. Which versions of java and scala are you using? On 17 August 2015 at 19:59, Charlie Hack charles.t.h...@gmail.com wrote: I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1 using these instructions http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html (using `$ sbt/sbt clean assembly`, with the additional step of downloading the proper sbt-launch.jar (0.13.7) from here http://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.7/ and replacing the one that is in build/ as you noted. You've set SCALA_HOME and JAVA_HOME environment variables? On Mon, Aug 17, 2015 at 8:36 PM, Alun Champion a...@achampion.net wrote: Has anyone experienced issues running Spark 1.4.1 on a Mac OSX Yosemite? I'm been running a standalone 1.3.1 fine but it failed when trying to run 1.4.1. (I also trie 1.4.0). I've tried both the pre-built packages as well as compiling from source, both with the same results (I can successfully compile with both mvn and sbt (after fixing the sbt.jar - which was corrupt) After downloading/building spark and running ./bin/pyspark or ./bin/spark-shell it silently exits with a code 1. Creating a context in python I get: Exception: Java gateway process exited before sending the driver its port number I couldn't find any specific resolutions on the web. I did add 'pyspark-shell' to the PYSPARK_SUBMIT_ARGS but to no effect. Anyone have any further ideas I can explore? Cheers -Alun. -- # +17344761472 -- # +17344761472
Re: Spark 1.4.1 - Mac OSX Yosemite
I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1 using these instructions http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html (using `$ sbt/sbt clean assembly`, with the additional step of downloading the proper sbt-launch.jar (0.13.7) from here http://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.7/ and replacing the one that is in build/ as you noted. You've set SCALA_HOME and JAVA_HOME environment variables? On Mon, Aug 17, 2015 at 8:36 PM, Alun Champion a...@achampion.net wrote: Has anyone experienced issues running Spark 1.4.1 on a Mac OSX Yosemite? I'm been running a standalone 1.3.1 fine but it failed when trying to run 1.4.1. (I also trie 1.4.0). I've tried both the pre-built packages as well as compiling from source, both with the same results (I can successfully compile with both mvn and sbt (after fixing the sbt.jar - which was corrupt) After downloading/building spark and running ./bin/pyspark or ./bin/spark-shell it silently exits with a code 1. Creating a context in python I get: Exception: Java gateway process exited before sending the driver its port number I couldn't find any specific resolutions on the web. I did add 'pyspark-shell' to the PYSPARK_SUBMIT_ARGS but to no effect. Anyone have any further ideas I can explore? Cheers -Alun. -- # +17344761472