Re: Computing hamming distance over large data set

2016-02-12 Thread Charlie Hack
I ran across DIMSUM a while ago but never used it.

https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

Annoy is wonderful if you want to make queries.

If you want to do the "self similarity join" you might look at DIMSUM or
preferably if at all possible see if there's some key that you can join
possible pairs and then use a similarity metric to filter out non matches.
Does that make sense? In general way more efficient then computing n^2
similarities.

Hth

Charlie
On Fri, Feb 12, 2016 at 20:57 Maciej Szymkiewicz 
wrote:

> There is also this: https://github.com/soundcloud/cosine-lsh-join-spark
>
>
> On 02/11/2016 10:12 PM, Brian Morton wrote:
>
> Karl,
>
> This is tremendously useful.  Thanks very much for your insight.
>
> Brian
>
> On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley  wrote:
>
>> Hi,
>>
>> It sounds like you're trying to solve the approximate nearest neighbor
>> (ANN) problem. With a large dataset, parallelizing a brute force O(n^2)
>> approach isn't likely to help all that much, because the number of pairwise
>> comparisons grows quickly as the size of the dataset increases. I'd look at
>> ways to avoid computing the similarity between all pairs, like
>> locality-sensitive hashing. (Unfortunately Spark doesn't yet support LSH --
>> it's currently slated for the Spark 2.0.0 release, but AFAIK development on
>> it hasn't started yet.)
>>
>> There are a bunch of Python libraries that support various approaches to
>> the ANN problem (including LSH), though. It sounds like you need fast
>> lookups, so you might check out https://github.com/spotify/annoy. For
>> other alternatives, see this performance comparison of Python ANN libraries
>> : https://github.com/erikbern/ann-benchmarks.
>>
>> Hope that helps,
>> Karl
>>
>> On Wed, Feb 10, 2016 at 10:29 PM rokclimb15  wrote:
>>
>>> Hi everyone, new to this list and Spark, so I'm hoping someone can point
>>> me
>>> in the right direction.
>>>
>>> I'm trying to perform this same sort of task:
>>>
>>> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
>>>
>>> and I'm running into the same problem - it doesn't scale.  Even on a very
>>> fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to
>>> find a match with 30 million+ rows.
>>>
>>> What I would like to do is to load this data set from MySQL into Spark
>>> and
>>> compute the Hamming distance using all available cores, then select the
>>> rows
>>> matching a maximum distance.  I'm most familiar with Python, so would
>>> prefer
>>> to use that.
>>>
>>> I found an example of loading data from MySQL
>>>
>>>
>>> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/
>>>
>>> I found a related DataFrame commit and docs, but I'm not exactly sure
>>> how to
>>> put this all together.
>>>
>>>
>>> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E
>>>
>>>
>>> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR
>>>
>>> Could anyone please point me to a similar example I could follow as a
>>> Spark
>>> newb to try this out?  Is this even worth attempting, or will it
>>> similarly
>>> fail performance-wise?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
> --
> Maciej Szymkiewicz
>
>


Re: Cosine LSH Join

2015-09-23 Thread Charlie Hack
This is great! Pretty sure I have a use for it involving entity resolution of 
text records. 





​

​How does this compare to the DIMSUM similarity join implementation in MLlib 
performance wise, out of curiosity?

​

​Thanks,

​

​Charlie 














On Wednesday, Sep 23, 2015 at 09:25, Nick Pentreath , 
wrote:

Looks interesting - I've been trying out a few of the ANN / LSH packages on 
spark-packages.org and elsewhere


e.g. http://spark-packages.org/package/tdebatty/spark-knn-graphs and 
https://github.com/marufaytekin/lsh-spark





How does this compare? Perhaps you could put it up on spark-packages to get 
visibility?














On Wed, Sep 23, 2015 at 3:02 PM, Demir  wrote:
We've just open sourced a LSH implementation on Spark. We're using this

internally in order to find topK neighbors after a matrix factorization.


We hope that this might be of use for others:

https://github.com/soundcloud/cosine-lsh-join-spark


For those wondering: lsh is a technique to quickly find most similar

neighbors in a high dimensional space. This is a problem faced whenever

objects are represented as vectors in a high dimensional space e.g. words,

items, users...


cheers


özgür demir




--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cosine-LSH-Join-tp24785.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.


-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org

Re: Build k-NN graph for large dataset

2015-08-26 Thread Charlie Hack
+1 to all of the above esp.  Dimensionality reduction and locality sensitive 
hashing / min hashing. 


There's also an algorithm implemented in MLlib called DIMSUM which was 
developed at Twitter for this purpose. I've been meaning to try it and would be 
interested to hear about results you get. 





https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum













​Charlie 







—
Sent from Mailbox




On Wednesday, Aug 26, 2015 at 09:57, Michael Malak 
michaelma...@yahoo.com.invalid, wrote:


Yes. And a paper that describes using grids (actually varying grids) is 
http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CCVPR12-GraphConstruction.pdf
 In the Spark GraphX In Action book that Robin East and I are writing, we 
implement a drastically simplified version of this in chapter 7, which should 
become available in the MEAP mid-September. 
http://www.manning.com/books/spark-graphx-in-action






   
 


If you don't want to compute all N^2 similarities, you need to implement some 
kind of blocking first. For example, LSH (locally sensitive hashing). A quick 
search gave this link to a Spark implementation: 

http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashing












On Wed, Aug 26, 2015 at 7:35 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

Dear all,


I'm trying to find an efficient way to build a k-NN graph for a large dataset. 
Precisely, I have a large set of high dimensional vector (say d  1) and 
I want to build a graph where those high dimensional points are the vertices 
and each one is linked to the k-nearest neighbor based on some kind similarity 
defined on the vertex spaces. 
My problem is to implement an efficient algorithm to compute the weight matrix 
of the graph. I need to compute a N*N similarities and the only way I know is 
to use cartesian operation follow by map operation on RDD. But, this is 
very slow when the N is large. Is there a more cleaver way to do this for an 
arbitrary similarity function ? 




Cheers,




Jao

Creating Spark DataFrame from large pandas DataFrame

2015-08-20 Thread Charlie Hack
Hi,

I'm new to spark and am trying to create a Spark df from a pandas df with
~5 million rows. Using Spark 1.4.1.

When I type:

df = sqlContext.createDataFrame(pandas_df.where(pd.notnull(didf), None))

(the df.where is a hack I found on the Spark JIRA to avoid a problem with
NaN values making mixed column types)

I get:

TypeError: cannot create an RDD from type: type 'list'

Converting a smaller pandas dataframe (~2000 rows) works fine. Anyone had
this issue?


This is already a workaround-- ideally I'd like to read the spark dataframe
from a Hive table. But this is currently not an option for my setup.

I also tried reading the data into spark from a CSV using spark-csv.
Haven't been able to make this work as yet. I launch

$ pyspark --jars path/to/spark-csv_2.11-1.2.0.jar

and when I attempt to read the csv I get:

Py4JJavaError: An error occurred while calling o22.load. :
java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat ...

Other options I can think of:

- Convert my CSV to json (use Pig?) and read into Spark
- Read in using jdbc connect from postgres

But want to make sure I'm not misusing Spark or missing something obvious.

Thanks!

Charlie


Re: Spark 1.4.1 - Mac OSX Yosemite

2015-08-18 Thread Charlie Hack
Looks like Scala 2.11.6 and Java 1.7.0_79.

✔ ~
09:17 $ scala
Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_79).
Type in expressions to have them evaluated.
Type :help for more information.

scala

✔ ~
09:26 $ echo $JAVA_HOME
/Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home



On Mon, Aug 17, 2015 at 11:11 PM, Alun Champion a...@achampion.net wrote:

 Yes, they both are set. Just recompiled and still no success, silent
 failure.
 Which versions of java and scala are you using?


 On 17 August 2015 at 19:59, Charlie Hack charles.t.h...@gmail.com wrote:

 I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1
 using these instructions
 http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html
  (using
 `$ sbt/sbt clean assembly`, with the additional step of downloading the
 proper sbt-launch.jar (0.13.7) from here
 http://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.7/
 and replacing the one that is in build/ as you noted. You've set SCALA_HOME
 and JAVA_HOME environment variables?

 On Mon, Aug 17, 2015 at 8:36 PM, Alun Champion a...@achampion.net
 wrote:

 Has anyone experienced issues running Spark 1.4.1 on a Mac OSX Yosemite?

 I'm been running a standalone 1.3.1 fine but it failed when trying to
 run 1.4.1. (I also trie 1.4.0).

 I've tried both the pre-built packages as well as compiling from source,
 both with the same results (I can successfully compile with both mvn and
 sbt (after fixing the sbt.jar - which was corrupt)
 After downloading/building spark and running ./bin/pyspark or
 ./bin/spark-shell it silently exits with a code 1.
 Creating a context in python I get: Exception: Java gateway process
 exited before sending the driver its port number

 I couldn't find any specific resolutions on the web.
 I did add 'pyspark-shell' to the PYSPARK_SUBMIT_ARGS but to no effect.

 Anyone have any further ideas I can explore?
 Cheers
-Alun.




 --
 # +17344761472





-- 
# +17344761472


Re: Spark 1.4.1 - Mac OSX Yosemite

2015-08-17 Thread Charlie Hack
I had success earlier today on OSX Yosemite 10.10.4 building Spark 1.4.1
using these instructions
http://genomegeek.blogspot.com/2014/11/how-to-install-apache-spark-on-mac-os-x.html
(using
`$ sbt/sbt clean assembly`, with the additional step of downloading the
proper sbt-launch.jar (0.13.7) from here
http://dl.bintray.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.7/
and replacing the one that is in build/ as you noted. You've set SCALA_HOME
and JAVA_HOME environment variables?

On Mon, Aug 17, 2015 at 8:36 PM, Alun Champion a...@achampion.net wrote:

 Has anyone experienced issues running Spark 1.4.1 on a Mac OSX Yosemite?

 I'm been running a standalone 1.3.1 fine but it failed when trying to run
 1.4.1. (I also trie 1.4.0).

 I've tried both the pre-built packages as well as compiling from source,
 both with the same results (I can successfully compile with both mvn and
 sbt (after fixing the sbt.jar - which was corrupt)
 After downloading/building spark and running ./bin/pyspark or
 ./bin/spark-shell it silently exits with a code 1.
 Creating a context in python I get: Exception: Java gateway process exited
 before sending the driver its port number

 I couldn't find any specific resolutions on the web.
 I did add 'pyspark-shell' to the PYSPARK_SUBMIT_ARGS but to no effect.

 Anyone have any further ideas I can explore?
 Cheers
-Alun.




-- 
# +17344761472