Сергей, A simple implementation would be to create a DataFrame of CVs by issuing a Spark SQL query against your Postgres database, persist it in memory, and then to map F over it at query time and return the top <https://spark.apache.org/docs/1.3.1/api/scala/org/apache/spark/rdd/RDD.html#top(num:Int)(implicitord:Ordering[T]):Array[T]> N on the mapped data structure. However, this might not meet your latency needs depending on how expensive your scoring function F is (I imagine it's something like computing the overlap or Jaccard similarity between the vacancy IDs and the set of IDs for each CV). It might be worth trying.
For example, following a similar strategy on a cluster with ~100GB RAM and ~160 cores, I get sorted list of the top 10,000 documents from a set of 50 million documents in less than ten seconds for a query. In my case, the cost of scoring each query-document pair is dominated by computing ~50 dot products of 100-dimensional vectors. Best, Alex On Mon, May 25, 2015 at 2:59 AM, Сергей Мелехин <cpro...@gmail.com> wrote: > Hi, ankur! > Thanks for your reply! > CVs are a just bunch of IDs, each ID represents some object of some class > (eg. class=JOB, object=SW Developer). We have already processed texts and > extracted all facts. So we don't need to do any text processing in Spark, > just to run scoring function on many many CVs, and return top 10 matches. > > С Уважением, Сергей Мелехин. > > 2015-05-25 16:28 GMT+10:00 ankur chauhan <an...@malloc64.com>: > >> Hi, >> >> I am sure you can use spark for this but it seems like a problem that >> should be delegated to a text based indexing technology like elastic search >> or something based on lucene to serve the requests. Spark can be used to >> prepare the data that can be fed to the indexing service. >> >> Using spark directly seems like there would be a lot of repeated >> computations between requests which can be avoided. >> >> There are a bunch of spark-elasticsearch bindings that can be used to >> make the process easier. >> >> Again, sparksql can help you convert most of the logic directly to spark >> jobs but I would suggest exploring text indexing technologies too. >> >> -- ankur >> ------------------------------ >> From: Сергей Мелехин <cpro...@gmail.com> >> Sent: 5/24/2015 10:59 PM >> To: user@spark.apache.org >> Subject: Using Spark like a search engine >> >> HI! >> We are developing scoring system for recruitment. Recruiter enters >> vacancy requirements, and we score tens of thousands of CVs to this >> requirements, and return e.g. top 10 matches. >> We do not use fulltext search and sometimes even dont filter input CVs >> prior to scoring (some vacancies do not have mandatory requirements that >> can be used as a filter effectively). >> >> So we have scoring function F(CV,VACANCY) that is currently inplemented >> in SQL and runs on Postgresql cluster. In worst case F is executed once on >> every CV in database. VACANCY part is fixed for one query, but changes >> between queries and there's very little we can process in advance. >> >> We expect to have about 100 000 000 CVs in next year, and do not expect >> our current implementation to offer desired low latency responce (<1 s) on >> 100M CVs. So we look for a horizontaly scaleable and fault-tolerant >> in-memory solution. >> >> Will Spark be usefull for our task? All tutorials I could find describe >> stream processing, or ML applications. What Spark extensions/backends can >> be useful? >> >> >> With best regards, Segey Melekhin >> > >