Сергей,
A simple implementation would be to create a DataFrame of CVs by issuing a
Spark SQL query against your Postgres database, persist it in memory, and
then to map F over it at query time and return the top
<https://spark.apache.org/docs/1.3.1/api/scala/org/apache/spark/rdd/RDD.html#top(num:Int)(implicitord:Ordering[T]):Array[T]>
N
on the mapped data structure. However, this might not meet your latency
needs depending on how expensive your scoring function F is (I imagine it's
something like computing the overlap or Jaccard similarity between the
vacancy IDs and the set of IDs for each CV). It might be worth trying.

For example, following a similar strategy on a cluster with ~100GB RAM and
~160 cores, I get sorted list of the top 10,000 documents from a set of 50
million documents in less than ten seconds for a query. In my case, the
cost of scoring each query-document pair is dominated by computing ~50 dot
products of 100-dimensional vectors.

Best,
Alex

On Mon, May 25, 2015 at 2:59 AM, Сергей Мелехин <cpro...@gmail.com> wrote:

> Hi, ankur!
> Thanks for your reply!
> CVs are a just bunch of IDs, each ID represents some object of some class
> (eg. class=JOB, object=SW Developer). We have already processed texts and
> extracted all facts. So we don't need to do any text processing in Spark,
> just to run scoring function on many many CVs, and return top 10 matches.
>
> С Уважением, Сергей Мелехин.
>
> 2015-05-25 16:28 GMT+10:00 ankur chauhan <an...@malloc64.com>:
>
>> Hi,
>>
>> I am sure you can use spark for this but it seems like a problem that
>> should be delegated to a text based indexing technology like elastic search
>> or something based on lucene to serve the requests. Spark can be used to
>> prepare the data that can be fed to the indexing service.
>>
>> Using spark directly seems like there would be a lot of repeated
>> computations between requests which can be avoided.
>>
>> There are a bunch of spark-elasticsearch bindings that can be used to
>> make the process easier.
>>
>> Again, sparksql can help you convert most of the logic directly to spark
>> jobs but I would suggest exploring text indexing technologies too.
>>
>> -- ankur
>> ------------------------------
>> From: Сергей Мелехин <cpro...@gmail.com>
>> Sent: ‎5/‎24/‎2015 10:59 PM
>> To: user@spark.apache.org
>> Subject: Using Spark like a search engine
>>
>> HI!
>> We are developing scoring system for recruitment. Recruiter enters
>> vacancy requirements, and we score tens of thousands of CVs to this
>> requirements, and return e.g. top 10 matches.
>> We do not use fulltext search and sometimes even dont filter input CVs
>> prior to scoring (some vacancies do not have mandatory requirements that
>> can be used as a filter effectively).
>>
>> So we have scoring function F(CV,VACANCY) that is currently inplemented
>> in SQL and runs on Postgresql cluster. In worst case F is executed once on
>> every CV in database. VACANCY part is fixed for one query, but changes
>> between queries and there's very little we can process in advance.
>>
>> We expect to have about 100 000 000 CVs in next year, and do not expect
>> our current implementation to offer desired low latency responce (<1 s) on
>> 100M CVs. So we look for a horizontaly scaleable and fault-tolerant
>> in-memory solution.
>>
>> Will Spark be usefull for our task? All tutorials I could find describe
>> stream processing, or ML applications. What Spark extensions/backends can
>> be useful?
>>
>>
>> With best regards, Segey Melekhin
>>
>
>

Reply via email to