Re: Which database should I use with Mahout

Pat Ferrel Sun, 19 May 2013 18:27:08 -0700

Using a Hadoop version of a Mahout recommender will create some number of recs 
for all users as its output. Sean is talking about Myrrix I think which uses 
factorization to get much smaller models and so can calculate the recs at 
runtime for fairly large user sets.

However if you are using Mahout and Hadoop the question is how to store and 
lookup recommendations in the quickest scalable way. You will have a user ID 
and perhaps an item ID as a key to the list of recommendations. The fastest 
thing to do is have a hashmap in memory, perhaps read in from HDFS. Remember 
that Mahout will output the recommendations with internal Mahout IDs so you 
will have to replace these in the data with your actual user and item ids.

I use a NoSQL DB, either MongoDB or Cassandra but others are fine too, even 
MySQL if you can scale it to meet your needs. I end up with two tables, one has 
my user ID as a key and recommendations with my item IDs either ordered or with 
strengths. The second table has my item ID as the key with a list of similar 
items (again sorted or with strengths). At runtime I may have both a user ID 
and an item ID context so I get a list from both tables and combine them at 
runtime.

I use a DB for many reasons and let it handle the caching. I never need to 
worry about memory management. If you have scaled your DB properly the lookups 
will actually be executed like an in-memory hashmap with indexed keys for ids. 
Scaling the DB can be done as your user base grows when needed without 
affecting the rest of the calculation pipeline. Yes there will be overhead due 
to network traffic in a cluster but the flexibility is worth it for me. If high 
availability is important you can spread out your db cluster over multiple data 
centers without affecting the API for serving recommendations. I set up the 
recommendation calculation to run continuously in the background, replacing 
values in the two tables as fast as I can. This allows you to scale update 
speed (how many machines in the mahout/hadoop cluster) independently from 
lookup performance scaling (how many machines in your db cluster, how much 
memory do the db machine have).

On May 19, 2013, at 11:45 AM, Manuel Blechschmidt <manuel.blechschm...@gmx.de> 
wrote:

Hi Tevfik,
I am working with mysql but I would guess that HDFS like Sean suggested would 
be a good idea as well.

There is also a project called sqoop which can be used to transfer data from 
relation databases to Hadoop.

http://sqoop.apache.org/

Scribe might be also an option for transferring a lot of data:
https://github.com/facebook/scribe#readme

I would suggest that you just start with the technology that you know best and 
then if you solve the problem as soon as you get them.

/Manuel

Am 19.05.2013 um 20:26 schrieb Sean Owen:

> I think everyone is agreeing that it is essential to only access
> information in memory at run-time, yes, whatever that info may be.
> I don't think the original question was about Hadoop, but, the answer
> is the same: Hadoop mappers are just reading the input serially. There
> is no advantage to a relational database or NoSQL database; they're
> just overkill. HDFS is sufficient, and probably even best of these at
> allowing fast serial access to the data.
> 
> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
> <tevfik.ayte...@gmail.com> wrote:
>> Hi Manuel,
>> But if one uses matrix factorization and stores the user and item
>> factors in memory then there will be no database access during
>> recommendation.
>> I thought that the original question was where to store the data and
>> how to give it to hadoop.
>> 
>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>> <manuel.blechschm...@gmx.de> wrote:
>>> Hi Tevfik,
>>> one request to the recommender could become more then 1000 queries to the 
>>> database depending on which recommender you use and the amount of 
>>> preferences for the given user.
>>> 
>>> The problem is not if you are using SQL, NoSQL, or any other query 
>>> language. The problem is the latency of the answers.
>>> 
>>> An average tcp package in the same data center takes 500 µs. A main memory 
>>> reference 0,1 µs. This means that your main memory of your java process can 
>>> be accessed 5000 times faster then any other process like a database 
>>> connected via TCP/IP.
>>> 
>>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>> 
>>> Here you can see a screenshot that shows that database communication is by 
>>> far (99%) the slowest component of a recommender request:
>>> 
>>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>> 
>>> If you do not want to cache your data in your Java process you can use a 
>>> complete in memory database technology like SAP HANA 
>>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>> 
>>> Nevertheless if you are using these you do not need Mahout anymore.
>>> 
>>> An architecture of a Mahout system can be seen here:
>>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>> 
>>> Hope that helps
>>>   Manuel
>>> 
>>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>>> 
>>>> I'm first saying that you really don't want to use the database as a
>>>> data model directly. It is far too slow.
>>>> Instead you want to use a data model implementation that reads all of
>>>> the data, once, serially, into memory. And in that case, it makes no
>>>> difference where the data is being read from, because it is read just
>>>> once, serially. A file is just as fine as a fancy database. In fact
>>>> it's probably easier and faster.
>>>> 
>>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>>>> <tevfik.ayte...@gmail.com> wrote:
>>>>> Thanks Sean, but I could not get your answer. Can you please explain it 
>>>>> again?
>>>>> 
>>>>> 
>>>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen <sro...@gmail.com> wrote:
>>>>>> It doesn't matter, in the sense that it is never going to be fast
>>>>>> enough for real-time at any reasonable scale if actually run off a
>>>>>> database directly. One operation results in thousands of queries. It's
>>>>>> going to read data into memory anyway and cache it there. So, whatever
>>>>>> is easiest for you. The simplest solution is a file.
>>>>>> 
>>>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>>>>> <ahmetyilmazefe...@yahoo.com> wrote:
>>>>>>> Hi,
>>>>>>> I would like to use Mahout to make recommendations on my web site. 
>>>>>>> Since the data is going to be big, hopefully, I plan to use hadoop 
>>>>>>> implementations of the recommender algorithms.
>>>>>>> 
>>>>>>> I'm currently storing the data in mysql. Should I continue with it or 
>>>>>>> should I switch to a nosql database such as mongodb or something else?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Ahmet
>>> 
>>> --
>>> Manuel Blechschmidt
>>> M.Sc. IT Systems Engineering
>>> Dortustr. 57
>>> 14467 Potsdam
>>> Mobil: 0173/6322621
>>> Twitter: http://twitter.com/Manuel_B
>>> 

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B

Re: Which database should I use with Mahout

Reply via email to