Re: Which database should I use with Mahout

Sean Owen Sun, 19 May 2013 11:45:02 -0700

(Oh, by the way, I realize the original question was about Hadoop. I
can't read carefully.)


No, HDFS is not good for anything like random access. For input,
that's OK, because you don't need random access. So HDFS is just fine.
For output, if you are going to then serve these precomputed results
at run-time, they need to be in a container appropriate for quick
random access. There, a NoSQL store like HBase or something does sound
appropriate. You can create an output format that writes directly into
it, with a little work.

The drawbacks to this approach -- computing results in Hadoop -- is
that they are inevitably a bit stale, not real-time, and you have to
compute results for everyone, even though very few of those results
will be used. Of course, serving is easy and fast. There are hybrid
solutions that I can talk to you about offline that get a bit of the
best of both worlds.


On Sun, May 19, 2013 at 11:37 AM, Ahmet Ylmaz
<ahmetyilmazefe...@yahoo.com> wrote:
> Hi Sean,
> If I understood you correctly you are saying that I will not need mysql. But 
> if I store my data on HDFS will I be make fast queries such as
> "Return all the ratings of a specific user"
> which will be needed for showing the past ratings of a user.
>
> Ahmet
>
>
> ________________________________
>  From: Sean Owen <sro...@gmail.com>
> To: Mahout User List <user@mahout.apache.org>
> Sent: Sunday, May 19, 2013 9:26 PM
> Subject: Re: Which database should I use with Mahout
>
>
> I think everyone is agreeing that it is essential to only access
> information in memory at run-time, yes, whatever that info may be.
> I don't think the original question was about Hadoop, but, the answer
> is the same: Hadoop mappers are just reading the input serially. There
> is no advantage to a relational database or NoSQL database; they're
> just overkill. HDFS is sufficient, and probably even best of these at
> allowing fast serial access to the data.
>
> On Sun, May 19, 2013 at 11:19 AM, Tevfik Aytekin
> <tevfik.ayte...@gmail.com> wrote:
>> Hi Manuel,
>> But if one uses matrix factorization and stores the user and item
>> factors in memory then there will be no database access during
>> recommendation.
>> I thought that the original question was where to store the data and
>> how to give it to hadoop.
>>
>> On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
>> <manuel.blechschm...@gmx.de> wrote:
>>> Hi Tevfik,
>>> one request to the recommender could become more then 1000 queries to the 
>>> database depending on which recommender you use and the amount of 
>>> preferences for the given user.
>>>
>>> The problem is not if you are using SQL, NoSQL, or any other query 
>>> language. The problem is the latency of the answers.
>>>
>>> An average tcp package in the same data center takes 500 µs. A main memory 
>>> reference 0,1 µs. This means that your main memory of your java process can 
>>> be accessed 5000 times faster then any other process like a database 
>>> connected via TCP/IP.
>>>
>>> http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
>>>
>>> Here you can see a screenshot that shows that database communication is by 
>>> far (99%) the slowest component of a recommender request:
>>>
>>> https://source.apaxo.de/MahoutDatabaseLowPerformance.png
>>>
>>> If you do not want to cache your data in your Java process you can use a 
>>> complete in memory database technology like SAP HANA 
>>> http://www.saphana.com/welcome or EXASOL http://www.exasol.com/
>>>
>>> Nevertheless if you are using these you do not need Mahout anymore.
>>>
>>> An architecture of a Mahout system can be seen here:
>>> https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png
>>>
>>> Hope that helps
>>>     Manuel
>>>
>>> Am 19.05.2013 um 19:20 schrieb Sean Owen:
>>>
>>>> I'm first saying that you really don't want to use the database as a
>>>> data model directly. It is far too slow.
>>>> Instead you want to use a data model implementation that reads all of
>>>> the data, once, serially, into memory. And in that case, it makes no
>>>> difference where the data is being read from, because it is read just
>>>> once, serially. A file is just as fine as a fancy database. In fact
>>>> it's probably easier and faster.
>>>>
>>>> On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
>>>> <tevfik.ayte...@gmail.com> wrote:
>>>>> Thanks Sean, but I could not get your answer. Can you please explain it 
>>>>> again?
>>>>>
>>>>>
>>>>> On Sun, May 19, 2013 at 8:00 PM, Sean Owen <sro...@gmail.com> wrote:
>>>>>> It doesn't matter, in the sense that it is never going to be fast
>>>>>> enough for real-time at any reasonable scale if actually run off a
>>>>>> database directly. One operation results in thousands of queries. It's
>>>>>> going to read data into memory anyway and cache it there. So, whatever
>>>>>> is easiest for you. The simplest solution is a file.
>>>>>>
>>>>>> On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
>>>>>> <ahmetyilmazefe...@yahoo.com> wrote:
>>>>>>> Hi,
>>>>>>> I would like to use Mahout to make recommendations on my web site. 
>>>>>>> Since the data is going to be big, hopefully, I plan to use hadoop 
>>>>>>> implementations of the recommender algorithms.
>>>>>>>
>>>>>>> I'm currently storing the data in mysql. Should I continue with it or 
>>>>>>> should I switch to a nosql database such as mongodb or something else?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ahmet
>>>
>>> --
>>> Manuel Blechschmidt
>>> M.Sc. IT Systems Engineering
>>> Dortustr. 57
>>> 14467 Potsdam
>>> Mobil: 0173/6322621
>>> Twitter: http://twitter.com/Manuel_B
>>>

Re: Which database should I use with Mahout

Reply via email to