I concur with everything that you state. In ideal world, we would have a
framework that offers a well implemented hybrid hash-join [1] that takes
advantage of all available memory and gracefully uses the disk once the
amount of memory is not enough, such as the one used by Stratosphere [2].

Best,
Sebastian

[1] Graefe, Bunker, Cooper: Hash Joins and Hash Teams in Microsoft SQL
Server
[2]
https://github.com/stratosphere-eu/stratosphere/blob/master/pact/pact-runtime/src/main/java/eu/stratosphere/pact/runtime/hash/MutableHashTable.java



On 20.03.2013 11:52, Sean Owen wrote:
> This highlights the 'cheating' going on here to make this run so fast, and
> I completely endorse it. The join proceeds by side-loading half of the join
> into memory. This creates a new possible limiting bottleneck here, the
> amount of memory allocated to the mapper (/reducer).
> 
> There are a number of things that could be done later to optimize this,
> though they might be kind of painful to implement. One is using floats
> instead of doubles. The other is precomputing which items are needed in
> which mappers and then only loading that subset into memory.
> 
> In some of the development I have done these and other tricks make these
> things run fast. The down-side is just that it takes effort to optimize
> that far and a lot of the optimization doesn't generalize to other
> algorithms.
> 
> I still quite buy the argument that there are going to be better frameworks
> for this, where you don't fight and work around the paradigm to get things
> done. But it's quite possible to make these things hum on the commodity
> distributed paradigm of today.
> 
> (PS: anyone at Hadoop Summit Europe today, let me know)
> 
> 
> On Wed, Mar 20, 2013 at 11:39 AM, Sebastian Schelter <s...@apache.org> wrote:
> 
>> Hi JU,
>>
>> the job creates an OpenIntObjectHashMap<Vector> holding the feature
>> vectors as DenseVectors. In one map-job, it is filled with the
>> user-feature vectors, in the next one with the item feature vectors.
>>
>> I used 4 gigabytes for a dataset with  1.8M users (using 20 features),
>> so I guess that 2-3gig should be enough for your dataset.
>>
>> I used these settings:
>>
>> mapred.job.reuse.jvm.num.tasks=-1
>> mapred.tasktracker.map.tasks.maximum=1
>> mapred.child.java.opts=-Xmx4096m
>>
>> On 20.03.2013 10:01, Han JU wrote:
>>> Hi Sebastian,
>>>
>>> I've tried the svn trunk. Hadoop constantly complains about memory like
>>> "out of memory error".
>>> On the datanode there's 4 physic cores and by hyper-threading it has 16
>>> logical cores, so I set --numThreadsPerSolver to 16 and that seems to
>> have
>>> a problem with memory.
>>> How you set your mapred.child.java.opts? Given that we allow only one
>>> mapper so that should be nearly the whole size of system memory?
>>>
>>> Thanks!
>>>
>>>
>>> 2013/3/19 Sebastian Schelter <s...@apache.org>
>>>
>>>> Hi JU,
>>>>
>>>> We recently rewrote the factorization code, it should be much faster
>>>> now. You should use the current trunk, make Hadoop schedule only one
>>>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
>>>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>>>> number of cores that you want to use per machine (use all if you can).
>>>>
>>>> I got astonishing results running the code like this on a 26 machines
>>>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
>>>> (700M datapoints).
>>>>
>>>> Let me know if you need more information.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> On 19.03.2013 15:31, Han JU wrote:
>>>>> Thanks Sebastian and Sean, I will dig more into the paper.
>>>>> With a simple try on a small part of the data, it seems larger alpha
>>>> (~40)
>>>>> gets me a better result.
>>>>> Do you have an idea how long it will be for ParellelALS for the 700mb
>>>>> complete dataset? It contains ~48 million triples. The hadoop cluster I
>>>>> dispose is of 5 nodes and can factorize the movieLens 10M in about
>> 13min.
>>>>>
>>>>>
>>>>> 2013/3/18 Sebastian Schelter <s...@apache.org>
>>>>>
>>>>>> You should also be aware that the alpha parameter comes from a formula
>>>>>> the authors introduce to measure the "confidence" in the observed
>>>> values:
>>>>>>
>>>>>> confidence = 1 + alpha * observed_value
>>>>>>
>>>>>> You can also change that formula in the code to something that you see
>>>>>> more fit, the paper even suggests alternative variants.
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>>
>>>>>> On 18.03.2013 18:06, Han JU wrote:
>>>>>>> Thanks for quick responses.
>>>>>>>
>>>>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>>>>>>> play_times", of ~ 1m users. No audio things, just plein text triples.
>>>>>>>
>>>>>>> It seems to me that the paper about "implicit feedback" matchs well
>>>> this
>>>>>>> dataset: no explicit ratings, but times of listening to a song.
>>>>>>>
>>>>>>> Thank you Sean for the alpha value, I think they use big numbers is
>>>>>> because
>>>>>>> their values in the R matrix is big.
>>>>>>>
>>>>>>>
>>>>>>> 2013/3/18 Sebastian Schelter <ssc.o...@googlemail.com>
>>>>>>>
>>>>>>>> JU,
>>>>>>>>
>>>>>>>> are you refering to this dataset?
>>>>>>>>
>>>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>>>>>
>>>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>>>>>> One word of caution, is that there are at least two papers on ALS
>> and
>>>>>>>> they
>>>>>>>>> define lambda differently. I think you are talking about
>>>> "Collaborative
>>>>>>>>> Filtering for Implicit Feedback Datasets".
>>>>>>>>>
>>>>>>>>> I've been working with some folks who point out that alpha=40 seems
>>>> to
>>>>>> be
>>>>>>>>> too high for most data sets. After running some tests on common
>> data
>>>>>>>> sets,
>>>>>>>>> alpha=1 looks much better. YMMV.
>>>>>>>>>
>>>>>>>>> In the end you have to evaluate these two parameters, and the # of
>>>>>>>>> features, across a range to determine what's best.
>>>>>>>>>
>>>>>>>>> Is this data set not a bunch of audio features? I am not sure it
>>>> works
>>>>>>>> for
>>>>>>>>> ALS, not naturally at least.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju.han.fe...@gmail.com>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm wondering has someone tried ParallelALS with implicite
>> feedback
>>>>>> job
>>>>>>>> on
>>>>>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>>>>>
>>>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
>>>> are
>>>>>>>> their
>>>>>>>>>> r values in the matrix. They said is based on time units that
>> users
>>>>>> have
>>>>>>>>>> watched the show, so may be it's big.
>>>>>>>>>>
>>>>>>>>>> Many thanks!
>>>>>>>>>> --
>>>>>>>>>> *JU Han*
>>>>>>>>>>
>>>>>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>>>>>
>>>>>>>>>> +33 0619608888
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 

Reply via email to