I concur with everything that you state. In ideal world, we would have a framework that offers a well implemented hybrid hash-join [1] that takes advantage of all available memory and gracefully uses the disk once the amount of memory is not enough, such as the one used by Stratosphere [2].
Best, Sebastian [1] Graefe, Bunker, Cooper: Hash Joins and Hash Teams in Microsoft SQL Server [2] https://github.com/stratosphere-eu/stratosphere/blob/master/pact/pact-runtime/src/main/java/eu/stratosphere/pact/runtime/hash/MutableHashTable.java On 20.03.2013 11:52, Sean Owen wrote: > This highlights the 'cheating' going on here to make this run so fast, and > I completely endorse it. The join proceeds by side-loading half of the join > into memory. This creates a new possible limiting bottleneck here, the > amount of memory allocated to the mapper (/reducer). > > There are a number of things that could be done later to optimize this, > though they might be kind of painful to implement. One is using floats > instead of doubles. The other is precomputing which items are needed in > which mappers and then only loading that subset into memory. > > In some of the development I have done these and other tricks make these > things run fast. The down-side is just that it takes effort to optimize > that far and a lot of the optimization doesn't generalize to other > algorithms. > > I still quite buy the argument that there are going to be better frameworks > for this, where you don't fight and work around the paradigm to get things > done. But it's quite possible to make these things hum on the commodity > distributed paradigm of today. > > (PS: anyone at Hadoop Summit Europe today, let me know) > > > On Wed, Mar 20, 2013 at 11:39 AM, Sebastian Schelter <s...@apache.org> wrote: > >> Hi JU, >> >> the job creates an OpenIntObjectHashMap<Vector> holding the feature >> vectors as DenseVectors. In one map-job, it is filled with the >> user-feature vectors, in the next one with the item feature vectors. >> >> I used 4 gigabytes for a dataset with 1.8M users (using 20 features), >> so I guess that 2-3gig should be enough for your dataset. >> >> I used these settings: >> >> mapred.job.reuse.jvm.num.tasks=-1 >> mapred.tasktracker.map.tasks.maximum=1 >> mapred.child.java.opts=-Xmx4096m >> >> On 20.03.2013 10:01, Han JU wrote: >>> Hi Sebastian, >>> >>> I've tried the svn trunk. Hadoop constantly complains about memory like >>> "out of memory error". >>> On the datanode there's 4 physic cores and by hyper-threading it has 16 >>> logical cores, so I set --numThreadsPerSolver to 16 and that seems to >> have >>> a problem with memory. >>> How you set your mapred.child.java.opts? Given that we allow only one >>> mapper so that should be nearly the whole size of system memory? >>> >>> Thanks! >>> >>> >>> 2013/3/19 Sebastian Schelter <s...@apache.org> >>> >>>> Hi JU, >>>> >>>> We recently rewrote the factorization code, it should be much faster >>>> now. You should use the current trunk, make Hadoop schedule only one >>>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make >>>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the >>>> number of cores that you want to use per machine (use all if you can). >>>> >>>> I got astonishing results running the code like this on a 26 machines >>>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset >>>> (700M datapoints). >>>> >>>> Let me know if you need more information. >>>> >>>> Best, >>>> Sebastian >>>> >>>> On 19.03.2013 15:31, Han JU wrote: >>>>> Thanks Sebastian and Sean, I will dig more into the paper. >>>>> With a simple try on a small part of the data, it seems larger alpha >>>> (~40) >>>>> gets me a better result. >>>>> Do you have an idea how long it will be for ParellelALS for the 700mb >>>>> complete dataset? It contains ~48 million triples. The hadoop cluster I >>>>> dispose is of 5 nodes and can factorize the movieLens 10M in about >> 13min. >>>>> >>>>> >>>>> 2013/3/18 Sebastian Schelter <s...@apache.org> >>>>> >>>>>> You should also be aware that the alpha parameter comes from a formula >>>>>> the authors introduce to measure the "confidence" in the observed >>>> values: >>>>>> >>>>>> confidence = 1 + alpha * observed_value >>>>>> >>>>>> You can also change that formula in the code to something that you see >>>>>> more fit, the paper even suggests alternative variants. >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>> >>>>>> On 18.03.2013 18:06, Han JU wrote: >>>>>>> Thanks for quick responses. >>>>>>> >>>>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id >>>>>>> play_times", of ~ 1m users. No audio things, just plein text triples. >>>>>>> >>>>>>> It seems to me that the paper about "implicit feedback" matchs well >>>> this >>>>>>> dataset: no explicit ratings, but times of listening to a song. >>>>>>> >>>>>>> Thank you Sean for the alpha value, I think they use big numbers is >>>>>> because >>>>>>> their values in the R matrix is big. >>>>>>> >>>>>>> >>>>>>> 2013/3/18 Sebastian Schelter <ssc.o...@googlemail.com> >>>>>>> >>>>>>>> JU, >>>>>>>> >>>>>>>> are you refering to this dataset? >>>>>>>> >>>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile >>>>>>>> >>>>>>>> On 18.03.2013 17:47, Sean Owen wrote: >>>>>>>>> One word of caution, is that there are at least two papers on ALS >> and >>>>>>>> they >>>>>>>>> define lambda differently. I think you are talking about >>>> "Collaborative >>>>>>>>> Filtering for Implicit Feedback Datasets". >>>>>>>>> >>>>>>>>> I've been working with some folks who point out that alpha=40 seems >>>> to >>>>>> be >>>>>>>>> too high for most data sets. After running some tests on common >> data >>>>>>>> sets, >>>>>>>>> alpha=1 looks much better. YMMV. >>>>>>>>> >>>>>>>>> In the end you have to evaluate these two parameters, and the # of >>>>>>>>> features, across a range to determine what's best. >>>>>>>>> >>>>>>>>> Is this data set not a bunch of audio features? I am not sure it >>>> works >>>>>>>> for >>>>>>>>> ALS, not naturally at least. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju.han.fe...@gmail.com> >>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I'm wondering has someone tried ParallelALS with implicite >> feedback >>>>>> job >>>>>>>> on >>>>>>>>>> million song dataset? Some pointers on alpha and lambda? >>>>>>>>>> >>>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what >>>> are >>>>>>>> their >>>>>>>>>> r values in the matrix. They said is based on time units that >> users >>>>>> have >>>>>>>>>> watched the show, so may be it's big. >>>>>>>>>> >>>>>>>>>> Many thanks! >>>>>>>>>> -- >>>>>>>>>> *JU Han* >>>>>>>>>> >>>>>>>>>> UTC - Université de Technologie de Compiègne >>>>>>>>>> * **GI06 - Fouille de Données et Décisionnel* >>>>>>>>>> >>>>>>>>>> +33 0619608888 >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >