I have several in-memory implementations almost ready to publish. These provide straightforward implementation of the original SSVD algorithm from the Martinsson and Halko paper, a version that avoids QR and LQ decompositions and an out-of-core version that only keeps a moderate sized amount of data in memory at any time.
My hangup at this point is getting my Cholesky decomposition reliable for rank-deficient inputs. On Tue, Aug 16, 2011 at 1:57 PM, Eshwaran Vijaya Kumar < [email protected]> wrote: > I have decided to do something similar: Do the pipeline in memory and not > invoke map-reduce for small datasets which I think will handle the issue. > Thanks again for clearing that up. > Esh > > Aug 16, 2011, at 1:45 PM, Dmitriy Lyubimov wrote: > > > PPS Mahout also has in-memory SVD Colt-migrated solver which is BTW what > i > > am using int local tests to assert SSVD results. Although it starts to > feel > > slow pretty quickly and sometimes produces errors (i think i starts > feeling > > slow at 10k x 1k inputs) > > > > On Tue, Aug 16, 2011 at 12:52 PM, Dmitriy Lyubimov <[email protected] > >wrote: > > > >> also, with data as small as this, stochastic noise ratio would be > >> significant (as in 'big numbers' law) so if you really think you might > need > >> to handle inputs that small, you better write a pipeline that detects > this > >> as a corner case and just runs in-memory decomposition. In fact, i think > >> dense matrices up to 100,000 rows can be quite comfortably computed > >> in-memory (Ted knows much more on practical limits of tools like R or > even > >> as simple as apache.math) > >> > >> -d > >> > >> > >> On Tue, Aug 16, 2011 at 12:46 PM, Dmitriy Lyubimov <[email protected] > >wrote: > >> > >>> yep that's what i figured. you have 193 rows or so but distributed > between > >>> 7 files so they are small and would generate several mappers and there > are > >>> probably some there with a small row count. > >>> > >>> See my other email. This method is for big data, big files. If you want > to > >>> automate handling of small files, you can probably do some intermediate > step > >>> with some heuristic that merges together all files say shorter than > 1Mb. > >>> > >>> -d > >>> > >>> > >>> > >>> On Tue, Aug 16, 2011 at 12:43 PM, Eshwaran Vijaya Kumar < > >>> [email protected]> wrote: > >>> > >>>> Number of mappers is 7. DFS block size is 128 MB, the reason I think > >>>> there are 7 mappers being used is that I am using a Pig script to > generate > >>>> the sequence file of Vectors and that script generates 7 reducers. I > am not > >>>> setting minSplitSize though. > >>>> > >>>> On Aug 16, 2011, at 12:15 PM, Dmitriy Lyubimov wrote: > >>>> > >>>>> Hm. This is not common at all. > >>>>> > >>>>> This error would surface if map split can't accumulate at least k+p > >>>> rows. > >>>>> > >>>>> That's another requirement which usually is non-issue -- any > >>>> precomputed > >>>>> split must contain at least k+p rows, which normally would not be the > >>>> case > >>>>> only if matrix is extra wide and dense, in which case --minSplitSize > >>>> must be > >>>>> used to avoid this. > >>>>> > >>>>> But in your case, the matrix is so small it must fit in one split. > Can > >>>> you > >>>>> please verify how many mappers the job generates? > >>>>> > >>>>> if it's more than 1 than something is going fishy with hadoop. > >>>> Otherwise, > >>>>> something is fishy with input (it's either not 293 rows, or k+p is > more > >>>> than > >>>>> 293). > >>>>> > >>>>> -d > >>>>> > >>>>> On Tue, Aug 16, 2011 at 11:39 AM, Eshwaran Vijaya Kumar < > >>>>> [email protected]> wrote: > >>>>> > >>>>>> > >>>>>> On Aug 16, 2011, at 10:35 AM, Dmitriy Lyubimov wrote: > >>>>>> > >>>>>>> This is unusually small input. What's the block size? Use large > >>>> blocks > >>>>>> (such > >>>>>>> as 30,000). Block size can't be less than k+p. > >>>>>>> > >>>>>> > >>>>>> I did set blockSize to 30,000 (as recommended in the PDF that you > >>>> wrote > >>>>>> up). As far as input size, the reason to do that is because it is > >>>> easier to > >>>>>> test and verify the map-reduce pipeline with my in-memory > >>>> implementation of > >>>>>> the algorithm. > >>>>>> > >>>>>>> Can you please cut and paste actual log of qjob tasks that failed? > >>>> This > >>>>>> is > >>>>>>> front end error, but the actual problem is actually in the backend > >>>>>> ranging > >>>>>>> anywhere from hadoop problems to algorithm problems. > >>>>>> Sure. Refer http://esh.pastebin.mozilla.org/1302059 > >>>>>> Input is a DistributedRowMatrix 293 X 236, k = 4, p = 40, > >>>> numReduceTasks = > >>>>>> 1, blockHeight = 30,000. Reducing p = 20 ensures job goes through... > >>>>>> > >>>>>> Thanks again > >>>>>> Esh > >>>>>> > >>>>>> > >>>>>>> On Aug 16, 2011 9:44 AM, "Eshwaran Vijaya Kumar" < > >>>>>> [email protected]> > >>>>>>> wrote: > >>>>>>>> Thanks again. I am using 0.5 right now. We will try to patch it up > >>>> and > >>>>>> see > >>>>>>> how it performs. In the mean time, I am having another (possibly > >>>> user?) > >>>>>>> error: I have a 260 X 230 matrix. I set k+p = 40, it fails with > >>>>>>>> > >>>>>>>> Exception in thread "main" java.io.IOException: Q job > unsuccessful. > >>>>>>>> at > >>>> org.apache.mahout.math.hadoop.stochasticsvd.QJob.run(QJob.java:349) > >>>>>>>> at > >>>>>>> > >>>>>> > >>>> > org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:262) > >>>>>>>> at > >>>>>>> > >>>> > org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.run(SSVDCli.java:91) > >>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > >>>>>>>> at > >>>>>>> > >>>>>> > >>>> > org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.main(SSVDCli.java:131) > >>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>>>>>>> at > >>>>>>> > >>>>>> > >>>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >>>>>>>> at > >>>>>>> > >>>>>> > >>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) > >>>>>>>> at > >>>>>>> > >>>>>> > >>>> > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > >>>>>>>> at > >>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > >>>>>>>> at > org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) > >>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>>>>>>> at > >>>>>>> > >>>>>> > >>>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >>>>>>>> at > >>>>>>> > >>>>>> > >>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) > >>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > >>>>>>>> > >>>>>>>> > >>>>>>>> Suppose I set k+p to be much lesser say around 20, it works fine. > Is > >>>> it > >>>>>>> just that my dataset is of low rank or is there something else > going > >>>> on > >>>>>>> here? > >>>>>>>> > >>>>>>>> Thanks > >>>>>>>> Esh > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Aug 14, 2011, at 1:47 PM, Dmitriy Lyubimov wrote: > >>>>>>>> > >>>>>>>>> ... i need to let some time for review before pushing to ASF repo > >>>> ).. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Sun, Aug 14, 2011 at 1:47 PM, Dmitriy Lyubimov < > >>>> [email protected]> > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> patch is posted as MAHOUT -786. > >>>>>>>>>> > >>>>>>>>>> also 0.6 trunk with patch applied is here : > >>>>>>>>>> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786 > >>>>>>>>>> > >>>>>>>>>> <https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786>I > >>>> will > >>>>>>> commit > >>>>>>>>>> to ASF repo tomorrow night (even that it is extremely simple, i > >>>> need > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Sat, Aug 13, 2011 at 1:48 PM, Eshwaran Vijaya Kumar < > >>>>>>>>>> [email protected]> wrote: > >>>>>>>>>> > >>>>>>>>>>> Dmitriy, > >>>>>>>>>>> That sounds great. I eagerly await the patch. > >>>>>>>>>>> Thanks > >>>>>>>>>>> Esh > >>>>>>>>>>> On Aug 13, 2011, at 1:37 PM, Dmitriy Lyubimov wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Ok, i got u0 working. > >>>>>>>>>>>> > >>>>>>>>>>>> The problem is of course that something called BBt job is to > be > >>>>>>> coerced > >>>>>>>>>>> to > >>>>>>>>>>>> have 1 reducer (it's fine, every mapper won't yeld more than > >>>>>>>>>>>> upper-triangular matrix of k+p x k+p geometry, so even if you > >>>> end up > >>>>>>>>>>> having > >>>>>>>>>>>> thousands of them, reducer would sum them up just fine. > >>>>>>>>>>>> > >>>>>>>>>>>> it worked before apparently because configuration hold 1 > reducer > >>>> by > >>>>>>>>>>> default > >>>>>>>>>>>> if not set explicitly, i am not quite sure if that's something > >>>> in > >>>>>>> hadoop > >>>>>>>>>>> mr > >>>>>>>>>>>> client or mahout change that now precludes it from working. > >>>>>>>>>>>> > >>>>>>>>>>>> anyway, i got a patch (really a one-liner) and an example > >>>> equivalent > >>>>>>> to > >>>>>>>>>>>> yours worked fine for me with 3 reducers. > >>>>>>>>>>>> > >>>>>>>>>>>> Also, in the tests, it also requests 3 reducers, but the > reason > >>>> it > >>>>>>> works > >>>>>>>>>>> in > >>>>>>>>>>>> tests and not in distributed mapred is because local mapred > >>>> doesn't > >>>>>>>>>>> support > >>>>>>>>>>>> multiple reducers. I investigated this issue before and > >>>> apparently > >>>>>>> there > >>>>>>>>>>>> were a couple of patches floating around but for some reason > >>>> those > >>>>>>>>>>> changes > >>>>>>>>>>>> did not take hold in cdh3u0. > >>>>>>>>>>>> > >>>>>>>>>>>> I will publish patch in a jira shortly and will commit it > >>>>>> Sunday-ish. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks. > >>>>>>>>>>>> -d > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Fri, Aug 5, 2011 at 7:06 PM, Eshwaran Vijaya Kumar < > >>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> OK. So to add more info to this, I tried setting the number > of > >>>>>>> reducers > >>>>>>>>>>> to > >>>>>>>>>>>>> 1 and now I don't get that particular error. The singular > >>>> values > >>>>>> and > >>>>>>>>>>> left > >>>>>>>>>>>>> and right singular vectors appear to be correct though > >>>> (verified > >>>>>>> using > >>>>>>>>>>>>> Matlab). > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Aug 5, 2011, at 1:55 PM, Eshwaran Vijaya Kumar wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> All, > >>>>>>>>>>>>>> I am trying to test Stochastic SVD and am facing some errors > >>>> where > >>>>>>> it > >>>>>>>>>>>>> would be great if someone could clarifying what is going on. > I > >>>> am > >>>>>>>>>>> trying to > >>>>>>>>>>>>> feed the solver a DistributedRowMatrix with the exact same > >>>>>> parameters > >>>>>>>>>>> that > >>>>>>>>>>>>> the test in LocalSSVDSolverSparseSequentialTest uses, i.e, > >>>> Generate > >>>>>> a > >>>>>>>>>>> 1000 > >>>>>>>>>>>>> X 100 DRM with SequentialSparseVectors and then ask for > >>>> blockHeight > >>>>>>>>>>> 251, p > >>>>>>>>>>>>> (oversampling) = 60, k (rank) = 40. I get the following > error: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Exception in thread "main" java.io.IOException: Unexpected > >>>> overrun > >>>>>>> in > >>>>>>>>>>>>> upper triangular matrix files > >>>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>> > >>>>>> > >>>> > org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.loadUpperTriangularMatrix(SSVDSolver.java:471) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>> > >>>>>> > >>>> > org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:268) > >>>>>>>>>>>>>> at com.mozilla.SSVDCli.run(SSVDCli.java:89) > >>>>>>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>>>>>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) > >>>>>>>>>>>>>> at com.mozilla.SSVDCli.main(SSVDCli.java:129) > >>>>>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>> > >>>>>> > >>>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >>>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>> > >>>>>> > >>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >>>>>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) > >>>>>>>>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Also, I am using CDH3 with Mahout recompiled to work with > CDH3 > >>>>>> jars. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>> Esh > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >>> > >> > >
