PPS Mahout also has in-memory SVD Colt-migrated solver which is BTW what i am using int local tests to assert SSVD results. Although it starts to feel slow pretty quickly and sometimes produces errors (i think i starts feeling slow at 10k x 1k inputs)
On Tue, Aug 16, 2011 at 12:52 PM, Dmitriy Lyubimov <[email protected]>wrote: > also, with data as small as this, stochastic noise ratio would be > significant (as in 'big numbers' law) so if you really think you might need > to handle inputs that small, you better write a pipeline that detects this > as a corner case and just runs in-memory decomposition. In fact, i think > dense matrices up to 100,000 rows can be quite comfortably computed > in-memory (Ted knows much more on practical limits of tools like R or even > as simple as apache.math) > > -d > > > On Tue, Aug 16, 2011 at 12:46 PM, Dmitriy Lyubimov <[email protected]>wrote: > >> yep that's what i figured. you have 193 rows or so but distributed between >> 7 files so they are small and would generate several mappers and there are >> probably some there with a small row count. >> >> See my other email. This method is for big data, big files. If you want to >> automate handling of small files, you can probably do some intermediate step >> with some heuristic that merges together all files say shorter than 1Mb. >> >> -d >> >> >> >> On Tue, Aug 16, 2011 at 12:43 PM, Eshwaran Vijaya Kumar < >> [email protected]> wrote: >> >>> Number of mappers is 7. DFS block size is 128 MB, the reason I think >>> there are 7 mappers being used is that I am using a Pig script to generate >>> the sequence file of Vectors and that script generates 7 reducers. I am not >>> setting minSplitSize though. >>> >>> On Aug 16, 2011, at 12:15 PM, Dmitriy Lyubimov wrote: >>> >>> > Hm. This is not common at all. >>> > >>> > This error would surface if map split can't accumulate at least k+p >>> rows. >>> > >>> > That's another requirement which usually is non-issue -- any >>> precomputed >>> > split must contain at least k+p rows, which normally would not be the >>> case >>> > only if matrix is extra wide and dense, in which case --minSplitSize >>> must be >>> > used to avoid this. >>> > >>> > But in your case, the matrix is so small it must fit in one split. Can >>> you >>> > please verify how many mappers the job generates? >>> > >>> > if it's more than 1 than something is going fishy with hadoop. >>> Otherwise, >>> > something is fishy with input (it's either not 293 rows, or k+p is more >>> than >>> > 293). >>> > >>> > -d >>> > >>> > On Tue, Aug 16, 2011 at 11:39 AM, Eshwaran Vijaya Kumar < >>> > [email protected]> wrote: >>> > >>> >> >>> >> On Aug 16, 2011, at 10:35 AM, Dmitriy Lyubimov wrote: >>> >> >>> >>> This is unusually small input. What's the block size? Use large >>> blocks >>> >> (such >>> >>> as 30,000). Block size can't be less than k+p. >>> >>> >>> >> >>> >> I did set blockSize to 30,000 (as recommended in the PDF that you >>> wrote >>> >> up). As far as input size, the reason to do that is because it is >>> easier to >>> >> test and verify the map-reduce pipeline with my in-memory >>> implementation of >>> >> the algorithm. >>> >> >>> >>> Can you please cut and paste actual log of qjob tasks that failed? >>> This >>> >> is >>> >>> front end error, but the actual problem is actually in the backend >>> >> ranging >>> >>> anywhere from hadoop problems to algorithm problems. >>> >> Sure. Refer http://esh.pastebin.mozilla.org/1302059 >>> >> Input is a DistributedRowMatrix 293 X 236, k = 4, p = 40, >>> numReduceTasks = >>> >> 1, blockHeight = 30,000. Reducing p = 20 ensures job goes through... >>> >> >>> >> Thanks again >>> >> Esh >>> >> >>> >> >>> >>> On Aug 16, 2011 9:44 AM, "Eshwaran Vijaya Kumar" < >>> >> [email protected]> >>> >>> wrote: >>> >>>> Thanks again. I am using 0.5 right now. We will try to patch it up >>> and >>> >> see >>> >>> how it performs. In the mean time, I am having another (possibly >>> user?) >>> >>> error: I have a 260 X 230 matrix. I set k+p = 40, it fails with >>> >>>> >>> >>>> Exception in thread "main" java.io.IOException: Q job unsuccessful. >>> >>>> at >>> org.apache.mahout.math.hadoop.stochasticsvd.QJob.run(QJob.java:349) >>> >>>> at >>> >>> >>> >> >>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:262) >>> >>>> at >>> >>> >>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.run(SSVDCli.java:91) >>> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> >>>> at >>> >>> >>> >> >>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.main(SSVDCli.java:131) >>> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> >>>> at >>> >>> >>> >> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> >>>> at >>> >>> >>> >> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>> >>>> at >>> >>> >>> >> >>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>> >>>> at >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>> >>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) >>> >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> >>>> at >>> >>> >>> >> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> >>>> at >>> >>> >>> >> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>> >>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) >>> >>>> >>> >>>> >>> >>>> Suppose I set k+p to be much lesser say around 20, it works fine. Is >>> it >>> >>> just that my dataset is of low rank or is there something else going >>> on >>> >>> here? >>> >>>> >>> >>>> Thanks >>> >>>> Esh >>> >>>> >>> >>>> >>> >>>> >>> >>>> On Aug 14, 2011, at 1:47 PM, Dmitriy Lyubimov wrote: >>> >>>> >>> >>>>> ... i need to let some time for review before pushing to ASF repo >>> ).. >>> >>>>> >>> >>>>> >>> >>>>> On Sun, Aug 14, 2011 at 1:47 PM, Dmitriy Lyubimov < >>> [email protected]> >>> >>> wrote: >>> >>>>> >>> >>>>>> patch is posted as MAHOUT -786. >>> >>>>>> >>> >>>>>> also 0.6 trunk with patch applied is here : >>> >>>>>> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786 >>> >>>>>> >>> >>>>>> <https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786>I >>> will >>> >>> commit >>> >>>>>> to ASF repo tomorrow night (even that it is extremely simple, i >>> need >>> >>>>>> >>> >>>>>> >>> >>>>>> On Sat, Aug 13, 2011 at 1:48 PM, Eshwaran Vijaya Kumar < >>> >>>>>> [email protected]> wrote: >>> >>>>>> >>> >>>>>>> Dmitriy, >>> >>>>>>> That sounds great. I eagerly await the patch. >>> >>>>>>> Thanks >>> >>>>>>> Esh >>> >>>>>>> On Aug 13, 2011, at 1:37 PM, Dmitriy Lyubimov wrote: >>> >>>>>>> >>> >>>>>>>> Ok, i got u0 working. >>> >>>>>>>> >>> >>>>>>>> The problem is of course that something called BBt job is to be >>> >>> coerced >>> >>>>>>> to >>> >>>>>>>> have 1 reducer (it's fine, every mapper won't yeld more than >>> >>>>>>>> upper-triangular matrix of k+p x k+p geometry, so even if you >>> end up >>> >>>>>>> having >>> >>>>>>>> thousands of them, reducer would sum them up just fine. >>> >>>>>>>> >>> >>>>>>>> it worked before apparently because configuration hold 1 reducer >>> by >>> >>>>>>> default >>> >>>>>>>> if not set explicitly, i am not quite sure if that's something >>> in >>> >>> hadoop >>> >>>>>>> mr >>> >>>>>>>> client or mahout change that now precludes it from working. >>> >>>>>>>> >>> >>>>>>>> anyway, i got a patch (really a one-liner) and an example >>> equivalent >>> >>> to >>> >>>>>>>> yours worked fine for me with 3 reducers. >>> >>>>>>>> >>> >>>>>>>> Also, in the tests, it also requests 3 reducers, but the reason >>> it >>> >>> works >>> >>>>>>> in >>> >>>>>>>> tests and not in distributed mapred is because local mapred >>> doesn't >>> >>>>>>> support >>> >>>>>>>> multiple reducers. I investigated this issue before and >>> apparently >>> >>> there >>> >>>>>>>> were a couple of patches floating around but for some reason >>> those >>> >>>>>>> changes >>> >>>>>>>> did not take hold in cdh3u0. >>> >>>>>>>> >>> >>>>>>>> I will publish patch in a jira shortly and will commit it >>> >> Sunday-ish. >>> >>>>>>>> >>> >>>>>>>> Thanks. >>> >>>>>>>> -d >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> On Fri, Aug 5, 2011 at 7:06 PM, Eshwaran Vijaya Kumar < >>> >>>>>>>> [email protected]> wrote: >>> >>>>>>>> >>> >>>>>>>>> OK. So to add more info to this, I tried setting the number of >>> >>> reducers >>> >>>>>>> to >>> >>>>>>>>> 1 and now I don't get that particular error. The singular >>> values >>> >> and >>> >>>>>>> left >>> >>>>>>>>> and right singular vectors appear to be correct though >>> (verified >>> >>> using >>> >>>>>>>>> Matlab). >>> >>>>>>>>> >>> >>>>>>>>> On Aug 5, 2011, at 1:55 PM, Eshwaran Vijaya Kumar wrote: >>> >>>>>>>>> >>> >>>>>>>>>> All, >>> >>>>>>>>>> I am trying to test Stochastic SVD and am facing some errors >>> where >>> >>> it >>> >>>>>>>>> would be great if someone could clarifying what is going on. I >>> am >>> >>>>>>> trying to >>> >>>>>>>>> feed the solver a DistributedRowMatrix with the exact same >>> >> parameters >>> >>>>>>> that >>> >>>>>>>>> the test in LocalSSVDSolverSparseSequentialTest uses, i.e, >>> Generate >>> >> a >>> >>>>>>> 1000 >>> >>>>>>>>> X 100 DRM with SequentialSparseVectors and then ask for >>> blockHeight >>> >>>>>>> 251, p >>> >>>>>>>>> (oversampling) = 60, k (rank) = 40. I get the following error: >>> >>>>>>>>>> >>> >>>>>>>>>> Exception in thread "main" java.io.IOException: Unexpected >>> overrun >>> >>> in >>> >>>>>>>>> upper triangular matrix files >>> >>>>>>>>>> at >>> >>>>>>>>> >>> >>>>>>> >>> >>> >>> >> >>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.loadUpperTriangularMatrix(SSVDSolver.java:471) >>> >>>>>>>>>> at >>> >>>>>>>>> >>> >>>>>>> >>> >>> >>> >> >>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:268) >>> >>>>>>>>>> at com.mozilla.SSVDCli.run(SSVDCli.java:89) >>> >>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> >>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> >>>>>>>>>> at com.mozilla.SSVDCli.main(SSVDCli.java:129) >>> >>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> >>>>>>>>>> at >>> >>>>>>>>> >>> >>>>>>> >>> >>> >>> >> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> >>>>>>>>>> at >>> >>>>>>>>> >>> >>>>>>> >>> >>> >>> >> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> >>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>> >>>>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) >>> >>>>>>>>>> >>> >>>>>>>>>> Also, I am using CDH3 with Mahout recompiled to work with CDH3 >>> >> jars. >>> >>>>>>>>>> >>> >>>>>>>>>> Thanks >>> >>>>>>>>>> Esh >>> >>>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>> >>> >>>> >>> >> >>> >> >>> >>> >> >
