I have decided to do something similar: Do the pipeline in memory and not invoke map-reduce for small datasets which I think will handle the issue. Thanks again for clearing that up. Esh
Aug 16, 2011, at 1:45 PM, Dmitriy Lyubimov wrote: > PPS Mahout also has in-memory SVD Colt-migrated solver which is BTW what i > am using int local tests to assert SSVD results. Although it starts to feel > slow pretty quickly and sometimes produces errors (i think i starts feeling > slow at 10k x 1k inputs) > > On Tue, Aug 16, 2011 at 12:52 PM, Dmitriy Lyubimov <[email protected]>wrote: > >> also, with data as small as this, stochastic noise ratio would be >> significant (as in 'big numbers' law) so if you really think you might need >> to handle inputs that small, you better write a pipeline that detects this >> as a corner case and just runs in-memory decomposition. In fact, i think >> dense matrices up to 100,000 rows can be quite comfortably computed >> in-memory (Ted knows much more on practical limits of tools like R or even >> as simple as apache.math) >> >> -d >> >> >> On Tue, Aug 16, 2011 at 12:46 PM, Dmitriy Lyubimov <[email protected]>wrote: >> >>> yep that's what i figured. you have 193 rows or so but distributed between >>> 7 files so they are small and would generate several mappers and there are >>> probably some there with a small row count. >>> >>> See my other email. This method is for big data, big files. If you want to >>> automate handling of small files, you can probably do some intermediate step >>> with some heuristic that merges together all files say shorter than 1Mb. >>> >>> -d >>> >>> >>> >>> On Tue, Aug 16, 2011 at 12:43 PM, Eshwaran Vijaya Kumar < >>> [email protected]> wrote: >>> >>>> Number of mappers is 7. DFS block size is 128 MB, the reason I think >>>> there are 7 mappers being used is that I am using a Pig script to generate >>>> the sequence file of Vectors and that script generates 7 reducers. I am not >>>> setting minSplitSize though. >>>> >>>> On Aug 16, 2011, at 12:15 PM, Dmitriy Lyubimov wrote: >>>> >>>>> Hm. This is not common at all. >>>>> >>>>> This error would surface if map split can't accumulate at least k+p >>>> rows. >>>>> >>>>> That's another requirement which usually is non-issue -- any >>>> precomputed >>>>> split must contain at least k+p rows, which normally would not be the >>>> case >>>>> only if matrix is extra wide and dense, in which case --minSplitSize >>>> must be >>>>> used to avoid this. >>>>> >>>>> But in your case, the matrix is so small it must fit in one split. Can >>>> you >>>>> please verify how many mappers the job generates? >>>>> >>>>> if it's more than 1 than something is going fishy with hadoop. >>>> Otherwise, >>>>> something is fishy with input (it's either not 293 rows, or k+p is more >>>> than >>>>> 293). >>>>> >>>>> -d >>>>> >>>>> On Tue, Aug 16, 2011 at 11:39 AM, Eshwaran Vijaya Kumar < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> On Aug 16, 2011, at 10:35 AM, Dmitriy Lyubimov wrote: >>>>>> >>>>>>> This is unusually small input. What's the block size? Use large >>>> blocks >>>>>> (such >>>>>>> as 30,000). Block size can't be less than k+p. >>>>>>> >>>>>> >>>>>> I did set blockSize to 30,000 (as recommended in the PDF that you >>>> wrote >>>>>> up). As far as input size, the reason to do that is because it is >>>> easier to >>>>>> test and verify the map-reduce pipeline with my in-memory >>>> implementation of >>>>>> the algorithm. >>>>>> >>>>>>> Can you please cut and paste actual log of qjob tasks that failed? >>>> This >>>>>> is >>>>>>> front end error, but the actual problem is actually in the backend >>>>>> ranging >>>>>>> anywhere from hadoop problems to algorithm problems. >>>>>> Sure. Refer http://esh.pastebin.mozilla.org/1302059 >>>>>> Input is a DistributedRowMatrix 293 X 236, k = 4, p = 40, >>>> numReduceTasks = >>>>>> 1, blockHeight = 30,000. Reducing p = 20 ensures job goes through... >>>>>> >>>>>> Thanks again >>>>>> Esh >>>>>> >>>>>> >>>>>>> On Aug 16, 2011 9:44 AM, "Eshwaran Vijaya Kumar" < >>>>>> [email protected]> >>>>>>> wrote: >>>>>>>> Thanks again. I am using 0.5 right now. We will try to patch it up >>>> and >>>>>> see >>>>>>> how it performs. In the mean time, I am having another (possibly >>>> user?) >>>>>>> error: I have a 260 X 230 matrix. I set k+p = 40, it fails with >>>>>>>> >>>>>>>> Exception in thread "main" java.io.IOException: Q job unsuccessful. >>>>>>>> at >>>> org.apache.mahout.math.hadoop.stochasticsvd.QJob.run(QJob.java:349) >>>>>>>> at >>>>>>> >>>>>> >>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:262) >>>>>>>> at >>>>>>> >>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.run(SSVDCli.java:91) >>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>>>>> at >>>>>>> >>>>>> >>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDCli.main(SSVDCli.java:131) >>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>>> at >>>>>>> >>>>>> >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>>>>> at >>>>>>> >>>>>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>> at >>>>>>> >>>>>> >>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >>>>>>>> at >>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >>>>>>>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187) >>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>>> at >>>>>>> >>>>>> >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>>>>> at >>>>>>> >>>>>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) >>>>>>>> >>>>>>>> >>>>>>>> Suppose I set k+p to be much lesser say around 20, it works fine. Is >>>> it >>>>>>> just that my dataset is of low rank or is there something else going >>>> on >>>>>>> here? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Esh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Aug 14, 2011, at 1:47 PM, Dmitriy Lyubimov wrote: >>>>>>>> >>>>>>>>> ... i need to let some time for review before pushing to ASF repo >>>> ).. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Aug 14, 2011 at 1:47 PM, Dmitriy Lyubimov < >>>> [email protected]> >>>>>>> wrote: >>>>>>>>> >>>>>>>>>> patch is posted as MAHOUT -786. >>>>>>>>>> >>>>>>>>>> also 0.6 trunk with patch applied is here : >>>>>>>>>> https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786 >>>>>>>>>> >>>>>>>>>> <https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-786>I >>>> will >>>>>>> commit >>>>>>>>>> to ASF repo tomorrow night (even that it is extremely simple, i >>>> need >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Aug 13, 2011 at 1:48 PM, Eshwaran Vijaya Kumar < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Dmitriy, >>>>>>>>>>> That sounds great. I eagerly await the patch. >>>>>>>>>>> Thanks >>>>>>>>>>> Esh >>>>>>>>>>> On Aug 13, 2011, at 1:37 PM, Dmitriy Lyubimov wrote: >>>>>>>>>>> >>>>>>>>>>>> Ok, i got u0 working. >>>>>>>>>>>> >>>>>>>>>>>> The problem is of course that something called BBt job is to be >>>>>>> coerced >>>>>>>>>>> to >>>>>>>>>>>> have 1 reducer (it's fine, every mapper won't yeld more than >>>>>>>>>>>> upper-triangular matrix of k+p x k+p geometry, so even if you >>>> end up >>>>>>>>>>> having >>>>>>>>>>>> thousands of them, reducer would sum them up just fine. >>>>>>>>>>>> >>>>>>>>>>>> it worked before apparently because configuration hold 1 reducer >>>> by >>>>>>>>>>> default >>>>>>>>>>>> if not set explicitly, i am not quite sure if that's something >>>> in >>>>>>> hadoop >>>>>>>>>>> mr >>>>>>>>>>>> client or mahout change that now precludes it from working. >>>>>>>>>>>> >>>>>>>>>>>> anyway, i got a patch (really a one-liner) and an example >>>> equivalent >>>>>>> to >>>>>>>>>>>> yours worked fine for me with 3 reducers. >>>>>>>>>>>> >>>>>>>>>>>> Also, in the tests, it also requests 3 reducers, but the reason >>>> it >>>>>>> works >>>>>>>>>>> in >>>>>>>>>>>> tests and not in distributed mapred is because local mapred >>>> doesn't >>>>>>>>>>> support >>>>>>>>>>>> multiple reducers. I investigated this issue before and >>>> apparently >>>>>>> there >>>>>>>>>>>> were a couple of patches floating around but for some reason >>>> those >>>>>>>>>>> changes >>>>>>>>>>>> did not take hold in cdh3u0. >>>>>>>>>>>> >>>>>>>>>>>> I will publish patch in a jira shortly and will commit it >>>>>> Sunday-ish. >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> -d >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Aug 5, 2011 at 7:06 PM, Eshwaran Vijaya Kumar < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> OK. So to add more info to this, I tried setting the number of >>>>>>> reducers >>>>>>>>>>> to >>>>>>>>>>>>> 1 and now I don't get that particular error. The singular >>>> values >>>>>> and >>>>>>>>>>> left >>>>>>>>>>>>> and right singular vectors appear to be correct though >>>> (verified >>>>>>> using >>>>>>>>>>>>> Matlab). >>>>>>>>>>>>> >>>>>>>>>>>>> On Aug 5, 2011, at 1:55 PM, Eshwaran Vijaya Kumar wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> All, >>>>>>>>>>>>>> I am trying to test Stochastic SVD and am facing some errors >>>> where >>>>>>> it >>>>>>>>>>>>> would be great if someone could clarifying what is going on. I >>>> am >>>>>>>>>>> trying to >>>>>>>>>>>>> feed the solver a DistributedRowMatrix with the exact same >>>>>> parameters >>>>>>>>>>> that >>>>>>>>>>>>> the test in LocalSSVDSolverSparseSequentialTest uses, i.e, >>>> Generate >>>>>> a >>>>>>>>>>> 1000 >>>>>>>>>>>>> X 100 DRM with SequentialSparseVectors and then ask for >>>> blockHeight >>>>>>>>>>> 251, p >>>>>>>>>>>>> (oversampling) = 60, k (rank) = 40. I get the following error: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Exception in thread "main" java.io.IOException: Unexpected >>>> overrun >>>>>>> in >>>>>>>>>>>>> upper triangular matrix files >>>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>>> >>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.loadUpperTriangularMatrix(SSVDSolver.java:471) >>>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>>> >>>> org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:268) >>>>>>>>>>>>>> at com.mozilla.SSVDCli.run(SSVDCli.java:89) >>>>>>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>>>>>>>>>>>>> at com.mozilla.SSVDCli.main(SSVDCli.java:129) >>>>>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>>> >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>>> >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>>>>>>>> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also, I am using CDH3 with Mahout recompiled to work with CDH3 >>>>>> jars. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> Esh >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >>> >>
