I believe I have fixed the bug. Please allow me to also design a test to ensure this.
In short, the problem was that when doing the column swapping for efficient memory output, the algorithm would get fooled if a low-order column would take the place of an already-chosen high-order column and get chosen again. Here is the output I observe from Alejandro's script after the fix. I hope that the first values can be attributed to chance. Comparing scikit.learn version of OMP versus naive implementation n: 256, m: 30, s: 10 -> p_skit = 0.24 p_naive = 0.28 n: 256, m: 50, s: 10 -> p_skit = 0.92 p_naive = 0.97 n: 256, m: 70, s: 10 -> p_skit = 1.00 p_naive = 0.99 n: 256, m: 90, s: 10 -> p_skit = 1.00 p_naive = 0.99 n: 256, m: 110, s: 10 -> p_skit = 1.00 p_naive = 1.00 n: 256, m: 130, s: 10 -> p_skit = 1.00 p_naive = 0.99 n: 256, m: 150, s: 10 -> p_skit = 1.00 p_naive = 0.99 n: 256, m: 170, s: 10 -> p_skit = 1.00 p_naive = 0.99 n: 256, m: 190, s: 10 -> p_skit = 1.00 p_naive = 0.99 This was a very very subtle bug! Enormous thanks to Alejandro and Bob. If you would like to play with the fixed code for now, it is on the "omp_bug" branch on my github. I will merge to master after writing a test for exactly this case, in case of future refactoring. Best, Vlad On Wed, Oct 19, 2011 at 9:25 PM, Vlad Niculae <[email protected]> wrote: > Interesting. I've been staring at the code but the algorithm itself > shouldn't be losing precision. On the other hand, there are those > "stopping conditions" that I had taken from the C implementation of > the author of the Cholesky-OMP paper. If it's as you say, it could be > that when it fails, OMP stops one iteration early, but it should raise > a warning when doing so. Allow me to run the debugger on the gist. > > Vlad > > On Wed, Oct 19, 2011 at 6:54 PM, Alejandro Weinstein > <[email protected]> wrote: >> On Tue, Oct 18, 2011 at 7:49 PM, Alexandre Gramfort >> <[email protected]> wrote: >>> could you open an issue with a small test script with one X and y that >>> produce a different result using both implementations? >> >> Here it is: >> >> https://github.com/scikit-learn/scikit-learn/issues/403 >> >> I notice that when it fails, the support of the orthogonal_mp >> solutions is "almost" correct. For example, in the issue, the support >> is missing one index. It is like if orthogonal_mp is exiting the loop >> one iteration too soon. >> >> Alejandro. >> >> ------------------------------------------------------------------------------ >> All the data continuously generated in your IT infrastructure contains a >> definitive record of customers, application performance, security >> threats, fraudulent activity and more. Splunk takes this data and makes >> sense of it. Business sense. IT sense. Common sense. >> http://p.sf.net/sfu/splunk-d2d-oct >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Ciosco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
