Regarding sourceless firmware, as Brian noted, heuristics can only get us so far, but two other things are also true:
1. Heuristics can be improved. 2. Human reading itself uses (imperfect) heuristics (e.g., pattern seeking). Our reading of a source file will not come close to the reading of one who has been hacking on it for five years. Thus, we should grab any advantage we can in helping to spot patterns. While I too agree that human reading is necessary, I also believe strongly in automated help. It is this *combination* of qualities that has convinced me of the value of tools such as Emacs. KFV Mode exploits Emacs, but still has much, much room for improvement, e.g., it could do a lot more syntax highlighting, such as on possible firmware data. Moreover, the success of filtering of unsolicited email (spam) has convinced me that the computer could help us do stricter checking for non-free elements such as sourceless firmware. I have already seen some success with *license* classification from a text classifying program, CRM114: http://crm114.sourceforge.net/wiki/doku.php I installed my copy by running: sudo apt-get install crm114 While CRM114 is mainly used a spam filter, it is also easy to adapt it to any other kind of (text) classification. *At the very least*, it can be used as a second opinion on our checking. Thus, in KFV Mode, for example, right before results are uploaded, a last-minute check can notify you about whether there are any differences between your findings and those of CRM114. I am still devising an interface between KFV Mode and CRM114 and have some notes about CRM114 at the end of this email. In the meantime, I will see how CRM114 does with sourceless firmware. Of course, we already have: http://svn.gnewsense.svnhopper.net/gnewsense/builder/trunk/firmware/find-firmware However, CRM114 has surprised me so far with its successful prediction rate, so perhaps it is worth trying on the firmware problem. Comments, code, criticisms, etc. are most welcome. If anyone has been itching to apply some AI to help out gNewSense, this may be promising. ** NOTES ON CRM114 ** 1. At the very least, CRM114 is like a super grep, e.g., here is part of my license filter: match <fromend nocase> [ :x: ] (:y:) /[^\n]*[^[:alnum:]](?:\ (?:GNU|GPL|MPL|licen|permi|rights|terms|distrib|released?.under|covered.by\ |this.software|sell|subject.to|provisions)\ (?:(?:[^\n]*[[:alpha:]][^\n]*)\n|\n){,4})/ This bit is a statement in my CMR114 filter file -- as you can see the most complicated part, as it often happens, is the regular expression (the multiline part in between the slashes ['/']). It matches any line with a license-like term and up to four lines after it that are either just empty or contain at least a letter. The idea is to capture some "interesting" text but to avoid noise like a bunch of asterisks. This example in itself, of course, is nothing special -- indeed it is primitive. What *is* special is that CRM114 happens to be a great performer on text. 2. The exciting part of CRM114 is its learning power. We can train it in any of a number of methods and with different "classifiers". It turns out that many methods are based on training on errors, i.e., do not train on "obviously" successful predictions. Thus, we could use a learning file to teach CRM and a classifying file to give us CRM114's guess about a source file. These files are quite simple, but this part from my classifier file shows the license class names I have so far: classify <hyperspace unique> \ (GPLv2- GPLv2 GPL LGPLv2.1 LGPLv2 GFDLv1.2 GFDLv1.1 ModBSD FBSD \ GPLv2%ModBSD GPLv2%FBSD GPLv2%MPLv1.1 GPL%FBSD X11 CPL1%ModBSD%GPLv2 \ GPLv2+FBSD FREE NONFREE NONDIST) \ (:stats:) Those class names will be saved to distinct files of statistics, and I have used three abbreviations with them: "-" means "or later", "+" means "combined with", and "%" means "or". The "hyperspace" flag indicates my current choice for a classifier. 3. My feeling is that CRM114 could help alert us about files triggering "NONFREE" and "NONDIST". Sources of NONFREE include: sourceless firmware, restrictive license terms, what else? Sources of NONDIST (free but non-distributable) include: files under a GPLv2-incompatible license, patent issues (e.g., mplayer). 4. One of the biggest ways to improve CRM114's performance is to try to limit the size of texts on which it *learns*. So devising great filters would be very helpful. The idea is to cut down on the "noise" that interferes with quick learning and high confidence of guesses. Indeed, a great filter is likely to be critical in detecting sourceless firmware. 5. The biggest problem so far in my experience with CRM114 is the somewhat tedious process of training it, hence my work on an interface between it and KFV Mode. With that interface in place, it will be fun to put CRM114 to the test. 6. Since the results of KFV is downloadable, CRM114 could easily be trained on *everyone's* work. It would merely be a matter of occasionally updating one's computer with KFV results from time to time. CRM114 could then automatically be trained on the new results. Thus, CRM114 would become ever smarter. Bake P.S. A few months ago, a few of us discussed the Fossology license classifying program a little. While I did try that on the kernel, I did not find it to be particularly adaptable to our efforts. Certainly, it is a more "heavyweight", much harder to install and manage, and much harder to customize than CRM114. Moreover, CRM114 is much more fun to play with! :) _______________________________________________ gNewSense-users mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/gnewsense-users
