[gNewSense-users] Detecting sourceless firmware vs. unsolicited email

Bake Timmons Tue, 22 Apr 2008 09:41:47 -0700

Regarding sourceless firmware, as Brian noted, heuristics can only get
us so far, but two other things are also true:


1. Heuristics can be improved.
2. Human reading itself uses (imperfect) heuristics (e.g., pattern
   seeking).  Our reading of a source file will not come close to the
   reading of one who has been hacking on it for five years.  Thus, we
   should grab any advantage we can in helping to spot patterns.

While I too agree that human reading is necessary, I also believe
strongly in automated help.  It is this *combination* of qualities that
has convinced me of the value of tools such as Emacs.  KFV Mode exploits
Emacs, but still has much, much room for improvement, e.g., it could do
a lot more syntax highlighting, such as on possible firmware data.

Moreover, the success of filtering of unsolicited email (spam) has
convinced me that the computer could help us do stricter checking for
non-free elements such as sourceless firmware.

I have already seen some success with *license* classification from a
text classifying program, CRM114:

http://crm114.sourceforge.net/wiki/doku.php

I installed my copy by running:

sudo apt-get install crm114

While CRM114 is mainly used a spam filter, it is also easy to adapt it
to any other kind of (text) classification.  *At the very least*, it can
be used as a second opinion on our checking.  Thus, in KFV Mode, for
example, right before results are uploaded, a last-minute check can
notify you about whether there are any differences between your findings
and those of CRM114.  I am still devising an interface between KFV Mode
and CRM114 and have some notes about CRM114 at the end of this email.

In the meantime, I will see how CRM114 does with sourceless firmware.
Of course, we already have:
http://svn.gnewsense.svnhopper.net/gnewsense/builder/trunk/firmware/find-firmware

However, CRM114 has surprised me so far with its successful prediction
rate, so perhaps it is worth trying on the firmware problem.  Comments,
code, criticisms, etc. are most welcome.  If anyone has been itching to
apply some AI to help out gNewSense, this may be promising.

** NOTES ON CRM114 **

1. At the very least, CRM114 is like a super grep, e.g., here is part of
my license filter:

        match <fromend nocase> [ :x: ] (:y:) /[^\n]*[^[:alnum:]](?:\
(?:GNU|GPL|MPL|licen|permi|rights|terms|distrib|released?.under|covered.by\
|this.software|sell|subject.to|provisions)\
(?:(?:[^\n]*[[:alpha:]][^\n]*)\n|\n){,4})/

This bit is a statement in my CMR114 filter file -- as you can see the
most complicated part, as it often happens, is the regular expression
(the multiline part in between the slashes ['/']).  It matches any line
with a license-like term and up to four lines after it that are either
just empty or contain at least a letter.  The idea is to capture some
"interesting" text but to avoid noise like a bunch of asterisks.  This
example in itself, of course, is nothing special -- indeed it is
primitive.  What *is* special is that CRM114 happens to be a great
performer on text.

2. The exciting part of CRM114 is its learning power.  We can train it
in any of a number of methods and with different "classifiers".  It
turns out that many methods are based on training on errors, i.e., do
not train on "obviously" successful predictions.  Thus, we could use a
learning file to teach CRM and a classifying file to give us CRM114's
guess about a source file.  These files are quite simple, but this part
from my classifier file shows the license class names I have so far:

  classify <hyperspace unique> \
  (GPLv2- GPLv2 GPL LGPLv2.1 LGPLv2 GFDLv1.2 GFDLv1.1 ModBSD FBSD \
   GPLv2%ModBSD GPLv2%FBSD GPLv2%MPLv1.1 GPL%FBSD X11 CPL1%ModBSD%GPLv2 \
   GPLv2+FBSD FREE NONFREE NONDIST) \
   (:stats:)

Those class names will be saved to distinct files of statistics, and I
have used three abbreviations with them: "-" means "or later", "+" means
"combined with", and "%" means "or".

The "hyperspace" flag indicates my current choice for a classifier.

3. My feeling is that CRM114 could help alert us about files triggering
"NONFREE" and "NONDIST".  Sources of NONFREE include: sourceless
firmware, restrictive license terms, what else?  Sources of NONDIST
(free but non-distributable) include: files under a GPLv2-incompatible
license, patent issues (e.g., mplayer).

4. One of the biggest ways to improve CRM114's performance is to try to
limit the size of texts on which it *learns*.  So devising great filters
would be very helpful.  The idea is to cut down on the "noise" that
interferes with quick learning and high confidence of guesses.  Indeed,
a great filter is likely to be critical in detecting sourceless
firmware.

5. The biggest problem so far in my experience with CRM114 is the
somewhat tedious process of training it, hence my work on an interface
between it and KFV Mode.  With that interface in place, it will be fun
to put CRM114 to the test.

6. Since the results of KFV is downloadable, CRM114 could easily be
trained on *everyone's* work.  It would merely be a matter of
occasionally updating one's computer with KFV results from time to
time.  CRM114 could then automatically be trained on the new results.
Thus, CRM114 would become ever smarter.

Bake

P.S.  A few months ago, a few of us discussed the Fossology license
classifying program a little.  While I did try that on the kernel, I did
not find it to be particularly adaptable to our efforts.  Certainly, it
is a more "heavyweight", much harder to install and manage, and much
harder to customize than CRM114.  Moreover, CRM114 is much more fun to
play with! :)


_______________________________________________
gNewSense-users mailing list
[email protected]
http://lists.nongnu.org/mailman/listinfo/gnewsense-users

[gNewSense-users] Detecting sourceless firmware vs. unsolicited email

Reply via email to