copyright_analysis

Laser, Mary Tue, 23 Mar 2010 09:52:05 -0700

Thanks for capturing the TODO notes in our svn tree.  This will help you/us 
insure the list is completed.


Mary

-----Original Message-----
From: fossology-commits-boun...@fossology.org 
[mailto:fossology-commits-boun...@fossology.org] On Behalf Of 
adamrba...@users.sourceforge.net
Sent: Monday, March 22, 2010 11:02 PM
To: fossology-comm...@fossology.org
Subject: [FOSSology-commits] SF.net SVN: fossology:[2918] 
trunk/fossology/agents/copyright_analysis

Revision: 2918
          http://fossology.svn.sourceforge.net/fossology/?rev=2918&view=rev
Author:   adamrbates
Date:     2010-03-23 05:02:21 +0000 (Tue, 23 Mar 2010)

Log Message:
-----------
Added some comments to the copyright library functions. Also added a TODO file 
with changes and fixes that need to be made to the agent.

Modified Paths:
--------------
    trunk/fossology/agents/copyright_analysis/copyright_library.py

Added Paths:
-----------
    trunk/fossology/agents/copyright_analysis/TODO

Added: trunk/fossology/agents/copyright_analysis/TODO
===================================================================
--- trunk/fossology/agents/copyright_analysis/TODO                              
(rev 0)
+++ trunk/fossology/agents/copyright_analysis/TODO      2010-03-23 05:02:21 UTC 
(rev 2918)
@@ -0,0 +1,47 @@
+In copyright.py v 2903:
+
+line 87:
+            print >> sys.stderr, 'ERROR: Something is broken. Could not 
connect to database.'
+1) "Something is broken." is pretty redundant.
+2) I'm pretty sure the actual postgres error message is more specific than 
"Could not connect to database".  Why don't you just print that out instead of 
your own message?  The libpq functions to get the error code and error message 
string are PQresStatus() and PQresultErrorMessage().  I notice that you write 
the error message string in some cases but not others.
+
+
+line 100:
+        print >> sys.stderr, 'You must specify a model file for all phases of 
the algorithm.
+1) What do you mean "for all phases of the algorithm"?  There is only one 
switch -m (--model) for specifying the model.   So to reduce ambiguity (i.e. is 
there a different model file for each phase"), why not just say "You mush 
specify a model file.".
+
+lines 122-129:
+    if options.analyze_from_file:
+        files = [line.rstrip() for line in 
open(options.analyze_from_file).readlines()]
+        for file in files:
+            results = library.label_file(file,model)
+            print "%s :: " % (file)
+            if len(results) == 0:
+                print "No copyrights"
+            for i in range(len(results)):
+                print "\t[%d:%d]" % (results[i][0], results[i][1])
+1) each of those print's are going to come out in the log file as "ERROR: 
...".   Did you mean for these to be printed with --verbose only?  They are 
also going to be printed on separate lines in the scheduler log, is that what 
you wanted?
+
+You get the idea.  Here is a summary of tips.
+1) Please don't print debug to the log unless it is requested with the verbose 
option.
+2) There are 5 options you should be using when printing to the log:
+   - "FATAL" Technical and detailed errors.
+   - "ERROR" Human readable errors.
+   - "WARNING" Human readable warning.
+   - "LOG" Machine readable warning.
+   - "DEBUG" Debugging message.
+3) Make sure your messages mean something to the intended reader.
+4) Make your messages as explicit as reasonable.  For example, line 158:
+            print >> sys.stderr, 'ERROR: Something is broken. Could not open 
Repo.'
+    would be improved by including the path you are trying to open.  This 
extra info will tell the reader if their config path file is correct.
+     Another example is the dbconnect error I mentioned above.
+     Another example is  line 189:
+                        print >> sys.stderr, "ERROR: DB Access error,\n%s" % 
db.status()
+     printing the status code is fine, but the error message 
(PQresultErrorMessage()) is better.  Who wants to go look up the message from 
the code when you can just print the message?
+5) Starting on line 201 you have a couple of "except:" where you print a 
traceback.  It would be easier to understand the log if you print out why you 
are doing the traceback before you print it.  Or is it there and I'm just not 
seeing it?
+6) When you want to know if a table exists, don't (potentially) count every 
record like on line 220:
+    if db.access2('SELECT count(ct_pk) FROM copyright;') != 1:
+Quicker would be "select ct_pk from copyright limit 1".
+7) Semicolons aren't needed at the end of your sql stmt unless you are 
stringing a bunch of statements together.  the semicolon is a statement 
separator so it isn't needed if you are only sending a single stmt to the 
postmaster.
+8) This statement in 6) brings up a question about db.access2().   Does 
access2() distinguish between an invalid query (like a syntax error) and no 
records returned?  It's important to give an error if the command fails.  Is 
db.access2() is a wrapper around the same function in libfossdb.c?  If so, then 
the stmt in 6) is an error because DBaccess2 in libfossdb returns a PGresult * 
or a null pointer on serious errors.
+

Modified: trunk/fossology/agents/copyright_analysis/copyright_library.py
===================================================================
--- trunk/fossology/agents/copyright_analysis/copyright_library.py      
2010-03-23 01:54:53 UTC (rev 2917)
+++ trunk/fossology/agents/copyright_analysis/copyright_library.py      
2010-03-23 05:02:21 UTC (rev 2918)
@@ -19,12 +19,30 @@
 import math

 def findall(RE, text):
+    """
+    findall(Regex, text) -> list(['str_1',start_byte_1, end_byte_1], ...)
+
+    Uses the re.finditer method to locate all instances of the search
+    string and return the string and its start and end bytes.
+
+    Regex is a compiled regular expression using the re.compile() method.
+    text is the body of text to search through.
+    """
     found = []
     for iter in RE.finditer(text):
         found.append([iter.group(), iter.start(), iter.end()])
     return found

 def findall_erase(RE, text):
+    """
+    findall_erase(Regex, text) -> string
+
+    Uses the re.finditer method to locate all instances of the search
+    string and replaces them with spaces. The modified text is returned.
+
+    Regex is a compiled regular expression using the re.complile() method.
+    text is the body of text to search through.
+    """
     new_text = list(text)
     found = []
     for iter in RE.finditer(text):
@@ -33,6 +51,7 @@
         new_text[l[1]:l[2]] = [' ' for i in range(l[2]-l[1])]
     return (found, ''.join(new_text))

+# standard Regular Expressions used to tokenize files.
 months = 
['JAN','FEB','MAR','MAY','APR','JUL','JUN','AUG','OCT','SEP','NOV','DEC','January',
 'February', 'March', 'April', 'June', 'July', 'August', 'September', 'SEPT', 
'October', 'November', 'December',]
 RE_ANDOR = re.compile('and\/or',re.I)
 RE_COMMENT = re.compile('\s([\*\/\#\%...@]+)')
@@ -50,6 +69,25 @@
 RE_ANYTHING = re.compile('.', re.DOTALL)

 def parsetext(text):
+    """
+    parsetext(text) -> dict()
+
+    Tokenizes a body of text and returns a dictionary containing
+    a set of features located within the text.
+
+    'start':    the byte offset of '<s>' tags found in the text.
+    'end':      the byte offset of '</s>' tags found in the text.
+    'email':    the byte offset and text of emails found in the text.
+    'date':     the byte offset and text of date like strings found.
+    'time':     the byte offset and text of time like strings found.
+    'year':     the byte offset and text of year like strings found.
+    'float':    the byte offset and text of floating point numbers found.
+    'copyright':the byte offset and text of copyright strings and symbols.
+                includes 'opyright', '(c)', and the copyright characters.
+    'tokens':   the byte offset and text of white space split tokens.
+                this also includes the characters located between
+                alphanumeric characters.
+    """
     stuff = {}

     text = RE_ANDOR.sub('and or',text)
@@ -67,6 +105,10 @@
     (stuff['copyright'], text) = findall_erase(RE_COPYRIGHT, text)
     (stuff['tokens'], text) = findall_erase(RE_TOKEN, text)

+
+    # we replace the original information extracted from the text with place
+    # holders so we can learn a generic trend in the structure of the
+    # documents.
     stuff['tokens'].extend([['XXXstartXXX', stuff['start'][i][1], 
stuff['start'][i][2]] for i in range(len(stuff['start']))])
     stuff['tokens'].extend([['XXXendXXX', stuff['end'][i][1], 
stuff['end'][i][2]] for i in range(len(stuff['end']))])
     stuff['tokens'].extend([['XXXemailXXX', stuff['email'][i][1], 
stuff['email'][i][2]] for i in range(len(stuff['email']))])
@@ -83,6 +125,20 @@
     return stuff

 def replace_placeholders(tokens,stuff):
+    """
+    replace_placeholders(tokens, parse_dictionary) -> list()
+
+    Given a set of tokens and a parse_dictionary replace all the
+    place holders in the list of tokens with their respective
+    counter parts in the parse_dictionary.
+
+    The parse_dictionary is the dictionary returned by the
+    parsetext function.
+
+    the list of tokens is a list of lists containing the token
+    as a string and the start and end bytes of the token as structured
+    as the tokens element in the parse_dictionary.
+    """
     t = tokens[:]
     n = len(t)
     for needle in ['XXXcopyrightXXX', 'XXXfloatXXX', 'XXXyearXXX', 
'XXXtimeXXX',
@@ -95,6 +151,12 @@
     return t

 def token_sort(x,y):
+    """
+    token_sort(x,y) -> int
+
+    A sort help function. Takes two lists, ['string', int, int], and returns
+    {-1: if x < y, 1: if x > 1, 0: if x == y}.
+    """
     if (x[1] < y[1]):
         return -1
     if (x[1] > y[1]):
@@ -102,6 +164,22 @@
     return 0

 def tokens_to_BIO(tokens):
+    """
+    tokens_to_BIO(tokens) -> (tokens, labels)
+
+    Takes a token list and returns the BIO labels based on the position
+    of the start and end tags.
+
+    This is used to convert the training data which is labeled using '<s></s>'
+    tags into a NLP standard labeling.
+
+    The returned tokens does not include start and end place holders.
+    The labels list contained the BIO tag for that token.
+
+    B - begining
+    I - inside
+    O - outside
+    """
     n = len(tokens)
     t = []
     l = []


This was sent by the SourceForge.net collaborative development platform, the 
world's largest Open Source development site.
_______________________________________________
fossology-commits mailing list
fossology-comm...@fossology.org
http://fossology.org/mailman/listinfo/fossology-commits

_______________________________________________
fossology mailing list
fossology@fossology.org
http://fossology.org/mailman/listinfo/fossology

Re: [FOSSology] [FOSSology-commits] SF.net SVN: fossology:[2918] trunk/fossology/agents/copyright_analysis

Reply via email to