Thanks for capturing the TODO notes in our svn tree. This will help you/us insure the list is completed.
Mary -----Original Message----- From: fossology-commits-boun...@fossology.org [mailto:fossology-commits-boun...@fossology.org] On Behalf Of adamrba...@users.sourceforge.net Sent: Monday, March 22, 2010 11:02 PM To: fossology-comm...@fossology.org Subject: [FOSSology-commits] SF.net SVN: fossology:[2918] trunk/fossology/agents/copyright_analysis Revision: 2918 http://fossology.svn.sourceforge.net/fossology/?rev=2918&view=rev Author: adamrbates Date: 2010-03-23 05:02:21 +0000 (Tue, 23 Mar 2010) Log Message: ----------- Added some comments to the copyright library functions. Also added a TODO file with changes and fixes that need to be made to the agent. Modified Paths: -------------- trunk/fossology/agents/copyright_analysis/copyright_library.py Added Paths: ----------- trunk/fossology/agents/copyright_analysis/TODO Added: trunk/fossology/agents/copyright_analysis/TODO =================================================================== --- trunk/fossology/agents/copyright_analysis/TODO (rev 0) +++ trunk/fossology/agents/copyright_analysis/TODO 2010-03-23 05:02:21 UTC (rev 2918) @@ -0,0 +1,47 @@ +In copyright.py v 2903: + +line 87: + print >> sys.stderr, 'ERROR: Something is broken. Could not connect to database.' +1) "Something is broken." is pretty redundant. +2) I'm pretty sure the actual postgres error message is more specific than "Could not connect to database". Why don't you just print that out instead of your own message? The libpq functions to get the error code and error message string are PQresStatus() and PQresultErrorMessage(). I notice that you write the error message string in some cases but not others. + + +line 100: + print >> sys.stderr, 'You must specify a model file for all phases of the algorithm. +1) What do you mean "for all phases of the algorithm"? There is only one switch -m (--model) for specifying the model. So to reduce ambiguity (i.e. is there a different model file for each phase"), why not just say "You mush specify a model file.". + +lines 122-129: + if options.analyze_from_file: + files = [line.rstrip() for line in open(options.analyze_from_file).readlines()] + for file in files: + results = library.label_file(file,model) + print "%s :: " % (file) + if len(results) == 0: + print "No copyrights" + for i in range(len(results)): + print "\t[%d:%d]" % (results[i][0], results[i][1]) +1) each of those print's are going to come out in the log file as "ERROR: ...". Did you mean for these to be printed with --verbose only? They are also going to be printed on separate lines in the scheduler log, is that what you wanted? + +You get the idea. Here is a summary of tips. +1) Please don't print debug to the log unless it is requested with the verbose option. +2) There are 5 options you should be using when printing to the log: + - "FATAL" Technical and detailed errors. + - "ERROR" Human readable errors. + - "WARNING" Human readable warning. + - "LOG" Machine readable warning. + - "DEBUG" Debugging message. +3) Make sure your messages mean something to the intended reader. +4) Make your messages as explicit as reasonable. For example, line 158: + print >> sys.stderr, 'ERROR: Something is broken. Could not open Repo.' + would be improved by including the path you are trying to open. This extra info will tell the reader if their config path file is correct. + Another example is the dbconnect error I mentioned above. + Another example is line 189: + print >> sys.stderr, "ERROR: DB Access error,\n%s" % db.status() + printing the status code is fine, but the error message (PQresultErrorMessage()) is better. Who wants to go look up the message from the code when you can just print the message? +5) Starting on line 201 you have a couple of "except:" where you print a traceback. It would be easier to understand the log if you print out why you are doing the traceback before you print it. Or is it there and I'm just not seeing it? +6) When you want to know if a table exists, don't (potentially) count every record like on line 220: + if db.access2('SELECT count(ct_pk) FROM copyright;') != 1: +Quicker would be "select ct_pk from copyright limit 1". +7) Semicolons aren't needed at the end of your sql stmt unless you are stringing a bunch of statements together. the semicolon is a statement separator so it isn't needed if you are only sending a single stmt to the postmaster. +8) This statement in 6) brings up a question about db.access2(). Does access2() distinguish between an invalid query (like a syntax error) and no records returned? It's important to give an error if the command fails. Is db.access2() is a wrapper around the same function in libfossdb.c? If so, then the stmt in 6) is an error because DBaccess2 in libfossdb returns a PGresult * or a null pointer on serious errors. + Modified: trunk/fossology/agents/copyright_analysis/copyright_library.py =================================================================== --- trunk/fossology/agents/copyright_analysis/copyright_library.py 2010-03-23 01:54:53 UTC (rev 2917) +++ trunk/fossology/agents/copyright_analysis/copyright_library.py 2010-03-23 05:02:21 UTC (rev 2918) @@ -19,12 +19,30 @@ import math def findall(RE, text): + """ + findall(Regex, text) -> list(['str_1',start_byte_1, end_byte_1], ...) + + Uses the re.finditer method to locate all instances of the search + string and return the string and its start and end bytes. + + Regex is a compiled regular expression using the re.compile() method. + text is the body of text to search through. + """ found = [] for iter in RE.finditer(text): found.append([iter.group(), iter.start(), iter.end()]) return found def findall_erase(RE, text): + """ + findall_erase(Regex, text) -> string + + Uses the re.finditer method to locate all instances of the search + string and replaces them with spaces. The modified text is returned. + + Regex is a compiled regular expression using the re.complile() method. + text is the body of text to search through. + """ new_text = list(text) found = [] for iter in RE.finditer(text): @@ -33,6 +51,7 @@ new_text[l[1]:l[2]] = [' ' for i in range(l[2]-l[1])] return (found, ''.join(new_text)) +# standard Regular Expressions used to tokenize files. months = ['JAN','FEB','MAR','MAY','APR','JUL','JUN','AUG','OCT','SEP','NOV','DEC','January', 'February', 'March', 'April', 'June', 'July', 'August', 'September', 'SEPT', 'October', 'November', 'December',] RE_ANDOR = re.compile('and\/or',re.I) RE_COMMENT = re.compile('\s([\*\/\#\%...@]+)') @@ -50,6 +69,25 @@ RE_ANYTHING = re.compile('.', re.DOTALL) def parsetext(text): + """ + parsetext(text) -> dict() + + Tokenizes a body of text and returns a dictionary containing + a set of features located within the text. + + 'start': the byte offset of '<s>' tags found in the text. + 'end': the byte offset of '</s>' tags found in the text. + 'email': the byte offset and text of emails found in the text. + 'date': the byte offset and text of date like strings found. + 'time': the byte offset and text of time like strings found. + 'year': the byte offset and text of year like strings found. + 'float': the byte offset and text of floating point numbers found. + 'copyright':the byte offset and text of copyright strings and symbols. + includes 'opyright', '(c)', and the copyright characters. + 'tokens': the byte offset and text of white space split tokens. + this also includes the characters located between + alphanumeric characters. + """ stuff = {} text = RE_ANDOR.sub('and or',text) @@ -67,6 +105,10 @@ (stuff['copyright'], text) = findall_erase(RE_COPYRIGHT, text) (stuff['tokens'], text) = findall_erase(RE_TOKEN, text) + + # we replace the original information extracted from the text with place + # holders so we can learn a generic trend in the structure of the + # documents. stuff['tokens'].extend([['XXXstartXXX', stuff['start'][i][1], stuff['start'][i][2]] for i in range(len(stuff['start']))]) stuff['tokens'].extend([['XXXendXXX', stuff['end'][i][1], stuff['end'][i][2]] for i in range(len(stuff['end']))]) stuff['tokens'].extend([['XXXemailXXX', stuff['email'][i][1], stuff['email'][i][2]] for i in range(len(stuff['email']))]) @@ -83,6 +125,20 @@ return stuff def replace_placeholders(tokens,stuff): + """ + replace_placeholders(tokens, parse_dictionary) -> list() + + Given a set of tokens and a parse_dictionary replace all the + place holders in the list of tokens with their respective + counter parts in the parse_dictionary. + + The parse_dictionary is the dictionary returned by the + parsetext function. + + the list of tokens is a list of lists containing the token + as a string and the start and end bytes of the token as structured + as the tokens element in the parse_dictionary. + """ t = tokens[:] n = len(t) for needle in ['XXXcopyrightXXX', 'XXXfloatXXX', 'XXXyearXXX', 'XXXtimeXXX', @@ -95,6 +151,12 @@ return t def token_sort(x,y): + """ + token_sort(x,y) -> int + + A sort help function. Takes two lists, ['string', int, int], and returns + {-1: if x < y, 1: if x > 1, 0: if x == y}. + """ if (x[1] < y[1]): return -1 if (x[1] > y[1]): @@ -102,6 +164,22 @@ return 0 def tokens_to_BIO(tokens): + """ + tokens_to_BIO(tokens) -> (tokens, labels) + + Takes a token list and returns the BIO labels based on the position + of the start and end tags. + + This is used to convert the training data which is labeled using '<s></s>' + tags into a NLP standard labeling. + + The returned tokens does not include start and end place holders. + The labels list contained the BIO tag for that token. + + B - begining + I - inside + O - outside + """ n = len(tokens) t = [] l = [] This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. _______________________________________________ fossology-commits mailing list fossology-comm...@fossology.org http://fossology.org/mailman/listinfo/fossology-commits _______________________________________________ fossology mailing list fossology@fossology.org http://fossology.org/mailman/listinfo/fossology