Re: Need Help with Programming Science Project
On 24/01/2014 8:05 PM, theguy wrote: I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. This sounds like exactly the sort of thing NLTK was made for. Here's an example of using it for this requirement: http://www.aicbt.com/authorship-attribution/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
On Saturday, January 25, 2014 8:12:20 PM UTC+5:30, Dennis Lee Bieber wrote: > > Heck, at the very least turn all those _99 variables into single > lists The posted code looks like something from 1968 K&K BASIC. Yes thats correct. My suggestion of data-files is a second step. A first step is just converting to using internal (python) data structures. [And not 1968 BASIC scalars!] -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
On Fri, 24 Jan 2014 20:58:50 -0800, theguy wrote: > I know. I'm kind of ashamed of the code, but it does the job I need it > to up to a certain point OK, well first of all take a step back and look at the problem. You have n exemplars, each from a known author. You analyse each exemplar, and determine some statistics for it. You then take your unknown sample, determine the same statistics for the unknown sample. Finally, you compare each exemplar's stats with the sample's stats to try and find a best match. So, perhaps you want a dictionary of { author: statistics }, and a function to analyse a piece of text, which might call other functions to get eg avg words / sentence, avg letters / sentence, avg word length, and the sd in each, and the short word ratio (words <= 3 chars vs words >= 4 chars) and some other statistics. Given the statistics for each exemplar, you might store these in your dictionary as a tuple. this isn't python, it's a description of an algorithm, it just looks a bit pythonic: # tuple of weightings applied to different stats stat_weightings = ( 1.0, 1.3, 0.85, .. ) def get_some_stat( t ): # calculate some numerical statistic on a block of text # return it def analyse( f ): text = read_file( f ) return ( get_some_stat( text ), .. ) exemplars = {} for exemplar_file in exemplar_files: exemplar_data[author] = analyse( exemplar_file ) sample_data = analyse( sample_file ) scores = {} tmp = 0 x = 0 # score for a piece of work is sum of ( diff of stat * weighting ) # for all the stats, lower score = closer match for author in keys( exemplar_data ): for i in len( exemplar_data[ author ] ): tmp = tmp + sqrt( exemplar_data[ author ][ i ] - sample_data[ i ] ) * stat_weightings( i ) scores[ author ] = tmp if tmp > x: x = tmp names = [] for author in keys( scores ): if scores[ author ] < x: x = scores[ author ] names = [ author ] elif scores[ author ] == x: names.append( [ author ] ) print "the best matching author(s) is/are: ", names Then all you have to do is find enough ways to calculate stats, and the magic coefficients to use in the stat_weightings -- Denis McMahon, denismfmcma...@gmail.com -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
theguy wrote: If I could get it to actually calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems would be solved. Have you tried getting it to print out the values it's getting for the scores, and comparing them with what you calculate by hand? -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
theguy wrote: I so far have three different authors in the program and have already put in the example text but for some reason, the program always leans toward one specific author, Suzanne Collins, no matter what insane number I try to put in or how much I tinker with the coding. It's obvious what's happening here: all the other authors have heavily borrowed from Suzanne Collins. You've created a plagiarism detector! :-) -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
kvxde...@gmail.com Wrote in message: > Alright. I have the code here. Now, I just want to note that the code was not > designed to work "quickly" or be very well-written. It was rushed, as I only > had a few days to finish the work, and by the time I wrote the program, I > hadn't worked with Python (which I never had TOO much experience with > anyways) for a while. (About a year, maybe?) It was a bit foolish to take up > the project, but here's the code anyways: . > > > LPW_Comparisons = [avgLPW_DJ_EXAMPLE, avgLPW_SUZC_EXAMPLE, > avgLPW_SUZC_EXAMPLE] > avgLPW_Match = min(LPW_Comparisons) > > if avgLPW_Match == avgLPW_DJ_EXAMPLE: > DJMachalePossibility = (DJMachalePossibility+1) > > if avgLPW_Match == avgLPW_SUZC_EXAMPLE: > SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1) > > if avgLPW_Match == avgLPW_RICH_EXAMPLE: > RichardPeckPossibility = (RichardPeckPossibility+1) > > AUTHOR_SCOREBOARD = [DJMachalePossibility, SuzanneCollinsPossibility, > RichardPeckPossibility] > > #The author with the most points on them would be considered the program's > guess. > Match = max(AUTHOR_SCOREBOARD) > > print AUTHOR_SCOREBOARD > > if Match == DJMachalePossibility: > print "The author should be D.J. Machale." > > if Match == SuzanneCollinsPossibility: > print "The author should be Suzanne Collins." > > if Match == RichardPeckPossibility: > print "The author should be Richard Peck." > > > -- > Hopefully, there won't be any copyright issues. Like someone said, this > should be fair use. The problem I'm having is that it always gives Suzanne > Collins, no matter what example is put in. I'm really sorry that the code > isn't very clean. Like I said, it was rushed and I have little experience. > I'm just desperate for help as it's a bit too late to change projects, so I > have to stick with this. Also, if it's of any importance, I have to be able > to remove or add any of the "average letters per word/average letters per > sentence/average words per sentence things" to test the program at different > levels of strictness. I would GREATLY appreciate any help with this. Thank > you! > 1. When you calculate averages, you should be using floating point divide. avg = float (a) / b 2. When you subtract two values, you need an abs, because otherwise min () will hone in on the negative values. 3. Realize that having Match agree with more than one is not that unlikely. 4. If you want to vary what you call strictness, you're really going to need to learn about functions. -- DaveA -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
On Friday, January 24, 2014 7:06:55 PM UTC-8, Rustom Mody wrote: > On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote: > > > Alright. I have the code here. Now, I just want to note that the code was > > not designed to work "quickly" or be very well-written. It was rushed, as I > > only had a few days to finish the work, and by the time I wrote the > > program, I hadn't worked with Python (which I never had TOO much experience > > with anyways) for a while. (About a year, maybe?) It was a bit foolish to > > take up the project, but here's the code anyways: > > > > > > > > E! > > > > If you (or anyone with basic python experience) rewrites that code, it will > become > > 1/50th the size and all that you call 'code' will reside in data files. > > > > That can mean one of json, xml, yml, ini, pickle, ini, csv etc > > > > If you need further help in understanding/choosing, post back I know. I'm kind of ashamed of the code, but it does the job I need it to up to a certain point, where it for some reason continually gives me Suzanne Collins as the author. It always gives three points to her name in the AUTHOR_SCOREBOARD list. The code, though, is REALLY bad. I'm trying to simply get it to do the things needed for the program. If I could get it to actually calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems would be solved. Luckily, I'm not being graded on the elegance or conciseness of my code. Thank you for the constructive criticism, though I am really seeking help with my little problem involving that dang scoreboard. Thank you. -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote: > Alright. I have the code here. Now, I just want to note that the code was not > designed to work "quickly" or be very well-written. It was rushed, as I only > had a few days to finish the work, and by the time I wrote the program, I > hadn't worked with Python (which I never had TOO much experience with > anyways) for a while. (About a year, maybe?) It was a bit foolish to take up > the project, but here's the code anyways: E! If you (or anyone with basic python experience) rewrites that code, it will become 1/50th the size and all that you call 'code' will reside in data files. That can mean one of json, xml, yml, ini, pickle, ini, csv etc If you need further help in understanding/choosing, post back -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways: #D.J. Machale - Pendragon #Pendragon: Book Six - The Rivers of Zadaa #Page 98 #The sample sentences for this author. I put each sentence into a seperate variable because I knew no other way to divide the sentence. I also removed spaces so they wouldn't be counted. djmachale_1 = 'WheretonowIaskedLoor' djmachale_2 = 'ToaplacewherewewillnotbedisturbedbyBatuorRokadorsheanswered' djmachale_3 = 'WelefttheroomfollowingLoorthroughthetwistingtunnelthatIhadwalkedthroughseveraltimesbeforeonvisitingtoZadaa' djmachale_4 = 'Shortlyweleftthesmallertunneltoenterthehugecavernthatonceheldanundergroundriver' djmachale_5 = 'WhenSpaderandIwerefirstheretherewasafour-storywaterfallononesideoftheimmensecavernthatfedadeepragingriver' djmachale_6 = 'Nowtherewasonlyadribbleofwaterthatfellfromarockymouthintoapathetictrickleofastreamatthebottomofthemostlydryriverbed' djmachale_7 = 'WhathappenedhereAlderasked' djmachale_8 = 'ThereisalottotellLooranswered' djmachale_9 = 'Later' djmachale_10 = 'Alderacceptedthat' djmachale_11 = 'Hewasaneasyguy' djmachale_12 = 'Loorledustotheopeningthatwasoncehiddenbehindthewaterfallbutwasnowinplainsight' djmachale_13 = 'Weclimbedafewstonestairssteppedthroughtheportalandenteredaroomthatheldthewater-controldeviceIhavedescribedtoyoubefore' djmachale_14 = 'Toremindyouguysthisthinglookedlikeoneofthosegiantpipe-organsthatyouseeinchurch' djmachale_15 = 'Butthesepipesranhorizontallydisappearingintotherockwalloneithersideoftheroom' djmachale_16 = 'Therewasaplatforminfrontofitthatheldanamazingarrayofswitchesandvalves' djmachale_17 = 'WhenIfirstcameheretherewasaRokadorengineeronthatplatformfeverishlyworkingthecontrolslikeanexpert' djmachale_18 = 'Ihadnoideawhatthedevicedidotherthanknowingithadsomethingtodowithcontrollingtheflowofwaterfromtherivers' djmachale_19 = 'Theguyhadmapsanddiagramsthathereferredtowhilehequicklymadeadjustmentsandtoggledswitches' djmachale_20 = 'Nowtheplatformwasempty' #djmwords contains the amount of words in each sentence #djmwords_total is the total word count between all the samples djmwords = [6, 15, 22, 17, 26, 29, 5, 8, 1, 3, 5, 19, 25, 18, 16, 17, 20, 25, 18, 5] djmwords_total = sum(djmwords) avgWORDS_per_SENTENCE_DJMACHALE = (djmwords_total/20) #Each variable becomes the total number of letters in each sentence djmachale_1 = len(djmachale_1) djmachale_2 = len(djmachale_2) djmachale_3 = len(djmachale_3) djmachale_4 = len(djmachale_4) djmachale_5 = len(djmachale_5) djmachale_6 = len(djmachale_6) djmachale_7 = len(djmachale_7) djmachale_8 = len(djmachale_8) djmachale_9 = len(djmachale_9) djmachale_10 = len(djmachale_10) djmachale_11 = len(djmachale_11) djmachale_12 = len(djmachale_12) djmachale_13 = len(djmachale_13) djmachale_14 = len(djmachale_14) djmachale_15 = len(djmachale_15) djmachale_16 = len(djmachale_16) djmachale_17 = len(djmachale_17) djmachale_18 = len(djmachale_18) djmachale_19 = len(djmachale_19) djmachale_20 = len(djmachale_20) #DJMACHALE_TOTAL is the total letter count between all the samples DJ_Machale = [djmachale_1, djmachale_2, djmachale_3, djmachale_4, djmachale_5, djmachale_6, djmachale_7, djmachale_8, djmachale_9, djmachale_10, djmachale_11, djmachale_12, djmachale_13, djmachale_14, djmachale_15, djmachale_16, djmachale_17, djmachale_18, djmachale_19, djmachale_20] DJMACHALE_TOTAL = (djmachale_1+djmachale_2+djmachale_3+djmachale_4+djmachale_5+djmachale_6+djmachale_7+djmachale_8+djmachale_9+djmachale_10+djmachale_11+djmachale_12+djmachale_13+djmachale_14+djmachale_15+djmachale_16+djmachale_17+djmachale_18+djmachale_19+djmachale_20) avgLETTERS_per_SENTENCE_DJMACHALE = (DJMACHALE_TOTAL/20) avgLETTERS_per_WORD_DJMACHALE = (DJMACHALE_TOTAL/djmwords_total) #-- #Suzanne Collins - The Hunger Games #The Hunger Games #Page 103 suzannecollins_1 = 'AsIstridetowardtheelevatorIflingmybowtoonesideandmyquivertotheother' suzannecollins_2 = 'IbrushpastthegapingAvoxeswhoguardtheelevatorsandhitthenumbertwelvebuttonwithmyfist' suzannecollins_3 = 'ThedoorsslidetogetherandIzipupward' suzannecollins_4 = 'Iactuallymakeitbacktomyfloorbeforethetearsstartrunningdownmycheeks' suzannecollins_5 = 'IcanheartheotherscallingmefromthesittingroombutIflydownthehallintomyroomboltthedoorandflingmyselfontomybed' suzannecollins_6 = 'ThenIreallybegintosob' suzannecollins_7 = 'NowIvedoneit' suzannecollins_8 = 'NowIveruinedeverything' suzannecollins_9 = 'IfIdevenstoodaghostofachanceitvanishedwhenIsentt
Re: Need Help with Programming Science Project
In article , Ben Finney wrote: > bob gailer writes: > > > On 1/24/2014 5:05 AM, theguy wrote: > > > I would post the code, but I don't know if it's fine to put it here, > > > as it contains pieces from books. I do believe that would go against > > > copyright laws. > > > AFAIK copyright laws apply to reproducing something for profit. > > That's a common misconception that has never been true. > > http://www.faqs.org/faqs/law/copyright/myths/part1/> > > Copyright is a legal monopoly in a work, reserving a large set of > actions to the copyright holders. Without license from the copyright > holders, or an exemption under the law, you cannot legally perform those > actions. [The rest of this post is based on my "I am not a lawyer" understanding of the law. Also, this is based on US copyright law; things may be different elsewhere, and I haven't the foggiest idea what law applies to an international forum such as this] On the other hand (where Ben Finney's post is the first hand), there is the Fair Use Doctrine (FUD), which grants certain exemptions. The US Copyright Office has a page (http://www.copyright.gov/fls/fl102.html) about this. As a real-life example, I believe I can safely invoke the FUD to quote the leading paragraphs from today's New York Times and New York Post articles about the same event and give their Fleish-Kincaid Reading Ease and Grade Level scores, if I was comparing the writing style of the two newspapers: -- NY Times: The crime gripped the publicâs imagination, for both its magnitude and its moxie: In the predawn hours of Dec. 11, 1978, a group of masked gunmen seized about $6 million in cash and jewels from a cargo building at Kennedy International Airport. Reading Ease Score: 56.6 Grade Level: 10.6 -- NY Post: On Dec. 11, 1978, armed mobsters stole $5 million in cash and nearly $1 million in jewels from a Lufthansa airlines vault at JFK Airport, in what would be for decades the biggest-ever heist on US soil. Reading Ease Score: 76.2 Grade Level: 7.3 -- The scores above were computed by http://www.readability-score.com/ In my opinion, this meets all of the requirements of the FUD. I'm quoting short passages, and using them to critique the writing styles of the two papers. In the OP's case, he's analyzing published works as input to a text analysis algorithm. In my personal opinion, posting samples of those texts, for the purpose of discussing how his algorithm works, would be well within the bounds of Fair Use. -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
On 1/24/2014 7:34 PM, Chris Angelico wrote: On Sat, Jan 25, 2014 at 10:38 AM, bob gailer wrote: On 1/24/2014 5:05 AM, theguy wrote: I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. I so far have three different authors in the program and have already put in the example text but for some reason, the program always leans toward one specific author, Suzanne Collins, no matter what insane number I try to put in or how much I tinker with the coding. I would post the code, but I don't know if it's fine to put it here, as it contains pieces from books. I do believe that would go against copyright laws. AFAIK copyright laws apply to reproducing something for profit. I doubt that posting it here will matter. Incorrect; posting not-for-profit can still be a violation of copyright. But as Peter said, the text itself isn't critical. Post with placeholder text, as he suggested, and we can look at the code. In the US, short quotations are allowed for 'fair use'. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
bob gailer writes: > On 1/24/2014 5:05 AM, theguy wrote: > > I would post the code, but I don't know if it's fine to put it here, > > as it contains pieces from books. I do believe that would go against > > copyright laws. > AFAIK copyright laws apply to reproducing something for profit. That's a common misconception that has never been true. http://www.faqs.org/faqs/law/copyright/myths/part1/> Copyright is a legal monopoly in a work, reserving a large set of actions to the copyright holders. Without license from the copyright holders, or an exemption under the law, you cannot legally perform those actions. Paying money may sometimes help one acquire a license to perform some reserved actions (though frequently the license is severely restricted, and frequently the license you need isn't available for any price). But “I'm not seeking a profit” nor “I didn't get any money for it” are never grounds for copyright exemptions under any jurisdiction I've ever heard of. -- \ “People are very open-minded about new things, as long as | `\ they're exactly like the old ones.” —Charles F. Kettering | _o__) | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
On Sat, Jan 25, 2014 at 10:38 AM, bob gailer wrote: > On 1/24/2014 5:05 AM, theguy wrote: >> >> I have a science project that involves designing a program which can >> examine a bit of text with the author's name given, then figure out who the >> author is if another piece of example text without the name is given. I so >> far have three different authors in the program and have already put in the >> example text but for some reason, the program always leans toward one >> specific author, Suzanne Collins, no matter what insane number I try to put >> in or how much I tinker with the coding. I would post the code, but I don't >> know if it's fine to put it here, as it contains pieces from books. I do >> believe that would go against copyright laws. > > AFAIK copyright laws apply to reproducing something for profit. I doubt that > posting it here will matter. Incorrect; posting not-for-profit can still be a violation of copyright. But as Peter said, the text itself isn't critical. Post with placeholder text, as he suggested, and we can look at the code. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
On 1/24/2014 5:05 AM, theguy wrote: I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. I so far have three different authors in the program and have already put in the example text but for some reason, the program always leans toward one specific author, Suzanne Collins, no matter what insane number I try to put in or how much I tinker with the coding. I would post the code, but I don't know if it's fine to put it here, as it contains pieces from books. I do believe that would go against copyright laws. AFAIK copyright laws apply to reproducing something for profit. I doubt that posting it here will matter. In any case do post your code; you could trim the fat out of the text if you need to, -- https://mail.python.org/mailman/listinfo/python-list
Re: Need Help with Programming Science Project
theguy wrote: > I have a science project that involves designing a program which can > examine a bit of text with the author's name given, then figure out who > the author is if another piece of example text without the name is given. > I so far have three different authors in the program and have already put > in the example text but for some reason, the program always leans toward > one specific author, Suzanne Collins, no matter what insane number I try > to put in or how much I tinker with the coding. I would post the code, but > I don't know if it's fine to put it here, as it contains pieces from > books. I do believe that would go against copyright laws. If I can figure > out a way to put it in without the bits from the stories, then I'll do so, > but as of now, any help is appreciated. I understand I'm not exactly mak > ing it easy since I'm not putting up any code, but I'm kind of desperate > for help here, as I can't seem to find anybody or anything else helpful > in any way. Thank you. If I were to speculate what your program might look like: text_samples = { "Suzanne Collins": "... some text by collins ...", "J. K. Rowling": "... some text by rowling ...", #... } unknown = "... sample text by unknown author ..." def calc_match(text1, text2): import random return random.random() guessed_author = None guessed_match = None for author, text in text_samples.items(): match = calc_match(unknown, text) print(author, match) if guessed_author is None or match > guessed_match: guessed_author = author guessed_match = match print("The author is", guessed_author) The important part in this script are not the text samples or the loop to determine the best match -- it's the algorithm used to determine how good two texts match. In the above example that algorithm is encapsulated in the calc_match() function and it's really bad, it gives you random numbers between 0 and 1. For us to help you it should be sufficient when you post the analog of this function in your code together with a description in plain english of how it is meant to calculate the similarity between two texts. Alternatavely, instead of the copyrighted texts grab text samples from project gutenberg with expired copyright. Make sure that the resulting post is as short as possible -- long text samples don't make the post clearer than short ones. -- https://mail.python.org/mailman/listinfo/python-list