Re: Need Help with Programming Science Project

2014-01-27 Thread alex23

On 24/01/2014 8:05 PM, theguy wrote:

I have a science project that involves designing a program which can examine a 
bit of text with the author's name given, then figure out who the author is if 
another piece of example text without the name is given.


This sounds like exactly the sort of thing NLTK was made for. Here's an 
example of using it for this requirement:


http://www.aicbt.com/authorship-attribution/
--
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-25 Thread Rustom Mody
On Saturday, January 25, 2014 8:12:20 PM UTC+5:30, Dennis Lee Bieber wrote:
> 
>   Heck, at the very least turn all those _99 variables into single
> lists The posted code looks like something from 1968 K&K BASIC.

Yes thats correct.

My suggestion of data-files is a second step.

A first step is just converting to using internal (python) data structures.
[And not 1968 BASIC scalars!]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-25 Thread Denis McMahon
On Fri, 24 Jan 2014 20:58:50 -0800, theguy wrote:

> I know. I'm kind of ashamed of the code, but it does the job I need it
> to up to a certain point

OK, well first of all take a step back and look at the problem.

You have n exemplars, each from a known author.

You analyse each exemplar, and determine some statistics for it.

You then take your unknown sample, determine the same statistics for the 
unknown sample.

Finally, you compare each exemplar's stats with the sample's stats to try 
and find a best match.

So, perhaps you want a dictionary of { author: statistics }, and a 
function to analyse a piece of text, which might call other functions to 
get eg avg words / sentence, avg letters / sentence, avg word length, and 
the sd in each, and the short word ratio (words <= 3 chars vs words >= 4 
chars) and some other statistics.

Given the statistics for each exemplar, you might store these in your 
dictionary as a tuple.

this isn't python, it's a description of an algorithm, it just looks a 
bit pythonic:

# tuple of weightings applied to different stats
stat_weightings = ( 1.0, 1.3, 0.85, .. )

def get_some_stat( t ):
# calculate some numerical statistic on a block of text
# return it

def analyse( f ):
text = read_file( f )
return ( get_some_stat( text ), .. )

exemplars = {}

for exemplar_file in exemplar_files:
exemplar_data[author] = analyse( exemplar_file )

sample_data = analyse( sample_file )

scores = {}

tmp = 0
x = 0

# score for a piece of work is sum of ( diff of stat * weighting )
# for all the stats, lower score = closer match
for author in keys( exemplar_data ):
for i in len( exemplar_data[ author ] ):
tmp = tmp + sqrt( exemplar_data[ author ][ i ] - 
sample_data[ i ] ) * stat_weightings( i )
scores[ author ] = tmp
if tmp > x:
x = tmp

names = []

for author in keys( scores ):
if scores[ author ] < x:
x = scores[ author ]
names = [ author ]
elif scores[ author ] == x:
names.append( [ author ] )

print "the best matching author(s) is/are: ", names

Then all you have to do is find enough ways to calculate stats, and the 
magic coefficients to use in the stat_weightings

-- 
Denis McMahon, denismfmcma...@gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread Gregory Ewing

theguy wrote:

If I could get it to actually
calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems
would be solved.


Have you tried getting it to print out the values
it's getting for the scores, and comparing them
with what you calculate by hand?

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread Gregory Ewing

theguy wrote:

I so far have
three different authors in the program and have already put in the example
text but for some reason, the program always leans toward one specific
author, Suzanne Collins, no matter what insane number I try to put in or how
much I tinker with the coding.


It's obvious what's happening here: all the other
authors have heavily borrowed from Suzanne Collins.
You've created a plagiarism detector! :-)

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread Dave Angel
 kvxde...@gmail.com Wrote in message:
> Alright. I have the code here. Now, I just want to note that the code was not 
> designed to work "quickly" or be very well-written. It was rushed, as I only 
> had a few days to finish the work, and by the time I wrote the program, I 
> hadn't worked with Python (which I never had TOO much experience with 
> anyways) for a while. (About a year, maybe?) It was a bit foolish to take up 
> the project, but here's the code anyways:
.
>
> 
> LPW_Comparisons = [avgLPW_DJ_EXAMPLE, avgLPW_SUZC_EXAMPLE, 
> avgLPW_SUZC_EXAMPLE]
> avgLPW_Match = min(LPW_Comparisons)
> 
> if avgLPW_Match == avgLPW_DJ_EXAMPLE:
> DJMachalePossibility = (DJMachalePossibility+1)
> 
> if avgLPW_Match == avgLPW_SUZC_EXAMPLE:
> SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)
> 
> if avgLPW_Match == avgLPW_RICH_EXAMPLE:
> RichardPeckPossibility = (RichardPeckPossibility+1)
> 
> AUTHOR_SCOREBOARD = [DJMachalePossibility, SuzanneCollinsPossibility, 
> RichardPeckPossibility]
> 
> #The author with the most points on them would be considered the program's 
> guess.
> Match = max(AUTHOR_SCOREBOARD)
> 
> print AUTHOR_SCOREBOARD
> 
> if Match == DJMachalePossibility:
> print "The author should be D.J. Machale."
> 
> if Match == SuzanneCollinsPossibility:
> print "The author should be Suzanne Collins."
> 
> if Match == RichardPeckPossibility:
> print "The author should be Richard Peck."
> 
> 
> --
> Hopefully, there won't be any copyright issues. Like someone said, this 
> should be fair use. The problem I'm having is that it always gives Suzanne 
> Collins, no matter what example is put in. I'm really sorry that the code 
> isn't very clean. Like I said, it was rushed and I have little experience. 
> I'm just desperate for help as it's a bit too late to change projects, so I 
> have to stick with this. Also, if it's of any importance, I have to be able 
> to remove or add any of the "average letters per word/average letters per 
> sentence/average words per sentence things" to test the program at different 
> levels of strictness. I would GREATLY appreciate any help with this. Thank 
> you!
> 

1. When you calculate averages,  you should be using floating
 point divide. 
 avg = float (a) / b

  2. When you subtract two values, you need an abs, because
 otherwise min () will hone in on the negative values.
 

  3. Realize that having Match agree with more than one is not
 that unlikely. 

   4. If you want to vary what you call strictness,  you're really
 going to need to learn about functions.


-- 
DaveA

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread theguy
On Friday, January 24, 2014 7:06:55 PM UTC-8, Rustom Mody wrote:
> On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote:
> 
> > Alright. I have the code here. Now, I just want to note that the code was 
> > not designed to work "quickly" or be very well-written. It was rushed, as I 
> > only had a few days to finish the work, and by the time I wrote the 
> > program, I hadn't worked with Python (which I never had TOO much experience 
> > with anyways) for a while. (About a year, maybe?) It was a bit foolish to 
> > take up the project, but here's the code anyways:
> 
> 
> 
> 
> 
> 
> 
> E!
> 
> 
> 
> If you (or anyone with basic python experience) rewrites that code, it will 
> become
> 
> 1/50th the size and all that you call 'code' will reside in data files.
> 
> 
> 
> That can mean one of json, xml, yml, ini, pickle, ini, csv  etc
> 
> 
> 
> If you need further help in understanding/choosing, post back

I know. I'm kind of ashamed of the code, but it does the job I need it to up to 
a certain point, where it for some reason continually gives me Suzanne Collins 
as the author. It always gives three points to her name in the 
AUTHOR_SCOREBOARD list. The code, though, is REALLY bad. I'm trying to simply 
get it to do the things needed for the program. If I could get it to actually 
calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems 
would be solved. Luckily, I'm not being graded on the elegance or conciseness 
of my code. Thank you for the constructive criticism, though I am really 
seeking help with my little problem involving that dang scoreboard. Thank you.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread Rustom Mody
On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote:
> Alright. I have the code here. Now, I just want to note that the code was not 
> designed to work "quickly" or be very well-written. It was rushed, as I only 
> had a few days to finish the work, and by the time I wrote the program, I 
> hadn't worked with Python (which I never had TOO much experience with 
> anyways) for a while. (About a year, maybe?) It was a bit foolish to take up 
> the project, but here's the code anyways:



E!

If you (or anyone with basic python experience) rewrites that code, it will 
become
1/50th the size and all that you call 'code' will reside in data files.

That can mean one of json, xml, yml, ini, pickle, ini, csv  etc

If you need further help in understanding/choosing, post back
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread kvxdelta
Alright. I have the code here. Now, I just want to note that the code was not 
designed to work "quickly" or be very well-written. It was rushed, as I only 
had a few days to finish the work, and by the time I wrote the program, I 
hadn't worked with Python (which I never had TOO much experience with anyways) 
for a while. (About a year, maybe?) It was a bit foolish to take up the 
project, but here's the code anyways:

#D.J. Machale - Pendragon
#Pendragon: Book Six - The Rivers of Zadaa
#Page 98
#The sample sentences for this author. I put each sentence into a seperate 
variable because I knew no other way to divide the sentence. I also removed 
spaces so they wouldn't be counted.
djmachale_1 = 'WheretonowIaskedLoor'
djmachale_2 = 'ToaplacewherewewillnotbedisturbedbyBatuorRokadorsheanswered'
djmachale_3 = 
'WelefttheroomfollowingLoorthroughthetwistingtunnelthatIhadwalkedthroughseveraltimesbeforeonvisitingtoZadaa'
djmachale_4 = 
'Shortlyweleftthesmallertunneltoenterthehugecavernthatonceheldanundergroundriver'
djmachale_5 = 
'WhenSpaderandIwerefirstheretherewasafour-storywaterfallononesideoftheimmensecavernthatfedadeepragingriver'
djmachale_6 = 
'Nowtherewasonlyadribbleofwaterthatfellfromarockymouthintoapathetictrickleofastreamatthebottomofthemostlydryriverbed'
djmachale_7 = 'WhathappenedhereAlderasked'
djmachale_8 = 'ThereisalottotellLooranswered'
djmachale_9 = 'Later'
djmachale_10 = 'Alderacceptedthat'
djmachale_11 = 'Hewasaneasyguy'
djmachale_12 = 
'Loorledustotheopeningthatwasoncehiddenbehindthewaterfallbutwasnowinplainsight'
djmachale_13 = 
'Weclimbedafewstonestairssteppedthroughtheportalandenteredaroomthatheldthewater-controldeviceIhavedescribedtoyoubefore'
djmachale_14 = 
'Toremindyouguysthisthinglookedlikeoneofthosegiantpipe-organsthatyouseeinchurch'
djmachale_15 = 
'Butthesepipesranhorizontallydisappearingintotherockwalloneithersideoftheroom'
djmachale_16 = 
'Therewasaplatforminfrontofitthatheldanamazingarrayofswitchesandvalves'
djmachale_17 = 
'WhenIfirstcameheretherewasaRokadorengineeronthatplatformfeverishlyworkingthecontrolslikeanexpert'
djmachale_18 = 
'Ihadnoideawhatthedevicedidotherthanknowingithadsomethingtodowithcontrollingtheflowofwaterfromtherivers'
djmachale_19 = 
'Theguyhadmapsanddiagramsthathereferredtowhilehequicklymadeadjustmentsandtoggledswitches'
djmachale_20 = 'Nowtheplatformwasempty'

#djmwords contains the amount of words in each sentence
#djmwords_total is the total word count between all the samples
djmwords = [6, 15, 22, 17, 26, 29, 5, 8, 1, 3, 5, 19, 25, 18, 16, 17, 20, 25, 
18, 5]
djmwords_total = sum(djmwords)
avgWORDS_per_SENTENCE_DJMACHALE = (djmwords_total/20)

#Each variable becomes the total number of letters in each sentence
djmachale_1 = len(djmachale_1)
djmachale_2 = len(djmachale_2)
djmachale_3 = len(djmachale_3)
djmachale_4 = len(djmachale_4)
djmachale_5 = len(djmachale_5)
djmachale_6 = len(djmachale_6)
djmachale_7 = len(djmachale_7)
djmachale_8 = len(djmachale_8)
djmachale_9 = len(djmachale_9)
djmachale_10 = len(djmachale_10)
djmachale_11 = len(djmachale_11)
djmachale_12 = len(djmachale_12)
djmachale_13 = len(djmachale_13)
djmachale_14 = len(djmachale_14)
djmachale_15 = len(djmachale_15)
djmachale_16 = len(djmachale_16)
djmachale_17 = len(djmachale_17)
djmachale_18 = len(djmachale_18)
djmachale_19 = len(djmachale_19)
djmachale_20 = len(djmachale_20)

#DJMACHALE_TOTAL is the total letter count between all the samples
DJ_Machale = [djmachale_1, djmachale_2, djmachale_3, djmachale_4, djmachale_5, 
djmachale_6, djmachale_7, djmachale_8, djmachale_9, djmachale_10, djmachale_11, 
djmachale_12, djmachale_13, djmachale_14, djmachale_15, djmachale_16, 
djmachale_17, djmachale_18, djmachale_19, djmachale_20]
DJMACHALE_TOTAL = 
(djmachale_1+djmachale_2+djmachale_3+djmachale_4+djmachale_5+djmachale_6+djmachale_7+djmachale_8+djmachale_9+djmachale_10+djmachale_11+djmachale_12+djmachale_13+djmachale_14+djmachale_15+djmachale_16+djmachale_17+djmachale_18+djmachale_19+djmachale_20)
avgLETTERS_per_SENTENCE_DJMACHALE = (DJMACHALE_TOTAL/20)

avgLETTERS_per_WORD_DJMACHALE = (DJMACHALE_TOTAL/djmwords_total)

#--
#Suzanne Collins - The Hunger Games
#The Hunger Games
#Page 103
suzannecollins_1 = 
'AsIstridetowardtheelevatorIflingmybowtoonesideandmyquivertotheother'
suzannecollins_2 = 
'IbrushpastthegapingAvoxeswhoguardtheelevatorsandhitthenumbertwelvebuttonwithmyfist'
suzannecollins_3 = 'ThedoorsslidetogetherandIzipupward'
suzannecollins_4 = 
'Iactuallymakeitbacktomyfloorbeforethetearsstartrunningdownmycheeks'
suzannecollins_5 = 
'IcanheartheotherscallingmefromthesittingroombutIflydownthehallintomyroomboltthedoorandflingmyselfontomybed'
suzannecollins_6 = 'ThenIreallybegintosob'
suzannecollins_7 = 'NowIvedoneit'
suzannecollins_8 = 'NowIveruinedeverything'
suzannecollins_9 = 
'IfIdevenstoodaghostofachanceitvanishedwhenIsentt

Re: Need Help with Programming Science Project

2014-01-24 Thread Roy Smith
In article ,
 Ben Finney  wrote:

> bob gailer  writes:
> 
> > On 1/24/2014 5:05 AM, theguy wrote:
> > > I would post the code, but I don't know if it's fine to put it here,
> > > as it contains pieces from books. I do believe that would go against
> > > copyright laws.
> 
> > AFAIK copyright laws apply to reproducing something for profit.
> 
> That's a common misconception that has never been true.
> 
> http://www.faqs.org/faqs/law/copyright/myths/part1/>
> 
> Copyright is a legal monopoly in a work, reserving a large set of
> actions to the copyright holders. Without license from the copyright
> holders, or an exemption under the law, you cannot legally perform those
> actions.

[The rest of this post is based on my "I am not a lawyer" understanding 
of the law.  Also, this is based on US copyright law; things may be 
different elsewhere, and I haven't the foggiest idea what law applies to 
an international forum such as this]

On the other hand (where Ben Finney's post is the first hand), there is 
the Fair Use Doctrine (FUD), which grants certain exemptions.  The US 
Copyright Office has a page (http://www.copyright.gov/fls/fl102.html) 
about this.

As a real-life example, I believe I can safely invoke the FUD to quote 
the leading paragraphs from today's New York Times and New York Post 
articles about the same event and give their Fleish-Kincaid Reading Ease 
and Grade Level scores, if I was comparing the writing style of the two 
newspapers:

--

NY Times:

The crime gripped the public’s imagination, for both its magnitude and 
its moxie: In the predawn hours of Dec. 11, 1978, a group of masked 
gunmen seized about $6 million in cash and jewels from a cargo building 
at Kennedy International Airport.

Reading Ease Score: 56.6
Grade Level: 10.6

--

NY Post:

On Dec. 11, 1978, armed mobsters stole $5 million in cash and nearly $1 
million in jewels from a Lufthansa airlines vault at JFK Airport, in 
what would be for decades the biggest-ever heist on US soil.

Reading Ease Score: 76.2
Grade Level: 7.3

--

The scores above were computed by http://www.readability-score.com/

In my opinion, this meets all of the requirements of the FUD.  I'm 
quoting short passages, and using them to critique the writing styles of 
the two papers.

In the OP's case, he's analyzing published works as input to a text 
analysis algorithm.  In my personal opinion, posting samples of those 
texts, for the purpose of discussing how his algorithm works, would be 
well within the bounds of Fair Use.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread Terry Reedy

On 1/24/2014 7:34 PM, Chris Angelico wrote:

On Sat, Jan 25, 2014 at 10:38 AM, bob gailer  wrote:

On 1/24/2014 5:05 AM, theguy wrote:


I have a science project that involves designing a program which can
examine a bit of text with the author's name given, then figure out who the
author is if another piece of example text without the name is given. I so
far have three different authors in the program and have already put in the
example text but for some reason, the program always leans toward one
specific author, Suzanne Collins, no matter what insane number I try to put
in or how much I tinker with the coding. I would post the code, but I don't
know if it's fine to put it here, as it contains pieces from books. I do
believe that would go against copyright laws.


AFAIK copyright laws apply to reproducing something for profit. I doubt that
posting it here will matter.


Incorrect; posting not-for-profit can still be a violation of
copyright. But as Peter said, the text itself isn't critical. Post
with placeholder text, as he suggested, and we can look at the code.


In the US, short quotations are allowed for 'fair use'.

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread Ben Finney
bob gailer  writes:

> On 1/24/2014 5:05 AM, theguy wrote:
> > I would post the code, but I don't know if it's fine to put it here,
> > as it contains pieces from books. I do believe that would go against
> > copyright laws.

> AFAIK copyright laws apply to reproducing something for profit.

That's a common misconception that has never been true.

http://www.faqs.org/faqs/law/copyright/myths/part1/>

Copyright is a legal monopoly in a work, reserving a large set of
actions to the copyright holders. Without license from the copyright
holders, or an exemption under the law, you cannot legally perform those
actions.

Paying money may sometimes help one acquire a license to perform some
reserved actions (though frequently the license is severely restricted,
and frequently the license you need isn't available for any price).

But “I'm not seeking a profit” nor “I didn't get any money for it” are
never grounds for copyright exemptions under any jurisdiction I've ever
heard of.

-- 
 \   “People are very open-minded about new things, as long as |
  `\ they're exactly like the old ones.” —Charles F. Kettering |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread Chris Angelico
On Sat, Jan 25, 2014 at 10:38 AM, bob gailer  wrote:
> On 1/24/2014 5:05 AM, theguy wrote:
>>
>> I have a science project that involves designing a program which can
>> examine a bit of text with the author's name given, then figure out who the
>> author is if another piece of example text without the name is given. I so
>> far have three different authors in the program and have already put in the
>> example text but for some reason, the program always leans toward one
>> specific author, Suzanne Collins, no matter what insane number I try to put
>> in or how much I tinker with the coding. I would post the code, but I don't
>> know if it's fine to put it here, as it contains pieces from books. I do
>> believe that would go against copyright laws.
>
> AFAIK copyright laws apply to reproducing something for profit. I doubt that
> posting it here will matter.

Incorrect; posting not-for-profit can still be a violation of
copyright. But as Peter said, the text itself isn't critical. Post
with placeholder text, as he suggested, and we can look at the code.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread bob gailer

On 1/24/2014 5:05 AM, theguy wrote:

I have a science project that involves designing a program which can examine a 
bit of text with the author's name given, then figure out who the author is if 
another piece of example text without the name is given. I so far have three 
different authors in the program and have already put in the example text but 
for some reason, the program always leans toward one specific author, Suzanne 
Collins, no matter what insane number I try to put in or how much I tinker with 
the coding. I would post the code, but I don't know if it's fine to put it 
here, as it contains pieces from books. I do believe that would go against 
copyright laws.
AFAIK copyright laws apply to reproducing something for profit. I doubt 
that posting it here will matter.


In any case do post your code; you could trim the fat out of the text if 
you need to,

--
https://mail.python.org/mailman/listinfo/python-list


Re: Need Help with Programming Science Project

2014-01-24 Thread Peter Otten
theguy wrote:

> I have a science project that involves designing a program which can
> examine a bit of text with the author's name given, then figure out who
> the author is if another piece of example text without the name is given.
> I so far have three different authors in the program and have already put
> in the example text but for some reason, the program always leans toward
> one specific author, Suzanne Collins, no matter what insane number I try
> to put in or how much I tinker with the coding. I would post the code, but
> I don't know if it's fine to put it here, as it contains pieces from
> books. I do believe that would go against copyright laws. If I can figure
> out a way to put it in without the bits from the stories, then I'll do so,
> but as of now, any help is appreciated. I understand I'm not exactly mak
>  ing it easy since I'm not putting up any code, but I'm kind of desperate
>  for help here, as I can't seem to find anybody or anything else helpful
>  in any way. Thank you.

If I were to speculate what your program might look like:

text_samples = {
"Suzanne Collins": "... some text by collins ...",
"J. K. Rowling": "... some text by rowling ...",
#...
}

unknown = "... sample text by unknown author ..."

def calc_match(text1, text2):
   import random
   return random.random()

guessed_author = None
guessed_match = None

for author, text in text_samples.items():
   match = calc_match(unknown, text)
   print(author, match)
   if guessed_author is None or match > guessed_match:
   guessed_author = author
   guessed_match = match

print("The author is", guessed_author)

The important part in this script are not the text samples or the loop to 
determine the best match -- it's the algorithm used to determine how good 
two texts match. 
In the above example that algorithm is encapsulated in the calc_match() 
function and it's really bad, it gives you random numbers between 0 and 1.

For us to help you it should be sufficient when you post the analog of this 
function in your code together with a description in plain english of how it 
is meant to calculate the similarity between two texts.

Alternatavely, instead of the copyrighted texts grab text samples from 
project gutenberg with expired copyright.

Make sure that the resulting post is as short as possible -- long text 
samples don't make the post clearer than short ones.

-- 
https://mail.python.org/mailman/listinfo/python-list