Re: help make it faster please

2005-11-13 Thread Sybren Stuvel
Bengt Richter enlightened us with: I meant somestring.split() just like that -- without a splitter argument. My suspicion remains ;-) Mine too ;-) Sybren -- The problem with the world is stupidity. Not saying there should be a capital punishment for stupidity, but why don't we just take the

Re: help make it faster please

2005-11-13 Thread Ron Adam
Fredrik Lundh wrote: Lonnie Princehouse wrote: [a-z0-9_] means match a single character from the set {a through z, 0 through 9, underscore}. \w should be a bit faster; it's equivalent to [a-zA-Z0-9_] (unless you specify otherwise using the locale or unicode flags), but is handled more

Re: help make it faster please

2005-11-13 Thread Fredrik Lundh
Ron Adam wrote: The \w does make a small difference, but not as much as I expected. that's probably because your benchmark has a lot of dubious overhead: word_finder = re.compile('[EMAIL PROTECTED]', re.I) no need to force case-insensitive search here; \w looks for both lower- and uppercase

Re: help make it faster please

2005-11-13 Thread Ron Adam
Fredrik Lundh wrote: Ron Adam wrote: The \w does make a small difference, but not as much as I expected. that's probably because your benchmark has a lot of dubious overhead: I think it does what the OP described, but that may not be what he really needs. Although the test to find

Re: help make it faster please

2005-11-12 Thread Sybren Stuvel
Bengt Richter enlightened us with: I suspect it's not possible to get '' in the list from somestring.split() Time to adjust your suspicions: ';abc;'.split(';') ['', 'abc', ''] countDict[w] += 1 else: countDict[w] = 1 does that beat

Re: help make it faster please

2005-11-12 Thread bearophileHUGS
Thank you Bengt Richter and Sybren Stuvel for your comments, my little procedure can be improved a bit in many ways, it was just a first quickly written version (but it can be enough for a basic usage). Bengt Richter: good way to prepare for split Maybe there is a better way, that is putting in

Re: help make it faster please

2005-11-12 Thread Bengt Richter
On Sat, 12 Nov 2005 10:46:53 +0100, Sybren Stuvel [EMAIL PROTECTED] wrote: Bengt Richter enlightened us with: I suspect it's not possible to get '' in the list from somestring.split() Time to adjust your suspicions: ';abc;'.split(';') ['', 'abc', ''] I know about that one ;-) I meant

Re: help make it faster please

2005-11-11 Thread Sion Arrowsmith
[EMAIL PROTECTED] wrote: Oh sorry indentation was messed here...the wordlist = countDict.keys() wordlist.sort() should be outside the word loop now def create_words(lines): cnt = 0 spl_set = '[,;{}_?!():-[\.=+*\t\n\r]+' for content in lines: words=content.split()

Re: help make it faster please

2005-11-11 Thread Bengt Richter
On 10 Nov 2005 10:43:04 -0800, [EMAIL PROTECTED] wrote: This can be faster, it avoids doing the same things more times: from string import maketrans, ascii_lowercase, ascii_uppercase def create_words(afile): stripper = '[,;{}_?!():[]\.=+-*\t\n\r^%0123456789/ mapper = maketrans(stripper

help make it faster please

2005-11-10 Thread pkilambi
I wrote this function which does the following: after readling lines from file.It splits and finds the word occurences through a hash table...for some reason this is quite slow..can some one help me make it faster... f = open(filename) lines = f.readlines() def create_words(lines): cnt = 0

Re: help make it faster please

2005-11-10 Thread [EMAIL PROTECTED]
why reload wordlist and sort it after each word processing ? seems that it can be done after the for loop. [EMAIL PROTECTED] wrote: I wrote this function which does the following: after readling lines from file.It splits and finds the word occurences through a hash table...for some reason

Re: help make it faster please

2005-11-10 Thread pkilambi
Oh sorry indentation was messed here...the wordlist = countDict.keys() wordlist.sort() should be outside the word loop now def create_words(lines): cnt = 0 spl_set = '[,;{}_?!():-[\.=+*\t\n\r]+' for content in lines: words=content.split() countDict={}

Re: help make it faster please

2005-11-10 Thread [EMAIL PROTECTED]
don't know your intend so have no idea what it is for. However, you are doing : wordlist=contDict.keys() wordlist.sort() for every word processed yet you don't use the content of x in anyway during the loop. Even if you need one fresh snapshot of contDict after each word, I don't see the need

Re: help make it faster please

2005-11-10 Thread Lonnie Princehouse
You're making a new countDict for each line read from the file... is that what you meant to do? Or are you trying to count word occurrences across the whole file? -- In general, any time string manipulation is going slowly, ask yourself, Can I use the re module for this? # disclaimer: untested

Re: help make it faster please

2005-11-10 Thread bearophileHUGS
This can be faster, it avoids doing the same things more times: from string import maketrans, ascii_lowercase, ascii_uppercase def create_words(afile): stripper = '[,;{}_?!():[]\.=+-*\t\n\r^%0123456789/ mapper = maketrans(stripper + ascii_uppercase, *len(stripper)

Re: help make it faster please

2005-11-10 Thread Larry Bates
[EMAIL PROTECTED] wrote: I wrote this function which does the following: after readling lines from file.It splits and finds the word occurences through a hash table...for some reason this is quite slow..can some one help me make it faster... f = open(filename) lines = f.readlines() def

Re: help make it faster please

2005-11-10 Thread pkilambi
Actually I create a seperate wordlist for each so called line.Here line I mean would be a paragraph in future...so I will have to recreate the wordlist for each loop -- http://mail.python.org/mailman/listinfo/python-list

Re: help make it faster please

2005-11-10 Thread pkilambi
ok this sounds much better..could you tell me what to do if I want to leave characters like @ in words.So I would like to consider this as a part of word -- http://mail.python.org/mailman/listinfo/python-list

Re: help make it faster please

2005-11-10 Thread Lonnie Princehouse
The word_finder regular expression defines what will be considered a word. [a-z0-9_] means match a single character from the set {a through z, 0 through 9, underscore}. The + means match as many as you can, minimum of one To match @ as well, add it to the set of characters to match:

Re: help make it faster please

2005-11-10 Thread Fredrik Lundh
Lonnie Princehouse wrote: [a-z0-9_] means match a single character from the set {a through z, 0 through 9, underscore}. \w should be a bit faster; it's equivalent to [a-zA-Z0-9_] (unless you specify otherwise using the locale or unicode flags), but is handled more efficiently by the RE engine.