Bengt Richter enlightened us with:
I meant somestring.split() just like that -- without a splitter
argument. My suspicion remains ;-)
Mine too ;-)
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don't we just take the
Fredrik Lundh wrote:
Lonnie Princehouse wrote:
[a-z0-9_] means match a single character from the set {a through z,
0 through 9, underscore}.
\w should be a bit faster; it's equivalent to [a-zA-Z0-9_] (unless you
specify otherwise using the locale or unicode flags), but is handled more
Ron Adam wrote:
The \w does make a small difference, but not as much as I expected.
that's probably because your benchmark has a lot of dubious overhead:
word_finder = re.compile('[EMAIL PROTECTED]', re.I)
no need to force case-insensitive search here; \w looks for both lower-
and uppercase
Fredrik Lundh wrote:
Ron Adam wrote:
The \w does make a small difference, but not as much as I expected.
that's probably because your benchmark has a lot of dubious overhead:
I think it does what the OP described, but that may not be what he
really needs.
Although the test to find
Bengt Richter enlightened us with:
I suspect it's not possible to get '' in the list from
somestring.split()
Time to adjust your suspicions:
';abc;'.split(';')
['', 'abc', '']
countDict[w] += 1
else:
countDict[w] = 1
does that beat
Thank you Bengt Richter and Sybren Stuvel for your comments, my little
procedure can be improved a bit in many ways, it was just a first
quickly written version (but it can be enough for a basic usage).
Bengt Richter:
good way to prepare for split
Maybe there is a better way, that is putting in
On Sat, 12 Nov 2005 10:46:53 +0100, Sybren Stuvel [EMAIL PROTECTED] wrote:
Bengt Richter enlightened us with:
I suspect it's not possible to get '' in the list from
somestring.split()
Time to adjust your suspicions:
';abc;'.split(';')
['', 'abc', '']
I know about that one ;-)
I meant
[EMAIL PROTECTED] wrote:
Oh sorry indentation was messed here...the
wordlist = countDict.keys()
wordlist.sort()
should be outside the word loop now
def create_words(lines):
cnt = 0
spl_set = '[,;{}_?!():-[\.=+*\t\n\r]+'
for content in lines:
words=content.split()
On 10 Nov 2005 10:43:04 -0800, [EMAIL PROTECTED] wrote:
This can be faster, it avoids doing the same things more times:
from string import maketrans, ascii_lowercase, ascii_uppercase
def create_words(afile):
stripper = '[,;{}_?!():[]\.=+-*\t\n\r^%0123456789/
mapper = maketrans(stripper
I wrote this function which does the following:
after readling lines from file.It splits and finds the word occurences
through a hash table...for some reason this is quite slow..can some one
help me make it faster...
f = open(filename)
lines = f.readlines()
def create_words(lines):
cnt = 0
why reload wordlist and sort it after each word processing ? seems that
it can be done after the for loop.
[EMAIL PROTECTED] wrote:
I wrote this function which does the following:
after readling lines from file.It splits and finds the word occurences
through a hash table...for some reason
Oh sorry indentation was messed here...the
wordlist = countDict.keys()
wordlist.sort()
should be outside the word loop now
def create_words(lines):
cnt = 0
spl_set = '[,;{}_?!():-[\.=+*\t\n\r]+'
for content in lines:
words=content.split()
countDict={}
don't know your intend so have no idea what it is for. However, you are
doing :
wordlist=contDict.keys()
wordlist.sort()
for every word processed yet you don't use the content of x in anyway
during the loop. Even if you need one fresh snapshot of contDict after
each word, I don't see the need
You're making a new countDict for each line read from the file... is
that what you meant to do? Or are you trying to count word occurrences
across the whole file?
--
In general, any time string manipulation is going slowly, ask yourself,
Can I use the re module for this?
# disclaimer: untested
This can be faster, it avoids doing the same things more times:
from string import maketrans, ascii_lowercase, ascii_uppercase
def create_words(afile):
stripper = '[,;{}_?!():[]\.=+-*\t\n\r^%0123456789/
mapper = maketrans(stripper + ascii_uppercase,
*len(stripper)
[EMAIL PROTECTED] wrote:
I wrote this function which does the following:
after readling lines from file.It splits and finds the word occurences
through a hash table...for some reason this is quite slow..can some one
help me make it faster...
f = open(filename)
lines = f.readlines()
def
Actually I create a seperate wordlist for each so called line.Here line
I mean would be a paragraph in future...so I will have to recreate the
wordlist for each loop
--
http://mail.python.org/mailman/listinfo/python-list
ok this sounds much better..could you tell me what to do if I want to
leave characters like @ in words.So I would like to consider this as a
part of word
--
http://mail.python.org/mailman/listinfo/python-list
The word_finder regular expression defines what will be considered a
word.
[a-z0-9_] means match a single character from the set {a through z,
0 through 9, underscore}.
The + means match as many as you can, minimum of one
To match @ as well, add it to the set of characters to match:
Lonnie Princehouse wrote:
[a-z0-9_] means match a single character from the set {a through z,
0 through 9, underscore}.
\w should be a bit faster; it's equivalent to [a-zA-Z0-9_] (unless you
specify otherwise using the locale or unicode flags), but is handled more
efficiently by the RE engine.
20 matches
Mail list logo