Re: How to find all the same words in a text?

2007-02-11 Thread Samuel Karl Peterson
Johny [EMAIL PROTECTED] on 10 Feb 2007 05:29:23 -0800 didst step
forth and proclaim thus:

 I need to find all the same words in a text .
 What would be the best idea  to do that?

I make no claims of this being the best approach:


def findOccurances(a_string, word):

Given a string and a word, returns a double:
[0] = count [1] = list of indexes where word occurs

import re
count = 0
indexes = []
start = 0 # offset for successive passes
pattern = re.compile(r'\b%s\b' % word, re.I)

while True:
match = pattern.search(a_string)
if not match: break
count += 1;
indexes.append(match.start() + start)
start += match.end()
a_string = a_string[match.end():]

return (count, indexes)


Seems to work for me.  No guarantees.

-- 
Sam Peterson
skpeterson At nospam ucdavis.edu
if programmers were paid to remove code instead of adding it,
software would be much better -- unknown
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find all the same words in a text?

2007-02-11 Thread Neil Cerutti
On 2007-02-10, Johny [EMAIL PROTECTED] wrote:
 I need to find all the same words in a text .
 What would be the best idea  to do that?
 I used string.find but it does not work properly for the words.
 Let suppose I want to find a number 324 in the  text

 '45  324 45324'

 there is only one occurrence  of 324 word but string.find()   finds 2
 occurrences  ( in 45324 too)

 Must I use regex?
 Thanks for help

The first thing to do is to answer the question: What is a word?

The second thing to do is to design some code that can find
words in strings.

The last thing to do is to search those actual words for the word
you're looking for.

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find all the same words in a text?

2007-02-11 Thread Maël Benjamin Mettler
In order to find all the words in a text, you need to tokenize it first.
The rest is a matter of calling the count method on the list of
tokenized words. For tokenization look here:
http://nltk.sourceforge.net/lite/doc/en/words.html
A little bit of warning: depending on what exactly you need to do, the
seemingly trivial taks of tokenizing a text can become quite complex.

Enjoy,

Maël

Neil Cerutti schrieb:
 On 2007-02-10, Johny [EMAIL PROTECTED] wrote:
 I need to find all the same words in a text .
 What would be the best idea  to do that?
 I used string.find but it does not work properly for the words.
 Let suppose I want to find a number 324 in the  text

 '45  324 45324'

 there is only one occurrence  of 324 word but string.find()   finds 2
 occurrences  ( in 45324 too)

 Must I use regex?
 Thanks for help
 
 The first thing to do is to answer the question: What is a word?
 
 The second thing to do is to design some code that can find
 words in strings.
 
 The last thing to do is to search those actual words for the word
 you're looking for.
 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find all the same words in a text?

2007-02-11 Thread attn . steven . kuo
On Feb 11, 5:13 am, Samuel Karl Peterson
[EMAIL PROTECTED] wrote:
 Johny [EMAIL PROTECTED] on 10 Feb 2007 05:29:23 -0800 didst step
 forth and proclaim thus:

  I need to find all the same words in a text .
  What would be the best idea  to do that?

 I make no claims of this being the best approach:

 
 def findOccurances(a_string, word):
 
 Given a string and a word, returns a double:
 [0] = count [1] = list of indexes where word occurs
 
 import re
 count = 0
 indexes = []
 start = 0 # offset for successive passes
 pattern = re.compile(r'\b%s\b' % word, re.I)

 while True:
 match = pattern.search(a_string)
 if not match: break
 count += 1;
 indexes.append(match.start() + start)
 start += match.end()
 a_string = a_string[match.end():]

 return (count, indexes)
 

 Seems to work for me.  No guarantees.




More concisely:

import re

pattern = re.compile(r'\b324\b')
indices = [ match.start() for match in
pattern.finditer(target_string) ]
print Indices, indices
print Count: , len(indices)

--
Cheers,
Steven

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find all the same words in a text?

2007-02-11 Thread Samuel Karl Peterson
[EMAIL PROTECTED] on 11 Feb 2007 08:16:11 -0800 didst step
forth and proclaim thus:

 More concisely:
 
 import re
 
 pattern = re.compile(r'\b324\b')
 indices = [ match.start() for match in
 pattern.finditer(target_string) ]
 print Indices, indices
 print Count: , len(indices)
 

Thank you, this is educational.  I didn't realize that finditer
returned match objects instead of tuples.

 Cheers,
 Steven
 

-- 
Sam Peterson
skpeterson At nospam ucdavis.edu
if programmers were paid to remove code instead of adding it,
software would be much better -- unknown
-- 
http://mail.python.org/mailman/listinfo/python-list


How to find all the same words in a text?

2007-02-10 Thread Johny
I need to find all the same words in a text .
What would be the best idea  to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the  text

'45  324 45324'

there is only one occurrence  of 324 word but string.find()   finds 2
occurrences  ( in 45324 too)

Must I use regex?
Thanks for help
L.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find all the same words in a text?

2007-02-10 Thread Marco Giusti
On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
I need to find all the same words in a text .
What would be the best idea  to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the  text

'45  324 45324'

there is only one occurrence  of 324 word but string.find()   finds 2
occurrences  ( in 45324 too)

 '45  324 45324'.split().count('324')
1


ciao
marco

-- 
reply to `python -c print '[EMAIL PROTECTED]'[::-1]`


signature.asc
Description: Digital signature
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to find all the same words in a text?

2007-02-10 Thread Johny
On Feb 10, 2:42 pm, Marco Giusti [EMAIL PROTECTED] wrote:
 On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
 I need to find all the same words in a text .
 What would be the best idea  to do that?
 I used string.find but it does not work properly for the words.
 Let suppose I want to find a number 324 in the  text

 '45  324 45324'

 there is only one occurrence  of 324 word but string.find()   finds 2
 occurrences  ( in 45324 too)

  '45  324 45324'.split().count('324')
 1
 

 ciao
Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that
Thanks.
L

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find all the same words in a text?

2007-02-10 Thread ZeD
Johny wrote:

 Let suppose I want to find a number 324 in the  text

 '45  324 45324'

 there is only one occurrence  of 324 word but string.find()   finds 2
 occurrences  ( in 45324 too)

  '45  324 45324'.split().count('324')
 1
 

 ciao
 Marco,
 Thank you for your help.
 It works perfectly but I forgot to say that I also need to find the
 possition of each word's occurrence.Is it possible that

 [i for i, e in enumerate('45  324 45324'.split()) if e=='324']
[1]
 

-- 
Under construction
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to find all the same words in a text?

2007-02-10 Thread Marco Giusti
On Sat, Feb 10, 2007 at 06:00:05AM -0800, Johny wrote:
On Feb 10, 2:42 pm, Marco Giusti [EMAIL PROTECTED] wrote:
 On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
 I need to find all the same words in a text .
 What would be the best idea  to do that?
 I used string.find but it does not work properly for the words.
 Let suppose I want to find a number 324 in the  text

 '45  324 45324'

 there is only one occurrence  of 324 word but string.find()   finds 2
 occurrences  ( in 45324 too)

  '45  324 45324'.split().count('324')
 1
 

 ciao
Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that

 li = '45  324 45324'.split()
 li.index('324')
1
 

play with count and index and take a look at the  help of both

ciao
marco

-- 
reply to `python -c print '[EMAIL PROTECTED]'[::-1]`


signature.asc
Description: Digital signature
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How to find all the same words in a text?

2007-02-10 Thread Thorsten Kampe
* Johny (10 Feb 2007 05:29:23 -0800)
 I need to find all the same words in a text .
 What would be the best idea  to do that?
 I used string.find but it does not work properly for the words.
 Let suppose I want to find a number 324 in the  text
 
 '45  324 45324'
 
 there is only one occurrence  of 324 word but string.find()   finds 2
 occurrences  ( in 45324 too)
 
 Must I use regex?

There are two approaches: one is the solve once and forget approach 
where you code around this particular problem. Mario showed you one 
solution for this.

The other approach would be to realise that your problem is a specific 
case of two general problems: partitioning a sequence by a separator 
and partioning a sequence into equivalence classes. The bonus for this 
approach is that you will have a /lot/ of problems that can be solved 
with either one of these utils or a combination of them.

1 a = '45  324 45324'
2 quotient_set(part(a, [' ', '  '], 'sep'), ident)
2:   {'324': ['324'], '45': ['45'], '45324': ['45324']}

The latter approach is much more flexible. Just imagine your problem 
changes to a string that's separated by newlines (instead of spaces) 
and you want to find words that start with the same character (instead 
of being the same as criterion).


Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list