Re: How to find all the same words in a text?
Johny [EMAIL PROTECTED] on 10 Feb 2007 05:29:23 -0800 didst step forth and proclaim thus: I need to find all the same words in a text . What would be the best idea to do that? I make no claims of this being the best approach: def findOccurances(a_string, word): Given a string and a word, returns a double: [0] = count [1] = list of indexes where word occurs import re count = 0 indexes = [] start = 0 # offset for successive passes pattern = re.compile(r'\b%s\b' % word, re.I) while True: match = pattern.search(a_string) if not match: break count += 1; indexes.append(match.start() + start) start += match.end() a_string = a_string[match.end():] return (count, indexes) Seems to work for me. No guarantees. -- Sam Peterson skpeterson At nospam ucdavis.edu if programmers were paid to remove code instead of adding it, software would be much better -- unknown -- http://mail.python.org/mailman/listinfo/python-list
Re: How to find all the same words in a text?
On 2007-02-10, Johny [EMAIL PROTECTED] wrote: I need to find all the same words in a text . What would be the best idea to do that? I used string.find but it does not work properly for the words. Let suppose I want to find a number 324 in the text '45 324 45324' there is only one occurrence of 324 word but string.find() finds 2 occurrences ( in 45324 too) Must I use regex? Thanks for help The first thing to do is to answer the question: What is a word? The second thing to do is to design some code that can find words in strings. The last thing to do is to search those actual words for the word you're looking for. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: How to find all the same words in a text?
In order to find all the words in a text, you need to tokenize it first. The rest is a matter of calling the count method on the list of tokenized words. For tokenization look here: http://nltk.sourceforge.net/lite/doc/en/words.html A little bit of warning: depending on what exactly you need to do, the seemingly trivial taks of tokenizing a text can become quite complex. Enjoy, Maël Neil Cerutti schrieb: On 2007-02-10, Johny [EMAIL PROTECTED] wrote: I need to find all the same words in a text . What would be the best idea to do that? I used string.find but it does not work properly for the words. Let suppose I want to find a number 324 in the text '45 324 45324' there is only one occurrence of 324 word but string.find() finds 2 occurrences ( in 45324 too) Must I use regex? Thanks for help The first thing to do is to answer the question: What is a word? The second thing to do is to design some code that can find words in strings. The last thing to do is to search those actual words for the word you're looking for. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to find all the same words in a text?
On Feb 11, 5:13 am, Samuel Karl Peterson [EMAIL PROTECTED] wrote: Johny [EMAIL PROTECTED] on 10 Feb 2007 05:29:23 -0800 didst step forth and proclaim thus: I need to find all the same words in a text . What would be the best idea to do that? I make no claims of this being the best approach: def findOccurances(a_string, word): Given a string and a word, returns a double: [0] = count [1] = list of indexes where word occurs import re count = 0 indexes = [] start = 0 # offset for successive passes pattern = re.compile(r'\b%s\b' % word, re.I) while True: match = pattern.search(a_string) if not match: break count += 1; indexes.append(match.start() + start) start += match.end() a_string = a_string[match.end():] return (count, indexes) Seems to work for me. No guarantees. More concisely: import re pattern = re.compile(r'\b324\b') indices = [ match.start() for match in pattern.finditer(target_string) ] print Indices, indices print Count: , len(indices) -- Cheers, Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: How to find all the same words in a text?
[EMAIL PROTECTED] on 11 Feb 2007 08:16:11 -0800 didst step forth and proclaim thus: More concisely: import re pattern = re.compile(r'\b324\b') indices = [ match.start() for match in pattern.finditer(target_string) ] print Indices, indices print Count: , len(indices) Thank you, this is educational. I didn't realize that finditer returned match objects instead of tuples. Cheers, Steven -- Sam Peterson skpeterson At nospam ucdavis.edu if programmers were paid to remove code instead of adding it, software would be much better -- unknown -- http://mail.python.org/mailman/listinfo/python-list
How to find all the same words in a text?
I need to find all the same words in a text . What would be the best idea to do that? I used string.find but it does not work properly for the words. Let suppose I want to find a number 324 in the text '45 324 45324' there is only one occurrence of 324 word but string.find() finds 2 occurrences ( in 45324 too) Must I use regex? Thanks for help L. -- http://mail.python.org/mailman/listinfo/python-list
Re: How to find all the same words in a text?
On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote: I need to find all the same words in a text . What would be the best idea to do that? I used string.find but it does not work properly for the words. Let suppose I want to find a number 324 in the text '45 324 45324' there is only one occurrence of 324 word but string.find() finds 2 occurrences ( in 45324 too) '45 324 45324'.split().count('324') 1 ciao marco -- reply to `python -c print '[EMAIL PROTECTED]'[::-1]` signature.asc Description: Digital signature -- http://mail.python.org/mailman/listinfo/python-list
Re: How to find all the same words in a text?
On Feb 10, 2:42 pm, Marco Giusti [EMAIL PROTECTED] wrote: On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote: I need to find all the same words in a text . What would be the best idea to do that? I used string.find but it does not work properly for the words. Let suppose I want to find a number 324 in the text '45 324 45324' there is only one occurrence of 324 word but string.find() finds 2 occurrences ( in 45324 too) '45 324 45324'.split().count('324') 1 ciao Marco, Thank you for your help. It works perfectly but I forgot to say that I also need to find the possition of each word's occurrence.Is it possible that Thanks. L -- http://mail.python.org/mailman/listinfo/python-list
Re: How to find all the same words in a text?
Johny wrote: Let suppose I want to find a number 324 in the text '45 324 45324' there is only one occurrence of 324 word but string.find() finds 2 occurrences ( in 45324 too) '45 324 45324'.split().count('324') 1 ciao Marco, Thank you for your help. It works perfectly but I forgot to say that I also need to find the possition of each word's occurrence.Is it possible that [i for i, e in enumerate('45 324 45324'.split()) if e=='324'] [1] -- Under construction -- http://mail.python.org/mailman/listinfo/python-list
Re: How to find all the same words in a text?
On Sat, Feb 10, 2007 at 06:00:05AM -0800, Johny wrote: On Feb 10, 2:42 pm, Marco Giusti [EMAIL PROTECTED] wrote: On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote: I need to find all the same words in a text . What would be the best idea to do that? I used string.find but it does not work properly for the words. Let suppose I want to find a number 324 in the text '45 324 45324' there is only one occurrence of 324 word but string.find() finds 2 occurrences ( in 45324 too) '45 324 45324'.split().count('324') 1 ciao Marco, Thank you for your help. It works perfectly but I forgot to say that I also need to find the possition of each word's occurrence.Is it possible that li = '45 324 45324'.split() li.index('324') 1 play with count and index and take a look at the help of both ciao marco -- reply to `python -c print '[EMAIL PROTECTED]'[::-1]` signature.asc Description: Digital signature -- http://mail.python.org/mailman/listinfo/python-list
Re: How to find all the same words in a text?
* Johny (10 Feb 2007 05:29:23 -0800) I need to find all the same words in a text . What would be the best idea to do that? I used string.find but it does not work properly for the words. Let suppose I want to find a number 324 in the text '45 324 45324' there is only one occurrence of 324 word but string.find() finds 2 occurrences ( in 45324 too) Must I use regex? There are two approaches: one is the solve once and forget approach where you code around this particular problem. Mario showed you one solution for this. The other approach would be to realise that your problem is a specific case of two general problems: partitioning a sequence by a separator and partioning a sequence into equivalence classes. The bonus for this approach is that you will have a /lot/ of problems that can be solved with either one of these utils or a combination of them. 1 a = '45 324 45324' 2 quotient_set(part(a, [' ', ' '], 'sep'), ident) 2: {'324': ['324'], '45': ['45'], '45324': ['45324']} The latter approach is much more flexible. Just imagine your problem changes to a string that's separated by newlines (instead of spaces) and you want to find words that start with the same character (instead of being the same as criterion). Thorsten -- http://mail.python.org/mailman/listinfo/python-list