Re: Regular Expressions: large amount of or's

2005-03-23 Thread Daniel Yoo
: Done.  'startpos' and other bug fixes are in Release 0.7:

: http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ahocorasick-0.7.tar.gz

Ok, I stopped working on the Aho-Corasick module for a while, so I've
just bumped the version number to 0.8 and posted it up on PyPI.

I did add some preliminary code to use graphviz to emit DOT files, but
it's very untested code.  I also added an undocumented api for
inspecting the states and their transitions.

I hope that the original poster finds it useful, even though it's
probably a bit late.


Hope this helps!
-- 
http://mail.python.org/mailman/listinfo/python-list


PyPI errors?

2005-03-20 Thread Daniel Yoo

Does anyone know why PyPI's doesn't like my PKG-INFO file?  Here's
what I have:

##
mumak:~/work/aho/src/python/dist/ahocorasick-0.8 dyoo$ cat PKG-INFO 
Metadata-Version: 1.0
Name: ahocorasick
Version: 0.8
Summary: Aho-Corasick automaton implementation
Home-page: http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/
Author: Danny Yoo
Author-email: [EMAIL PROTECTED]
License: GPL
Download-URL: 
http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ahocorasick-0.8.tar.gz
Description: UNKNOWN
Platform: any
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Topic :: Text Editors :: Text Processing
##


Here's the error message I'm getting when I do a submit_form from the
PyPI web interface:

##
Internal Server Error

Traceback (most recent call last):
  File "/usr/local/pypi/lib/pypi/webui.py", line 115, in run
self.inner_run()
  File "/usr/local/pypi/lib/pypi/webui.py", line 408, in inner_run
getattr(self, action)()
  File "/usr/local/pypi/lib/pypi/webui.py", line 1148, in submit_pkg_info
self.validate_metadata(data)
  File "/usr/local/pypi/lib/pypi/webui.py", line 1284, in validate_metadata
map(versionpredicate.check_provision, data['provides'])
KeyError: provides
##


Thanks for any help!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expressions: large amount of or's

2005-03-14 Thread Daniel Yoo
Scott David Daniels <[EMAIL PROTECTED]> wrote:

: I have a (very high speed) modified Aho-Corasick machine that I sell.
: The calling model that I found works well is:

:  def chases(self, sourcestream, ...):
:   '''A generator taking a generator of source blocks,
:   yielding (matches, position) pairs where position is an
:   offset within the "current" block.
:   '''

: You might consider taking a look at providing that form.


Hi Scott,

No problem, I'll be happy to do this.

I need some clarification on the calling model though.  Would this be
an accurate test case?

##
def testChasesInterface(self):
self.tree.add("python")
self.tree.add("is")
self.tree.make()
sourceStream = iter(("python programming is fun",
 "how much is that python in the window"))
self.assertEqual([
   (sourceBlocks[0], (0, 6)),
   (sourceBlocks[0], (19, 21)),
   (sourceBlocks[1], (9, 11)),
   (sourceBlocks[1], (17, 23)),
 ],
 list(self.tree.chases(sourceStream))
##

Here, I'm assuming that chases() takes in a 'sourceStream', which is
an iterator of text blocks., and that the return value is itself an
iterator.


Best of wishes!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expressions: large amount of or's

2005-03-14 Thread Daniel Yoo
Daniel Yoo <[EMAIL PROTECTED]> wrote:
: John Machin <[EMAIL PROTECTED]> wrote:


: : tree.search("I went to alpha beta the other day to pick up some spam")

: : could use a startpos (default=0) argument for efficiently restarting
: : the search after finding the first match

: Ok, that's easy to fix.  I'll do that tonight.

Done.  'startpos' and other bug fixes are in Release 0.7:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ahocorasick-0.7.tar.gz

But I think I'd better hold off adding the ahocorasick package to PyPI
until it stabilizes for longer than a day... *grin*
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expressions: large amount of or's

2005-03-13 Thread Daniel Yoo
John Machin <[EMAIL PROTECTED]> wrote:


: tree.search("I went to alpha beta the other day to pick up some spam")

: could use a startpos (default=0) argument for efficiently restarting
: the search after finding the first match

Ok, that's easy to fix.  I'll do that tonight.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expressions: large amount of or's

2005-03-13 Thread Daniel Yoo
: Otherwise, you may want to look at a specialized data structure for
: doing mutiple keyword matching; I had an older module that wrapped
: around a suffix tree:

:http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

: It looks like other folks, thankfully, have written other
: implementations of suffix trees:

:http://cs.haifa.ac.il/~shlomo/suffix_tree/

: Another approach is something called the Aho-Corasick algorithm:

:http://portal.acm.org/citation.cfm?doid=360825.360855

: though I haven't been able to find a nice Python module for this yet.


Followup on this: I haven't been able to find one, so I took someone
else's implementation and adapted it.  *grin*

Here you go:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

This provides an 'ahocorasick' Python C extension module for doing
matching on a set of keywords.  I'll start writing out the package
announcements tomorrow.


I hope this helps!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expressions: large amount of or's

2005-03-01 Thread Daniel Yoo
Kent Johnson <[EMAIL PROTECTED]> wrote:

:> Given a string, I want to find all ocurrences of
:> certain predefined words in that string. Problem is, the list of
:> words that should be detected can be in the order of thousands.
:> 
:> With the re module, this can be solved something like this:
:> 
:> import re
:> 
:> r = re.compile("word1|word2|word3|...|wordN")
:> r.findall(some_string)

The internal data structure that encodes that set of keywords is
probably humongous.  An alternative approach to this problem is to
tokenize your string into words, and then check to see if each word is
in a defined list of "keywords".  This works if your keywords are
single words:

###
keywords = set([word1, word2, ...])
matchingWords = set(re.findall(r'\w+')).intersection(keywords)
###

Would this approach work for you?



Otherwise, you may want to look at a specialized data structure for
doing mutiple keyword matching; I had an older module that wrapped
around a suffix tree:

http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

It looks like other folks, thankfully, have written other
implementations of suffix trees:

http://cs.haifa.ac.il/~shlomo/suffix_tree/

Another approach is something called the Aho-Corasick algorithm:

http://portal.acm.org/citation.cfm?doid=360825.360855

though I haven't been able to find a nice Python module for this yet.


Best of wishes to you!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Google Technology

2005-03-01 Thread Daniel Yoo
[EMAIL PROTECTED] wrote:
: I am just wondering which technologies google is using for gmail and
: Google Groups???

Hello Vijay,

You may want to look at:

http://adaptivepath.com/publications/essays/archives/000385.php

which appears to collect a lot of introductory material about the
client-side Javascript techniques that those applications use.

The LivePage component of Nevow is a Python implementation that does a
lot of the heavy lifing for these kinds of applications:

http://nevow.com/

Best of wishes!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Flushing print()

2005-02-24 Thread Daniel Yoo
gf gf <[EMAIL PROTECTED]> wrote:
: Is there any way to make Python's print() flush
: automatically, similar to...mmm...that other
: language's $|=1 ?


Hello gf gf,

Yes; you can use the '-u' command line option to Python, which will
turn off stdout/stderr buffering.


: If not, how can I flush it manually?  sys.stdout.flush() didn't
: seem to work.

H, that's odd.  sys.stdout.flush() should do it.  How are you
testing that stdout isn't flushing as you expect?


Best of wishes to you!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Multidimensional arrays - howto?

2005-02-14 Thread Daniel Yoo
[EMAIL PROTECTED] wrote:
: Hello all,

: I am trying to convert some C code into python. Since i am new to
: python, i would like to know how to deal with multidimensional arrays?


Here you go:


http://python.org/doc/faq/programming.html#how-do-i-create-a-multidimensional-list


Also, if your table is relatively sparse, you might even be able to
use a dictionary, because a 2-d array can be considered as a mapping
between (index1, index2) keys and its values.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: [NewBie] Tut0r Thing

2005-02-14 Thread Daniel Yoo

: if you're talking about the "tutor at python.org" mailing list, it's a 
mailing list
: that you send mail to and get mails from, as explained on the tutor mailing
: list page:

:http://mail.python.org/mailman/listinfo/tutor


Hello,

Also, from the odd spelling of the subject line, I suspect that the
original poster thinks that we're some kind of cracking group.


I hope not, but just to make sure it's clear: when we in the Python
community talk about "hacking", we usually mean it in the constructive
sense: we like building software systems and helping people learn how
to program.

ESR has a good summary here:

http://www.catb.org/~esr/faqs/hacker-howto.html#what_is

I always feel silly about bringing this up, but it has to be said,
just to avoid any misunderstanding.


Best of wishes to you.
-- 
http://mail.python.org/mailman/listinfo/python-list


webbrowser._iscommand(): is there a public version?

2005-02-14 Thread Daniel Yoo
Hi everyone,


I was curious to know: does the functionality of webbrowser._iscommand()
live anywhere else in the Standard Library?

webbrowser._iscommand() is a helper function that searches through
PATH, and seems useful enough that I was surprised that it didn't live
in a more public place like os.path.

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list