subject:"Best practice for operations on streams of text"

Re: Best practice for operations on streams of text

2009-05-17 Thread Beni Cherniavsky

On May 8, 12:07 am, MRAB goo...@mrabarnett.plus.com wrote:
 def compound_filter(token_stream):
      stream = lowercase_token(token_stream)
      stream = remove_boring(stream)
      stream = remove_dupes(stream)
      for t in stream(t):
          yield t

The last loop is superfluous.  You can just do::

def compound_filter(token_stream):
 stream = lowercase_token(token_stream)
 stream = remove_boring(stream)
 stream = remove_dupes(stream)
 return stream

which is simpler and slightly more efficient.  This works because from
the caller's perspective, a generator is just a function that returns
an iterator.  It doesn't matter whether it implements the iterator
itself by containing ``yield`` statements, or shamelessly passes on an
iterator implemented elsewhere.
-- 
http://mail.python.org/mailman/listinfo/python-list

Best practice for operations on streams of text

2009-05-07 Thread James

Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.

The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.

I originally had code like this:
for t0 in token_stream:
  for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
  for t3 in remove_dupes(t2):
yield t3

Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.

Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?

Thanks,
James
--
http://mail.python.org/mailman/listinfo/python-list

Re: Best practice for operations on streams of text

2009-05-07 Thread J Kenneth King

James rent.lupin.r...@gmail.com writes:

 Hello all,
 I'm working on some NLP code - what I'm doing is passing a large
 number of tokens through a number of filtering / processing steps.

 The filters take a token as input, and may or may not yield a token as
 a result. For example, I might have filters which lowercases the
 input, filter out boring words and filter out duplicates chained
 together.

 I originally had code like this:
 for t0 in token_stream:
   for t1 in lowercase_token(t0):
 for t2 in remove_boring(t1):
   for t3 in remove_dupes(t2):
 yield t3

 Apart from being ugly as sin, I only get one token out as
 StopIteration is raised before the whole token stream is consumed.

 Any suggestions on an elegant way to chain together a bunch of
 generators, with processing steps in between?

 Thanks,
 James

Co-routines my friends. Google will help you greatly in discovering
this processing wonder.
--
http://mail.python.org/mailman/listinfo/python-list

Re: Best practice for operations on streams of text

2009-05-07 Thread Gary Herron


James wrote:

Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.

The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.

I originally had code like this:
for t0 in token_stream:
  for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
  for t3 in remove_dupes(t2):
yield t3

Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.

Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?

Thanks,
James
--
http://mail.python.org/mailman/listinfo/python-list
  


David Beazly has a very interesting talk on using generators for 
building and linking together individual stream filters.  Its very cool 
and surprisingly eye-opening.


See Generator Tricks for Systems Programmers at  
http://www.dabeaz.com/generators/


Gary Herron


--
http://mail.python.org/mailman/listinfo/python-list

Re: Best practice for operations on streams of text

2009-05-07 Thread MRAB


James wrote:

Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.

The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.

I originally had code like this:
for t0 in token_stream:
  for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
  for t3 in remove_dupes(t2):
yield t3

Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.

Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?


What you should be doing is letting the filters accept an iterator and
yield values on demand:

def lowercase_token(stream):
for t in stream:
yield t.lower()

def remove_boring(stream):
for t in stream:
if t not in boring:
yield t

def remove_dupes(stream):
seen = set()
for t in stream:
if t not in seen:
yield t
seen.add(t)

def compound_filter(token_stream):
stream = lowercase_token(token_stream)
stream = remove_boring(stream)
stream = remove_dupes(stream)
for t in stream(t):
yield t
--
http://mail.python.org/mailman/listinfo/python-list

Re: Best practice for operations on streams of text

2009-05-07 Thread Terry Reedy


MRAB wrote:

James wrote:

Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.

The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.

I originally had code like this:
for t0 in token_stream:
  for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
  for t3 in remove_dupes(t2):
yield t3


For that to work at all, the three functions would have to turn each 
token into an iterable of 0 or 1 tokens.  Hence the inner 'loops' would 
execute 0 or 1 times.  Better to return a token or None, and replace the 
three inner 'loops' with three conditional statements (ugly too) or less 
efficiently (due to lack of short circuiting),


t = remove_dupes(remove_boring(lowercase_token(t0)))
if t is not None: yield t


Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.


That puzzles me.  Your actual code must be slightly different from the 
above and what I imagine the functions to be.  But nevermind, because



Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?


MRAB's suggestion is the way to go.  Your automatically get 
short-circuiting because each generator only gets what is passed on. 
And resuming a generator is much faster that re-calling a function.



What you should be doing is letting the filters accept an iterator and
yield values on demand:

def lowercase_token(stream):
for t in stream:
yield t.lower()

def remove_boring(stream):
for t in stream:
if t not in boring:
yield t

def remove_dupes(stream):
seen = set()
for t in stream:
if t not in seen:
yield t
seen.add(t)

def compound_filter(token_stream):
stream = lowercase_token(token_stream)
stream = remove_boring(stream)
stream = remove_dupes(stream)
for t in stream(t):
yield t


I also recommend the Beazly reference Herron gave.

tjr

--
http://mail.python.org/mailman/listinfo/python-list

Re: Best practice for operations on streams of text

Best practice for operations on streams of text

Re: Best practice for operations on streams of text

Re: Best practice for operations on streams of text

Re: Best practice for operations on streams of text

Re: Best practice for operations on streams of text

6 matches

Site Navigation

Mail list logo

Footer information