Re: Best practice for operations on streams of text
On May 8, 12:07 am, MRAB goo...@mrabarnett.plus.com wrote: def compound_filter(token_stream): stream = lowercase_token(token_stream) stream = remove_boring(stream) stream = remove_dupes(stream) for t in stream(t): yield t The last loop is superfluous. You can just do:: def compound_filter(token_stream): stream = lowercase_token(token_stream) stream = remove_boring(stream) stream = remove_dupes(stream) return stream which is simpler and slightly more efficient. This works because from the caller's perspective, a generator is just a function that returns an iterator. It doesn't matter whether it implements the iterator itself by containing ``yield`` statements, or shamelessly passes on an iterator implemented elsewhere. -- http://mail.python.org/mailman/listinfo/python-list
Best practice for operations on streams of text
Hello all, I'm working on some NLP code - what I'm doing is passing a large number of tokens through a number of filtering / processing steps. The filters take a token as input, and may or may not yield a token as a result. For example, I might have filters which lowercases the input, filter out boring words and filter out duplicates chained together. I originally had code like this: for t0 in token_stream: for t1 in lowercase_token(t0): for t2 in remove_boring(t1): for t3 in remove_dupes(t2): yield t3 Apart from being ugly as sin, I only get one token out as StopIteration is raised before the whole token stream is consumed. Any suggestions on an elegant way to chain together a bunch of generators, with processing steps in between? Thanks, James -- http://mail.python.org/mailman/listinfo/python-list
Re: Best practice for operations on streams of text
James rent.lupin.r...@gmail.com writes: Hello all, I'm working on some NLP code - what I'm doing is passing a large number of tokens through a number of filtering / processing steps. The filters take a token as input, and may or may not yield a token as a result. For example, I might have filters which lowercases the input, filter out boring words and filter out duplicates chained together. I originally had code like this: for t0 in token_stream: for t1 in lowercase_token(t0): for t2 in remove_boring(t1): for t3 in remove_dupes(t2): yield t3 Apart from being ugly as sin, I only get one token out as StopIteration is raised before the whole token stream is consumed. Any suggestions on an elegant way to chain together a bunch of generators, with processing steps in between? Thanks, James Co-routines my friends. Google will help you greatly in discovering this processing wonder. -- http://mail.python.org/mailman/listinfo/python-list
Re: Best practice for operations on streams of text
James wrote: Hello all, I'm working on some NLP code - what I'm doing is passing a large number of tokens through a number of filtering / processing steps. The filters take a token as input, and may or may not yield a token as a result. For example, I might have filters which lowercases the input, filter out boring words and filter out duplicates chained together. I originally had code like this: for t0 in token_stream: for t1 in lowercase_token(t0): for t2 in remove_boring(t1): for t3 in remove_dupes(t2): yield t3 Apart from being ugly as sin, I only get one token out as StopIteration is raised before the whole token stream is consumed. Any suggestions on an elegant way to chain together a bunch of generators, with processing steps in between? Thanks, James -- http://mail.python.org/mailman/listinfo/python-list David Beazly has a very interesting talk on using generators for building and linking together individual stream filters. Its very cool and surprisingly eye-opening. See Generator Tricks for Systems Programmers at http://www.dabeaz.com/generators/ Gary Herron -- http://mail.python.org/mailman/listinfo/python-list
Re: Best practice for operations on streams of text
James wrote: Hello all, I'm working on some NLP code - what I'm doing is passing a large number of tokens through a number of filtering / processing steps. The filters take a token as input, and may or may not yield a token as a result. For example, I might have filters which lowercases the input, filter out boring words and filter out duplicates chained together. I originally had code like this: for t0 in token_stream: for t1 in lowercase_token(t0): for t2 in remove_boring(t1): for t3 in remove_dupes(t2): yield t3 Apart from being ugly as sin, I only get one token out as StopIteration is raised before the whole token stream is consumed. Any suggestions on an elegant way to chain together a bunch of generators, with processing steps in between? What you should be doing is letting the filters accept an iterator and yield values on demand: def lowercase_token(stream): for t in stream: yield t.lower() def remove_boring(stream): for t in stream: if t not in boring: yield t def remove_dupes(stream): seen = set() for t in stream: if t not in seen: yield t seen.add(t) def compound_filter(token_stream): stream = lowercase_token(token_stream) stream = remove_boring(stream) stream = remove_dupes(stream) for t in stream(t): yield t -- http://mail.python.org/mailman/listinfo/python-list
Re: Best practice for operations on streams of text
MRAB wrote: James wrote: Hello all, I'm working on some NLP code - what I'm doing is passing a large number of tokens through a number of filtering / processing steps. The filters take a token as input, and may or may not yield a token as a result. For example, I might have filters which lowercases the input, filter out boring words and filter out duplicates chained together. I originally had code like this: for t0 in token_stream: for t1 in lowercase_token(t0): for t2 in remove_boring(t1): for t3 in remove_dupes(t2): yield t3 For that to work at all, the three functions would have to turn each token into an iterable of 0 or 1 tokens. Hence the inner 'loops' would execute 0 or 1 times. Better to return a token or None, and replace the three inner 'loops' with three conditional statements (ugly too) or less efficiently (due to lack of short circuiting), t = remove_dupes(remove_boring(lowercase_token(t0))) if t is not None: yield t Apart from being ugly as sin, I only get one token out as StopIteration is raised before the whole token stream is consumed. That puzzles me. Your actual code must be slightly different from the above and what I imagine the functions to be. But nevermind, because Any suggestions on an elegant way to chain together a bunch of generators, with processing steps in between? MRAB's suggestion is the way to go. Your automatically get short-circuiting because each generator only gets what is passed on. And resuming a generator is much faster that re-calling a function. What you should be doing is letting the filters accept an iterator and yield values on demand: def lowercase_token(stream): for t in stream: yield t.lower() def remove_boring(stream): for t in stream: if t not in boring: yield t def remove_dupes(stream): seen = set() for t in stream: if t not in seen: yield t seen.add(t) def compound_filter(token_stream): stream = lowercase_token(token_stream) stream = remove_boring(stream) stream = remove_dupes(stream) for t in stream(t): yield t I also recommend the Beazly reference Herron gave. tjr -- http://mail.python.org/mailman/listinfo/python-list