Re: Syncing up iterators with gaps

2016-09-28 Thread Steve D'Aprano
On Thu, 29 Sep 2016 11:38 am, Tim Chase wrote:

> This seems to discard the data's origin (data1/data2/data3) which is
> how I determine whether to use process_a(), process_b(), or
> process_c() in my original example where N iterators were returned,
> one for each input iterator.

So add another stage to your generator pipeline, one which adds a unique ID
to the output of each generator so you know where it came from.

Hint: the ID doesn't have to be an ID *number*. It can be the process_a,
process_b, process_c ... function itself. Then instead of doing:

for key, (id, stuff) in groupby(merge(data1, data2, data3), keyfunc):
for x in stuff:
if id == 1:
process_a(key, *x)
elif id == 2:
process_b(key, *x)
elif ...



or even:

DISPATCH = {1: process_a, 2: process_b, ...}

for key, (id, stuff) in groupby(merge(data1, data2, data3), keyfunc):
for x in stuff:
DISPATCH[id](key, *x)


you can do:


for key, (process, stuff) in groupby(merge(data1, data2, data3), keyfunc):
for x in stuff:
process(key, *x)





-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Syncing up iterators with gaps

2016-09-28 Thread Tim Chase
On 2016-09-29 10:20, Steve D'Aprano wrote:
> On Thu, 29 Sep 2016 05:10 am, Tim Chase wrote:
> >   data1 = [ # key, data1
> > (1, "one A"),
> > (1, "one B"),
> > (2, "two"),
> > (5, "five"),
> > ]
> 
> So data1 has keys 1, 1, 2, 5.
> Likewise data2 has keys 1, 2, 3, 3, 3, 4 and data3 has keys 2, 4, 5.

Correct

> (data3 also has *two* values, not one, which is an additional
> complication.)

As commented towards the end, the source is set of CSV files, so each
row is a list where a particular (identifiable) item is the key.
Assume that one can use something like get_key(row) to return the key,
which in the above could be implemented as

  get_key = lambda row: row[0]

and for my csv.DictReader data, would be something like

  get_key = lambda row: row["Account Number"]

> > And I'd like to do something like
> > 
> >   for common_key, d1, d2, d3 in magic_happens_here(data1, data2,
> > data3):
> 
> What's common_key? In particular, given that data1, data2 and data3
> have the first key each of 1, 1 and 2 respectively, how do you get:
> 
> > So in the above data, the outer FOR loop would
> > happen 5 times with common_key being [1, 2, 3, 4, 5]
> 
> I'm confused. Is common_key a *constant* [1, 2, 3, 4, 5] or are you
> saying that it iterates over 1, 2, 3, 4, 5?

Your interpretation later is correct, that it each unique key once,
in-order.  So if you

  data1.append((17, "seventeen"))

the outer loop would iterate over [1,2,3,4,5,17]

(so not constant, to hopefully answer that part of your question)

The actual keys are account-numbers, so they're ascii-sorted strings
of the form "1234567-8901", ascending in order through the files.
But for equality/less-than/greater-than comparisons, they work
effectively as integers in my example.

> If the later, it sounds like you want something like a cross between
> itertools.groupby and the "merge" stage of mergesort.

That's a pretty good description at some level.  I looked into
groupby() but was having trouble getting it to do what I wanted.

> Note that I have modified data3 so instead of three columns, (key
> value value), it has two (key value) and value is a 2-tuple.

I'm cool with that.  Since they're CSV rows, you can imagine the
source data then as a generator something like

  data1 = ( (get_key(row), row) for row in my_csv_iter1 )

to get the data to look like your example input data.

> So first you want an iterator that does an N-way merge:
> 
> merged = [(1, "one A"), (1, "one B"), (1, "uno"), 
>   (2, "two"), (2, "dos"), (2, ("ii", "extra alpha")), 
>   (3, "tres x"), (3, "tres y"), (3, "tres z"),
>   (4, "cuatro"), (4, ("iv", "extra beta")),
>   (5, "five"), (5, ("v", "extra gamma")),
>   ]

This seems to discard the data's origin (data1/data2/data3) which is
how I determine whether to use process_a(), process_b(), or
process_c() in my original example where N iterators were returned,
one for each input iterator.  So the desired output would be akin to
(converting everything to tuples as you suggest below)

  [
   (1, [("one A",), ("one B",)], [1, ("uno",)], []),
   (2, [("two",)], [("dos",)], [("ii", "extra alpha")]),
   (3, [], [("tres x",), ("tres y",)], []),
   (4, [], [("cuatro",)], [("iv", "extra beta")]),
   (5, [("five",)], [], [("v", "extra gamma")]),
   ]

only instead of N list()s, having N generators that are smart enough
to yield the corresponding data.

> You might find it easier to have *all* the iterators yield (key,
> tuple) pairs, where data1 and data2 yield a 1-tuple and data3
> yields a 2-tuple.

Right.  Sorry my example obscured that shoulda-obviously-been-used
simplification.

-tkc




-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Syncing up iterators with gaps

2016-09-28 Thread Steve D'Aprano
On Thu, 29 Sep 2016 05:10 am, Tim Chase wrote:

> I've got several iterators sharing a common key in the same order and
> would like to iterate over them in parallel, operating on all items
> with the same key.  I've simplified the data a bit here, but it would
> be something like
> 
>   data1 = [ # key, data1
> (1, "one A"),
> (1, "one B"),
> (2, "two"),
> (5, "five"),
> ]

So data1 has keys 1, 1, 2, 5.

Likewise data2 has keys 1, 2, 3, 3, 3, 4 and data3 has keys 2, 4, 5.

(data3 also has *two* values, not one, which is an additional complication.)

> And I'd like to do something like
> 
>   for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):

What's common_key? In particular, given that data1, data2 and data3 have the
first key each of 1, 1 and 2 respectively, how do you get:

> So in the above data, the outer FOR loop would
> happen 5 times with common_key being [1, 2, 3, 4, 5]

I'm confused. Is common_key a *constant* [1, 2, 3, 4, 5] or are you saying
that it iterates over 1, 2, 3, 4, 5?


If the later, it sounds like you want something like a cross between
itertools.groupby and the "merge" stage of mergesort. You have at least
three sorted(?) iterators, representing the CSV files, let's say they
iterate over

data1 = [(1, "one A"), (1, "one B"), (2, "two"), (5, "five")]

data2 = [(1, "uno"), (2, "dos"), (3, "tres x"), (3, "tres y"), 
 (3, "tres z"), (4, "cuatro")]

data3 = [ (2, ("ii", "extra alpha")), (4, ("iv", "extra beta")),
  (5, ("v", "extra gamma"))]


Note that I have modified data3 so instead of three columns, (key value
value), it has two (key value) and value is a 2-tuple.

So first you want an iterator that does an N-way merge:

merged = [(1, "one A"), (1, "one B"), (1, "uno"), 
  (2, "two"), (2, "dos"), (2, ("ii", "extra alpha")), 
  (3, "tres x"), (3, "tres y"), (3, "tres z"),
  (4, "cuatro"), (4, ("iv", "extra beta")),
  (5, "five"), (5, ("v", "extra gamma")),
  ]

and then you can simply call itertools.groupby to group by the common keys:

1: ["one A", "one B", "uno"]
2: ["two", "dos", ("ii", "extra alpha")]
3: ["tres x", "tres y", "tres z"]
4: ["cuatro", ("iv", "extra beta")]
5: ["five", ("v", "extra gamma")]

and then you can extract the separate columns from each value.

You might find it easier to have *all* the iterators yield (key, tuple)
pairs, where data1 and data2 yield a 1-tuple and data3 yields a 2-tuple.


If you look on ActiveState, I'm pretty sure you will find a recipe from
Raymond Hettinger for a merge sort or heap sort or something along those
lines, which you can probably adapt for an arbitrary number of inputs.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Syncing up iterators with gaps

2016-09-28 Thread Peter Otten
Tim Chase wrote:

> I've got several iterators sharing a common key in the same order and
> would like to iterate over them in parallel, operating on all items
> with the same key.  I've simplified the data a bit here, but it would
> be something like
> 
>   data1 = [ # key, data1
> (1, "one A"),
> (1, "one B"),
> (2, "two"),
> (5, "five"),
> ]
> 
>   data2 = [ # key, data1
> (1, "uno"),
> (2, "dos"),
> (3, "tres x"),
> (3, "tres y"),
> (3, "tres z"),
> (4, "cuatro"),
> ]
> 
>   data3 = [ # key, data1, data2
> (2, "ii", "extra alpha"),
> (4, "iv", "extra beta"),
> (5, "v", "extra gamma"),
> ]
> 
> And I'd like to do something like
> 
>   for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
> for row in d1:
>   process_a(common_key, row)
> for thing in d2:
>   process_b(common_key, row)
> for thing in d3:
>   process_c(common_key, row)
> 
> which would yield the common_key, along with enough of each of those
> iterators (note that gaps can happen, but the sortable order should
> remain the same).  So in the above data, the outer FOR loop would
> happen 5 times with common_key being [1, 2, 3, 4, 5], and each of
> [d1, d2, d3] being an iterator that deals with just that data.
> 
> My original method was hauling everything into memory and making
> multiple passes filtering on the data. However, the actual sources
> are CSV-files, some of which are hundreds of megs in size, and my
> system was taking a bit of a hit.  So I was hoping for a way to do
> this with each iterator making only one complete pass through each
> source (since they're sorted by common key).
> 
> It's somewhat similar to the *nix "join" command, only dealing with
> N files.
> 
> Thanks for any hints.
> 
> -tkc

A bit messy, might try replacing groups list with a dict:

$ cat merge.py  
from itertools import groupby
from operator import itemgetter

first = itemgetter(0)
rest = itemgetter(slice(1, None))


def magic_happens_here(*columns):
grouped = [groupby(column, key=first) for column in columns]
missing = object()

def getnext(g):
nonlocal n
try:
k, g = next(g)
except StopIteration:
n -= 1
return (missing, None)
return k, g

n = len(grouped)
groups = [getnext(g) for g in grouped]
while n:
minkey = min(k for k, g in groups if k is not missing)
yield (minkey,) + tuple(
map(rest, g) if k == minkey else ()
for k, g in groups)
for i, (k, g) in enumerate(groups):
if k == minkey:
groups[i] = getnext(grouped[i])


if __name__ == "__main__":
data1 = [  # key, data1
(1, "one A"),
(1, "one B"),
(2, "two"),
(5, "five"),
]

data2 = [  # key, data1
(1, "uno"),
(2, "dos"),
(3, "tres x"),
(3, "tres y"),
(3, "tres z"),
(4, "cuatro"),
]

data3 = [  # key, data1, data2
(2, "ii", "extra alpha"),
(4, "iv", "extra beta"),
(5, "v", "extra gamma"),
]

for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
print(common_key)
for d in d1, d2, d3:
print("", list(d))
$ python3 merge.py 
1
 [('one A',), ('one B',)]
 [('uno',)]
 []
2
 [('two',)]
 [('dos',)]
 [('ii', 'extra alpha')]
3
 []
 [('tres x',), ('tres y',), ('tres z',)]
 []
4
 []
 [('cuatro',)]
 [('iv', 'extra beta')]
5
 [('five',)]
 []
 [('v', 'extra gamma')]


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Syncing up iterators with gaps

2016-09-28 Thread Chris Kaynor
Here is a slight variation of Chris A's code that does not require
more than a single look-ahead per generator. It may be better
depending on the exact data passed in.

Chris A's version will store all of the items for each output that
have a matching key, which, depending on the expected data, could use
quite a bit of memory. This version yields a list of generators, which
then allows for never having more than a single lookahead per list.
The generators returned must be consumed immediately or they will be
emptied - I put in a safety loop that consumes them before continuing
processing.

My version is likely better if your processing does not require
storing (most of) the items and you expect there to be a large number
of common keys in each iterator. If you expect only a couple of items
per shared key per list, Chris A's version will probably perform
better for slightly more memory usage, as well as being somewhat safer
and simpler.

def magic_happens_here(*iters):
def gen(j):
while nexts[j][0] == common_key:
yield nexts[j]
nexts[j] = next(iters[j], (None,))
iters = [iter(it) for it in iters]
nexts = [next(it, (None,)) for it in iters]
while "moar stuff":
try: common_key = min(row[0] for row in nexts if row[0])
except ValueError: break # No moar stuff
outputs = [common_key]
for i in range(len(nexts)): # code smell, sorry
outputs.append(gen(i))
yield outputs
# The following three lines confirm that the generators provided
#  were consumed. This allows not exhausting the yielded generators.
#  If this is not included, and the iterator is not consumed, it can
#  result in an infinite loop.
for output in outputs[1:]:
for item in output:
pass
Chris


On Wed, Sep 28, 2016 at 12:48 PM, Chris Angelico  wrote:
> On Thu, Sep 29, 2016 at 5:10 AM, Tim Chase
>  wrote:
>> And I'd like to do something like
>>
>>   for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
>> for row in d1:
>>   process_a(common_key, row)
>> for thing in d2:
>>   process_b(common_key, row)
>> for thing in d3:
>>   process_c(common_key, row)
>
> Assuming that the keys are totally ordered and the data sets are
> sorted, this should work:
>
> def magic_happens_here(*iters):
> iters = [iter(it) for it in iters]
> nexts = [next(it, (None,)) for it in iters]
> while "moar stuff":
> try: common_key = min(row[0] for row in nexts if row[0])
> except ValueError: break # No moar stuff
> outputs = [common_key]
> for i in range(len(nexts)): # code smell, sorry
> output = []
> while nexts[i][0] == common_key:
> output.append(nexts[i])
> nexts[i] = next(iters[i], (None,))
> outputs.append(output)
> yield outputs
>
> Basically, it takes the lowest available key, then takes everything of
> that key and yields it as a unit.
>
> Code not tested. Use at your own risk.
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Syncing up iterators with gaps

2016-09-28 Thread Chris Angelico
On Thu, Sep 29, 2016 at 5:10 AM, Tim Chase
 wrote:
> And I'd like to do something like
>
>   for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
> for row in d1:
>   process_a(common_key, row)
> for thing in d2:
>   process_b(common_key, row)
> for thing in d3:
>   process_c(common_key, row)

Assuming that the keys are totally ordered and the data sets are
sorted, this should work:

def magic_happens_here(*iters):
iters = [iter(it) for it in iters]
nexts = [next(it, (None,)) for it in iters]
while "moar stuff":
try: common_key = min(row[0] for row in nexts if row[0])
except ValueError: break # No moar stuff
outputs = [common_key]
for i in range(len(nexts)): # code smell, sorry
output = []
while nexts[i][0] == common_key:
output.append(nexts[i])
nexts[i] = next(iters[i], (None,))
outputs.append(output)
yield outputs

Basically, it takes the lowest available key, then takes everything of
that key and yields it as a unit.

Code not tested. Use at your own risk.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Syncing up iterators with gaps

2016-09-28 Thread Terry Reedy

On 9/28/2016 3:10 PM, Tim Chase wrote:

I've got several iterators sharing a common key in the same order and
would like to iterate over them in parallel, operating on all items
with the same key.  I've simplified the data a bit here, but it would
be something like

  data1 = [ # key, data1
(1, "one A"),
(1, "one B"),
(2, "two"),
(5, "five"),
]

  data2 = [ # key, data1
(1, "uno"),
(2, "dos"),
(3, "tres x"),
(3, "tres y"),
(3, "tres z"),
(4, "cuatro"),
]

  data3 = [ # key, data1, data2
(2, "ii", "extra alpha"),
(4, "iv", "extra beta"),
(5, "v", "extra gamma"),
]

And I'd like to do something like

  for common_key, d1, d2, d3 in magic_happens_here(data1, data2, data3):
for row in d1:
  process_a(common_key, row)
for thing in d2:
  process_b(common_key, row)
for thing in d3:
  process_c(common_key, row)

which would yield the common_key, along with enough of each of those
iterators (note that gaps can happen, but the sortable order should
remain the same).  So in the above data, the outer FOR loop would
happen 5 times with common_key being [1, 2, 3, 4, 5], and each of
[d1, d2, d3] being an iterator that deals with just that data.


You just need d1, d2, d3 to be iterables, such as a list.  Write a magic 
generator that opens the three files and reads one line of each (with 
next()).  Then in while True loop, find minimum key and make 3 lists (up 
to 2 possibly empty) of the items in each file with that key.  This will 
require up to 3 inner loops.  The read-ahead makes this slightly messy. 
If any list is not empty, yield the key and 3 lists.  Otherwise break 
the outer loop.



My original method was hauling everything into memory and making
multiple passes filtering on the data. However, the actual sources
are CSV-files, some of which are hundreds of megs in size, and my
system was taking a bit of a hit.  So I was hoping for a way to do
this with each iterator making only one complete pass through each
source (since they're sorted by common key).

It's somewhat similar to the *nix "join" command, only dealing with
N files.


It is also somewhat similar to a 3-way mergesort.

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list