Thanks for linking to these. I looked at many of them in my own research, but for some reason didn't think to write down the links. I'll respond to each one separately.
Throughout, I'm going to use my proposed ``grouped`` builtin to demonstrate possible revisions. Note that I am *not* suggesting a replacement to defaultdict. The proposal is to make a common task easier and more reliable. It does not satisfy all uses of defaultdict. https://github.com/selik/peps/blob/master/pep-9999.rst On Fri, Jul 13, 2018 at 6:17 AM Nicolas Rolin <nicolas.ro...@tiime.fr> wrote: > I noticed recently that *all* examples for collection.defaultdict ( > https://docs.python.org/3.7/library/collections.html#collections.defaultdict) > are cases of grouping (for an int, a list and a set) from an iterator with > a key, value output. > > I wondered how common those constructions were, and what are defaultdict > used for else. So I took a little dive into a few libs to see it (std lib, > pypy, pandas, tensorflow, ..), and I saw essentially : > > A) basic cases of "grouping" with a simple for loop and a > default_dict[key].append(value). I saw many kind of default factory > utilized, with list, int, set, dict, and even defaultdict(list). ex : > https://frama.link/UtNqvpvb, > facesizes = collections.defaultdict(int) for face in cubofaces: facesizes[len(face)] += 1 This is more specifically a Counter. It might look better as: facesizes = collections.Counter(len(face) for face in cubofaces) https://frama.link/o3Hb3-4U, > accum = defaultdict(list) garbageitems = [] for item in root: filename = findfile(opts.fileroot, item.attrib['classname']) accum[filename].append(float(item.attrib['time'])) if filename is None: garbageitems.append(item) This might be more clear if separated into two parts. def keyfunc(item): return findfile(opts.fileroot, item.attrib['classname']) groups = grouped(root, keyfunc) groups = {k: [float(v.attrib['time']) for v in g] for k, g in groups.items()} garbage = groups.pop(None, []) Interestingly, this code later makes a distinction between ``garbage`` and ``garbageitems`` even though they appear to be the same. The programmer might not have made that error (?) if using the ``grouped`` function, because the act of grouping becomes single-purpose. > https://frama.link/dw92yJ1q > In this case, the grouping is being performed by Pandas, not by the defaultdict. expected = defaultdict(dict) for n1, gp1 in data.groupby('A'): for n2, gp2 in gp1.groupby('B'): expected[n1][n2] = op(gp2.loc[:, ['C', 'D']]) However, this is still relevant to our discussion, because the defaultdict had to be converted to a regular dict afterward. Presumably to ensure that missing keys cause KeyErrors. expected = dict((k, DataFrame(v)) for k, v in compat.iteritems(expected)) If you felt like one-lining this, it might look like expected = {n1: DataFrame({n2: op(gp2.loc[:, ['C', 'D']]) for n2, gp2 in gp1}) for n1, gp1 in data.groupby('A')} I'm not saying that's more clear, but the fact that's possible emphasizes that the grouping is performed by Pandas. Grouping cannot be one-lined if restricted to Python standard library. > https://frama.link/1Gqoa7WM > results = defaultdict(dict) for i, k1 in enumerate(arg1.columns): for j, k2 in enumerate(arg2.columns): if j < i and arg2 is arg1: # Symmetric case results[i][j] = results[j][i] else: results[i][j] = f(*_prep_binary(arg1.iloc[:, i], arg2.iloc[:, j])) The nested loop looks like a cross-product. It might be clearer using itertools, but I wouldn't use ``grouped`` here. cross = itertools.product(enumerate(arg1.columns), enumerate(arg2.columns)) > https://frama.link/bWswbHsU, > self.mapping = collections.defaultdict(set) for op in (op for op in graph.get_operations()): if op.name.startswith(common.SKIPPED_PREFIXES): continue for op_input in op.inputs: self.mapping[op_input].add(op) This is a case of a single element being added to multiple groups, which is your section B, below. The loop and filter could be better. It looks like someone intended to convert if/continue to a comprehension, but stopped partway through the revision. ops = (op for op in graph.get_operations() if not op.name.startswith(common.SKIPPED_PREFIXES)) I'm not sure how I like it, but here's a revision to use ``grouped``: inputs = ((op_input, op) for op in ops for op_input in op.inputs) groups = grouped(inputs, key=itemgetter(0)) groups = {k: set(v for _, v in g) for k, g in groups.items()} https://frama.link/SZh2q8pS > handler_results = collections.defaultdict(list) for handler in per_handler_ops.keys(): for op in per_handler_ops[handler]: handler_results[handler].append(op_results[op]) return handler_results Here it looks like the results are already grouped by handler and this loop is constructing a parallel dict of lists for the results instead of the ops themselves. return {handler: [op_results[op] for op in group] for handler, group in per_handler_ops.items()} B) cases of grouping, but where the for loop used was alimenting more than > one "grouper". pretty annoying if we want to group something. ex: > https://frama.link/Db-Ny49a, > def _get_headnode_dict(fixer_list): """ Accepts a list of fixers and returns a dictionary of head node type --> fixer list. """ head_nodes = collections.defaultdict(list) every = [] for fixer in fixer_list: if fixer.pattern: try: heads = _get_head_types(fixer.pattern) except _EveryNode: every.append(fixer) else: for node_type in heads: head_nodes[node_type].append(fixer) else: if fixer._accept_type is not None: head_nodes[fixer._accept_type].append(fixer) else: every.append(fixer) for node_type in chain(pygram.python_grammar.symbol2number.values(), pygram.python_grammar.tokens): head_nodes[node_type].extend(every) return dict(head_nodes) Again this ends with a conversion from defaultdict to basic dict: ``return dict(head_nodes)``. This is somewhat common in usage of defaultdict, because after the creation of groups, the later usage wants missing keys to raise KeyError, not insert a new, empty group. The association of a single element to multiple groups makes this a little awkward again. def _get_headnode_dict(fixer_list): """ Accepts a list of fixers and returns a dictionary of head node type --> fixer list. """ nodetypes = set(chain(pygram.python_grammar.symbol2number.values(), pygram.python_grammar.tokens)) heads = {} for fixer in fixer_list: if fixer.pattern: try: heads[fixer] = _get_head_types(fixer.pattern) except _EveryNode: heads[fixer] = nodetypes elif fixer._accept_type is not None: heads[fixer] = [fixer._accept_type] else: heads[fixer] = nodetypes pairs = ((head, fixer) for fixer, types in heads.items() for head in types) groups = grouped(pairs, key=itemgetter(0)) return {head: [fixer for _, fixer in g] for head, g in groups.items()} https://frama.link/bZakUR33, > directories = defaultdict(set) cache_directories = defaultdict(set) groups = defaultdict(list) for source, target, group, disk_id, condition in files: target = PureWindowsPath(target) groups[group].append((source, target, disk_id, condition)) if target.suffix.lower() in {".py", ".pyw"}: cache_directories[group].add(target.parent) for dirname in target.parents: parent = make_id(dirname.parent) if parent and parent != '.': directories[parent].add(dirname.name) This one has 3 different groupings created simultaneously. We could separate the 3 tasks, but they all rely on a converted ``target`` path, so it'd really end up being 4 loops. At that point, perhaps it's best to do just 1 loop as written. > https://frama.link/MwJFqh5o > paramdicts = {} testers = collections.defaultdict(list) for name, attr in cls.__dict__.items(): if name.endswith('_params'): if not hasattr(attr, 'keys'): d = {} for x in attr: if not hasattr(x, '__iter__'): x = (x,) n = '_'.join(str(v) for v in x).replace(' ', '_') d[n] = x attr = d paramdicts[name[:-7] + '_as_'] = attr if '_as_' in name: testers[name.split('_as_')[0] + '_as_'].append(name) This one isn't a multiple groups issue. We can separate the tasks of paramdicts and testers. as_names = (name for name in cls.__dict__ if '_as_' in name) testers = grouped(as_names, key=lambda name: name.split('_as_')[0] + '_as_') The paramdicts isn't a dict of lists, so it's not really appropriate for ``grouped``. > > C) classes attributes initialization (grouping is done by repeatably > calling a function, so any grouping constructor will be useless here). ex : > https://frama.link/GoGWuQwR, > Instead of ``defaultdict(lambda: 0)`` it seems better to use ``Counter()``. It's called "invocation_counts" after all. > https://frama.link/BugcS8wU > This seems like one of the ideal uses for defaultdict. It even has bits of code made to handle the possibly accidental insert of new keys by popping them off afterwards. D) Sometimes you just want to defautdict inside a defauldict inside a dict > and just have fun : https://frama.link/asBNLr1g, > I only have a tenuous grasp of git, so the fact that this file isn't in the master branch is stretching my mind. But again, this doesn't feel like a grouping task to me. It probably could be rewritten as such, but not trivially. > https://frama.link/8j7gzfA5 > It's hard to tell at a glance if this should truly be a dict of dicts of sets or a dict of sets where the keys are 2-tuples. I'll assume the former is correct. On a side note, it looks like they're doing some silly things to fit into the 80 characters per line restriction, like looping over keys and looking up values inside the loop instead of using ``.items()``. I understand that the character limit is strict, but... https://github.com/tensorflow/tensorflow/blob/b202db076ecdc8a0fc80a49252d4a8dd0c8015b3/tensorflow/python/debug/lib/debug_data.py#L1371 > From what I saw, the most useful would be to add method to a defaultdict > to fill it from an iterable, and using a grouping method adapted to the > default_factor (so __add__ for list, int and str, add for set, update for > dict and proably __add__ for anything else) > > A sample code would be : > > from collections import defaultdict > class groupingdict(defaultdict): > def group_by_iterator(self, iterator): > empty_element = self.default_factory() > if hasattr(empty_element, "__add__"): > for key, element in iterator: > self[key] += element > elif hasattr(empty_element, "update"): > for key, element in iterator: > self[key].update(element) > elif hasattr(empty_element, "add"): > for key, element in iterator: > self[key].add(element) > else: > raise TypeError('default factory does not support iteration') > return self > > So that for example : > >groupingdict(dict).group_by_iterator( > (grouping_key, a_dict) for grouping_key, a_dict in [ > (1, {'a': 'c'}), > (1, {'b': 'f'}), > (1, {'a': 'e'}), > (2, {'a': 'e'}) > ] > ) > returns > > >groupingdict(dict, {1: {'a': 'e', 'b': 'f'}, 2: {'a': 'e'}}) > > > My implementation is garbage and There should be 2 method, one returning > the object and one modifing it, but I think it gives more leeway than just > a function returning a dict > Let's take a look at these two examples: grouped(contacts, key=itemgetter('city') grouped(os.listdir('.'), key=lambda filepath: os.path.splitext(filepath)[1]) How would those be rewritten using your proposal?
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/