[Python-ideas] Re: giving set.add a return value

Chris Angelico Sat, 27 Jun 2020 01:51:39 -0700

On Sat, Jun 27, 2020 at 4:54 PM Steven D'Aprano <st...@pearwood.info> wrote:
>
> On Fri, Jun 26, 2020 at 10:10:22PM +1000, Chris Angelico wrote:
> > On Fri, Jun 26, 2020 at 7:58 PM Steven D'Aprano <st...@pearwood.info> wrote:
> > > Most importantly, it matches the way people think about the task:
> > >
> > >     # Task: look for duplicates
> > >     if element in seen:
> > >         # it's a duplicate
> > >         ...
> > >     else:
> > >         # never seen before, so remember it
> > >         ...
> >
> > Do you do this:
> >
> >     if file exists:
> >         # read the file
> >     else:
> >         # create the file
>
> No, because I have learned that this is both unreliable and dangerous.
> But that's how I *think* about the problem, at least initially. I
> expect that if you were prototyping the code on a whiteboard, you
> probably would too.
>
> Checking existence of an element in a set is not comparable to file
> access. File access has a lot more that can go wrong than just whether
> or not the file exists. There are permission errors and transient errors
> that bulletproof code needs to deal with. And file access is always
> concurrent, since the OS is concurrent. (Unless you have the luxury
> of working on a single user, single process OS.)


I grew up with concurrency being a perfectly normal thing. To me,
EVERYTHING is concurrent. (Back then, it was single-core CPUs, so
technically most machine code instructions could be assumed to be
atomic, but at any higher level, you could be preempted at any
moment.)

> So long as you know you are dealing with hashable objects, `set.add`
> should always succeed.
>
>
> > Or would you try/except around an open call? The TOCTOU issue may be
> > less obvious with sets, but it's still a reasonable thing to concern
> > yourself with.
>
> Unless you are in the [Quadrant of Hell](https://imgur.com/a/jjA52vI)
> you don't even need to think about this for sets. There are no Time
> Of Check to Time Of Use issues worth thinking about in the given
> example, and frankly Chris I doubt that you surround ever set access
> with locks in single-threaded code because TOCTOU is "stronger" (more
> important?) than writing clear and simple code. And if you do, you're
> probably wondering why Python is so slow *wink*

And that is a fundamental difference between our philosophies. You
consider threads to be terrifyingly scary, and shared mutable data to
be "hell". I consider this to be the normal state of things, and that
whenever you care, you take care. It actually doesn't come up all that
often.

> But I'll grant you that I wasn't thinking about threaded code. The
> example given was single-threaded, but maybe a better example would
> involve threading. But that supports my point: thinking concurrently
> doesn't come naturally.

Nor does thinking algebraically, nor thinking like a database, nor
thinking in terms of diamond inheritance and an MRO. These things have
to be learned. Is it a problem to have to learn things?

> Is this check and insert version of `add` threadsafe? If not, then you
> still need manual locking and this proposal doesn't get you much; if
> anything it's an attractive nuisance because it looks atomic but isn't:
>
>     seen = {}
>     assert seen.add(None) is False
>
> This could fail in threaded code if the underlying `add` method
> is not thread-safe:
>
> - your thread calls `add`, which checks that None is not an element;
> - between the check and the actual insertion, another thread adds None;
> - and your thread redundently adds it a second time and returns False.
>
> I don't know about Python sets, but most Java collections are not
> thread safe unless explicitly documented as such, so I wouldn't be the
> least bit surprised if this could happen.
>
> So `add` would have to be atomic at the C level. If it is already
> atomic, great, that's one objection neutralised. But if not, then will
> making it atomic have a performance impact?

Question: If it ISN'T atomic at the C level, what happens? I can
imagine a few possibilities, including utter memory corruption,
leaking of objects, insertion of two identical elements into a set,
and many other things that would fundamentally violate Python's own
integrity. What would happen that would be considered acceptable, but
wouldn't guarantee this? I would expect that set.add() is guaranteed
to not add a duplicate, and that right there basically guarantees
everything that I need. It doesn't have to be completely atomic but it
does have to maintain the set invariants.

CPython guarantees this level of safety using the GIL. Other Pythons
may do it in other ways. It hardly matters, because any way will
suffice.

> I'm not very interested in slowing down every single `set.add` just so I
> have the opportunity to write less understandable code when I don't need
> to. But if `set.add` is already atomic and hence thread safe, and the
> other issues with this proposal are addressed, then I don't object to
> this proposal. (I reserve the right to change my mind :-)
>
> Just don't tell me that this:
>
>     was_already_there = seen.add(element)
>     if was_already_there:
>         ...
>
> is a more natural way of thinking about the problem of checking for
> duplicates than:
>
>     if element in seen:
>         ...
>

For a more fair comparison, remember that you need to add the element.
So your options are:

if set.ensure(element):
    # it wasn't already there

and

if element not in set:
    set.add(element)
    # it wasn't already there

which is basically a wash.

> > > This idiom works with sets, it works with lists, it works with trees, it
> > > works with anything that supports membership testing and inserting into
> > > a collection. It is the natural way to think about the task.
> >
> > So would "ensure this is here, and let me know if you created it".
>
> Are you proposing to add this to lists, deques, dicts, Mappings,
> Sequences and all other collections? I think you'll need a PEP :-)

No, I'm not. I'm only talking about this as a concept in things that
reject duplicates. So at best, it would be sets and dicts. Why on
earth would you assume I would want this operation on a list? Are you
THAT desperate for an argument?

> > > But either way, you also have to decide whether the `add` (or the new
> > > method) should *unconditionally* insert the element, or only do so if it
> > > wasn't present. This makes a big difference:
> > >
> > >     seen = {2}
> > >     already_there = seen.add(2.0)
> > >
> > >
> > > At this point, is `seen` the set {2} or {2.0}? Justify why one answer is
> > > the best answer.
> > >
> >
> > It should do the same thing that happens already: keep the existing
> > one.
>
> I actually thought that the existing behaviour was the opposite, and
> for once I failed to check it before posting. Oops!
>
> But that's not really important. What is important is the question of
> which of these we wish to match:
>
>     # 1
>     if element in seen:
>         print("duplicate element")
>     seen.add(element)
>
>     # 2
>     if element in seen:
>         print("duplicate element")
>     else:
>         seen.add(element)
>
> They have different semantics, and I can imagine cases where both would
> be useful. Although I suppose option 1 is moot if sets don't over-write
> the existing element.
>

They don't have different semantics, so the distinction is irrelevant.
We can match them both.

>>> seen = {1}
>>> seen.add(1.0)
>>> seen.add(True)
>>> seen
{1}

The only change would be for the add call to signal whether or not it
changed the set. The state of the set afterwards would not change.

ChrisA
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6ESQLWGTU6WIHAWZ77TEVS2AKZF32EHS/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: giving set.add a return value

Reply via email to