On 2019-09-20 9:52 p.m., Richard Higginbotham wrote:
Andrew Barnert wrote:
set(b).intersection(a) or set(a) & set(b) or sb =
set(b) then [x for x in a if x in sb] and you’re done. They can easily
understand why it works. If they want to know why it’s faster, you can easily
explain it,
and they’ve learned something widely useful.
This isn't technically correct. It's not faster. It all depends on the use case which when it
contradicts your expectations you just deride as "artifical micro benchmarks". Python
isn't just used as a toy scripting language. 50 million elements in a collection is not even large
by a long shot at least from where I sit. You can make a case that it's not a good language for
that type of problem, say HPC clusters. Or you can tell people to go copy some C code "like
people have been doing for decades". That you ask if that is a proper response to your users
is very concerning to me.
Richard,
I can identify with the need to intersect large sets; and I often faced
with poor choices when it comes to dealing with them. I love Python,
and I would love to use it for my data munging, but the data is too big
to fit in memory and Python is too slow. I find it disappointing to
convert an elegant expression declaring *what* I want, like "a-b", into
code that declares *how* to do it, like "set(b).intersection(a)". This
disappointment is magnified by the fact someone, somewhere already
written code to do set subtraction faster, and I can not use it; my
choices are quick ugly code, or a time consuming search for an elegant
solution** on the internet.
Maybe we are looking for a different type of solution. You seem to be
asking for the elegance of Python with the expression optimization (aka
query planning) of a database. A database may be able to deliver the
speeds you require; it can pack data tighter; it has access to more
processors; it may already leverage SIMD; it can optimize the operation
according to how big the data is. I am suggesting a Python container
implementation, like sortedcontainers but using a database (maybe
Sqlite), may be a solution.
Of course, there is the problem of moving data in and out of the
database, but maybe that can be amortized over other operations, and
made relatively insignificant. There is the delay when translating
__sub_() call into SQL, but maybe that is relatively small compared to
size of work we are requesting.
May you link to your repo where these tests are run? I seem to have
lost the link in the long chain of emails on this list. I am
considering adding a Sqlite example, if only to prove to myself that it
is the slowest option of all.
Thank you.
** The sortedcontainers mentioned is an interesting demonstration of how
fast Python can get when you are aware of L1 cache effects.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/2TBY7XJAWWMPWJ7NUBMRYLA7KIS5HOP6/
Code of Conduct: http://python.org/psf/codeofconduct/