On 2019-09-20 9:52 p.m., Richard Higginbotham wrote:
Andrew Barnert wrote:
set(b).intersection(a) or set(a) & set(b) or sb =
set(b) then [x for x in a if x in sb] and you’re done. They can easily
understand why it works. If they want to know why it’s faster, you can easily 
explain it,
and they’ve learned something widely useful.
This isn't technically correct. It's not faster. It all depends on the use case which when it 
contradicts your expectations you just deride as "artifical micro benchmarks". Python 
isn't just used as a toy scripting language. 50 million elements in a collection is not even large 
by a long shot at least from where I sit. You can make a case that it's not a good language for 
that type of problem, say HPC clusters. Or you can tell people to go copy some C code "like 
people have been doing for decades". That you ask if that is a proper response to your users 
is very concerning to me.


Richard,

I can identify with the need to intersect large sets; and I often faced with poor choices when it comes to dealing with them.  I love Python, and I would love to use it for my data munging, but the data is too big to fit in memory and Python is too slow.  I find it disappointing to convert an elegant expression declaring *what* I want, like "a-b", into code that declares *how* to do it, like "set(b).intersection(a)".  This disappointment is magnified by the fact someone, somewhere already written code to do set subtraction faster, and I can not use it; my choices are quick ugly code, or a time consuming search for an elegant solution** on the internet.

Maybe we are looking for a different type of solution.  You seem to be asking for the elegance of Python with the expression optimization (aka query planning) of a database.  A database may be able to deliver the speeds you require; it can pack data tighter; it has access to more processors; it may already leverage SIMD; it can optimize the operation according to how big the data is.  I am suggesting a Python container implementation, like sortedcontainers but using a database (maybe Sqlite), may be a solution.

Of course, there is the problem of moving data in and out of the database, but maybe that can be amortized over other operations, and made relatively insignificant.  There is the delay when translating __sub_() call into SQL, but maybe that is  relatively small compared to size of work we are requesting.

May you link to your repo where these tests are run?  I seem to have lost the link in the long chain of emails on this list.  I am considering adding a Sqlite example, if only to prove to myself that it is the slowest option of all.

Thank you.


** The sortedcontainers mentioned is an interesting demonstration of how fast Python can get when you are aware of L1 cache effects.



_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/2TBY7XJAWWMPWJ7NUBMRYLA7KIS5HOP6/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to