[Python-ideas] Re: Set operations with Lists

Kyle Lahnakoski Wed, 25 Sep 2019 07:58:38 -0700


On 2019-09-20 9:52 p.m., Richard Higginbotham wrote:

Andrew Barnert wrote:

set(b).intersection(a) or set(a) & set(b) or sb =
set(b) then [x for x in a if x in sb] and you’re done. They can easily
understand why it works. If they want to know why it’s faster, you can easily 
explain it,
and they’ve learned something widely useful.

This isn't technically correct. It's not faster. It all depends on the use case which when it 
contradicts your expectations you just deride as "artifical micro benchmarks". Python 
isn't just used as a toy scripting language. 50 million elements in a collection is not even large 
by a long shot at least from where I sit. You can make a case that it's not a good language for 
that type of problem, say HPC clusters. Or you can tell people to go copy some C code "like 
people have been doing for decades". That you ask if that is a proper response to your users 
is very concerning to me.



Richard,

I can identify with the need to intersect large sets; and I often facedwith poor choices when it comes to dealing with them. I love Python,and I would love to use it for my data munging, but the data is too bigto fit in memory and Python is too slow. I find it disappointing toconvert an elegant expression declaring *what* I want, like "a-b", intocode that declares *how* to do it, like "set(b).intersection(a)". Thisdisappointment is magnified by the fact someone, somewhere alreadywritten code to do set subtraction faster, and I can not use it; mychoices are quick ugly code, or a time consuming search for an elegantsolution** on the internet.

Maybe we are looking for a different type of solution. You seem to beasking for the elegance of Python with the expression optimization (akaquery planning) of a database. A database may be able to deliver thespeeds you require; it can pack data tighter; it has access to moreprocessors; it may already leverage SIMD; it can optimize the operationaccording to how big the data is. I am suggesting a Python containerimplementation, like sortedcontainers but using a database (maybeSqlite), may be a solution.

Of course, there is the problem of moving data in and out of thedatabase, but maybe that can be amortized over other operations, andmade relatively insignificant. There is the delay when translating__sub_() call into SQL, but maybe that is relatively small compared tosize of work we are requesting.

May you link to your repo where these tests are run? I seem to havelost the link in the long chain of emails on this list. I amconsidering adding a Sqlite example, if only to prove to myself that itis the slowest option of all.


Thank you.

** The sortedcontainers mentioned is an interesting demonstration of howfast Python can get when you are aware of L1 cache effects.




_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/2TBY7XJAWWMPWJ7NUBMRYLA7KIS5HOP6/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Set operations with Lists

Reply via email to