Steve Jorgensen wrote:
> What if, instead, there was an OrderedBag class that acts like a
> list and supports set operations.
> The advantage of this is that any hidden indexing that is used to optimize the
> operations could be efficiently created in the object that results from the
> set operati
On Oct 13, 2019, at 19:15, Steve Jorgensen wrote:
>
> What if, instead, there was an `OrderedBag` class that acts like a `list` and
> supports set operations.
>
> The advantage of this is that any hidden indexing that is used to optimize
> the operations could be efficiently created in the obj
What if, instead, there was an `OrderedBag` class that acts like a `list` and
supports set operations.
The advantage of this is that any hidden indexing that is used to optimize the
operations could be efficiently created in the object that results from the set
operation during its construction
Kyle Lahnakoski wrote:
> On 2019-09-20 9:52 p.m., Richard Higginbotham wrote:
> > Andrew Barnert wrote:
> > set(b).intersection(a) or set(a) & set(b) or
> > sb =
> > set(b) then [x for x in a if x in sb] and you’re done. They can easily
> > understand why it works. If they want to know why it’s fas
On 2019-09-20 9:52 p.m., Richard Higginbotham wrote:
Andrew Barnert wrote:
set(b).intersection(a) or set(a) & set(b) or sb =
set(b) then [x for x in a if x in sb] and you’re done. They can easily
understand why it works. If they want to know why it’s faster, you can easily
explain it,
and they
Hey Everybody,
I do want to thank you for this souvenir. I've having it blown up and mounted
on the wall. I'm going to leave some space for the benchmarks that compare my
module and his with the standard set implementation.
"I haven’t tested performance at all. I expect it to be slower (when
Andrew Barnert wrote:
> On Sep 23, 2019, at 15:32, Richard Higginbotham higgi...@gmail.com wrote:
> > Considering
> > your use case however, I wonder, if you would not be better
> > going with the iterator approach (as Andrew has been hinting already for
> > some time in his posts).
> > Richard M.
On Sep 23, 2019, at 15:32, Richard Higginbotham wrote:
>> Considering your use case however, I wonder, if you would not be better
>> going with the iterator approach (as Andrew has been hinting already for
>> some time in his posts).
>> Richard M.
> They will most likely have good performance on
Richard Musil wrote:
> On Mon, Sep 23, 2019 at 12:21 AM Richard Higginbotham higgi...@gmail.com
> wrote:
> > I really appreciate the time and thought you have put
> > into it (and others
> > here as well), and its been educational / fun for me too. One of the
> > concerns I have about using timeit
Andrew Barnert wrote:
> On Sep 22, 2019, at 15:28, Tim Peters tim.pet...@gmail.com wrote:
> > That's not by accident - the inspiration for
> > CPython's sort's basic
> > "galloping" approach was taken from this paper, which wasn't about
> > sorting at all:
> > "Adaptive Set Intersections, Unions, a
Tim Peters wrote:
> That's not by accident - the inspiration for CPython's sort's basic
> "galloping" approach was taken from this paper, which wasn't about
> sorting at all:
> "Adaptive Set Intersections, Unions, and Differences" (2000)
> Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
>
>
On Mon, Sep 23, 2019 at 12:21 AM Richard Higginbotham
wrote:
> I really appreciate the time and thought you have put into it (and others
> here as well), and its been educational / fun for me too. One of the
> concerns I have about using timeit is that it puts a lot of focus on the
> exact statem
On Sep 22, 2019, at 15:28, Tim Peters wrote:
>
>
> That's not by accident - the inspiration for CPython's sort's basic
> "galloping" approach was taken from this paper, which wasn't about
> sorting at all:
>
>"Adaptive Set Intersections, Unions, and Differences" (2000)
>Erik D. Demaine,
Note that CPython's sort is in the business of merging sorted
(sub)lists. Any mergesort is. But CPython's adapts to take
advantage, when possible, of "lumpy" distributions. For example, if
you sort
list(range(100, 200)) + list(range(0, 100))
it goes very fast (O(N)). Because
Richard Musil wrote:
> On Sat, Sep 21, 2019 at 1:44 AM Richard Higginbotham higgi...@gmail.com
> wrote: .
> I wrote all this to show that without an insight it might be sometimes
> difficult or impossible to do it right (I was caught myself in several
> pitfalls on my original benchmarks I post
Steven D'Aprano wrote:
> On Fri, Sep 20, 2019 at 11:44:17PM -, Richard Higginbotham wrote:
> > Let me expand on the reasons for my post here. It
> > seems like
> > motivation has turned into a bit of a bike-shead.
> > I don't think that anyone is questioning your motivation. (Although one
> o
Let me go back to the top here.
On Sep 18, 2019, at 12:29, Richard Higginbotham wrote:
>
> I have frequently come across cases where I would like to compare items in
> one list in another similar to relational algebra.
I’ve put together an itertools-style implementation at
https://github.com/
On Sep 22, 2019, at 12:50, Richard Musil wrote:
>
>> On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert wrote:
>
>> One case I think you didn’t test is when the strings are generated in
>> already-sorted order.
> For the sake of completeness I did some benchmarks also with already sorted
> stri
On Sun, Sep 22, 2019 at 8:25 PM Andrew Barnert wrote:
> One case I think you didn’t test is when the strings are generated in
> already-sorted order. In that case, as opposed to the case where you
> generate in random order and then sort, I think the PyUnicode objects and
> the actual character a
On Sep 21, 2019, at 04:35, Richard Musil wrote:
>
> I wrote all this to show that without an insight it might be sometimes
> difficult or impossible to do it right (I was caught myself in several
> pitfalls on my original benchmarks I posted here) and also because it was
> actually a fun to le
On Sat, Sep 21, 2019 at 1:44 AM Richard Higginbotham
wrote:
> It's written in Python. It used to require containers with 100k or more
> elements to see a speed improvement over some other unnamed methods. These
> days its closer to 50 million with the only other contender being set (hash
> based
On Fri, Sep 20, 2019 at 11:44:17PM -, Richard Higginbotham wrote:
> Let me expand on the reasons for my post here. It seems like
> motivation has turned into a bit of a bike-shead.
I don't think that anyone is questioning your motivation. (Although one
of your recent replies to Andrew seems
[Tim]
>> Something I haven't seen mentioned here, but I may have missed it:
>> when timing algorithms with sets and dicts, people often end up merely
>> measuring the bad effects of hardware data cache misses when the
>> containers get large, and especially in contrived benchmarks.
>>
>> In those t
Andrew Barnert wrote:
> On Sep 20, 2019, at 14:51, Richard Higginbotham higgi...@gmail.com wrote:
> > Andrew Barnert wrote:
> > What’s the goal here?
> > I'm not concerned about micro op, I'm concerned with macro op on large data
> > sets.
> > Right now if someone comes to you with that simple use
Let me expand on the reasons for my post here. It seems like motivation has
turned into a bit of a bike-shead. I have an algorithm I wanted to see if the
python devs would be interested in. It fulfills my use case as is. I've found
it useful for different problem areas and I haven't seen anythi
On Sep 20, 2019, at 14:51, Richard Higginbotham wrote:
>
> Andrew Barnert wrote:
>> What’s the goal here?
>
> I'm not concerned about micro op, I'm concerned with macro op on large data
> sets. Right now if someone comes to you with that simple use case of two
> large lists you have to tell t
On Sat, Sep 21, 2019 at 7:54 AM Richard Higginbotham wrote:
> I'm not concerned about micro op, I'm concerned with macro op on large data
> sets. Right now if someone comes to you with that simple use case of two
> large lists you have to tell them to convert to sets and then back again.
> Forg
Andrew Barnert wrote:
> On Sep 20, 2019, at 09:21, Richard Higginbotham higgi...@gmail.com wrote:
> > Richard Musil wrote:
> > On Fri,
> > Sep 20, 2019 at 1:52 AM Andrew Barnert abarn...@yahoo.com wrote:
> > On Sep 19, 2019, at 15:18, Richard Musil risa20...@gmail.com wrote:
> > After some thought
Chris Angelico wrote:
> On Fri, Sep 20, 2019 at 6:31 AM Richard Higginbotham higgi...@gmail.com wrote:
> >
> > That's exactly what I expected. It's the opposite
> > though. The time to check one list for all the elements
> > Please quote some text to give context. WHAT is exactly what you expected
BTW, this thread seems a good place to mention the under-appreciated
SortedContainers package from Grant Jenks:
http://www.grantjenks.com/docs/sortedcontainers/
This supplies sorted containers (lists, dicts, sets) coded in pure
Python, which generally run at least as fast as C extensions
impl
On Sep 20, 2019, at 09:21, Richard Higginbotham wrote:
>
> Richard Musil wrote:
>>> On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert abarn...@yahoo.com wrote:
>>> On Sep 19, 2019, at 15:18, Richard Musil risa20...@gmail.com wrote:
>
> After some thought I don't think the the time comparisons are u
On Sep 19, 2019, at 20:25, Tim Peters wrote:
>
> Something I haven't seen mentioned here, but I may have missed it:
> when timing algorithms with sets and dicts, people often end up merely
> measuring the bad effects of hardware data cache misses when the
> containers get large, and especially in
Richard Musil wrote:
> On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert abarn...@yahoo.com wrote:
> > On Sep 19, 2019, at 15:18, Richard Musil risa20...@gmail.com wrote:
After some thought I don't think the the time comparisons are useful unless we
include the set init time. If we already have two
On Sep 20, 2019, at 03:18, Richard Musil wrote:
>
> This is an interesting point, which is difficult to deduce without an
> implementation insight. I would for example assume that there is a "list
> construct" which holds just the references to the actual objects (strings in
> our example here
On Sep 20, 2019, at 02:41, Richard Musil wrote:
>
> I just added another implementation to the test (which I believe you had on
> mind):
> ```
> def relcomp_set_list(a, b):
> bset = set(b)
> return [ia for ia in a if ia not in bset]
Or just `set(b).intersection(a)` and maybe checking th
Tim Peters wrote:
> RE [Richard Higginbotham higgi...@gmail.com]
..
> Traversing via a set or dict instead reads the objects' data in
> pseudo-random memory order instead, which can give a very high rate of
> cache misses. In addition, the internal hash table probes also occur
> in pseudo-rand
Andrew Barnert wrote:
> On Sep 19, 2019, at 19:18, Richard Higginbotham higgi...@gmail.com wrote:
> > It's not really constant though.
> > It’s really hard to have a discussion when all of your posts are all
> > replies, but
> you don’t give us any clue what you’re replying to. There are multiple
You know, if anyone wanted a good example of why Computer Science
courses should carry on teaching about sorting algorithms despite them
being already implemented for you these days, this thread is it. Very
nicely explained, folks!
--
Rhodri James *-* Kynesim Ltd
_
>
> On Thu, Sep 19, 2019 at 10:25:20PM -0500, Tim Peters wrote:
> [...]
> > Something I haven't seen mentioned here, but I may have missed it:
> > when timing algorithms with sets and dicts, people often end up merely
> > measuring the bad effects of hardware data cache misses when the
> > containe
On Fri, Sep 20, 2019 at 1:52 AM Andrew Barnert wrote:
> On Sep 19, 2019, at 15:18, Richard Musil wrote:
> >
> > Ok, I misread the original code, the lists were not sorted in the
> previous results (and their should be). So with the correction to have them
> sorted,
>
> I think to be fair, you wa
On Thu, Sep 19, 2019 at 10:25:20PM -0500, Tim Peters wrote:
[...]
> Something I haven't seen mentioned here, but I may have missed it:
> when timing algorithms with sets and dicts, people often end up merely
> measuring the bad effects of hardware data cache misses when the
> containers get large,
[Richard Higginbotham ]
> If you start including the time it takes to convert lists to sets its even
> more
> pronounced. hash values can collide and the bigger the data set the
> more likely it is to happen.
That's not so. Python's hash tables dynamically resize so that the
load factor never ex
On Sep 19, 2019, at 19:18, Richard Higginbotham wrote:
>
> It's not really constant though.
It’s really hard to have a discussion when all of your posts are all replies,
but you don’t give us any clue what you’re replying to. There are multiple
responses from multiple people since your last em
It's not really constant though. Look at all of the test results here. Richard
Musil in the following post shows one value of A=1000,B=1 with .034 ms and
the next point A=1,B=5 If the hash were constant time and A increased
by 10x then the time should be 0.34ms but its not. It's 0.54
I think there is a problem here using timeit for this type of problem, look at
the large amount of time used by the set init time. Noticing this I changed my
sorts to in place and got better results. I assume there is some copying of
locals or other overhead but I can't explain the results.
htt
On Sep 19, 2019, at 15:18, Richard Musil wrote:
>
> Ok, I misread the original code, the lists were not sorted in the previous
> results (and their should be). So with the correction to have them sorted,
I think to be fair, you want to show it _both_ ways, just as you’re showing
sets both with
On Sep 19, 2019, at 13:18, Richard Higginbotham wrote:
>>> For example are the file names in A in B and if so return a new list with
>>> those items. Long story short, I wrote some functions to do that. They are
>>> quite simple and fast (due to timsort in no small part).
>>> Even the plain pyt
It might be interesting to note that Numpy provides various set routines
operating on their arrays (and hence lists as well, by conversion):
https://docs.scipy.org/doc/numpy/reference/routines.set.html
For
[intersection](https://docs.scipy.org/doc/numpy/reference/generated/numpy.intersect1d.htm
On Thu, Sep 19, 2019 at 11:58 PM Richard Musil wrote:
> I did not verify the results (I hope they are correct), but the timing
> suggest that the set operation is always faster. When however we add the
> set setup to the mix, the build up time, which is O(n + m) becomes more
> significant than ac
On Thu, Sep 19, 2019 at 10:18 PM Richard Higginbotham
wrote:
> I'm sorry I haven't used the mail list before and didn't send my comments
> to the list. Please see my response to Chris for some example times. I'm
> reluctant to use anything less than about a ms for timing purposes. There's
> too m
I meant a near linear time complexity. n log n is very very close to n even
when n is large. It's not constant, if it were it could be dropped entirely...
It's pretty close though.
___
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe
On Fri, Sep 20, 2019 at 6:31 AM Richard Higginbotham wrote:
>
> That's exactly what I expected. It's the opposite though. The time to check
> one list for all the elements in another list was far longer. Granted this
> was a Linux box using ext2. The other helpful element is that the filesystem
That's exactly what I expected. It's the opposite though. The time to check one
list for all the elements in another list was far longer. Granted this was a
Linux box using ext2. The other helpful element is that the filesystem returned
the files in the proper order. One of the tests runs I ment
On Fri, Sep 20, 2019 at 6:19 AM Richard Higginbotham wrote:
> >c = [val for val in a if val in filterset]
> This is the crux I think. The complexity is the number of elements in
> filterset. Since filterset has to be check for every value in a its O(len(a)
> * len(b) * filter search time). You w
>> For example are the file names in A in B and if so return a new list with
>> those items. Long story short, I wrote some functions to do that. They are
>> quite simple and fast (due to timsort in no small part).
>> Even the plain python code is faster than the built in set functions (afaik).
>
On Sep 19, 2019, at 12:09, Richard Higginbotham wrote:
>
> It might / probably? even be adapted to be used in the built in set
> operations.
How?
I’m not sure if you get how sets work, and maybe you think they’re based on a
red-black tree or skiplist or some other logarithmic collection, like
> On Sep 19, 2019, at 12:07, Richard Higginbotham wrote:
>
> Its faster than the built in set type.
Not according to timeit. If you want to dispute that, you need a proper
benchmark, not just repeating the claims from your inaccurate one.
I’m not sure what your relationtype is supposed to do (
On Fri, Sep 20, 2019 at 5:11 AM Richard Higginbotham wrote:
> The original use case was detecting new files in various directories. Each
> directory was expected to have millions of 4KB files and maybe at most 100k
> of new files. This was in around 2002 and it was a huge time savings for me.
>
On Fri, Sep 20, 2019 at 5:08 AM Richard Higginbotham wrote:
>
> Its faster than the built in set type. Uniqueness of values isn't a
> requirement. The values need to be comparable so that they can be sorted.
> This allows a linear time complexity.
>
Note that the linear time complexity assumes
The value of the order is actually the critical part. If they are both ordered
I can complete all comparisons in one pass over both lists at the same time.
Lets think about in A and not B
a = [1,2,3,6]
b = [1,4,5,6,7]
start at a[0] and b[0],
0 -- a[i] == b[j] so increment both indexes.
1 -- a[i]
Its faster than the built in set type. Uniqueness of values isn't a
requirement. The values need to be comparable so that they can be sorted. This
allows a linear time complexity. That's basically the only restraint on the
values. Here are some speed test results and pseudocode.
def naive(a,b):
On Sep 18, 2019, at 12:29, Richard Higginbotham wrote:
>
> I'm not sure if there is any interest by others but I have frequently come
> across cases where I would like to compare items in one list in another
> similar to relational algebra.
For anyone who doesn’t want to read this whole reply:
On Thu, Sep 19, 2019 at 10:32 PM Steven D'Aprano wrote:
>
> On Thu, Sep 19, 2019 at 03:26:27PM +1000, Chris Angelico wrote:
> > On Thu, Sep 19, 2019 at 3:08 PM Richard Higginbotham
> > wrote:
> > >
> > > I'm not sure if there is any interest by others but I have frequently
> > > come across cas
On Thu, Sep 19, 2019 at 03:26:27PM +1000, Chris Angelico wrote:
> On Thu, Sep 19, 2019 at 3:08 PM Richard Higginbotham
> wrote:
> >
> > I'm not sure if there is any interest by others but I have frequently come
> > across cases where I would like to compare items in one list in another
> > simi
On Thu, Sep 19, 2019 at 3:08 PM Richard Higginbotham wrote:
>
> I'm not sure if there is any interest by others but I have frequently come
> across cases where I would like to compare items in one list in another
> similar to relational algebra. For example are the file names in A in B and
> if
65 matches
Mail list logo