Re: An interesting data structure with search time O(sqrt n)

2015-12-03 Thread Andrei Alexandrescu via Digitalmars-d

On 12/02/2015 06:26 AM, Timon Gehr wrote:

Assume the initial array is sorted from largest to smallest. This will
be the worst case for this construction algorithm, as each element will
be sifted all the way down to leaf level of the last heap.

[...]

Ω(n^(3/2)).


Thanks. My math-fu is not good enough to properly follow the argument, 
but that bound sounds high; one intuition is that there shouldn't be 
more work done than in sorting - fewer swaps, fewer comparisons, less 
resulting structure.


For example, heapifying the array 1, 2, 3, 4, 5, 6, 7, 8, 9 in 1+3+5 max 
heaps results in the array 9, 8, 6, 7, 5, 4, 3, 2, 1 which is "less 
sorted" than the fully sorted array. (I know, that doesn't prove anything.)


At this point I need to either get to the bottom of the math or put 
together an experimental rig that counts comparisons and swaps.


(BTW I figured one reason for which these DAG heaps work at all is that 
there's never a need to sift an element up between heaps; that would be 
inefficient because the root of a subheap has multiple parents. Sifting 
down is efficient because each node has at most two children.)



Andrei



Re: An interesting data structure with search time O(sqrt n)

2015-12-03 Thread Timon Gehr via Digitalmars-d

On 12/03/2015 11:44 PM, Andrei Alexandrescu wrote:

On 12/02/2015 06:26 AM, Timon Gehr wrote:

Assume the initial array is sorted from largest to smallest. This will
be the worst case for this construction algorithm, as each element will
be sifted all the way down to leaf level of the last heap.

[...]

Ω(n^(3/2)).


Thanks. My math-fu is not good enough to properly follow the argument,


There was actually an (asymptotically inconsequential) error in the 
initial expression. Fixed version:


Ignoring the work performed in the heaps each elements starts in (which 
is obviously at most O(n)), we can lower bound the performed work; for 
the heap of size j, we sift j elements down all the following heaps of 
sizes i=j+2,i=j+4,…,i=m).



∑ⱼ[j∈{1,3,…,m}]·j·∑ᵢ[i∈{j+2,j+4,…,m}]·log(i)
=
∑ᵢ∑ⱼ[j∈{1,3,…,m}]·[i∈{j+2,j+4,…,m}]·j·log(i)
=
∑ᵢ∑ⱼ∑ₖ∑ₗ[i=2·k+1]·[j=2·l+1]·[1≤j≤m]·[j+2≤i≤m]·j·log(i)
=
∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ⱼ∑ₗ[j=2·l+1]·[1≤j≤m]·[j+2≤i]·j
=
∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ⱼ∑ₗ[j=2·l+1]·[1≤j≤i-2]·j
=
∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ⱼ∑ₗ[j=2·l+1]·[1≤2·l+1≤i-2]·(2·l+1)
=
∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ₗ[1≤2·l+1≤i-2]·(2·l+1)
=
∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ₗ[0≤l≤(i-3)/2]·(2·l+1)
=
∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·∑ₗ[0≤l≤(i-3)/2]·(2·l+1)
=
∑ᵢ∑ₖ[i=2·k+1]·[3≤i≤m]·log(i)·((i-1)/2)²
=
∑ₖ[3≤2·k+1≤m]·log(2·k+1)·k²
=
∑ₖ[1≤k≤(m-1)/2]·log(2·k+1)·k²
≥
Ω(m³)
=
Ω(n^(3/2)).



Less formally:

The heap of size 3 is sifted through by 1=1² element from the first heap.
The heap of size 5 is sifted through by 1+3=2² elements from the first 
and second heaps
The heap of size 7 is sifted through by 1+3+5=3² elements from the first 
three heaps.

...
The heap of size 2·k+1 is sifted through by k² elements.
...
The heap of size m is sifted through by ((m-1)/2)² elements.

This gives the last sum in the above derivation:

 sifting through heap of size 2·k+1 costs at least log(2·k+1)
v
∑ₖ[1≤k≤(m-1)/2]·log(2·k+1)·k²
 ^
k ranges from 1 to (m-1)/2.

The factor k² is the number of elements sifting through the heap of size 
2·k+1.


The sum can be lower bounded by
1+2²+3²+…+((m-1)/2)²
=
(m+1)·(m-1)·(m-3)/24+(m+1)·(m-1)/8
=
(m³-m)/24
≥
Ω(m³).



but that bound sounds high; one intuition is that there shouldn't be
more work done than in sorting - fewer swaps, fewer comparisons, less
resulting structure.
...


That intuition is spot on, but the sorting algorithm that the 
construction algorithm imitates most closely is a version of insertion sort.



For example, heapifying the array 1, 2, 3, 4, 5, 6, 7, 8, 9 in 1+3+5 max
heaps results in the array 9, 8, 6, 7, 5, 4, 3, 2, 1 which is "less
sorted" than the fully sorted array. (I know, that doesn't prove anything.)
...


It provides motivation to try and find another construction algorithm 
that runs asymptotically faster than n log n, or to prove that n log n 
is asymptotically optimal.



At this point I need to either get to the bottom of the math or


The basic argument is not too involved: If each element needs to be 
sifted all the way down to the last heap, then for each of √n̅ heaps, you 
need to sift all its ~√n̅ elements down ~√n̅ other heaps, which requires 
more than ≈n^(3/2) operations. The above derivation makes this rigorous.



put together an experimental rig that counts comparisons and swaps.

(BTW I figured one reason for which these DAG heaps work at all is that
there's never a need to sift an element up between heaps; that would be
inefficient because the root of a subheap has multiple parents. Sifting
down is efficient because each node has at most two children.)
...


Good point. I had declared victory on the "insert" front too early.


Re: An interesting data structure with search time O(sqrt n)

2015-12-02 Thread Timon Gehr via Digitalmars-d

On 12/02/2015 12:26 PM, Timon Gehr wrote:

...
∑ᵢ[2≤i≤m]·log(i)·∑ₖ[0≤k≤⌊i/2⌋-1]·(2·k+1)
=
∑ᵢ[2≤i≤m]·log(i)·⌊i/2⌋²



It should actually be (⌊i/2⌋-1)² here, but this does not change the 
asymptotics.


Re: An interesting data structure with search time O(sqrt n)

2015-12-02 Thread Andrei Alexandrescu via Digitalmars-d

On 12/02/2015 06:26 AM, Timon Gehr wrote:

The linear search can have way better locality, so it is likely that
actually the binary search dominates for some data sets.


Galloping search ftw (good locality, log complexity). -- Andrei


Re: An interesting data structure with search time O(sqrt n)

2015-12-02 Thread Timon Gehr via Digitalmars-d

On 12/02/2015 03:29 PM, Timon Gehr wrote:

On 12/02/2015 12:26 PM, Timon Gehr wrote:

...
∑ᵢ[2≤i≤m]·log(i)·∑ₖ[0≤k≤⌊i/2⌋-1]·(2·k+1)
=
∑ᵢ[2≤i≤m]·log(i)·⌊i/2⌋²



It should actually be (⌊i/2⌋-1)² here, but this does not change the
asymptotics.


Oops. No, it was actually right. Sorry for the noise. :-)


Re: An interesting data structure with search time O(sqrt n)

2015-12-02 Thread Chris Wright via Digitalmars-d
On Wed, 02 Dec 2015 05:58:24 +, Navin wrote:

> Nice to see this interesting post and learn.
> 
>   I have a few questions.
> 
> 1) This is offline datastructure since you don't know how the elements
> of the future are going to be ie dynamic. ie later elements from n to 2n
> can break or change your heaps as such in worst case or is it a dynamic
> data structure ?

You can usually change an offline datastructure into an amortized online 
one. (I think.) You can usually change an amortized online algorithm into 
a deamortized one with a series of ugly hacks and large increases in 
constants.

I think you can make insertions work and not be too terrible with an 
amortized algorithm. Something like:

To insert a value X into the square heap, find the largest heap with a 
head equal to X. If there is none, find the smallest heap with a head 
greater than X. If there is none, choose the final heap. If that heap is 
not full, insert X into it. If it is full, trigger a rebalance.

'Rebalance' is an operation we use to ensure that each intermediate heap 
is exactly half full (rounding down) and the final heap is no more than 
half full. We start at a given heap. We calculate its surplus -- the 
number of elements it has in excess of half its capacity. We scan to the 
right (increasing heap sizes) until we find a heap with at least enough 
spare capacity to hold the surpluses of all the prior heaps we've 
scanned. If we reach the end of the square, we create a new heap. This is 
the target heap.

Now we repeatedly bubble up values from lower heaps to higher ones until 
no heap between the one we tried to insert into and the target heap has 
any surplus.

This costs us in the worst case something like O(n * log(n) * sqrt(n)). I 
haven't done the full analysis. But after doing a large rebalance, you 
shouldn't need to do a large rebalance for quite a while, so the average 
case, even on adversarial data, should be not so horrible.

Deletions are simple: find the element and remove it from its heap. This 
can, however, reduce a heap to less than half filled. (This means I can 
insert and delete elements in an adversarial pattern to force all 
elements into one heap.) A similar rebalance algorithm is needed, but 
it's harder on this side because we're using max heaps everywhere and 
need to fill lower heaps.

> 2)  Searching in min or max heaps is bad isn't it ?

Linear time in the worst case. If you're looking for something in a max 
heap that happens to be the smallest element, it can be in any leaf node, 
which gives you half the tree to look through. And the average isn't much 
better.

However, if we can guarantee that an element is in one heap inside this 
square heap, it costs us O(sqrt n) to search that heap since that heap 
has no more than sqrt(n) elements.

> Lets say we choose
> max heaps. Now we have the root as 10^9 in the second last heap ie
> around n^2 elements.  The children of it are 4*10^8 and 5*10^8 . If i'm
> searching for say 4.5 *10^8 my job is easy but if i'm searching for
> 1000, i have to search in both the subtrees and it goes linear and
> becomes around n^2 in the worst case. Did i overlook anything ?
> 
> Instead of heaps, a single sorted array or breaking them into a series
> of sorted arrays ie skip lists kind of stuff  would be fine if it just
> want a offline Data structure ?

A skiplist is great. Problem is, it's random access to memory. That's bad 
enough at the best of times, and it's utter garbage if you're storing 
data on spinning disks. Even if you store it as a sorted list rather than 
a linked list, which means you never insert anything ever, it's O(log n/
B) seeks and reads to find an element. If you store it as a linked list, 
it's O(log n) reads.

A square heap with an insertion algorithm as I describe gives you O(1) 
seeks and O(sqrt n/B) reads.

(Here, B stands for block size, which in this case is the number of data 
elements you will naturally get for free or almost for free after reading 
one byte. When I was doing data structure research, the rule of thumb was 
that a block is roughly 1MB for spinning disks. Divide that by element 
size to get B.

We track seeks separately because they're expensive. 9ms is a reasonable 
seek time. Comparatively, reading data sequentially is extremely cheap, 
especially if you can inform the disk scheduler that you are going to 
perform sequential reads. Seeks are expensive no matter what, even if you 
can inform the scheduler in advance -- and the skiplist doesn't let you 
do that.

Even if you switch to SSDs, SSDs tend to be ~4x faster on sequential 
reads than random reads.)

So if you have 64-byte data structures, for instance, and you've got a 
billion on disk, you have sixteen seeks for a skiplist on average, which 
costs you 144ms. (You can drop that a bit by keeping the top of the 
skiplist in memory.)

The square heap, on the other hand? You've got a total of ~1500 heaps, so 
you can store their offsets and heads in 

Re: An interesting data structure with search time O(sqrt n)

2015-12-02 Thread Timon Gehr via Digitalmars-d

On 12/01/2015 11:48 PM, Andrei Alexandrescu wrote:

On 12/01/2015 12:13 AM, Andrei Alexandrescu wrote:

On 11/30/15 9:47 PM, Timon Gehr wrote:

On 12/01/2015 03:33 AM, Timon Gehr wrote:

On 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:

So now consider my square heaps. We have O(n) build time (just a bunch
of heapifications) and O(sqrt n) search.


How do you build in O(n)? (The initial array is assumed to be
completely
unordered, afaict.)


(I meant to say: There aren't any assumptions on the initial ordering of
the array elements.)


That's quite challenging. (My O(n) estimate was off the cuff and
possibly wrong.) Creating the structure entails simultaneously solving
the selection problem (find the k smallest elements) for several values
of k. I'll post here if I come up with something. -- Andrei


OK, I think I have an answer to this in the form of an efficient algorithm.

First off: sizes 1+3+5+7+... seem a great choice, I'll use that for the
initial implementation (thanks Titus!).

Second: the whole max heap is a red herring - min heap is just as good,
and in fact better. When doing the search just overshoot by one then go
back one heap to the left and do the final linear search in there.

So the structure we're looking at is an array of adjacent min-heaps of
sizes 1, 3, 5, etc. The heaps are ordered (the maximum of heap k is less
than or equal to the minimum of heap k+1). Question is how do we build
such an array of heaps in place starting from an unstructured array of
size n.

One simple approach is to just sort the array in O(n log n). This
satisfies all properties - all adjacent subsequences are obviously
ordered, and any subsequence has the min heap property. As an
engineering approach we may as well stop here - sorting is a widely
studied and well implemented algorithm. However, we hope to get away
with less work because we don't quite need full sorting.

Here's the intuition: the collection of heaps can be seen as one large
heap that has a DAG structure (as opposed to a tree). In the DAG, the
root of heap k+1 is the child of all leaves of heap k (see
http://imgur.com/I366GYS which shows the DAG for the 1, 3, 7, and 7 heaps).

Clearly getting this structure to respect the heap property is all
that's needed for everything to work - so we simply apply the classic
heapify algorithm to it. It seems it can be applied almost unchanged -
starting from the end, sift each element down the DAG.

This looks efficient and minimal; I doubt there's any redundant work.
However, getting bounds for complexity of this will be tricky. Classic
heapify is tricky, too - it seems to have complexity O(n log n) but in
fact has complexity O(n) - see nice discussion at
http://stackoverflow.com/questions/9755721/how-can-building-a-heap-be-on-time-complexity.
When applying heapify to the DAG, there's more restrictions and the
paths are longer, so a sliver more than O(n) is expected.

Anyway, this looks ready for a preliminary implementation and some more
serious calculations.



Assume the initial array is sorted from largest to smallest. This will 
be the worst case for this construction algorithm, as each element will 
be sifted all the way down to leaf level of the last heap.


For simplicity, let's assume that n = m² for m odd, such that the sizes 
of the heaps are 1,3,…,m. To sift some element down one entire heap of 
size i takes ≥ log₂(i) steps.


Ignoring the work performed in the heaps each elements starts in (which 
is obviously at most O(n) in total), we can lower bound the performed 
work; for the heap of size j, we sift j elements down all the following 
heaps of sizes i=j+1,i=j+2,…,i=m:


∑ⱼ[j∈{1,3,…,m}]·j·∑ᵢ[j+1≤i≤m]·log(i)
=
∑ᵢ∑ⱼ[j∈{1,3,…,m}]·[j+1≤i≤m]·j·log(i)
=
∑ᵢ[2≤i≤m]·log(i)·∑ⱼ[j∈{1,3,…,m}]·[j+1≤i] j
=
∑ᵢ[2≤i≤m]·log(i)·∑ⱼ∑ₖ[j=2·k+1]·[1≤j≤m]·[j≤i-1] j
=
∑ᵢ[2≤i≤m]·log(i)·∑ⱼ∑ₖ[j=2·k+1]·[1≤2·k+1≤i-1]·(2·k+1)
=
∑ᵢ[2≤i≤m]·log(i)·∑ₖ[1≤2·k+1≤i-1]·(2·k+1)
=
∑ᵢ[2≤i≤m]·log(i)·∑ₖ[0≤k≤⌊i/2⌋-1]·(2·k+1)
=
∑ᵢ[2≤i≤m]·log(i)·⌊i/2⌋²
≥
Ω(m³)
=
Ω(n^(3/2)).

I.e. sorting is asymptotically faster, and probably also in practice. 
Moreover, the construction algorithm would be slower even if sifting 
down one entire heap would run in constant time.


However, for your adapted data structure with min-heaps, implementing 
insert and remove in O(√n̅ log n) is now easy using your "sifting in a 
DAG" approach.


One way to prove a useful lower bound on the time it must take to build 
might be to observe that building recursively can be used for sorting.



One more interesting thing: the heap heads are sorted, so when
searching, the heap corresponding to the searched item can be found
using binary search. That makes that part of the search essentially
negligible - the lion's share will be the linear search on the last
mile. In turn, that suggests that more heaps that are smaller would be a
better choice. (At an extreme, if we have an array of heaps each
proportional to log(n), then we get search time O(log n) even though the

Re: An interesting data structure with search time O(sqrt n)

2015-12-01 Thread Andrei Alexandrescu via Digitalmars-d

On 12/01/2015 12:13 AM, Andrei Alexandrescu wrote:

On 11/30/15 9:47 PM, Timon Gehr wrote:

On 12/01/2015 03:33 AM, Timon Gehr wrote:

On 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:

So now consider my square heaps. We have O(n) build time (just a bunch
of heapifications) and O(sqrt n) search.


How do you build in O(n)? (The initial array is assumed to be completely
unordered, afaict.)


(I meant to say: There aren't any assumptions on the initial ordering of
the array elements.)


That's quite challenging. (My O(n) estimate was off the cuff and
possibly wrong.) Creating the structure entails simultaneously solving
the selection problem (find the k smallest elements) for several values
of k. I'll post here if I come up with something. -- Andrei


OK, I think I have an answer to this in the form of an efficient algorithm.

First off: sizes 1+3+5+7+... seem a great choice, I'll use that for the 
initial implementation (thanks Titus!).


Second: the whole max heap is a red herring - min heap is just as good, 
and in fact better. When doing the search just overshoot by one then go 
back one heap to the left and do the final linear search in there.


So the structure we're looking at is an array of adjacent min-heaps of 
sizes 1, 3, 5, etc. The heaps are ordered (the maximum of heap k is less 
than or equal to the minimum of heap k+1). Question is how do we build 
such an array of heaps in place starting from an unstructured array of 
size n.


One simple approach is to just sort the array in O(n log n). This 
satisfies all properties - all adjacent subsequences are obviously 
ordered, and any subsequence has the min heap property. As an 
engineering approach we may as well stop here - sorting is a widely 
studied and well implemented algorithm. However, we hope to get away 
with less work because we don't quite need full sorting.


Here's the intuition: the collection of heaps can be seen as one large 
heap that has a DAG structure (as opposed to a tree). In the DAG, the 
root of heap k+1 is the child of all leaves of heap k (see 
http://imgur.com/I366GYS which shows the DAG for the 1, 3, 7, and 7 heaps).


Clearly getting this structure to respect the heap property is all 
that's needed for everything to work - so we simply apply the classic 
heapify algorithm to it. It seems it can be applied almost unchanged - 
starting from the end, sift each element down the DAG.


This looks efficient and minimal; I doubt there's any redundant work. 
However, getting bounds for complexity of this will be tricky. Classic 
heapify is tricky, too - it seems to have complexity O(n log n) but in 
fact has complexity O(n) - see nice discussion at 
http://stackoverflow.com/questions/9755721/how-can-building-a-heap-be-on-time-complexity. 
When applying heapify to the DAG, there's more restrictions and the 
paths are longer, so a sliver more than O(n) is expected.


Anyway, this looks ready for a preliminary implementation and some more 
serious calculations.


One more interesting thing: the heap heads are sorted, so when 
searching, the heap corresponding to the searched item can be found 
using binary search. That makes that part of the search essentially 
negligible - the lion's share will be the linear search on the last 
mile. In turn, that suggests that more heaps that are smaller would be a 
better choice. (At an extreme, if we have an array of heaps each 
proportional to log(n), then we get search time O(log n) even though the 
array is not entirely sorted.)



Andrei



Re: An interesting data structure with search time O(sqrt n)

2015-12-01 Thread Andrei Alexandrescu via Digitalmars-d

On 12/01/2015 04:50 AM, Emil Kirschner wrote:

On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:

Okasaki's book is a continued inspiration of data structures and
algorithms.
Start with a simple array of data. Then mentally decompose that array
into a concatenation of smaller arrays: first has size 1, second has
size 4, third has size 9, then 16, 25, 36, ... Generally the size of
these imaginary subarrays grows quadratically. And they're adjacent to
each other. The last array may be incomplete.

Please share any thoughts!


Andrei


Interesting, but isn't that basically a skip list with uneven sub-lists?
insert could be quite complex unless the entire data structure is backed
by a continuous or a continuous linked list.


Doesn't look like a skip list to me. -- Andrei


Re: An interesting data structure with search time O(sqrt n)

2015-12-01 Thread rsw0x via Digitalmars-d
On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu 
wrote:
Okasaki's book is a continued inspiration of data structures 
and algorithms.


[...]


Sort of reminds me of a modified Hashed Array Tree — not keen on 
the name, I think "Bucket Array" would have been better.

https://en.wikipedia.org/wiki/Hashed_array_tree


Re: An interesting data structure with search time O(sqrt n)

2015-12-01 Thread Navin via Digitalmars-d
On Tuesday, 1 December 2015 at 22:48:56 UTC, Andrei Alexandrescu 
wrote:

On 12/01/2015 12:13 AM, Andrei Alexandrescu wrote:

On 11/30/15 9:47 PM, Timon Gehr wrote:

On 12/01/2015 03:33 AM, Timon Gehr wrote:

On 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:
So now consider my square heaps. We have O(n) build time 
(just a bunch

of heapifications) and O(sqrt n) search.


How do you build in O(n)? (The initial array is assumed to 
be completely

unordered, afaict.)


(I meant to say: There aren't any assumptions on the initial 
ordering of

the array elements.)


That's quite challenging. (My O(n) estimate was off the cuff 
and
possibly wrong.) Creating the structure entails simultaneously 
solving
the selection problem (find the k smallest elements) for 
several values

of k. I'll post here if I come up with something. -- Andrei


OK, I think I have an answer to this in the form of an 
efficient algorithm.


First off: sizes 1+3+5+7+... seem a great choice, I'll use that 
for the initial implementation (thanks Titus!).


Second: the whole max heap is a red herring - min heap is just 
as good, and in fact better. When doing the search just 
overshoot by one then go back one heap to the left and do the 
final linear search in there.


So the structure we're looking at is an array of adjacent 
min-heaps of sizes 1, 3, 5, etc. The heaps are ordered (the 
maximum of heap k is less than or equal to the minimum of heap 
k+1). Question is how do we build such an array of heaps in 
place starting from an unstructured array of size n.


One simple approach is to just sort the array in O(n log n). 
This satisfies all properties - all adjacent subsequences are 
obviously ordered, and any subsequence has the min heap 
property. As an engineering approach we may as well stop here - 
sorting is a widely studied and well implemented algorithm. 
However, we hope to get away with less work because we don't 
quite need full sorting.


Here's the intuition: the collection of heaps can be seen as 
one large heap that has a DAG structure (as opposed to a tree). 
In the DAG, the root of heap k+1 is the child of all leaves of 
heap k (see http://imgur.com/I366GYS which shows the DAG for 
the 1, 3, 7, and 7 heaps).


Clearly getting this structure to respect the heap property is 
all that's needed for everything to work - so we simply apply 
the classic heapify algorithm to it. It seems it can be applied 
almost unchanged - starting from the end, sift each element 
down the DAG.


This looks efficient and minimal; I doubt there's any redundant 
work. However, getting bounds for complexity of this will be 
tricky. Classic heapify is tricky, too - it seems to have 
complexity O(n log n) but in fact has complexity O(n) - see 
nice discussion at 
http://stackoverflow.com/questions/9755721/how-can-building-a-heap-be-on-time-complexity. When applying heapify to the DAG, there's more restrictions and the paths are longer, so a sliver more than O(n) is expected.


Anyway, this looks ready for a preliminary implementation and 
some more serious calculations.


One more interesting thing: the heap heads are sorted, so when 
searching, the heap corresponding to the searched item can be 
found using binary search. That makes that part of the search 
essentially negligible - the lion's share will be the linear 
search on the last mile. In turn, that suggests that more heaps 
that are smaller would be a better choice. (At an extreme, if 
we have an array of heaps each proportional to log(n), then we 
get search time O(log n) even though the array is not entirely 
sorted.)



Andrei


Nice to see this interesting post and learn.

 I have a few questions.

1) This is offline datastructure since you don't know how the 
elements of the future are going to be ie dynamic. ie later 
elements from n to 2n can break or change your heaps as such in 
worst case or is it a dynamic data structure ?


2)  Searching in min or max heaps is bad isn't it ? Lets say we 
choose max heaps. Now we have the root as 10^9 in the second last 
heap ie around n^2 elements.  The children of it are 4*10^8 and 
5*10^8 . If i'm searching for say 4.5 *10^8 my job is easy but if 
i'm searching for 1000, i have to search in both the subtrees and 
it goes linear and becomes around n^2 in the worst case. Did i 
overlook anything ?


Instead of heaps, a single sorted array or breaking them into a 
series of sorted arrays ie skip lists kind of stuff  would be 
fine if it just want a offline Data structure ?


or is this some domain specific data Structure where you only/cre 
want max/min in some sequence ?







Re: An interesting data structure with search time O(sqrt n)

2015-12-01 Thread Emil Kirschner via Digitalmars-d
On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu 
wrote:
Okasaki's book is a continued inspiration of data structures 
and algorithms.
Start with a simple array of data. Then mentally decompose that 
array into a concatenation of smaller arrays: first has size 1, 
second has size 4, third has size 9, then 16, 25, 36, ... 
Generally the size of these imaginary subarrays grows 
quadratically. And they're adjacent to each other. The last 
array may be incomplete.


Please share any thoughts!


Andrei


Interesting, but isn't that basically a skip list with uneven 
sub-lists? insert could be quite complex unless the entire data 
structure is backed by a continuous or a continuous linked list.


Re: An interesting data structure with search time O(sqrt n)

2015-12-01 Thread Marko Mikulicic via Digitalmars-d
On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu 
wrote:
Okasaki's book is a continued inspiration of data structures 
and algorithms.


[...]



Hi,

You might find relevant ideas in 
https://cs.uwaterloo.ca/research/tr/1979/CS-79-31.pdf and 
http://www.sciencedirect.com/science/article/pii/002280900379 
.


Cheers,
Marko



Re: An interesting data structure with search time O(sqrt n)

2015-12-01 Thread Andrei Alexandrescu via Digitalmars-d

On 12/1/15 5:19 AM, Marko Mikulicic wrote:

On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:

Okasaki's book is a continued inspiration of data structures and
algorithms.

[...]



Hi,

You might find relevant ideas in
https://cs.uwaterloo.ca/research/tr/1979/CS-79-31.pdf and
http://www.sciencedirect.com/science/article/pii/002280900379 .


Noted, thanks! -- Andrei




Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread H. S. Teoh via Digitalmars-d
On Mon, Nov 30, 2015 at 03:57:24PM -0500, Andrei Alexandrescu via Digitalmars-d 
wrote:
> On 11/30/15 3:20 PM, H. S. Teoh via Digitalmars-d wrote:
> >On Mon, Nov 30, 2015 at 03:13:11PM -0500, Andrei Alexandrescu via 
> >Digitalmars-d wrote:
> >>Okasaki's book is a continued inspiration of data structures and
> >>algorithms.
> >>
> >>I was thinking along the following lines: typical collections are
> >>searchable in linear time. Then we have highly structured
> >>collections that feature logarithmic search time. But there seems to
> >>be nothing in between. So I was thinking, what would be a data
> >>structure that allows O(sqrt n) search times?
> >>
> >>After a little tinkering, here's what I came up with.
> >
> >Interesting indeed.
> >
> >It leaves me wondering, though, what's the point of having an O(sqrt
> >n) collection? What are the advantages?  Why would I use this
> >structure instead of, say, a traditional array heap with O(log n)
> >search time?
> 
> (Heaps offer only linear search time. You may take advantage of the
> heap structure to skip portions of the array, but on average and in
> the worst case the search is still O(n). So I assume you meant "sorted
> array or one of the classic search trees".)

Right, I wrote that without thinking. :-P


> The motivation starts with a desire to use arrays as the fundamental
> layout.  There have been many trends in computing as of late, among
> which: cache hierarchies are here to stay and contiguous layout is
> preferable.
> 
> The short of it is, arrays are king. No two ways about it - following
> pointers is a losing strategy when there's an alternative. A variety
> of new data structures (Clojure's arrays, heaps with tombstones) avoid
> classic pointer-based data structures in favor of adding structure on
> top of arrays.

I'm not so sure about that. At the micro level, yes, following pointers
for a tree / graph / etc., that could easily fit in a small/medium sized
array is a losing strategy, due to cache misses, hardware prefetchers,
etc.. When you're dealing with larger amounts of data, though, things
become less clear-cut. If your array size is on the order of MB's or
GB's, pointers start looking a lot more attractive. IMO in the long run
what will win is data structures that use tree or pointer based
implementations in the large scale, but built on cache-friendly
array-based structures below a certain level of granularity.

I agree with you, though, that array-based structures of intermediate
big-O complexities are very interesting. If you can come up with an
array structure that has the same complexity for all common operations,
that could be useful as a quick-fix drop in solution in cases where top
performance isn't required, but you'd like something better than O(n)
array scanning.


T

-- 
Philosophy: how to make a career out of daydreaming.


An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Andrei Alexandrescu via Digitalmars-d

Okasaki's book is a continued inspiration of data structures and algorithms.

I was thinking along the following lines: typical collections are 
searchable in linear time. Then we have highly structured collections 
that feature logarithmic search time. But there seems to be nothing in 
between. So I was thinking, what would be a data structure that allows 
O(sqrt n) search times?


After a little tinkering, here's what I came up with.

Start with a simple array of data. Then mentally decompose that array 
into a concatenation of smaller arrays: first has size 1, second has 
size 4, third has size 9, then 16, 25, 36, ... Generally the size of 
these imaginary subarrays grows quadratically. And they're adjacent to 
each other. The last array may be incomplete.


Example: we decompose an array of size 35 into: an array of size 1 
followed by arrays of sizes 4, 9, 16, 5 (last being an incomplete 
fragment of an array of size 25).


Now each of these arrays we organize as a max heap. Moreover, we arrange 
data such that the maximums of these heaps are in INCREASING order. That 
means the smallest element of the entire (initial) array is at the first 
position, then followed by the next 4 smallest organized in a max heap, 
followed by the next 9 smallest organized in a max heap, ... etc. There 
are a total of O(sqrt n) heaps, and each has O(sqrt n) elements.


That's the layout! Now, to search we look at one heap at a time. 
Whenever the maximum element (first element in the subarray) is smaller 
than the searched value, we skip over that entire heap and go to the 
next one. In the worst case we'll skip O(sqrt n) heaps. When the max 
value in a heap is less than the searched element, we found the heap and 
we run a linear search among O(sqrt n) elements.


The structure is nice for sorting and in particular parallel sorting 
because each subarray can be sorted independently - there's no migration 
into or out of each subarray. Inside each subarray, of course heapsort 
would be a good choice because it takes advantage of the existing max 
heap. 	


I haven't worked out the math for insertion and deletion yet, but they 
seem to also be around O(sqrt n) or O(log(n) * sqrt(n)). Just wanted to 
share with you and discuss what seems to be an interesting data structure.


Please share any thoughts!


Andrei


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread H. S. Teoh via Digitalmars-d
On Mon, Nov 30, 2015 at 03:13:11PM -0500, Andrei Alexandrescu via Digitalmars-d 
wrote:
> Okasaki's book is a continued inspiration of data structures and
> algorithms.
> 
> I was thinking along the following lines: typical collections are
> searchable in linear time. Then we have highly structured collections
> that feature logarithmic search time. But there seems to be nothing in
> between. So I was thinking, what would be a data structure that allows
> O(sqrt n) search times?
> 
> After a little tinkering, here's what I came up with.

Interesting indeed.

It leaves me wondering, though, what's the point of having an O(sqrt n)
collection? What are the advantages?  Why would I use this structure
instead of, say, a traditional array heap with O(log n) search time?


T

-- 
Why are you blatanly misspelling "blatant"? -- Branden Robinson


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Dmitry Olshansky via Digitalmars-d

On 30-Nov-2015 23:13, Andrei Alexandrescu wrote:

Okasaki's book is a continued inspiration of data structures and
algorithms.

I was thinking along the following lines: typical collections are
searchable in linear time. Then we have highly structured collections
that feature logarithmic search time. But there seems to be nothing in
between. So I was thinking, what would be a data structure that allows
O(sqrt n) search times?

After a little tinkering, here's what I came up with.

Start with a simple array of data. Then mentally decompose that array
into a concatenation of smaller arrays: first has size 1, second has
size 4, third has size 9, then 16, 25, 36, ... Generally the size of
these imaginary subarrays grows quadratically. And they're adjacent to
each other. The last array may be incomplete.

Example: we decompose an array of size 35 into: an array of size 1
followed by arrays of sizes 4, 9, 16, 5 (last being an incomplete
fragment of an array of size 25).

Now each of these arrays we organize as a max heap. Moreover, we arrange
data such that the maximums of these heaps are in INCREASING order. That
means the smallest element of the entire (initial) array is at the first
position, then followed by the next 4 smallest organized in a max heap,
followed by the next 9 smallest organized in a max heap, ... etc. There
are a total of O(sqrt n) heaps, and each has O(sqrt n) elements.

That's the layout! Now, to search we look at one heap at a time.
Whenever the maximum element (first element in the subarray) is smaller
than the searched value, we skip over that entire heap and go to the
next one. In the worst case we'll skip O(sqrt n) heaps. When the max
value in a heap is less than the searched element, we found the heap and
we run a linear search among O(sqrt n) elements.


Reminds me of Van Emde Boas layout which is however fractal in nature: 
sqrt(N) pieces each having sqrt(N) element are subdivided again into 
sqrt(sqrt(N)) pieces and so on.


Not sure if you have seen, but see also cache-oblivious data-structures:
http://erikdemaine.org/papers/BRICS2002/paper.pdf


--
Dmitry Olshansky


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Andrei Alexandrescu via Digitalmars-d

On 11/30/15 3:29 PM, Dmitry Olshansky wrote:

Reminds me of Van Emde Boas layout which is however fractal in nature:
sqrt(N) pieces each having sqrt(N) element are subdivided again into
sqrt(sqrt(N)) pieces and so on.

Not sure if you have seen, but see also cache-oblivious data-structures:
http://erikdemaine.org/papers/BRICS2002/paper.pdf


Thanks, I'll look these up! -- Andrei


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Andrei Alexandrescu via Digitalmars-d

On 11/30/15 3:20 PM, H. S. Teoh via Digitalmars-d wrote:

On Mon, Nov 30, 2015 at 03:13:11PM -0500, Andrei Alexandrescu via Digitalmars-d 
wrote:

Okasaki's book is a continued inspiration of data structures and
algorithms.

I was thinking along the following lines: typical collections are
searchable in linear time. Then we have highly structured collections
that feature logarithmic search time. But there seems to be nothing in
between. So I was thinking, what would be a data structure that allows
O(sqrt n) search times?

After a little tinkering, here's what I came up with.


Interesting indeed.

It leaves me wondering, though, what's the point of having an O(sqrt n)
collection? What are the advantages?  Why would I use this structure
instead of, say, a traditional array heap with O(log n) search time?


(Heaps offer only linear search time. You may take advantage of the heap 
structure to skip portions of the array, but on average and in the worst 
case the search is still O(n). So I assume you meant "sorted array or 
one of the classic search trees".)


The motivation starts with a desire to use arrays as the fundamental 
layout. There have been many trends in computing as of late, among 
which: cache hierarchies are here to stay and contiguous layout is 
preferable.


The short of it is, arrays are king. No two ways about it - following 
pointers is a losing strategy when there's an alternative. A variety of 
new data structures (Clojure's arrays, heaps with tombstones) avoid 
classic pointer-based data structures in favor of adding structure on 
top of arrays.


So now if we consider thinking, "how do we organize an array for good 
search performance" we have a spectrum. Generally we also care about 
insertion and removal.


At one end of the spectrum there's doing nothing at all - that means 
O(1) build (nothing to do), O(n) search, O(1) insert (just append it), 
O(n) removal. Not very nice.


At the other end, the absolute best structuring on top of an array for 
search is sorting. With sorting you get great O(log n) search 
performance. But the others are not nice - O(n log n) build, O(n) add, 
O(n) remove.


So now consider my square heaps. We have O(n) build time (just a bunch 
of heapifications) and O(sqrt n) search. Then (again I haven't worked 
out the math yet) let's assume insertion and removal are both O(sqrt n). 
Then you get something less structured than full sorting, but also just 
good enough to offer the same complexity for each of search, insert, and 
delete. That would be pretty neat.



Andrei



Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Sriram Srinivasan via Digitalmars-d
The key search phrase is "cache oblivious data structures". One 
of the cache-friendly layouts is the van Emde Boas tree.


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Timon Gehr via Digitalmars-d

On 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:

So now consider my square heaps. We have O(n) build time (just a bunch
of heapifications) and O(sqrt n) search.


How do you build in O(n)? (The initial array is assumed to be completely 
unordered, afaict.)


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Timon Gehr via Digitalmars-d

On 12/01/2015 03:33 AM, Timon Gehr wrote:

On 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:

So now consider my square heaps. We have O(n) build time (just a bunch
of heapifications) and O(sqrt n) search.


How do you build in O(n)? (The initial array is assumed to be completely
unordered, afaict.)


(I meant to say: There aren't any assumptions on the initial ordering of 
the array elements.)


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Torin via Digitalmars-d
On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu 
wrote:
Now each of these arrays we organize as a max heap. Moreover, 
we arrange data such that the maximums of these heaps are in 
INCREASING order. That means the smallest element of the entire 
(initial) array is at the first position, then followed by the 
next 4 smallest organized in a max heap, followed by the next 9 
smallest organized in a max heap, ... etc. There are a total of 
O(sqrt n) heaps, and each has O(sqrt n) elements.


Isn't the size of each heap a little larger than O(sqrt n)? The 
total number of elements you can hold in k heaps is equal to the 
kth square pyramidal number, which is of size O(k^3), so the 
largest heap is of size k^2 = O(n^(2/3))


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Timon Gehr via Digitalmars-d

On 11/30/2015 09:13 PM, Andrei Alexandrescu wrote:

I haven't worked out the math for insertion and deletion yet, but they
seem to also be around O(sqrt n) or O(log(n) * sqrt(n)).


(Assuming the bucket sizes actually grow linearly.) It seems to me that 
for insertion O(√n̅ log n) is very easy to achieve, but not for deletion. 
(For insertion, one simple strategy that satisfies the bound is to 
repeatedly move the maximum to the next heap [1].)


Deletion leaves a hole in one heap, which should be filled by the 
minimum element of the next heap, etc. The minimum cannot be extracted 
efficiently from a max-heap.




[1]
struct Fancy(T){
  T[] a;
  void insert(T t){
size_t i=1,j=0;
for(;j+i<=a.length;j+=i,i++){
  if(!(t

Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Andrei Alexandrescu via Digitalmars-d

On 11/30/2015 06:19 PM, Titus Nicolae wrote:

On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:
...

smallest organized in a max heap, ... etc. There are a total of O(sqrt
n) heaps, and each has O(sqrt n) elements.

...

Hi Andrei,
If I'm not mistaken, the number of heaps is proportional to n^(1/3) not
n^(1/2)
1+2^2+..+n^2 is proportional to n^3
https://en.wikipedia.org/wiki/Square_pyramidal_number
If n heaps have n^3 elements, rewrite n^3=m, then m elements will be
stored in m^(1/3) heaps
Interesting data structure! need to look over the details some more
Titus


Yes, thanks! I just wrote about this on reddit (before having seen 
this). -- Andrei




Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Andrei Alexandrescu via Digitalmars-d

On 11/30/2015 06:32 PM, Titus Nicolae wrote:

On Monday, 30 November 2015 at 23:19:44 UTC, Titus Nicolae wrote:

On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu wrote:
...

smallest organized in a max heap, ... etc. There are a total of
O(sqrt n) heaps, and each has O(sqrt n) elements.

...

Hi Andrei,
If I'm not mistaken, the number of heaps is proportional to n^(1/3)
not n^(1/2)
1+2^2+..+n^2 is proportional to n^3
https://en.wikipedia.org/wiki/Square_pyramidal_number
If n heaps have n^3 elements, rewrite n^3=m, then m elements will be
stored in m^(1/3) heaps
Interesting data structure! need to look over the details some more
Titus


This might give a better balanced structure.

Sum of odd numbers 1+3+5+..+(2n-1) = n^2
For N total elements, there are sqrt(n) heaps, and 2*sqrt(n) elements in
the largest heap
@Andrei: La multi ani! :)


This is very interesting. Thanks, and thanks! -- Andrei



Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Torin via Digitalmars-d

On Monday, 30 November 2015 at 23:19:44 UTC, Titus Nicolae wrote:
On Monday, 30 November 2015 at 20:13:15 UTC, Andrei 
Alexandrescu wrote:

...
smallest organized in a max heap, ... etc. There are a total 
of O(sqrt n) heaps, and each has O(sqrt n) elements.

...

Hi Andrei,
If I'm not mistaken, the number of heaps is proportional to 
n^(1/3) not n^(1/2)

1+2^2+..+n^2 is proportional to n^3
https://en.wikipedia.org/wiki/Square_pyramidal_number
If n heaps have n^3 elements, rewrite n^3=m, then m elements 
will be stored in m^(1/3) heaps
Interesting data structure! need to look over the details some 
more

Titus


I think Titus and I noticed the same thing at the same time.

You could achieve O(sqrt n) lookup time by having each heap 
contain ceil(sqrt(n)) elements (except for the last one), giving 
you O(sqrt n) heaps of size O(sqrt n), but it may make insertion 
trickier as you try to keep the heaps the same size.


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Titus Nicolae via Digitalmars-d
On Monday, 30 November 2015 at 20:13:15 UTC, Andrei Alexandrescu 
wrote:

...
smallest organized in a max heap, ... etc. There are a total of 
O(sqrt n) heaps, and each has O(sqrt n) elements.

...

Hi Andrei,
If I'm not mistaken, the number of heaps is proportional to 
n^(1/3) not n^(1/2)

1+2^2+..+n^2 is proportional to n^3
https://en.wikipedia.org/wiki/Square_pyramidal_number
If n heaps have n^3 elements, rewrite n^3=m, then m elements will 
be stored in m^(1/3) heaps
Interesting data structure! need to look over the details some 
more

Titus


Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Titus Nicolae via Digitalmars-d

On Monday, 30 November 2015 at 23:19:44 UTC, Titus Nicolae wrote:
On Monday, 30 November 2015 at 20:13:15 UTC, Andrei 
Alexandrescu wrote:

...
smallest organized in a max heap, ... etc. There are a total 
of O(sqrt n) heaps, and each has O(sqrt n) elements.

...

Hi Andrei,
If I'm not mistaken, the number of heaps is proportional to 
n^(1/3) not n^(1/2)

1+2^2+..+n^2 is proportional to n^3
https://en.wikipedia.org/wiki/Square_pyramidal_number
If n heaps have n^3 elements, rewrite n^3=m, then m elements 
will be stored in m^(1/3) heaps
Interesting data structure! need to look over the details some 
more

Titus


This might give a better balanced structure.

Sum of odd numbers 1+3+5+..+(2n-1) = n^2
For N total elements, there are sqrt(n) heaps, and 2*sqrt(n) 
elements in the largest heap

@Andrei: La multi ani! :)



Re: An interesting data structure with search time O(sqrt n)

2015-11-30 Thread Andrei Alexandrescu via Digitalmars-d

On 11/30/15 9:47 PM, Timon Gehr wrote:

On 12/01/2015 03:33 AM, Timon Gehr wrote:

On 11/30/2015 09:57 PM, Andrei Alexandrescu wrote:

So now consider my square heaps. We have O(n) build time (just a bunch
of heapifications) and O(sqrt n) search.


How do you build in O(n)? (The initial array is assumed to be completely
unordered, afaict.)


(I meant to say: There aren't any assumptions on the initial ordering of
the array elements.)


That's quite challenging. (My O(n) estimate was off the cuff and 
possibly wrong.) Creating the structure entails simultaneously solving 
the selection problem (find the k smallest elements) for several values 
of k. I'll post here if I come up with something. -- Andrei