Re: [RFC] Adding a real HashTable implementation to gnulib

2019-08-26 Thread Tim Rühsen
Hi Bruno,

sorry for the delay. This somewhat dropped off the tip of my list, but
lurks at the edge of my consciousness, coming up from time to time.

On 12/5/18 12:43 AM, Bruno Haible wrote:
> Hi Tim,
> 
>> Currently we have
>> - positive integer (N): resize by old value * N
>> - negative integer (N): resize by old value + (-N)
> 
> It's currently the opposite:
> - positive integer (N): resize to old size + N
> - negative integer (N): resize to old size * (-N)
> 
>>>   - The ability to set a growth policy > 0. This leads to quadratic
>>> run time behaviour. I mean, you make such an effort to have O(1)
>>> access on average, and then the fixed-size increment of the size
>>> turns the insertion operation into an average-O(n) operation.
>>
>> In Wget2 we use reasonable default sizes for the hashmap to make
>> resizing more or less a corner case.
> 
> The problem is that when you guessed wrong about the "corner case",
> the rehash will consume 99.9% of your program's run time.
> 
>> How can we do it better ?
> 
> 1. Remove the ability to "resize to old size + N",
> 2. (optional) When "resize to old size * FACTOR", allow the factor
>to be 1.5 or 1.3 or something like that.

That's what I did meanwhile, FACTOR is now a float and that's it.


>>>   - Some functions take keysize and valuesize arguments, some don't.
>>> I'm confused.
>>
>> The "noalloc" functions just store the given pointers for key and value,
>> that means without allocation. The hashmap takes ownership of the
>> objects and has to release them when being removed/destroyed (or use a
>> NULL destructor when the values are static). This can be used to
>> generate a search index for an existing bunch of values (objects /
>> structures). Like a "unique index" for a field in a database table.
>> Example function: wget_hashmap_put_noalloc()
>>
>> Then we have the functions that do (shallow) allocations/clones, like
>> wget_hashmap_put(). This is for using the hashmap as a container, the
>> original objects have to be (shallow) released outside of the container.
> 
> Hmm. What you describe here is whether ownership is passed to the container
> or kept outside the container. My question was about the keysize and
> valuesize arguments, that is, about the nature of the key and value objects
> (a. a pointer, pointing to anything in memory, or
>  b. a region of memory, with a start address and an end address).
> 
> I think both questions are orthogonal:
>   - In either a or b, you can keep the ownership outside the container
> by using a NULL destructor function.
>   - For container-owned objects:
> In case a, you need a clone or copy function indeed,
> In case b, the container needs to clone precisely the given region of
> memory.
> 
> But I haven't thought long about it. What do you think?

I thought a while about this and finally decided to leave memory
allocation outside the container, in the responsibility of the caller.

That reduces the number of functions, makes things easier to understand,
reduce overall code. That moves the API usage more into the direction of
a no-brainer. The additional code complexity in the application (wget2)
was much less than I feared.

Thank you for your hints / thoughts and please excuse me for taking so
long to answer.

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Adding a real HashTable implementation to gnulib

2019-01-04 Thread Florian Weimer
* Tim Rühsen:

> But I still wonder why
>
>  - nested functions need an executable stack (why does the code have to
> be run on the stack ?)

The static chain pointer needs to be passed to the nest function, but
most ABIs only have a code pointer.  So a trampoline is written to the
stack which sets up the static chain (in a hidden function argument) and
calls the implementation.  The address of the trampoline is used as the
code pointer for the nested function.

>  - there are no efforts to standardize either nested functions or blocks
> (clang has 'blocks' instead of nested functions) or both. (There is a
> C22 panel, but I couldn't find anything useful on their task list.)

You can simply use C++.  C++11 and later have std::function and lambdas.

The standardization of lambdas for C has been abandoned, as far as I
know.

Thanks,
Florian



Re: [RFC] Adding a real HashTable implementation to gnulib

2019-01-03 Thread Tim Rühsen
On 12/4/18 3:32 AM, Bruno Haible wrote:
> Hi Tim,
> 
>> We have 'hashmap' [1] in GNU Wget2.
> 
> Things I like about the implementation:
> 
>   - Collision resolution through linked list (a robust algorithm).
> 
>   - Reasonably generic API.
> 
>   - The user-definable destructor functions.
> 
> Things that could be improved:
> 
>   - The ability to set a growth policy > 0. This leads to quadratic
> run time behaviour. I mean, you make such an effort to have O(1)
> access on average, and then the fixed-size increment of the size
> turns the insertion operation into an average-O(n) operation.

Changed now, only supporting a float number as factor now.


>   - Some functions take keysize and valuesize arguments, some don't.
> I'm confused.

That has a reason, I'll answer with examples in a second email. Not
enough time right now.


>   - NULL values are special. The documentation says "If there are NULL
> values in the stringmap, you should use wget_stringmap_get_null() to
> distinguish between 'not found' and 'found NULL value'." I would
> prefer an API that does not require me to think about whether I have
> null values in the map or not.

API has been changed to handle NULL values without thinking (put and get).


>   - To iterate through the table, you need to define an instance of the
> wget_hashmap_browse_t function. I love functional programming, but
> C is not the right language for that. If the inner part (the body
> of the 'browse' function) needs to access outer variables, the
> programmer needs to manually create a "context" struct and fill it -
> things that the compiler would be doing in other programming languages.
> Some people say that the solution to this problem are the nested
> functions supported by GCC, but
>   1. This is not portable; only GCC supports this.
>   2. On many CPUs (including x86_64), the use of nested functions
>  requires to disable on the entire process a security feature
>  (namely, stacks without execute permission).
> I therefore prefer the concept of "iterator" objects that allow
> the programmer to step through the hash table without contorting
> their code and without compromising on security.

Iterators are now implemented with three functions (...alloc, ...free,
...next).

BTW, nested functions are pretty nice to use.

But I still wonder why

 - nested functions need an executable stack (why does the code have to
be run on the stack ?)

 - there are no efforts to standardize either nested functions or blocks
(clang has 'blocks' instead of nested functions) or both. (There is a
C22 panel, but I couldn't find anything useful on their task list.)

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Adding a real HashTable implementation to gnulib

2018-12-04 Thread Bruno Haible
Hi Tim,

> Currently we have
> - positive integer (N): resize by old value * N
> - negative integer (N): resize by old value + (-N)

It's currently the opposite:
- positive integer (N): resize to old size + N
- negative integer (N): resize to old size * (-N)

> >   - The ability to set a growth policy > 0. This leads to quadratic
> > run time behaviour. I mean, you make such an effort to have O(1)
> > access on average, and then the fixed-size increment of the size
> > turns the insertion operation into an average-O(n) operation.
> 
> In Wget2 we use reasonable default sizes for the hashmap to make
> resizing more or less a corner case.

The problem is that when you guessed wrong about the "corner case",
the rehash will consume 99.9% of your program's run time.

> How can we do it better ?

1. Remove the ability to "resize to old size + N",
2. (optional) When "resize to old size * FACTOR", allow the factor
   to be 1.5 or 1.3 or something like that.

> >   - Some functions take keysize and valuesize arguments, some don't.
> > I'm confused.
> 
> The "noalloc" functions just store the given pointers for key and value,
> that means without allocation. The hashmap takes ownership of the
> objects and has to release them when being removed/destroyed (or use a
> NULL destructor when the values are static). This can be used to
> generate a search index for an existing bunch of values (objects /
> structures). Like a "unique index" for a field in a database table.
> Example function: wget_hashmap_put_noalloc()
> 
> Then we have the functions that do (shallow) allocations/clones, like
> wget_hashmap_put(). This is for using the hashmap as a container, the
> original objects have to be (shallow) released outside of the container.

Hmm. What you describe here is whether ownership is passed to the container
or kept outside the container. My question was about the keysize and
valuesize arguments, that is, about the nature of the key and value objects
(a. a pointer, pointing to anything in memory, or
 b. a region of memory, with a start address and an end address).

I think both questions are orthogonal:
  - In either a or b, you can keep the ownership outside the container
by using a NULL destructor function.
  - For container-owned objects:
In case a, you need a clone or copy function indeed,
In case b, the container needs to clone precisely the given region of
memory.

But I haven't thought long about it. What do you think?

Bruno




Re: [RFC] Adding a real HashTable implementation to gnulib

2018-12-04 Thread Tim Rühsen


bin_yeG1YHEY1.bin
Description: PGP/MIME version identification


encrypted.asc
Description: OpenPGP encrypted message


Re: [RFC] Adding a real HashTable implementation to gnulib

2018-12-03 Thread Bruno Haible
Hi Tim,

> We have 'hashmap' [1] in GNU Wget2.

Things I like about the implementation:

  - Collision resolution through linked list (a robust algorithm).

  - Reasonably generic API.

  - The user-definable destructor functions.

Things that could be improved:

  - The ability to set a growth policy > 0. This leads to quadratic
run time behaviour. I mean, you make such an effort to have O(1)
access on average, and then the fixed-size increment of the size
turns the insertion operation into an average-O(n) operation.

  - Some functions take keysize and valuesize arguments, some don't.
I'm confused.

  - NULL values are special. The documentation says "If there are NULL
values in the stringmap, you should use wget_stringmap_get_null() to
distinguish between 'not found' and 'found NULL value'." I would
prefer an API that does not require me to think about whether I have
null values in the map or not.

  - To iterate through the table, you need to define an instance of the
wget_hashmap_browse_t function. I love functional programming, but
C is not the right language for that. If the inner part (the body
of the 'browse' function) needs to access outer variables, the
programmer needs to manually create a "context" struct and fill it -
things that the compiler would be doing in other programming languages.
Some people say that the solution to this problem are the nested
functions supported by GCC, but
  1. This is not portable; only GCC supports this.
  2. On many CPUs (including x86_64), the use of nested functions
 requires to disable on the entire process a security feature
 (namely, stacks without execute permission).
I therefore prefer the concept of "iterator" objects that allow
the programmer to step through the hash table without contorting
their code and without compromising on security.

Bruno




Re: [RFC] Adding a real HashTable implementation to gnulib

2018-12-03 Thread Tim Rühsen
On 12/2/18 2:41 PM, Bruno Haible wrote:
> Hi,
> 
> Darshit Shah wrote:
>> I recently tried to use the hash table implementation in gnulib which resides
>> in the "hash" module. However, I quickly realised that the hash table in 
>> gnulib
>> seems to be what is otherwise popularly known as a hash set, i.e., it 
>> supports
>> storing and retrieving just values from the structure. 
>>
>> On the other hand, a hash table is usually expected to have a key->value
>> mapping that is stored.
> 
> I agree that the gnulib 'hash' module is just a particular case, and
> probably the module name is not very descriptive.
> 
>> Within GNU Wget, we have a fairly portable version of a hash table 
>> implemented
>> which I think would be a good addition for gnulib. What do you think?
> 
> There's not only the one from wget
>   https://git.savannah.gnu.org/gitweb/?p=wget.git;a=blob;f=src/hash.h
>   https://git.savannah.gnu.org/gitweb/?p=wget.git;a=blob;f=src/hash.c
> 
> but also the one from gettext
>   
> https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=gnulib-local/lib/hash.h
>   
> https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=gnulib-local/lib/hash.c
> 
> and the one from glib
>   https://gitlab.gnome.org/GNOME/glib/blob/master/glib/ghash.h
>   https://gitlab.gnome.org/GNOME/glib/blob/master/glib/ghash.c
> 
> and the one from libxml
>   https://gitlab.gnome.org/GNOME/libxml2/blob/master/include/libxml/hash.h
>   https://gitlab.gnome.org/GNOME/libxml2/blob/master/hash.c
> 
> and the ones from CLN
>   https://www.ginac.de/CLN/cln.git/?p=cln.git;a=tree;f=src/base/hash
> 
> and many more.
> 
> The implementation you are proposing is an "open-addressed table with linear
> probing collision resolution". To me that is unacceptable. When I used Kyoto
> Common Lisp (KCL) many years ago, I got an endless loop during a hash table
> access, and it was precisely because of this open-addressed table structure.
> I don't want a code which requires careful setting of parameters in order
> not to run into an endless loop. Instead better have a code that cannot run
> into an endless loop *by design*.
> 
> The hash_string function that you propose shifts by 5 bits at each step;
> I suspect that it has the same problem as the one I tested and discussed in
> https://haible.de/bruno/hashfunc.html .
> 
> For Gnulib, I would want a generic container, i.e. a "map", like we have
> "list" and "ordered set" already (modules 'list' and 'oset'). Other GNU
> maintainers have reported that they like this approach.

We have 'hashmap' [1] in GNU Wget2. It does not work with linear probing
collision resolution, but with a linked list as fallback for collisions
(that could be replaced by a different algorithm).

So it's a key/value store where you can have NULL as value - in case you
just want to know if a key exists or not.

You apply your own hash and compare function when you create a hashmap
instance. So it's up to the dev to decide for an optimal hash strategy
for a certain use case.

You can add entries (key+value) with or without malloc/copying.

You can easily make wrappers around hashmap for special purposes, e.g.
we have a 'stringmap' API for case sensitive and insensitive keys [2].

The API is documented at [3] (hashmap) and [4] (stringmap).

It's a naive implementation but fast enough for out purposes (70ms to
read in the german translation of the holy bible (both parts) and count
unique words).

Surely many details have to change, but I am absolutely willing to
accept critics and work together with you - and anyone else who likes -
to get it into shape. The API hasn't been released, so we are currently
pretty flexible with changes.


[1] https://gitlab.com/gnuwget/wget2/blob/master/libwget/hashmap.c
[2] https://gitlab.com/gnuwget/wget2/blob/master/libwget/stringmap.c
[3] https://gnuwget.gitlab.io/wget2/reference/group__libwget-hashmap.html
[4] https://gnuwget.gitlab.io/wget2/reference/group__libwget-stringmap.html

> 
> However, this will still not fit all possible needs because there are
> special cases that people want to see optimized:
>   - The case when the key is a string; additionally when the key is
> allocated in an obstack and there is no remove.
>   - The struniq function (as in localename.c).
> 
> Then, what about extra requirements?
>   - The existing gnulib 'hash' module is pretty unique: it keeps statistics.
> But is anyone really using this feature?
>   - malloc vs. xmalloc.
>   - Multithread-safety should IMO not be considered as an extra requirement.
> This is better done in application logic, because typically in the scope
> of the lock the application will do more than just the hash table lookup.

There are no stats in Wget2's implementation. The malloc/free functions
are currently not configurable (but that's trivial).

Regards, Tim



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Adding a real HashTable implementation to gnulib

2018-12-02 Thread LRN
On 02.12.2018 16:41, Bruno Haible wrote:
> Hi,
> 
> Darshit Shah wrote:
>> I recently tried to use the hash table implementation in gnulib which
>> resides in the "hash" module. However, I quickly realised that the hash
>> table in gnulib seems to be what is otherwise popularly known as a hash
>> set, i.e., it supports storing and retrieving just values from the
>> structure.
>> 
>> On the other hand, a hash table is usually expected to have a key->value 
>> mapping that is stored.
> 
> I agree that the gnulib 'hash' module is just a particular case, and 
> probably the module name is not very descriptive.
> 
>> Within GNU Wget, we have a fairly portable version of a hash table
>> implemented which I think would be a good addition for gnulib. What do you
>> think?
> 
> There's not only the one from wget but also the one from gettext and the one
> from glib https://gitlab.gnome.org/GNOME/glib/blob/master/glib/ghash.h 
> https://gitlab.gnome.org/GNOME/glib/blob/master/glib/ghash.c
> 
> and the one from libxml and the ones from CLN and many more.
> 

There was a hashtable shootout[1] recently, with a followup[2] (although that
one is glib-specific):

[1]: https://hpjansson.org/blag/2018/07/24/a-hash-table-re-hash/
[2]: https://hpjansson.org/blag/2018/08/29/what-ails-ghashtable/



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Adding a real HashTable implementation to gnulib

2018-12-02 Thread Bruno Haible
Hi,

Darshit Shah wrote:
> I recently tried to use the hash table implementation in gnulib which resides
> in the "hash" module. However, I quickly realised that the hash table in 
> gnulib
> seems to be what is otherwise popularly known as a hash set, i.e., it supports
> storing and retrieving just values from the structure. 
> 
> On the other hand, a hash table is usually expected to have a key->value
> mapping that is stored.

I agree that the gnulib 'hash' module is just a particular case, and
probably the module name is not very descriptive.

> Within GNU Wget, we have a fairly portable version of a hash table implemented
> which I think would be a good addition for gnulib. What do you think?

There's not only the one from wget
  https://git.savannah.gnu.org/gitweb/?p=wget.git;a=blob;f=src/hash.h
  https://git.savannah.gnu.org/gitweb/?p=wget.git;a=blob;f=src/hash.c

but also the one from gettext
  
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=gnulib-local/lib/hash.h
  
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=gnulib-local/lib/hash.c

and the one from glib
  https://gitlab.gnome.org/GNOME/glib/blob/master/glib/ghash.h
  https://gitlab.gnome.org/GNOME/glib/blob/master/glib/ghash.c

and the one from libxml
  https://gitlab.gnome.org/GNOME/libxml2/blob/master/include/libxml/hash.h
  https://gitlab.gnome.org/GNOME/libxml2/blob/master/hash.c

and the ones from CLN
  https://www.ginac.de/CLN/cln.git/?p=cln.git;a=tree;f=src/base/hash

and many more.

The implementation you are proposing is an "open-addressed table with linear
probing collision resolution". To me that is unacceptable. When I used Kyoto
Common Lisp (KCL) many years ago, I got an endless loop during a hash table
access, and it was precisely because of this open-addressed table structure.
I don't want a code which requires careful setting of parameters in order
not to run into an endless loop. Instead better have a code that cannot run
into an endless loop *by design*.

The hash_string function that you propose shifts by 5 bits at each step;
I suspect that it has the same problem as the one I tested and discussed in
https://haible.de/bruno/hashfunc.html .

For Gnulib, I would want a generic container, i.e. a "map", like we have
"list" and "ordered set" already (modules 'list' and 'oset'). Other GNU
maintainers have reported that they like this approach.

However, this will still not fit all possible needs because there are
special cases that people want to see optimized:
  - The case when the key is a string; additionally when the key is
allocated in an obstack and there is no remove.
  - The struniq function (as in localename.c).

Then, what about extra requirements?
  - The existing gnulib 'hash' module is pretty unique: it keeps statistics.
But is anyone really using this feature?
  - malloc vs. xmalloc.
  - Multithread-safety should IMO not be considered as an extra requirement.
This is better done in application logic, because typically in the scope
of the lock the application will do more than just the hash table lookup.

Bruno




Re: [RFC] Adding a real HashTable implementation to gnulib

2018-11-26 Thread Ben Pfaff
Much as I like the PSPP hmaps, it probably makes sense for any hash
table implementation in gnulib to match the existing code.

On Tue, Nov 27, 2018 at 02:02:17AM +0100, Darshit Shah wrote:
> Here are the links to the sources in the GNU Wget tree:
> 
> http://git.savannah.gnu.org/cgit/wget.git/tree/src/hash.h
> http://git.savannah.gnu.org/cgit/wget.git/tree/src/hash.c
> 
> At first sight, the implementation in PSPP looks a lot more concise.
> Also, it's usage of fewer preprocessor statements makes me already like it.
> 
> However, it does seem that this implementation is not as general or, complete
> as the existing hash table implementation in gnulib. The version available in
> GNU Wget does implement all the same interfaces with very similar usage
> characteristics.
> 
> One of the things I noticed missing in the implementation in GNU PSPP is the
> ability to grow the size of the table based on a certain threshold of
> "fullness".
> 
> * Ben Pfaff  [181127 00:41]:
> > On Tue, Nov 27, 2018 at 12:16:16AM +0100, Darshit Shah wrote:
> > > I recently tried to use the hash table implementation in gnulib which 
> > > resides
> > > in the "hash" module. However, I quickly realised that the hash table in 
> > > gnulib
> > > seems to be what is otherwise popularly known as a hash set, i.e., it 
> > > supports
> > > storing and retrieving just values from the structure. 
> > > 
> > > On the other hand, a hash table is usually expected to have a key->value
> > > mapping that is stored.
> > > 
> > > Within GNU Wget, we have a fairly portable version of a hash table 
> > > implemented
> > > which I think would be a good addition for gnulib. What do you think?
> > > 
> > > If I get a positive response here, I could extract that code and turn it 
> > > into a
> > > hash table module for gnulib. We should be able to reuse some part of the
> > > existing code in "hash.c" for this purpose as well
> > 
> > Can you point to the Wget hash table?
> > 
> > I'm pretty fond of the hash table implementation we have in PSPP:
> > http://git.savannah.gnu.org/cgit/pspp.git/tree/src/libpspp/hmap.h
> > http://git.savannah.gnu.org/cgit/pspp.git/tree/src/libpspp/hmap.c
> > 
> > 
> 
> -- 
> Thanking You,
> Darshit Shah
> PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6





Re: [RFC] Adding a real HashTable implementation to gnulib

2018-11-26 Thread Darshit Shah
Here are the links to the sources in the GNU Wget tree:

http://git.savannah.gnu.org/cgit/wget.git/tree/src/hash.h
http://git.savannah.gnu.org/cgit/wget.git/tree/src/hash.c

At first sight, the implementation in PSPP looks a lot more concise.
Also, it's usage of fewer preprocessor statements makes me already like it.

However, it does seem that this implementation is not as general or, complete
as the existing hash table implementation in gnulib. The version available in
GNU Wget does implement all the same interfaces with very similar usage
characteristics.

One of the things I noticed missing in the implementation in GNU PSPP is the
ability to grow the size of the table based on a certain threshold of
"fullness".

* Ben Pfaff  [181127 00:41]:
> On Tue, Nov 27, 2018 at 12:16:16AM +0100, Darshit Shah wrote:
> > I recently tried to use the hash table implementation in gnulib which 
> > resides
> > in the "hash" module. However, I quickly realised that the hash table in 
> > gnulib
> > seems to be what is otherwise popularly known as a hash set, i.e., it 
> > supports
> > storing and retrieving just values from the structure. 
> > 
> > On the other hand, a hash table is usually expected to have a key->value
> > mapping that is stored.
> > 
> > Within GNU Wget, we have a fairly portable version of a hash table 
> > implemented
> > which I think would be a good addition for gnulib. What do you think?
> > 
> > If I get a positive response here, I could extract that code and turn it 
> > into a
> > hash table module for gnulib. We should be able to reuse some part of the
> > existing code in "hash.c" for this purpose as well
> 
> Can you point to the Wget hash table?
> 
> I'm pretty fond of the hash table implementation we have in PSPP:
> http://git.savannah.gnu.org/cgit/pspp.git/tree/src/libpspp/hmap.h
> http://git.savannah.gnu.org/cgit/pspp.git/tree/src/libpspp/hmap.c
> 
> 

-- 
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6


signature.asc
Description: PGP signature


Re: [RFC] Adding a real HashTable implementation to gnulib

2018-11-26 Thread Ben Pfaff
On Tue, Nov 27, 2018 at 12:16:16AM +0100, Darshit Shah wrote:
> I recently tried to use the hash table implementation in gnulib which resides
> in the "hash" module. However, I quickly realised that the hash table in 
> gnulib
> seems to be what is otherwise popularly known as a hash set, i.e., it supports
> storing and retrieving just values from the structure. 
> 
> On the other hand, a hash table is usually expected to have a key->value
> mapping that is stored.
> 
> Within GNU Wget, we have a fairly portable version of a hash table implemented
> which I think would be a good addition for gnulib. What do you think?
> 
> If I get a positive response here, I could extract that code and turn it into 
> a
> hash table module for gnulib. We should be able to reuse some part of the
> existing code in "hash.c" for this purpose as well

Can you point to the Wget hash table?

I'm pretty fond of the hash table implementation we have in PSPP:
http://git.savannah.gnu.org/cgit/pspp.git/tree/src/libpspp/hmap.h
http://git.savannah.gnu.org/cgit/pspp.git/tree/src/libpspp/hmap.c