Christian,
Thx for your query. I have tested this code against a JAVA binding with
HashSet, running a BaseX instance (i.e. declare variable $my-big-DataBase
:= db:open(..) ).
I've taken this function, more representative of my needs :
declare function local:get-some-stuff-within($element) {
generate-id($element)
};
Here are the result (I ran the test 4/5 times to be sure that my
environment did not perturb the experiment) :
Java Hashset :
start : 21:34:21.98
size HashSet : 17568843
end : 21:35:05.93
BaseX map based hashset
start : 21:37:56.59
size map : 17568843
end : 21:38:50.43
Some remarks :
- Without surprise, storing docs into BaseX accelerate the query.
- Surpringly for me, this map code behave nicely (mine was terrible ! I
don't know why). Is there something I should know about fold-left ? Well,
clearly, I have a lot to learn about this langage.
- HashSet seems to be 20 % faster, with a smaller memory footprint (10%,
(4,3 against 4, unprecise measure, since I was looking to my Task Manager
!) ).
Thus I could use maps to simulate HashSet, it not a very big overload.
However, is there any incentive to trade off 20% performance ?
Cheers,
Jean-Marc
2013/11/12 Christian Grün <[email protected]>
> Hi Jean-Marc,
>
> I’m not sure how your actual data looks like, but you could have a look
> the attached query, which uses XQuery maps, and which allowe
> d us to
> inser
> t
> 10 million strings in about 40 seconds
> .
> Most
> presumably, it’s
> not as memory efficient as a Java map, but my guess is that most memory is
> actually spent by the XQuery representation of string items, and not
> necessarily the map itself.
>
> Hope this helps,
> Christian
> ___________________________
> ___________________________
>
> declare variable $my-big-DataBase := doc('db')/root/*;
>
> declare function local:do-some-stuff-with-set($map) {
> count(map:keys($map))
> };
>
> declare function local:init() {
> local:do-some-stuff-with-set(local:tree-traversal($my-big-DataBase,
> map:new()))
> };
>
> declare function local:tree-traversal($elements, $map) {
> fold-left($elements, $map,
> function($map2, $element) {
> let $map3 := map:new(($map2, { local:get-some-stuff-within($element)
> : true() }))
> return local:tree-traversal($element/element(), $map3)
> }
> )
> };
>
> declare function local:get-some-stuff-within($element) {
> $element/@name
> };
>
> local:init()
> ___________________________
> ___________________________
>
> On Tue, Nov 12, 2013 at 5:48 PM, jean-marc Mercier <
> [email protected]> wrote:
> > Dear Christian,
> >
> > Thanks for your answer.
> >>It may be interesting to do some testing in order to find out what’s
> >>the actual bottleneck in your query. How man entries is your hash set
> >>supposed to contain?
> >
> > The HashSet contains 10M strings. That's not a big deal. The problems
> comes
> > from filling the HashSet.
> > Well, let us be more technacal. I am using hashset for tree traversal
> > operations over a DataBase of 10M nodes, 1Go. This HashSet collects some
> > informations during the traversal, mainly a string at each node.
> Actually,
> > my code binds JAVA with something like
> >
> > import module namespace set = "java.util.HashSet";
> >
> > declare function init(){set:clear(), tree_traversal($my_big_DataBase),
> > do_some_stuff_with_hash_set()}
> > declare function tree_traversal($elements)
> > {
> > for $element in $elements return (
> > set:add(get_some_stuff_within($element) ),
> > tree_traversal($element/element())
> > )
> > }
> > and the whole traversal takes a minute.
> >
> > I don't know how to achieve this with pure XQUERY. The only thing that I
> > tried is something like :
> >
> > declare function init(){let $my_hash_set :=
> tree_traversal($my_big_DataBase,
> > map:new()) return do_some_stuff_with($my_hash_set) }
> >
> > declare function tree_traversal($elements, $hash_set) as function(*)
> (return
> > a HashSet)
> > {
> > for $element in $elements return (
> > tree_traversal($nodes/element(), $hash_set),
> > map:entry( get_some_stuff_within($element),boolean)
> > )
> > }
> > that performed very badly, compared to the first version (JAVA binding).
> I
> > explained this thinking that the first version is O(N) operations, and
> the
> > second is O(N^2) operations due to in-memory copy.
> >
> >>> - second : it lacks powerful libraries, as complete math modules.
> >
> >> What kind of functions would you be interested in?
> >
> > I need a stochastic modulus, as
> > http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html
> > also a linear Algebra modulus as UBLAS
> > http://www.boost.org/doc/libs/1_54_0/libs/numeric/ublas/doc/index.htm
> >
> > I guess that I could find JAVA equivalent and bind them to BaseX. Such
> > functionalities are available directly as XQUERY modules ?
> >
> > Cheers,
> >
> > Jean-Marc
> >
> >
> >
> >
> > 2013/11/12 Christian Grün <[email protected]>
> >>
> >> Dear JohnLeM,
> >>
> >> thanks for your mail. As you already noted, XQuery is a functional
> >> language, and this is the reason why XQuery maps are not exactly
> >> comparable to maps and sets, as they are used in imperative languages:
> >>
> >> All maps in XQuery are persisent (immutable): Once a map has been
> >> generated, it is not possible to change its contents. Instead, a new
> >> map is to be created by each insertion or deletion [1]. This sounds
> >> like a huge memory killer, but it’s not as bad as you might guess.
> >> Various efficient solutions exist for persistent map, such as the
> >> mapped trie that has been implemented in BaseX [2]. It will only
> >> create copies of parts of the data structure that are to be changed.
> >> The following query is an example for a query which creates 100.000
> >> map with a single entry, and a large map containing all the entries;
> >> on my system, it requires 200 ms to run:
> >>
> >> map:size(map:new(
> >> for $i in 1 to 100000
> >> return map { $i := true() }
> >> ))
> >>
> >> In short: persistent maps may not be as efficient as mutable maps, but
> >> they are usually not the bottleneck in XQuery applications, because
> >> deleted entries (or obsolete maps) will automatically be discarded by
> >> the garbage collector as soon as they are not in the query scope
> >> anymore. If you want to enforce this, you can put your map operations
> >> into FLWOR expresions or user-declared functions.
> >>
> >> Back to your original questions:
> >>
> >> > 2) This module must expose a method "hashset:clear($hashset)" to
> >> > de-allocate
> >> > memory dynamically. The map:module provides the function map:remove,
> and
> >> > I
> >> > could remove all elements. [...] It does not deallocate memory,
> leading
> >> > to poor
> >> > overall performances.
> >>
> >> It may be interesting to do some testing in order to find out what’s
> >> the actual bottleneck in your query. How man entries is your hash set
> >> supposed to contain?
> >>
> >> > 3) must expose a method "hashset:add($hashset,$element)" to add memory
> >> > dynamically. Through map:module, the only possibility that I see is to
> >> > wrap
> >> > it as map:new($hashset, map:entry($element,$dummyboolean)).
> >>
> >> Right, using true() is probably the best choice (booleans are only
> >> instantiated once in memory).
> >>
> >> > - first : its dynamic memory management (the in-memory printfoot of my
> >> > XQUERY executables are usually tremendous).
> >>
> >> This can in fact be a problem, and is mostly due to various decisions
> >> taken in the specification, and the complexity of XML nodes in
> >> general.
> >>
> >> > - second : it lacks powerful libraries, as complete math modules.
> >>
> >> What kind of functions would you be interested in?
> >>
> >> Christian
> >>
> >> [1] http://en.wikipedia.org/wiki/Persistent_data_structure
> >> [2] http://en.wikipedia.org/wiki/Hash_array_mapped_trie
> >
> >
>
>
_______________________________________________
BaseX-Talk mailing list
[email protected]
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk