Indexed in-memory graph
-----------------------
Key: CLEREZZA-683
URL: https://issues.apache.org/jira/browse/CLEREZZA-683
Project: Clerezza
Issue Type: New Feature
Components: rdf.core
Reporter: Rupert Westenthaler
# Indexed in-memory graph
Implementation of a TripleCollection that internally manages SPO, POS, OSP
indexes for fast filtered iterators. The current state of development is hosted
at
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/.
However the intention is that this module becomes direct part of clerezza.
## Background:
For Apache Stanbol having fast filtered iterators over in-memory graphs is
really important, because Stanbol uses in-memory graph to store extracted
metadata for parsed ContentItems.
When enhancing longer texts with EnhancementChain configurations that produce a
lot of enhancements (e.g. keyword extraction based on dbpedia) such in-memory
graphs can get bigger than 100k triples. Especially if also triples for
suggested entities are included within the result.
## Implementation:
Because of that I started to implement an TripleCollection that used TreeMaps
to manage SPO, POS, OSP indexes.
For fast sorting (comparator) I use the same Resource#hashCode
Resource#toString based solution as used in the rdf.rdfjson serializer. I hope
this is also sufficient for Literals (someone should check that).
The implementation of the "filter(..)" method is purely based on
"NavigableSet.subSet(..).iterator()". I only need to wrap the iterator to
ensure that by calls to Iterator.remove():
1) Triples are removed from all three indexes
2) GraphEvents are dispatched correctly
Note also the trick with the two static fields UriRef MIN and UriRef MAX used
to generate lower/upper bound triples as parsed to "NavigableSet.subSet(..)".
The implementation is currently hosted on
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/
It has no dependencies to Apache Stanbol. However users that do not want to
check-out Stanbol as a whole will need to edit the pom.xml file and provide
information usually imported from the parent poms.
## Tests:
This implementation passes all MGraphTest UnitTests.
In addition I have copied the tests define for SimpleTripleCollection
To compare the performance I also implemented code that
* allows to create a random Graph with n Triples
* create a TestCase with configurable numbers of Subjects, Predicates and
Objects
* performs than m calls to #filter(...)
This performance test runs also as UnitTest
1. by using the SimpleMGraph implementation
2. by using the IndexedMGraph implementation
NOTE: While implementing this I recognized that the SimpleTripleCollectionTest
does not extend MGraphTest and therefore the SimpleTripleCollection class is
not checked against the tests defined by MGraphTest. This might actually an
Issue!
## Performance
This is a copy from a run of the above described PerformanceTest
2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
Filter Performance Test (graph size 100000 triples, iterations 1000)
2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
--- TEST SimpleMGraph with 100000 triples ---
10694 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [S,P,O] in 8321ms with 2 results
18052 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [S,P,n] in 7358ms with 734 results
25318 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [S,n,O] in 7266ms with 100 results
31837 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [n,P,O] in 6519ms with 232 results
39236 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [S,n,n] in 7398ms with 8030 results
45170 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [n,P,n] in 5934ms with 8318000 results
55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [n,n,O] in 10666ms with 2260 results
55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
--- TEST completed in 53463ms
55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
--- TEST IndexedMGraph 100000 triples ---
55856 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [S,P,O] in 20ms with 2 results
55875 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [S,P,n] in 19ms with 734 results
55908 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [S,n,O] in 33ms with 100 results
55936 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [n,P,O] in 28ms with 232 results
55957 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [S,n,n] in 21ms with 8030 results
57022 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [n,P,n] in 1065ms with 8318000 results
57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
... run [n,n,O] in 8ms with 2260 results
57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
--- TEST completed in 1194ms
best
Rupert
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira