[
https://issues.apache.org/jira/browse/CLEREZZA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218196#comment-13218196
]
Rupert Westenthaler commented on CLEREZZA-683:
----------------------------------------------
Recent Jenkis build on Stanbol indicate [1] that there are some issues with the
Filtered Iterator implementation with filter(null,null,resource).
Up to now I was not able to reproduce this locally and the test on the server
do not provide sufficient logging. Will update the according test to use
Random#setSeed
when generating the test data. Using the same seed should allow to
reproduce/debug the Issue
[1]
https://builds.apache.org/job/stanbol-trunk-1.6/765/org.apache.stanbol$org.apache.stanbol.commons.indexedgraph/testReport/junit/org.apache.stanbol.commons.indexedgraph/IndexedGraphTest/testPerformance/
> Indexed in-memory graph
> -----------------------
>
> Key: CLEREZZA-683
> URL: https://issues.apache.org/jira/browse/CLEREZZA-683
> Project: Clerezza
> Issue Type: New Feature
> Components: rdf.core
> Reporter: Rupert Westenthaler
>
> # Indexed in-memory graph
> Implementation of a TripleCollection that internally manages SPO, POS, OSP
> indexes for fast filtered iterators. The current state of development is
> hosted at
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/.
> However the intention is that this module becomes direct part of clerezza.
> ## Background:
> For Apache Stanbol having fast filtered iterators over in-memory graphs is
> really important, because Stanbol uses in-memory graph to store extracted
> metadata for parsed ContentItems.
> When enhancing longer texts with EnhancementChain configurations that produce
> a lot of enhancements (e.g. keyword extraction based on dbpedia) such
> in-memory graphs can get bigger than 100k triples. Especially if also triples
> for suggested entities are included within the result.
> ## Implementation:
> Because of that I started to implement an TripleCollection that used TreeMaps
> to manage SPO, POS, OSP indexes.
> For fast sorting (comparator) I use the same Resource#hashCode
> Resource#toString based solution as used in the rdf.rdfjson serializer. I
> hope this is also sufficient for Literals (someone should check that).
> The implementation of the "filter(..)" method is purely based on
> "NavigableSet.subSet(..).iterator()". I only need to wrap the iterator to
> ensure that by calls to Iterator.remove():
> 1) Triples are removed from all three indexes
> 2) GraphEvents are dispatched correctly
> Note also the trick with the two static fields UriRef MIN and UriRef MAX used
> to generate lower/upper bound triples as parsed to "NavigableSet.subSet(..)".
> The implementation is currently hosted on
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/indexedgraph/
> It has no dependencies to Apache Stanbol. However users that do not want to
> check-out Stanbol as a whole will need to edit the pom.xml file and provide
> information usually imported from the parent poms.
> ## Tests:
> This implementation passes all MGraphTest UnitTests.
> In addition I have copied the tests define for SimpleTripleCollection
> To compare the performance I also implemented code that
> * allows to create a random Graph with n Triples
> * create a TestCase with configurable numbers of Subjects, Predicates and
> Objects
> * performs than m calls to #filter(...)
> This performance test runs also as UnitTest
> 1. by using the SimpleMGraph implementation
> 2. by using the IndexedMGraph implementation
> NOTE: While implementing this I recognized that the
> SimpleTripleCollectionTest does not extend MGraphTest and therefore the
> SimpleTripleCollection class is not checked against the tests defined by
> MGraphTest. This might actually an Issue!
> ## Performance
> This is a copy from a run of the above described PerformanceTest
> 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> Filter Performance Test (graph size 100000 triples, iterations 1000)
> 2373 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> --- TEST SimpleMGraph with 100000 triples ---
> 10694 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [S,P,O] in 8321ms with 2 results
> 18052 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [S,P,n] in 7358ms with 734 results
> 25318 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [S,n,O] in 7266ms with 100 results
> 31837 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [n,P,O] in 6519ms with 232 results
> 39236 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [S,n,n] in 7398ms with 8030 results
> 45170 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [n,P,n] in 5934ms with 8318000 results
> 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [n,n,O] in 10666ms with 2260 results
> 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> --- TEST completed in 53463ms
> 55836 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> --- TEST IndexedMGraph 100000 triples ---
> 55856 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [S,P,O] in 20ms with 2 results
> 55875 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [S,P,n] in 19ms with 734 results
> 55908 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [S,n,O] in 33ms with 100 results
> 55936 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [n,P,O] in 28ms with 232 results
> 55957 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [S,n,n] in 21ms with 8030 results
> 57022 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [n,P,n] in 1065ms with 8318000 results
> 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> ... run [n,n,O] in 8ms with 2260 results
> 57030 [main] INFO org.apache.stanbol.commons.indexedgraph.IndexedGraphTest -
> --- TEST completed in 1194ms
> best
> Rupert
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira