Hi,

On Tue, Feb 26, 2013 at 2:19 PM, Chetan Mehrotra
<chetan.mehro...@gmail.com> wrote:
> I modified the importer logic to use a custom nodeType similar to
> SlingFolder (no orderable nodes) and following are the results

Thanks! It indeed looks like the orderability is an issue here. With
the oak:unstructured type I added in OAK-657 and a few more
improvements and fixes to the SegmentMK I can now import also the
Simplified English wiki, with 167k pages:

    $ java -Xmx500m -DOAK-652=true -jar
oak-run/target/oak-run-0.7-SNAPSHOT.jar \
          benchmark --wikipedia=simplewiki-20130214-pages-articles.xml
--cache=200 \
          WikipediaImport Oak-Segment
    Apache Jackrabbit Oak 0.7-SNAPSHOT
    Oak-Segment: Wikipedia import benchmark
    Importing simplewiki-20130214-pages-articles.xml...
    Added 1000 pages in 1 seconds (1.35ms/page)
    [...]
    Added 167000 pages in 467 seconds (2.80ms/page)
    Imported 167404 pages in 1799 seconds (10.75ms/page)

The speed of transient operations slows down slightly over time mostly
since initially everything is cached and later cache misses become
more frequent. Note the new --cache option that can be used to control
the size (in MB) of the segment cache. Ideally, for better comparison,
we'd also make it control the cache used by the MongoMK.

There are still a few problems, most notably the fact the index update
hook operates directly on the plain MemoryNodeBuilder used by the
current SegmentMK, so it won't benefit from the automatic purging of
large change-sets and thus ends up requiring lots of memory during the
massive final save() call. Something like a SegmentNodeBuilder with
similar internal purge logic like what we already prototyped in
KernelNodeState should solve that issue.

The other big issue is the large amount of time spent processing the
commit hooks. The one hook approach I outlined earlier should help us
there.

BR,

Jukka Zitting

Reply via email to