Hi All,

as a solution to this issue I created wrapper around RandomAccessFile
that has a readLine() method that performs like the one from the
BufferedReader.
In fact I copied the source from BufferedReader and made adjustments
were needed. It hence can deal with files that have different line
separators.

So if RandomAccessFile is replaced with this wrapper class called
OptimizedRandomAccessFile, performance of indexing in cdks
RandomAccessReader

https://github.com/egonw/cdk/blob/master/src/main/org/openscience/cdk/io/random/RandomAccessReader.java

should increase about 100-fold.

This wrapper can be found here:

https://bitbucket.org/kienerj/io



As example a file with 131'299 molecules (540 mb) can be indexed in
about 2 seconds:

private void index() throws IOException {

        logger.debug("Generating Index...");
        sdfIndex.put(0, 0L); // first record
        int recordIndex = 1;
        String line;

        while ((line = raf.readLine()) != null) { // raf =
OptimizedRandomAccessFile
            if (line.equals(DELIMITER)) { // delimiter = $$$$
                long recordOffset = raf.getFilePointer(); // returns
start of next line!
                sdfIndex.put(recordIndex, recordOffset);
                recordIndex++;
            }
        }
        // sd-files terminate with DELIMITER
        // hence the last entry in index must be removed as no
        // record will be there.
        sdfIndex.remove(recordIndex - 1);
        logger.debug("Index generated");
    }

And using that index 10 records from 100'000 to 100'010 can be accessed
and returned in less than 2 miliseconds.

Sinde cdks RandomAccessReader saves the index to disc it has to be run
only once. And as an estimate it would take maybe 20-30 seconds to
create the index for 1 mio records assuming the makeIndex() methods does
not have some unforeseen overhead. Anyway I would say that sure falls in
the range of acceptable performance for that amount of data.

Best Regards,

Joos


------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to