Hi All, as a solution to this issue I created wrapper around RandomAccessFile that has a readLine() method that performs like the one from the BufferedReader. In fact I copied the source from BufferedReader and made adjustments were needed. It hence can deal with files that have different line separators.
So if RandomAccessFile is replaced with this wrapper class called OptimizedRandomAccessFile, performance of indexing in cdks RandomAccessReader https://github.com/egonw/cdk/blob/master/src/main/org/openscience/cdk/io/random/RandomAccessReader.java should increase about 100-fold. This wrapper can be found here: https://bitbucket.org/kienerj/io As example a file with 131'299 molecules (540 mb) can be indexed in about 2 seconds: private void index() throws IOException { logger.debug("Generating Index..."); sdfIndex.put(0, 0L); // first record int recordIndex = 1; String line; while ((line = raf.readLine()) != null) { // raf = OptimizedRandomAccessFile if (line.equals(DELIMITER)) { // delimiter = $$$$ long recordOffset = raf.getFilePointer(); // returns start of next line! sdfIndex.put(recordIndex, recordOffset); recordIndex++; } } // sd-files terminate with DELIMITER // hence the last entry in index must be removed as no // record will be there. sdfIndex.remove(recordIndex - 1); logger.debug("Index generated"); } And using that index 10 records from 100'000 to 100'010 can be accessed and returned in less than 2 miliseconds. Sinde cdks RandomAccessReader saves the index to disc it has to be run only once. And as an estimate it would take maybe 20-30 seconds to create the index for 1 mio records assuming the makeIndex() methods does not have some unforeseen overhead. Anyway I would say that sure falls in the range of acceptable performance for that amount of data. Best Regards, Joos ------------------------------------------------------------------------------ LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk _______________________________________________ Cdk-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/cdk-user

