The RandomAccessReader mentioned above is exactly such low level indexer -
it does not perceive chemistry when building the index.
Lochana may have issues if trying to load all these molecules in memory -
this is definitely not the way to go.
Nina
On 19 September 2013 03:07, Andrew Dalke <[email protected]> wrote:
> On Sep 18, 2013, at 7:46 PM, lochana menikarachchi wrote:
> > I need to quickly load 5-10 molecules to a jTable from a large SD
> file(say 1 million structures). ... MarvinView can load 5-10 structures
> from extremely large files in few seconds. I wonder how marvin does this??
> Any suggestions to replicate this functionality with CDK??
>
> I was curious on what the absolute fastest indexing time could be. If you
> can assume that the string "$$$$" is only found at the end of the record,
> and not in oddities like:
>
> > <price>
> $$$$
>
> or the extreme edge-case "S SKP" section, then you might be able to get
> indexing time of about 3 seconds.
>
> % ls -lh chembl_14.sdf
> -rw-r--r-- 1 dalke admin 2.6G Sep 18 22:47 chembl_14.sdf
> % time fgrep -c '$$$$' chembl_14.sdf
> 1212539
> 2.629u 0.397s 0:03.05 98.6% 0+0k 8+1io 0pf+0w
>
> This depends much on your system: before I restarted my computer this took
> around 25 seconds because I had very little free memory left. This was also
> from the second time I ran the test, so the disk cache was hot.
>
>
>
> However, that grep is a bad solution for a general-purpose SD record
> tokenizer because valid SD records will break that simple scanner.
>
> A few months ago I tried writing a correct one in C. The best I could do
> takes 14 seconds for this case.
>
> I don't see how MarvinView can be much faster than this. Since you say "a
> few seconds", I wonder if it has an indexing thread in the background,
> which scans the rest of the file while you are looking at the first 10 or
> so records.
>
> I do think that a fast low-level indexer, which reports the record id and
> start/end byte positions but does not perceive chemistry, is a very useful
> tool to have. It sounds like CDK doesn't have such a thing, and the other
> toolkits I know of (Open Babel, RDKit, and OEChem) don't have one either.
>
>
> On Sep 18, 2013, at 9:57 PM, John May wrote:
> > In terms of reading sections of a file - if it's uncompressed it would
> be nice to have a utility to do something with memory mapping (
> http://javarevisited.blogspot.co.uk/2012/01/memorymapped-file-and-io-in-java.html
> ).
>
> Egon reported times of:
>
> real 9m58.781s
> user 9m14.000s
> sys 0m8.528s
>
> This doesn't suggest that there's an I/O bottleneck that could be improved
> by memory mapping.
>
>
> > For the faster basic reader I've been hacking on/off at a
> reimplementation for the last year or so. ... There may be a faster
> implementation in future versions but as Joos says this requires some
> significant effort.
>
> I've had the same experience.
>
> Cheers,
>
> Andrew
> [email protected]
>
>
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Cdk-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user