Hi Joos,
On 20 September 2013 16:50, Joos Kiener <[email protected]> wrote:
> Hi Nina,
>
> yes I agree it's non-trivial due to java's file IO not being a very good
> API.
>
Java NIO is considered a better way, but it is quite different API.
> getLineSeparator(String data) in my code is a private method that looks if
> the past in string contains \r\n if yes, that is chosen as line separator
> else \n.
>
Yes, sorry, I was referring to the Java "OS independent" way to get the
line separator
System.getProperty("line.separator");
System.lineSeparator() (JDK 7)
> So it works for all files that have a consistent separator.
>
Right. My comments are due to previous experience with SD files
inconsistent in this respect.
> But anyway it's just an example not meant for any production use. (there
> are other issues I'm aware of).
>
Might be a good idea to improve indeed the CDK readers. Not sure I have
time for that at the moment.
Without checking the cdk-io code, I would say almost all readers in CDK use
readline() to read text files (with XML readers perhaps an exception).
Best regards,
Nina
> Best Regards,
>
> Joos
>
> Am 20.09.2013 14:23, schrieb Nina Jeliazkova:
>
> Hi Joos, All,
>
>
> On 20 September 2013 15:08, Joos Kiener <[email protected]> wrote:
>
>> Hi all,
>>
>> the issue with RandomAccessSDFReader, more specifically the underlying
>> RandomAccessReader is that it uses RandomAccessFiles readLine() method.
>> That method is very, very bad in terms of performance (because it's badly
>> written). Hence the indexing takes very long. A solution would be to
>> rewrite the index method without using the read line method.
>>
>>
>>
> True. With a caveat (as already I've pointed in a personal reply to
> Andrew Dalke, who commented with the same reasoning == readline is bad).
>
> Java readline readers are slow, but they provide transparency with the
> respect of different line separators (e.g. \n , \r\n, \r ) that originate
> from different operating systems.
>
> Specifically, it is not correct to search for only \n$$$$\n nor only for
> "\r\n"
>
> In SD files there could be all combinations of line separators not only
> in one file, but within one record. I could only guess how these files have
> been constructed, but they do exist in the wild. And the reason of using
> the Java line reader is exactly this. Of course it could be rewritten to
> match explicitly bytes without relying on existing Java classes.
>
> Unrelated to performance - using getLineSeparator() in this context is
> not quite right, as this method will return the OS specific line separator,
> while the SD file being read may have been generated on a different OS.
>
>
>> I have not looked at the indexing method and what exactly it does but
>> here is a way to index the start (as a byte offset) of every record in an
>> sd-file into a Map<Integer,Long>. see below.
>> Indexing this takes 3-4 seconds (on an SSD...) with a 540 mb sd-file
>> (from ZINC) containing aprox. 131'000 records.
>> hence that would be about 30 sec. for 1 mio compounds.
>>
>>
>> private void index() throws IOException {
>>
>> sdfIndex.put(0, 0L); // first record
>> byte[] buffer = new byte[8192];
>> int recordIndex = 0;
>> int newLineOffset = 1; // 1 or 2 depending if it is \n or \r\n
>> int bytesRead;
>>
>> while ((bytesRead = raf.read(buffer)) != -1) {
>>
>> String data = new String(buffer, "US-ASCII");
>>
>> // determine new line delimiter once
>> // can be \n for sd-files also on Windows if
>> // the were generated on linux or by certain toolkits
>> if (recordIndex == 0) {
>> if (getLineSeparator(data).equals("\r\n")) {
>> newLineOffset = 2;
>> }
>> }
>>
>> ArrayList<Integer> recordEnds = new ArrayList<>();
>> int index = data.indexOf(DELIMITER);
>> while (index >= 0) {
>> recordEnds.add(index);
>> index = data.indexOf(DELIMITER, index + 1);
>> }
>> long offsetBeforeRead = raf.getFilePointer() - bytesRead;
>> for (int position : recordEnds) {
>> // we want to start reading after the delimiter
>> // on the next new line
>> // aaaaaa
>> // $$$$
>> // bbbbbb <- get offset were this line starts
>> long recordOffset = offsetBeforeRead + position +
>> DELIMITER.length() + newLineOffset;
>> sdfIndex.put(recordIndex, recordOffset);
>> recordIndex++;
>> }
>> }
>> // sd-files terminate with DELIMITER
>> // hence the last entry in index must be removed as no
>> // record will be there.
>> sdfIndex.remove(recordIndex - 1);
>> }
>>
>>
>> See appended a full implementation of above idea. Note that it returns
>> text data only, no chemistry. (works, but not really tested, use at own
>> risk!!!).
>>
>>
>>
> It is just great that you took the time to improve that code. It's quite
> old already (written ~2007 and was intended for indexing a file at about
> 40K compounds originally, so never tested on 1 mln...). And in fact not
> really used anymore , at least by the original author :)
>
>
>
>
>>
>> 2013/9/19 lochana menikarachchi <[email protected]>
>>
>>>
>>> Joos,
>>>
>>> You are right. I should use a local database instead of SD files...
>>>
>>
>
> Indeed. This number of records is good for testing performance, but for
> real use I would vote for a database.
>
> Best regards,
> Nina
>
>
>
>>
>>> Lochana
>>> ------------------------------
>>> *From:* Joos Kiener <[email protected]>
>>> *To:* lochana menikarachchi <[email protected]>
>>> *Cc:* "[email protected]" <[email protected]>
>>> *Sent:* Thursday, September 19, 2013 9:37 AM
>>> *Subject:* Re: [Cdk-user] Reading large SD Files
>>>
>>> I played a round a bit and came up with a crude solution as I
>>> mentioned in my initial response.
>>>
>>> index all occurrences of "$$$$" -> takes 3-4 seconds for a file with
>>> 131'000 records
>>>
>>> use separate thread to index to increase performance but current
>>> implementation requires that index is fully built. This is an issue as you
>>> need to have 2 access mechanisms, index based and not-index based.
>>>
>>>
>>> Use BufferedReader to go to the indexed line, eg
>>>
>>> for (int i = 0; i < linesToRead; i++) {
>>> bufferedReader.readLine();
>>> }
>>>
>>> yeah, not ideal but it actually is faster than I expected.
>>>
>>> add caching to it.
>>>
>>>
>>> But a question remains:
>>>
>>> What is your actual goal? Why can't you use Marvin, for commercial use?
>>> 1 million is a lot. Using a real database comes to mind.
>>>
>>>
>>>
>>> 2013/9/19 lochana menikarachchi <[email protected]>
>>>
>>> Hi Nina,
>>>
>>> I did try the RandomAccessSDFReader. It took few minutes to build the
>>> index for an SD file with 50,000 structures. What I am saying is what ever
>>> MarvinView does to build index (if it is using an index) is much faster. I
>>> am wondering how it does that.
>>>
>>> Lochana
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
>>> SharePoint
>>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
>>> includes
>>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Cdk-user mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>>
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
>>> SharePoint
>>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
>>> includes
>>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Cdk-user mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
>> SharePoint
>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
>> includes
>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Cdk-user mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>
>>
>
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Cdk-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>
------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user