Re: [Cdk-user] Reading large SD Files

Nina Jeliazkova Fri, 20 Sep 2013 07:03:40 -0700

Hi Joos,

On 20 September 2013 16:50, Joos Kiener <[email protected]> wrote:


>  Hi Nina,
>
> yes I agree it's non-trivial due to java's file IO not being a very good
> API.
>

Java NIO is considered a better way, but it is quite different API.


> getLineSeparator(String data) in my code is a private method that looks if
> the past in string contains \r\n if yes, that is chosen as line separator
> else \n.
>


Yes, sorry, I was referring to the Java "OS independent" way to get the
line separator

System.getProperty("line.separator");
System.lineSeparator()   (JDK 7)


> So it works for all files that have a consistent separator.
>

Right. My comments are due to previous experience with SD files
inconsistent in this respect.


> But anyway it's just an example not meant for any production use. (there
> are other issues I'm aware of).
>

Might be a good idea to improve indeed the CDK readers. Not sure I have
time for that at the moment.
Without checking the cdk-io code, I would say almost all readers in CDK use
readline() to read text files (with XML readers perhaps an exception).

Best regards,
Nina


> Best Regards,
>
> Joos
>
> Am 20.09.2013 14:23, schrieb Nina Jeliazkova:
>
> Hi Joos, All,
>
>
>  On 20 September 2013 15:08, Joos Kiener <[email protected]> wrote:
>
>>  Hi all,
>>
>>  the issue with RandomAccessSDFReader, more specifically the underlying
>> RandomAccessReader is that it uses RandomAccessFiles readLine() method.
>> That method is very, very bad in terms of performance (because it's badly
>> written). Hence the indexing takes very long. A solution would be to
>> rewrite the index method without using the read line method.
>>
>>
>>
>  True. With a caveat (as already I've pointed in a personal reply to
> Andrew Dalke, who commented with the same reasoning == readline is bad).
>
>  Java readline readers are slow, but they provide transparency with the
> respect of different line separators (e.g. \n , \r\n, \r ) that originate
> from different operating systems.
>
>  Specifically, it is not correct to search for only \n$$$$\n nor only for
>  "\r\n"
>
>  In SD files there could be all combinations of line separators not only
> in one file, but within one record. I could only guess how these files have
> been constructed, but they do exist in the wild.  And the reason of using
> the Java line reader is exactly this. Of course it could be rewritten to
> match explicitly bytes without relying on existing Java classes.
>
>  Unrelated to performance - using getLineSeparator() in this context is
> not quite right, as this method will return the OS specific line separator,
> while the SD file being read may have been generated on a different OS.
>
>
>>  I have not looked at the indexing method and what exactly it does but
>> here is a way to index the start (as a byte offset) of every record in an
>> sd-file into a Map<Integer,Long>. see below.
>>  Indexing this takes 3-4 seconds (on an SSD...) with a 540 mb sd-file
>> (from ZINC) containing aprox. 131'000 records.
>>  hence that would be about 30 sec. for 1 mio compounds.
>>
>>
>> private void index() throws IOException {
>>
>>         sdfIndex.put(0, 0L); // first record
>>         byte[] buffer = new byte[8192];
>>         int recordIndex = 0;
>>         int newLineOffset = 1; // 1 or 2 depending if it is \n or \r\n
>>         int bytesRead;
>>
>>         while ((bytesRead = raf.read(buffer)) != -1) {
>>
>>             String data = new String(buffer, "US-ASCII");
>>
>>             // determine new line delimiter once
>>             // can be \n for sd-files also on Windows if
>>             // the were generated on linux or by certain toolkits
>>             if (recordIndex == 0) {
>>                 if (getLineSeparator(data).equals("\r\n")) {
>>                     newLineOffset = 2;
>>                 }
>>             }
>>
>>             ArrayList<Integer> recordEnds = new ArrayList<>();
>>             int index = data.indexOf(DELIMITER);
>>             while (index >= 0) {
>>                 recordEnds.add(index);
>>                 index = data.indexOf(DELIMITER, index + 1);
>>             }
>>             long offsetBeforeRead = raf.getFilePointer() - bytesRead;
>>             for (int position : recordEnds) {
>>                 // we want to start reading after the delimiter
>>                 // on the next new line
>>                 // aaaaaa
>>                 // $$$$
>>                 // bbbbbb <- get offset were this line starts
>>                 long recordOffset = offsetBeforeRead + position +
>> DELIMITER.length() + newLineOffset;
>>                 sdfIndex.put(recordIndex, recordOffset);
>>                 recordIndex++;
>>             }
>>         }
>>         // sd-files terminate with DELIMITER
>>         // hence the last entry in index must be removed as no
>>         // record will be there.
>>         sdfIndex.remove(recordIndex - 1);
>>     }
>>
>>
>>  See appended a full implementation of above idea. Note that it returns
>> text data only, no chemistry. (works, but not really tested, use at own
>> risk!!!).
>>
>>
>>
>  It is just great that you took the time to improve that code. It's quite
> old already (written ~2007 and was intended for indexing a file at about
> 40K compounds originally, so never tested on 1 mln...). And in fact not
> really used anymore , at least by the original author :)
>
>
>
>
>>
>> 2013/9/19 lochana menikarachchi <[email protected]>
>>
>>>
>>>  Joos,
>>>
>>>  You are right. I should use a local database instead of SD files...
>>>
>>
>
>  Indeed. This number of records is good for testing performance, but for
> real use I would vote for a database.
>
>  Best regards,
> Nina
>
>
>
>>
>>>  Lochana
>>>   ------------------------------
>>>  *From:* Joos Kiener <[email protected]>
>>> *To:* lochana menikarachchi <[email protected]>
>>> *Cc:* "[email protected]" <[email protected]>
>>> *Sent:* Thursday, September 19, 2013 9:37 AM
>>> *Subject:* Re: [Cdk-user] Reading large SD Files
>>>
>>>   I played a round a bit and came up with a crude solution as I
>>> mentioned in my initial response.
>>>
>>>  index all occurrences of "$$$$" -> takes 3-4 seconds for a file with
>>> 131'000 records
>>>
>>>  use separate thread to index to increase performance but current
>>> implementation requires that index is fully built. This is an issue as you
>>> need to have 2 access mechanisms, index based and not-index based.
>>>
>>>
>>>  Use BufferedReader to go to the indexed line, eg
>>>
>>> for (int i = 0; i < linesToRead; i++) {
>>>                 bufferedReader.readLine();
>>>             }
>>>
>>>  yeah, not ideal but it actually is faster than I expected.
>>>
>>>  add caching to it.
>>>
>>>
>>>  But a question remains:
>>>
>>> What is your actual goal? Why can't you use Marvin, for commercial use?
>>>  1 million is a lot. Using a real database comes to mind.
>>>
>>>
>>>
>>> 2013/9/19 lochana menikarachchi <[email protected]>
>>>
>>>  Hi Nina,
>>>
>>>  I did try the RandomAccessSDFReader. It took few minutes to build the
>>> index for an SD file with 50,000 structures. What I am saying is what ever
>>> MarvinView does to build index (if it is using an index) is much faster. I
>>> am wondering how it does that.
>>>
>>>  Lochana
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
>>> SharePoint
>>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
>>> includes
>>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Cdk-user mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>>
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
>>> SharePoint
>>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
>>> includes
>>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Cdk-user mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
>> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
>> SharePoint
>> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
>> includes
>> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Cdk-user mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>>
>>
>
>
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8,
> SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack
> includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> Cdk-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>
>

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk

_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Reading large SD Files

Reply via email to