Hello again, First let me thank those who responded to my previous question regarding performance issues. Your responses were a great help! After we implemented the changes needed to improve performance when adding files we ran into another problem almost instantly. This time around it is search that is slow.
The questions are located at the end of this e-mail, so anyone wishing to skip the introduction can scroll down ... The application we are developing will add or modify thousands of files every day, this will lead to a system that in a relative short time will have hundreds of thousands of files. All files are versioned using delta-v versioning and we add a few properties of our own, apart from the properties added by Slide. The files are stored in a file system and all other data, such as properties, are stored in a MySQL database (version 4.1). During the development phase all tests are performed on a normal desktop pc running Windows 2000. The computer is equipped with an AMD 64 3200+ CPU and 1 GB of RAM. The search operation we are running (for this particular test) looks for files with the property value "1.xml" in the property "filename", which is one of our custom properties (that property should contain the real file name while the property "displayname" should contain a more obscure name). The search query string would look something like this: <D:searchrequest xmlns:D ="DAV:"> <D:basicsearch> <D:select> <D:prop><D:filename/></D:prop> </D:select> <D:from> <D:scope> <D:href></D:href> </D:scope> </D:from> <D:where> <D:and> <D:eq> <D:prop><D:filename/></D:prop> <D:literal>1.xml</D:literal> </D:eq> </D:and> </D:where> </D:basicsearch> </D:searchrequest> When doing the search above on a system that contains 100 files it takes about 200 ms. When the system contains 1000 files it takes about 4 seconds and with 10000 files it takes about 220 seconds. The final test we did was with 40000 files, this test resulted in an OutOfMemoryException. We haven't done any tweaking of parameters as far as searching goes. We haven't looked into how the xml-based query above is translated to a SQL-query but we suspect that part of the problem lie there. Another part of the problem is the design of the MySQL database. The table "PROPERTIES" has a column named "PROPERTY_VALUE" which holds the value of any given property, in our example above the value would be "1.xml". That column is not indexed even though I believe that it should be in order to speed up database queries. I know it could be a problem to index that column as the values of properties can be large at times. I guess the best thing would be to design the database in a better way, this is something we will give some thought. Going back to the problem at hand, the fact that the column isn't indexed even makes direct queries to the MySQL database slow. I have done some simple queries that don't check for user access rights, etc, and with 10000 files those queries take 25-30 seconds. I'm by no means a database expert so it is possible that my queries are poorly constructed. I also did some further testing on this. The column "PROPERTY_VALUE" was changed to be of the type VARCHAR(255) and it was also included as an index. This increased performance on my SQL-query to 0.02 seconds. So my conclusion is that an index would be good. However, this change does nothing to increase performance when using Slide. To see what happens when using Slide I turned on query logging in MySQL. Doing the same search as before on 10000 files yielded a log file of 55 MB. Each one of the 10000 files generated 18 different select-queries!! 18!!! There simply must be a better way to do this! As I mentioned before we haven't looked at the actual code that generates these queries but I guess we will start now. My guess is that these files should be looked at: org.apache.slide.webdav.method.SearchMethod org.apache.slide.store.impl.rdbms.StandardRDBMSAdapter org.apache.slide.store.impl.rdbms.MySqlRDBMSAdapter org.apache.slide.store.impl.rdbms.MySql41RDBMSAdapter and most of the files in package org.apache.slide.store.impl.rdbms.expression. Any more classes we should look at? During our quest for knowledge in this area we found several references to Lucene. We are, however, reluctant to use something that is under development in our application. Is there a roadmap or prognosis on when Lucene could be ready? Could Lucene be helpful for us? I hope that the description given above is helpful, but now it is time for our questions. We can't be the first to use Slide to store hundreds of thousands of files and still wanting to search the properties, so how has this issue been handled by others? Have you used different databases? Have you modified the search queries or search methods? Are there any parameters in Domain.xml or similar files that can be added and/or modified to increase performance? Any help at all would be welcome, even if it is just asking "obvious" questions! Finally I would like to add that I have found Slide to be poorly documented, as a developer I understand just how boring it is to write documentation but I also know how important it is with detailed and correct documentation in order to use a program or an API. Given that Slide is an open-source project I can accept a little bit less documentation and a little bit more community support. I do, however, strongly believe that there must be more readily available documentation on how to configure and use Slide in a setting where large quantities of files are stored in Slide. We are documenting our work internally and this document can hopefully be converted to a useful base for such a document. But I don't want to give any promises as I'm a developer that really dislikes writing documents ;-) Sorry for the length of this e-mail... Regards, Pontus Strand --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]