Hello again,

First let me thank those who responded to my previous question regarding
performance issues. Your responses were a great help! After we implemented
the changes needed to improve performance when adding files we ran into
another problem almost instantly. This time around it is search that is
slow.

The questions are located at the end of this e-mail, so anyone wishing to
skip the introduction can scroll down ...

The application we are developing will add or modify thousands of files
every day, this will lead to a system that in a relative short time will
have hundreds of thousands of files. All files are versioned using delta-v
versioning and we add a few properties of our own, apart from the properties
added by Slide. The files are stored in a file system and all other data,
such as properties, are stored in a MySQL database (version 4.1). During the
development phase all tests are performed on a normal desktop pc running
Windows 2000. The computer is equipped with an AMD 64 3200+ CPU and 1 GB of
RAM.

The search operation we are running (for this particular test) looks for
files with the property value "1.xml" in the property "filename", which is
one of our custom properties (that property should contain the real file
name while the property "displayname" should contain a more obscure name).
The search query string would look something like this:

<D:searchrequest xmlns:D ="DAV:">
  <D:basicsearch>
    <D:select>
      <D:prop><D:filename/></D:prop>
    </D:select>
    <D:from>
      <D:scope>
        <D:href></D:href>
      </D:scope>
    </D:from>
    <D:where>
      <D:and>
        <D:eq>
          <D:prop><D:filename/></D:prop>
          <D:literal>1.xml</D:literal>
        </D:eq>
      </D:and>
    </D:where>
  </D:basicsearch>
</D:searchrequest>

When doing the search above on a system that contains 100 files it takes
about 200 ms. When the system contains 1000 files it takes about 4 seconds
and with 10000 files it takes about 220 seconds. The final test we did was
with 40000 files, this test resulted in an OutOfMemoryException. We haven't
done any tweaking of parameters as far as searching goes.

We haven't looked into how the xml-based query above is translated to a
SQL-query but we suspect that part of the problem lie there. Another part of
the problem is the design of the MySQL database. The table "PROPERTIES" has
a column named "PROPERTY_VALUE" which holds the value of any given property,
in our example above the value would be "1.xml". That column is not indexed
even though I believe that it should be in order to speed up database
queries. I know it could be a problem to index that column as the values of
properties can be large at times. I guess the best thing would be to design
the database in a better way, this is something we will give some thought.

Going back to the problem at hand, the fact that the column isn't indexed
even makes direct queries to the MySQL database slow. I have done some
simple queries that don't check for user access rights, etc, and with 10000
files those queries take 25-30 seconds. I'm by no means a database expert so
it is possible that my queries are poorly constructed. I also did some
further testing on this. The column "PROPERTY_VALUE" was changed to be of
the type VARCHAR(255) and it was also included as an index. This increased
performance on my SQL-query to 0.02 seconds. So my conclusion is that an
index would be good. 

However, this change does nothing to increase performance when using Slide.
To see what happens when using Slide I turned on query logging in MySQL.
Doing the same search as before on 10000 files yielded a log file of 55 MB.
Each one of the 10000 files generated 18 different select-queries!! 18!!!
There simply must be a better way to do this! As I mentioned before we
haven't looked at the actual code that generates these queries but I guess
we will start now. My guess is that these files should be looked at:

org.apache.slide.webdav.method.SearchMethod
org.apache.slide.store.impl.rdbms.StandardRDBMSAdapter
org.apache.slide.store.impl.rdbms.MySqlRDBMSAdapter
org.apache.slide.store.impl.rdbms.MySql41RDBMSAdapter

and most of the files in package
org.apache.slide.store.impl.rdbms.expression. Any more classes we should
look at?

During our quest for knowledge in this area we found several references to
Lucene. We are, however, reluctant to use something that is under
development in our application. Is there a roadmap or prognosis on when
Lucene could be ready? Could Lucene be helpful for us?

I hope that the description given above is helpful, but now it is time for
our questions.

We can't be the first to use Slide to store hundreds of thousands of files
and still wanting to search the properties, so how has this issue been
handled by others? Have you used different databases? Have you modified the
search queries or search methods? Are there any parameters in Domain.xml or
similar files that can be added and/or modified to increase performance? Any
help at all would be welcome, even if it is just asking "obvious" questions!

Finally I would like to add that I have found Slide to be poorly documented,
as a developer I understand just how boring it is to write documentation but
I also know how important it is with detailed and correct documentation in
order to use a program or an API. Given that Slide is an open-source project
I can accept a little bit less documentation and a little bit more community
support. I do, however, strongly believe that there must be more readily
available documentation on how to configure and use Slide in a setting where
large quantities of files are stored in Slide. We are documenting our work
internally and this document can hopefully be converted to a useful base for
such a document. But I don't want to give any promises as I'm a developer
that really dislikes writing documents ;-)

Sorry for the length of this e-mail...

Regards,
Pontus Strand

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to