On Oct 8, 2009, at 8:28 PM, David Melgar wrote:

I read a little on ICU and now understand that sqlite by default does not handle case insensitive unicode. Is there an easy way to make sqlite use ICU on the Mac, or do I have to build it myself with ICU enabled?

Probably the easiest thing to do (besides just using Core Data) is register custom functions with the SQLite API that go through CFString. The SQLite APIs for this are fairly straight forward. The ICU dylib on Mac OS X is not public API. There are a variety of references to building ICU or otherwise addressing this issue on the web. Should you do that, you can register custom functions leveraging ICU with SQLite as one would using CFString. I'd strongly recommend against building and statically linking SQLite yourself.

Based on the derived property example, it seems that I would need to duplicate any text fields I commonly search on where I want to support case insensitive queries.

Yes, for data sets of this size. For smaller data sets, in the 10^2 or 10^3 range, you might decide to skip this and accept "slower" queries since they'll probably be around 5ms - 20ms.

How does spotlight manage to index so much text and respond quickly? Are those queries can insensitive?

Spotlight supports insensitive queries. Spotlight also duplicates much of the material for its index. The importing process is not free. You're certainly welcome to put all your data into Spotlight and test its query performance for your purposes. Spotlight is fundamentally a full text search index. It's optimized heavily for prefix and word searching across very large numbers of documents. It's very good for its intended usage pattern, but that pattern is not the same as the usage of relational databases. The technologies are complementary.

I've done a bunch more performance tests. I don't understand some of the differences. I hadn't realized that coredata in the data model has a flag to index a field. When I did that for Coredata, it dramatically sped it up. Although adding an index to my sqlite test slowed it down.

Your test results are very odd, in that they are still 50x to 100x slower than I would expect. These tests have not yet applied the derived property optimization I suggested, correct ?

As literally noted here, none of those queries are eligible for an index, so I assume your actual code is doing something slightly different, or the timing difference are due to configuration issues. Hot I/O and cold I/O will be very different. For example:

search, without an index (cold I/O)
add an index (page everything into the UBC cache)
search, with an index, but a query that doesn't use the index (hot I/O)

The difference between hot and cold I/O for query performance will be huge. As much as 100x. You can use /usr/bin/purge to force everything to be cold I/O. Not necessary representative of real world performance, as the whole disk cache will be flushed, including material relevant to the system that would normally exist when you run your app.

Coredata, without an index, predicate "name like foo", 20 databases 400k records each. 2m17s. Coredata with an index, predicate "name like foo", 20 databases 1mil records each, 25 seconds. SQLite no index, sql "where name like foo%bar", 20 databases 1mil records each 11 seconds. SQLite index, sql "where name like foo%bar", 20 databases 1 mil records each 30 seconds.

Why the 20 database files ? These queries will be faster with a single database file and its unified index.

In any event, the derived property optimization should make all the prefix/equality queries subsecond. On my laptop, for 20 million records, that should be in the 500ms ballpark (hot I/O)

As you said, with an index, and testing for equality, both coredata and sqlite queries responded in <1 sec. SQLite without an index was 9 seconds for same query.

That sounds about right. It'll be faster if you use a single database with 20 million rows and a unified index table.

I can only presume SQLite slowed down with an index because the database file was physically much larger and took longer to read from disk.

I suspect its a flaw in the testing methodology, actually. Do you run the queries multiple times, throw out the best & worst and average the rest ? Are you running other apps at the same time you do performance tests ?

It is possible for the presence of an index to negatively impact a non- indexed query, but that would be a few %, not 3x slower An indexed query would read many fewer pages from disk than a non-indexed query, regardless of the file size. O(lg(N)) instead of O(N) and at 20 million that's a big difference. There are other possible issues in play, but you said the result set is small, so they seem unlikely.

- Ben

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to