Re: CoreData async fetch request

Ben Trumbull Wed, 07 Oct 2009 18:26:06 -0700


On Oct 6, 2009, at 8:29 PM, David Melgar wrote:

Hello,
Thanks for the response. Seems that its straying somewhat from myoriginal question.

Sure, your original question is that you have a serious performanceissue, and you'd like to hide it from the user by adding threads. I'mproposing fixing the performance issue instead, and not bothering withthe additional complexity of threads, at least until you have 100million rows or so.

For the 1.4 million row db I have handy, the indexed == query runsover 100x faster than the LIKE query. == returns 4 rows out of 1.4Min 4ms and LIKE returns 4 rows in 450ms. So, on my 2007 Mac Pro, your10 million row database would run in its query in less than 100ms.Too fast for meaningful human perception. Do we really need to addthreads for this ? The code to incrementally and asynchronouslydisplay the results will probably take longer than Just Do It.

Searching based on prefix matching is fine. The predicate I'm usingreally is of the form "SELF like foo", no wildcard, so it doesn'tseem that it should be that expensive.

Locale aware Unicode regex is very expensive. Unicode is the worstpossible text encoding system ever conceived, except for the others.Core Data insulates you from this so that your searches behave like OSX customers around the world expect. You're welcome to learn allabout Unicode and ICU, and work with it directly in SQLite if youprefer. It'll take a lot of code to make searching and sorting workfor every locale.

You say its possible to structure this to use a binary index. How? Idon't see any mention of indices in the Coredata documentation.

See the Derived Property example on the ADC web site that I'vereferenced repeatedly. However, if you're not using any wildcards,and your search is case sensitive, then you might as well just use ==and be done. Be sure to add an index to the attribute in your model.

If I use SQLite directory, presumably I can set indices on thefields I want and more closely manage the data model.

You would presume incorrectly. Generally, LIKE queries are noteligible for indices. There are some special circumstances where theycan be, but that won't work with Unicode. You're welcome to verifythat for yourself.

I don't see how setBatchFetchSize helps. Doesn't it just limit thenumber of results returned?

No. It's more closely an in memory cursor. It will require theentire WHERE clause execute, which unfortunately is your primaryproblem, but it will not restart the query as you stream through theresults.

I have no idea how quickly the results will come in. Setting a size>1 is therefore indeterminate and may take the full 3 minutes. If Iset it to one, and I want to try and get the second row as well, itappears that it starts the query all over again, worst caseresulting in 6 minutes before the 2nd result shows up. Doesn't seemthat it scales reasonably if I want to display the first 10-20entries.

No, -setFetchBatchSize does not restart the query. That's what usingfetchOffset does (in the database, not Core Data, which is why wewrote fetchBatchSize ourselves)

My issue with Coredata is that it NSFetchRequest always returns ALLthe results of the particular query at one time. If I use SQLitedirectly... assuming it supports cursors, I can get each result oneat a time as they show up, display it to the user without slowingdown the query as it continues to find other results.

If you try using -com.apple.CoreData.SQLDebug you will see both theSQL we pass to SQLite, and some performance annotations like:

2009-10-07 17:52:15.107 Address Book[13949:5403] CoreData: annotation:sql connection fetch time: 0.0013s2009-10-07 17:52:15.108 Address Book[13949:5403] CoreData: annotation:total fetch execution time: 0.0020s for 14 rows.

The first line is how much time was spent in SQLite. If you run thiswith your text queries, you'll see most of your time spent there.Switching to use SQLite directly is not going to change that. Again,you should verify that for yourself.

NSFetchRequest could support a delegate to invoke some method whenfor each item that has been found, rather than blocking until allthe results are received.It also could have been implemented as a virtual queue, an objectwhich could be read from while being written to in another thread.


That would make an excellent feature request.  Please file it with 
bugreport.apple.com

But if you take my advice and make the query run in 1.8s instead of180s, how important is this to you ?


- Ben

On Oct 6, 2009, at 4:08 AM, Ben Trumbull wrote:
On Oct 5, 2009, at 7:00 PM, enki1...@gmail.com wrote:
I am doing a simple query search for a text string pattern (ie'SELF like foo') on ~10 million small records stored persistentlyusing sqlite. This is a performance test to make sure I getreasonable performance from my database engine before I commit toomuch code to it.
Well, @"self like 'foo'" is a different problem than @"self like'*foo*'". LIKE queries require Unicode compliant regex and areintrinsically expensive. If you do not have a wildcard, you arebetter off use an == query. The DerivedProperty ADC example showshow to transform the text to make it much faster to search.
If you do need to use a wildcard, you'll really want to stick with1, either prefix matching or suffix matching. The DerivedPropertyexample shows prefix matching. It's possible to structure this touse a binary index, and make the query extremely fast even formillions of records. There is a huge difference in computationalcomplexity. Prefix matching can use an index, and therefore canrun O(lg(N)).
*foo* (contains) searches are slow, and cannot use an index. Youreally want to avoid these. Even Spotlight does not do arbitrarysubstring matching. Compare "help" with "elp" in your Spotlightresults. If you want word matching, you can use Spotlight orSearchKit to build a supplemental FTS index.
The query is taking over 3 minutes with a small result set. Thisis on a new 13" macbook pro w 4gb memory.
... a full table scan executing a regex on each of 10 million rowson a 5400 rpm drive ? Well, for doing all that, 3 minutes soundspretty fast.
Just as a reference point, if you grab the objectIDs from theresult set, and execute an IN query selecting those objects, howlong does it take ? 50ms ? 100ms ?
The query is taking too long for a user to sit and wait for it. Isthere a way to speed it up? Can indexing be applied to it?
I had thought if I could display results as they are found thatmight be reasonable. In my tests, if I use setFetchBatchSize andsetOffset to restart it, then it ends up repeating the querytaking that many times longer to get a result. Not reasonable. Itdoes not seem to start the query where it left off, as a databasecursor would do.
You can use -com.apple.CoreData.SQLDebug 1 to see the SQL we passto the database. This also has nothing to do with Core Data. Thisis how offset queries behave. I realize it's not what youexpected, which is why I recommended using -setFetchBatchSize:instead.
My impression is that my usage scenario is not an appropriate useof core data.
Core Data is just passing the query off to the database. I'm notsure why you think going to the database directly will do anythingfor the 179.9 / 180.0 seconds it takes to evaluate the query in thedatabase.
I was planning to try SQLite directly. Would it be more appropriate?
You can try it directly, but it won't have any meaningful effect onyour performance results except that SQLite's built in LIKEoperator doesn't support Unicode. It'll be a tiny bit faster forthat, but still the same order of magnitude. And then, eitheryou'll have to integrate ICU support as Core Data does, and it'llbe exactly the same, or be stuck with ASCII.
Regardless, you'll need to make your searches eligible for anindex. The DerivedProperty example shows how to do that.
- Ben
Thanks

On Oct 5, 2009 7:14pm, Ben Trumbull <trumb...@apple.com> wrote:
> Is there a way to do an asynchronous fetch request against Coredata
> returning partial results?
>
> That depends on whether it's the query part that's expensive(e.g. WHERE clause with complex text searching and table scans) orsimply the quantity of the row data that's your problem. For thelatter, you can just use -setFetchBatchSize: and be done.
>
>
> You can use a separate MOC on a background thread to performasynchronous work. You can then pass over results to the mainthread to display to the user. However, unless your search termsare very expensive, it's usually easier and faster to use -setFetchBatchSize: synchronously. For well indexed queries, itcan handle a million or two rows per second. Not sure why you'dsubject your users to that kind of experience. It's common to usefetch limits, count requests, and only show the top N results.What's your user going to do with a hundred thousand resultsanyway ?
>
>
> If you need to attack the computational expense of your queryterms, that's more complicated. Obviously it would be best tooptimize the queries and ensure they are using an index. But ifthat's not enough, you can execute the queries in a backgroundMOC, fetching objectIDs + row data (put in the the row cache) andthen have the other MOC materialize the objects by ID from the rowcache. There's a BackgroundFetching example in /Developer/Examples/CoreData. It shows how to do this. Returning partialresults incrementally would require some creativity on your partto subdivide the query into several. Since most expensive queriesare text searches, it's usually possible to subdivide the resultset naturally. Like the first letter of 'title'. Similar to thethumb bar index on the side of the Contacts app on the iPhone.
>
>
> There's also a DerivedProperty example on ADC for optimizingtext queries.
>
>
> Obviously, Apple's own Spotlight could not use something like
> Coredata, since it heavily relies on returning asynchronouspartial
> results.
>
> Which is neither here nor there. Most Cocoa applicationswouldn't want Spotlight to be the sole persistence back end oftheir data. The latency of putting all your data in a full textindex instead of a relational database or keyed archive would bepretty absurd. Now, if you're writing an app that's primarilystructured around full text searching, you might instead prefer tofocus on putting your data in Spotlight via small files, and usingthe Spotlight APIs. But it's not suitable for apps interested inan OOP view of their data.
>
>
> Frankly, this is my second application I've attempted to useCoredata> to find it come up surprisingly short. The first time the issuewas
> core data not being thread safe.
>
> Core Data can be used efficiently with multiple threads. Itmight help to think of each MOC as a separate writeable view. Ifyou'd like to know more, you can search the archives for my posts.
>
>
> What is the target market for Core Data? Why sort of applicationis
> ideal for its use? What size data store? Right now it escapes me.
>
>
> Cocoa and Cocoa Touch applications, particularly done in an MVCstyle with an OO perspective on their data. Some people also useit as a persistent cache for data stored in another canonicalformat, such as XML files. On the Mac side, we've had customerswith 3+ million rows (multi GB) databases, and on the embeddedside, roughly 400,000 rows (100s MB). However, it does take somecare and feeding to handle data sets like that, and mostdevelopers find it straight forward up to about 10% those numbers.
>
>
>
> It sounds like you're having performance issues. What kinds ofqueries are you trying to accomplish ? How much data are youworking with ? How have you modeled your primary entities?
>
>
> You can fetch back just NSManagedObjectIDs, and -setIncludesPropertyValues: to NO to effectively create your owncursors if you prefer.
>
>
> - Ben
>
>
>
>
>



_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: CoreData async fetch request

Reply via email to