Not sure what your search server does, but presumably the user enters keywords and you search through your file based pre-prepared 'indexes' of file offset and keywords to return information to locate the word within the text?
Since you don't have direct access to big table and are not wanting to format your data for that in any case as you already have code to read your current format, it sounds as if you'd want to place the indexes in a database to use the features of a distr database. With appengine, you can choose the number of active instances to remove load from your problem set. For data modeling w/ appengine, you might need to empirically determine the largest number of entities that can be searched for a field within a reasonable amount of time (and that information would be useful to post back if you have it.) That number would be a size to use for your queries and hence the number of index entities that belong to one entity group. Presumably, the finest level of partitioning here could use starts with [a-d], starts with [e-o], etc. for bin ranges (or bin ranges within bin ranges), for example. To partition your data in appengine w/ it still remaining searchable, you could consider the extreme cases? The most number of entities: Content of each file based index as a separate entity is ~1 million indexes for your 400 books (presuming each index was 90B, and you have 90MB in 1 group of 400 books). Then add "table" entities holding the bin ranges. The advantage to this approach is that if the number of "index entities" will grow over time, you only need to change the tables (references) that point to the ranges to be used. The queries would be simple, but presumably you would need many in order to complete one request. The fewest number of entities: All indexes with primary value something through something would be aggregated into one large entity, and the same for the next bin range etc. Then add "table" entities holding the bin ranges. The advantage to this is that you'd need very few queries. The disadvantages would be many though! Any growth in your indexes may need all entities to be re-written. Returning a result returns far more than is needed... Presumably, partitioning using a method closer to the most number of entities is closer to what you want. With respect to the user's query: Any progress of a user's search could be stored in their session. Searches of the exact same query that would be repeated using an offset in range can be replaced with a cursor to make it more efficient. Features such as the background processing task queue or the already available task queue can continue a search asynchronously... On Oct 5, 8:35 am, "jacek.ambroziak" <jacek.ambroz...@gmail.com> wrote: > I have full-text indexed all of O'Reilly eBooks (1,600 of them) using > my own search engine. You can see how the search works if you have an > Android tablet and install (free app) eCarrel (O'Reilly bookstore). To > make searching manageable in the context of GAE I have originally > partitioned all the books into 4 groups, each with its own index. That > way searches can be performed in parallel (merging results when done), > individual (per book group) indexes are smaller, and w/i group search > is faster. > > An index for 400 books is about 90 MB in size. To implement the search > engine on GAE > I would dedicate 4 applications to the task (eg. search1.appspot.com, > through search4....). > Each application would run exactly the same code, but would have > different "application files" > containing index data. (I wasn't sure if the index data should be > stored in DataStore entities, > Blobstore blobs, or application files; at the time the SE was first > implemented it seemed > that application files was the only option even if it meant that they > had to be split into > 10 MB chunks (1.5.5 supposedly raises the limit to 32 MB but I got an > error attempting that)). > > One problem with that approach is that multiple GAE applications are > used to implement > parallel "search servers." Another is that it takes time and resources > to read in the index > from application files into RAM before search results can be computed. > When an instance > is killed all this work goes to waste and will have to be repeated on > next search. > When the number of groups was too small and therefore indexes too big, > I was getting OutOfMemory errors just loading index data to RAM. > > Do you guys think it is a good idea to use application files to store > index data? > > Since each "search server" runs the same code (and only accesses > different application > files), can it be implemented via a single (versioned?) GAE > application? (I will run out of applications when adding more search > servers, and it will become more costly to run > the search engine). > > http://ecarrel.com -- You received this message because you are subscribed to the Google Groups "Google App Engine for Java" group. To post to this group, send email to google-appengine-java@googlegroups.com. To unsubscribe from this group, send email to google-appengine-java+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/google-appengine-java?hl=en.