[appengine-java] Re: Full-text search engine and its indexes

Nichole Wed, 05 Oct 2011 23:57:12 -0700

Not sure what your search server does, but presumably the user enters
keywords
and you search through your file based pre-prepared 'indexes' of file
offset and keywords to return
information to locate the word within the text?

Since you don't have direct access to big table and are not wanting to
format
your data for that in any case as you already have code to read your
current format,
it sounds as if you'd want to place the indexes in a database to use
the features of a distr database.
With appengine, you can choose the number of active instances to
remove load from your problem set.

For data modeling w/ appengine, you might need to empirically
determine the largest number
of entities that can be searched for a field within a reasonable
amount of time
(and that information would be useful to post back if you have it.)
That number would be a size to use for your queries and hence the
number of index entities
that belong to one entity group.  Presumably, the finest level of
partitioning here could use
starts with [a-d], starts with [e-o], etc. for bin ranges (or bin
ranges within bin ranges), for example.

To partition your data in appengine w/ it still remaining searchable,
you could consider the
extreme cases?
The most number of entities:
    Content of each file based index as a separate entity is ~1
million indexes for your 400 books
(presuming each index was 90B, and you have 90MB in 1 group of 400
books). Then add
"table" entities holding the bin ranges.  The advantage to this
approach is that if the number
of "index entities" will grow over time, you only need to change the
tables (references) that
point to the ranges to be used.  The queries would be simple, but
presumably you would need
many in order to complete one request.

The fewest number of entities:
   All indexes with primary value something through something would be
aggregated into one large
entity, and the same for the next bin range etc.  Then add "table"
entities holding the bin ranges.
The advantage to this is that you'd need very few queries.  The
disadvantages would be many though!
Any growth in your indexes may need all entities to be re-written.
Returning a result returns far more
than is needed...

Presumably, partitioning using a method closer to the  most number of
entities is closer to what
you want.

With respect to the user's query:
Any progress of a user's search could be stored in their session.
Searches of the exact same query
that would be repeated using an offset in range can be replaced with a
cursor to make it more
efficient.  Features such as the background processing task queue or
the already available task
queue can continue a search asynchronously...

On Oct 5, 8:35 am, "jacek.ambroziak" <jacek.ambroz...@gmail.com>
wrote:
> I have full-text indexed all of O'Reilly eBooks (1,600 of them) using
> my own search engine. You can see how the search works if you have an
> Android tablet and install (free app) eCarrel (O'Reilly bookstore). To
> make searching manageable in the context of GAE I have originally
> partitioned all the books into 4 groups, each with its own index. That
> way searches can be performed in parallel (merging results when done),
> individual (per book group) indexes are smaller, and w/i group search
> is faster.
>
> An index for 400 books is about 90 MB in size. To implement the search
> engine on GAE
> I would dedicate 4 applications to the task (eg. search1.appspot.com,
> through search4....).
> Each application would run exactly the same code, but would have
> different "application files"
> containing index data.  (I wasn't sure if the index data should be
> stored in DataStore entities,
> Blobstore blobs, or application files; at the time the SE was first
> implemented it seemed
> that application files was the only option even if it meant that they
> had to be split into
> 10 MB chunks (1.5.5 supposedly raises the limit to 32 MB but I got an
> error attempting that)).
>
> One problem with that approach is that multiple GAE applications are
> used to implement
> parallel "search servers." Another is that it takes time and resources
> to read in the index
> from application files into RAM before search results can be computed.
> When an instance
> is killed all this work goes to waste and will have to be repeated on
> next search.
> When the number of groups was too small and therefore indexes too big,
> I was getting OutOfMemory errors just loading index data to RAM.
>
> Do you guys think it is a good idea to use application files to store
> index data?
>
> Since each "search server" runs the same code (and only accesses
> different application
> files), can it be implemented via a single (versioned?) GAE
> application? (I will run out of applications when adding more search
> servers, and it will become more costly to run
> the search engine).
>
> http://ecarrel.com

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine for Java" group.
To post to this group, send email to google-appengine-java@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine-java+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine-java?hl=en.

[appengine-java] Re: Full-text search engine and its indexes

Reply via email to