[appengine-java] Re: Full-text search engine and its indexes

Nichole Thu, 06 Oct 2011 00:28:06 -0700

B.T.W. if you choose a method closer to 'using the most number of
entities' above, just keep in
mind that your key names should be scalable too if you will be adding
entities to a range.


The 2 parts which are needed for your binning are a parent key and
your entity key.
Your parent (ancestor) key is the entity group, so it's name could be
the name of your entity group.
(and that should be used to limit your searches to an entity group...
if you need access to the group
name in your search query filters, you may want to store that entity
group name
as a field in your 'index entities').
Your entity key could have a scalable name that is usable in an order
by clause (the order by is
needed if you're using cursors for example).
A pattern like object000000001 for the entity name means you can keep
adding names
with a number as string.


On Oct 5, 11:56 pm, Nichole <[email protected]> wrote:
> Not sure what your search server does, but presumably the user enters
> keywords
> and you search through your file based pre-prepared 'indexes' of file
> offset and keywords to return
> information to locate the word within the text?
>
> Since you don't have direct access to big table and are not wanting to
> format
> your data for that in any case as you already have code to read your
> current format,
> it sounds as if you'd want to place the indexes in a database to use
> the features of a distr database.
> With appengine, you can choose the number of active instances to
> remove load from your problem set.
>
> For data modeling w/ appengine, you might need to empirically
> determine the largest number
> of entities that can be searched for a field within a reasonable
> amount of time
> (and that information would be useful to post back if you have it.)
> That number would be a size to use for your queries and hence the
> number of index entities
> that belong to one entity group.  Presumably, the finest level of
> partitioning here could use
> starts with [a-d], starts with [e-o], etc. for bin ranges (or bin
> ranges within bin ranges), for example.
>
> To partition your data in appengine w/ it still remaining searchable,
> you could consider the
> extreme cases?
> The most number of entities:
>     Content of each file based index as a separate entity is ~1
> million indexes for your 400 books
> (presuming each index was 90B, and you have 90MB in 1 group of 400
> books). Then add
> "table" entities holding the bin ranges.  The advantage to this
> approach is that if the number
> of "index entities" will grow over time, you only need to change the
> tables (references) that
> point to the ranges to be used.  The queries would be simple, but
> presumably you would need
> many in order to complete one request.
>
> The fewest number of entities:
>    All indexes with primary value something through something would be
> aggregated into one large
> entity, and the same for the next bin range etc.  Then add "table"
> entities holding the bin ranges.
> The advantage to this is that you'd need very few queries.  The
> disadvantages would be many though!
> Any growth in your indexes may need all entities to be re-written.
> Returning a result returns far more
> than is needed...
>
> Presumably, partitioning using a method closer to the  most number of
> entities is closer to what
> you want.
>
> With respect to the user's query:
> Any progress of a user's search could be stored in their session.
> Searches of the exact same query
> that would be repeated using an offset in range can be replaced with a
> cursor to make it more
> efficient.  Features such as the background processing task queue or
> the already available task
> queue can continue a search asynchronously...
>
> On Oct 5, 8:35 am, "jacek.ambroziak" <[email protected]>
> wrote:
>
>
>
>
>
>
>
> > I have full-text indexed all of O'Reilly eBooks (1,600 of them) using
> > my own search engine. You can see how the search works if you have an
> > Android tablet and install (free app) eCarrel (O'Reilly bookstore). To
> > make searching manageable in the context of GAE I have originally
> > partitioned all the books into 4 groups, each with its own index. That
> > way searches can be performed in parallel (merging results when done),
> > individual (per book group) indexes are smaller, and w/i group search
> > is faster.
>
> > An index for 400 books is about 90 MB in size. To implement the search
> > engine on GAE
> > I would dedicate 4 applications to the task (eg. search1.appspot.com,
> > through search4....).
> > Each application would run exactly the same code, but would have
> > different "application files"
> > containing index data.  (I wasn't sure if the index data should be
> > stored in DataStore entities,
> > Blobstore blobs, or application files; at the time the SE was first
> > implemented it seemed
> > that application files was the only option even if it meant that they
> > had to be split into
> > 10 MB chunks (1.5.5 supposedly raises the limit to 32 MB but I got an
> > error attempting that)).
>
> > One problem with that approach is that multiple GAE applications are
> > used to implement
> > parallel "search servers." Another is that it takes time and resources
> > to read in the index
> > from application files into RAM before search results can be computed.
> > When an instance
> > is killed all this work goes to waste and will have to be repeated on
> > next search.
> > When the number of groups was too small and therefore indexes too big,
> > I was getting OutOfMemory errors just loading index data to RAM.
>
> > Do you guys think it is a good idea to use application files to store
> > index data?
>
> > Since each "search server" runs the same code (and only accesses
> > different application
> > files), can it be implemented via a single (versioned?) GAE
> > application? (I will run out of applications when adding more search
> > servers, and it will become more costly to run
> > the search engine).
>
> >http://ecarrel.com

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine for Java" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine-java?hl=en.

[appengine-java] Re: Full-text search engine and its indexes

Reply via email to