Re: [CODE4LIB] a brief summary of the Google App Engine

2008-07-18 Thread Keith Jenkins
Thanks for sharing that, Doug.  It's not mentioned at all in the
Developer's Guide (contains everything you need to know):
http://code.google.com/appengine/docs/

I'll have to take a closer look at the src docs...

Keith


On Thu, Jul 17, 2008 at 3:01 PM, Doug Chestnut [EMAIL PROTECTED] wrote:
 There is a Searchable Entity with GAE.  Refer to the src for docs.  It is
 fairly straight forward, it takes the text of the properties, removes stop
 words, creates a new list property that contains the words.  An index on
 this property allows fast retrieval.  It is fairly limited, you don't want
 list properties to get too large (this is what the devs told me at google
 io).  From the docs:

 Don't expect too much. First, there's no ranking, which is a killer
 drawback.
 There's also no exact phrase match, substring match, boolean operators,
 stemming, or other common full text search features. Finally, support for
 stop
 words (common words that are not indexed) is currently limited to English.

 I have been playing with reverse indexes in GAE with some potential success
 for search and faceted browse.

 --Doug



Re: [CODE4LIB] a brief summary of the Google App Engine

2008-07-18 Thread Doug Chestnut
For an example, check out the bulk loader article (it includes a simple
program that searches the datastore by keyword using the SearchableEntity):
http://code.google.com/appengine/articles/bulkload.html

The last time I tried the bulkloader it had problems with utf8.

--Doug

On Fri, Jul 18, 2008 at 9:39 AM, Keith Jenkins [EMAIL PROTECTED] wrote:

 Thanks for sharing that, Doug.  It's not mentioned at all in the
 Developer's Guide (contains everything you need to know):
 http://code.google.com/appengine/docs/

 I'll have to take a closer look at the src docs...

 Keith


 On Thu, Jul 17, 2008 at 3:01 PM, Doug Chestnut [EMAIL PROTECTED]
 wrote:
  There is a Searchable Entity with GAE.  Refer to the src for docs.  It is
  fairly straight forward, it takes the text of the properties, removes
 stop
  words, creates a new list property that contains the words.  An index on
  this property allows fast retrieval.  It is fairly limited, you don't
 want
  list properties to get too large (this is what the devs told me at google
  io).  From the docs:
 
  Don't expect too much. First, there's no ranking, which is a killer
  drawback.
  There's also no exact phrase match, substring match, boolean operators,
  stemming, or other common full text search features. Finally, support for
  stop
  words (common words that are not indexed) is currently limited to
 English.
 
  I have been playing with reverse indexes in GAE with some potential
 success
  for search and faceted browse.
 
  --Doug
 



Re: [CODE4LIB] a brief summary of the Google App Engine

2008-07-17 Thread Doug Chestnut
On Wed, Jul 16, 2008 at 6:29 AM, Keith Jenkins [EMAIL PROTECTED] wrote:

 [...]

 So it's a bit of a hack just to get a left-anchored search.  Querying
 for a particular keyword anywhere within a string value would be even
 more work.  For small datasets, I guess you could iterate through
 every record.  But for anything larger, you'd probably want to figure
 out a way to manually build an index within the Google datastore, or
 else keep the indexing outside GAE, and just use GAE for fetching
 specified records.  Any ideas on how that might work?

 Keith


Hi Keith,
There is a Searchable Entity with GAE.  Refer to the src for docs.  It is
fairly straight forward, it takes the text of the properties, removes stop
words, creates a new list property that contains the words.  An index on
this property allows fast retrieval.  It is fairly limited, you don't want
list properties to get too large (this is what the devs told me at google
io).  From the docs:

Don't expect too much. First, there's no ranking, which is a killer
drawback.
There's also no exact phrase match, substring match, boolean operators,
stemming, or other common full text search features. Finally, support for
stop
words (common words that are not indexed) is currently limited to English.

I have been playing with reverse indexes in GAE with some potential success
for search and faceted browse.

--Doug


Re: [CODE4LIB] a brief summary of the Google App Engine

2008-07-16 Thread Keith Jenkins
On Wed, Jul 16, 2008 at 12:21 AM, Godmar Back [EMAIL PROTECTED] wrote:
 Aside from the limitations imposed by the index model, the problem
 then is fundamentally similar to how you index MARC data for use in
 any discovery system.

I think Godmar is referring to GAE's lack of keyword searching.  To
elaborate, the following is from
http://code.google.com/appengine/docs/datastore/queriesandindexes.html


Tip: Query filters do not have an explicit way to match just part of a
string value, but you can fake a prefix match using inequality
filters:

db.GqlQuery(SELECT * FROM MyModel WHERE prop = :1 AND prop  :2,
abc, abc + \xEF\xBF\xBD)

This matches every MyModel entity with a string property prop that
begins with the characters abc. The byte string \xEF\xBF\xBD
represents the largest possible Unicode character. When the property
values are sorted in an index, the values that fall in this range are
all of the values that begin with the given prefix.


So it's a bit of a hack just to get a left-anchored search.  Querying
for a particular keyword anywhere within a string value would be even
more work.  For small datasets, I guess you could iterate through
every record.  But for anything larger, you'd probably want to figure
out a way to manually build an index within the Google datastore, or
else keep the indexing outside GAE, and just use GAE for fetching
specified records.  Any ideas on how that might work?

Keith


Re: [CODE4LIB] a brief summary of the Google App Engine

2008-07-16 Thread Mark A. Matienzo
On Wed, Jul 16, 2008 at 12:21 AM, Godmar Back [EMAIL PROTECTED] wrote:
 Aside from the limitations imposed by the index model, the problem
 then is fundamentally similar to how you index MARC data for use in
 any discovery system.  Presumably, you could learn from the
 experiences of the many projects that have done that - some in Python,
 such as http://code.google.com/p/fac-back-opac/  (though they use
 Django, they don't appear to be using its object-relational db model
 for MARC records; I say this from a 2 min examination of parts of
 their code; I may be wrong. PyMarc itself doesn't support it.)

Fac-Back-OPAC doesn't use the Django ORM because we're using Solr for indexing.

Mark


Re: [CODE4LIB] a brief summary of the Google App Engine

2008-07-15 Thread Godmar Back
On Tue, Jul 15, 2008 at 2:16 PM, Fernando Gomez [EMAIL PROTECTED] wrote:

 Any thoughts about a convenient way of storing and (more importantly)
 indexing  retrieving MARC records using GAE's Bigtable?


GAE uses Django's object-relational model. You can define a Python
class, inherit from db.model, declare properties of your model; then
instances can be created, stored, retrieved and updated.
GAE performs automatic indexing on some fields, and you can tell it to
index on others, or using certain combinations.

Aside from the limitations imposed by the index model, the problem
then is fundamentally similar to how you index MARC data for use in
any discovery system.  Presumably, you could learn from the
experiences of the many projects that have done that - some in Python,
such as http://code.google.com/p/fac-back-opac/  (though they use
Django, they don't appear to be using its object-relational db model
for MARC records; I say this from a 2 min examination of parts of
their code; I may be wrong. PyMarc itself doesn't support it.)

 - Godmar


[CODE4LIB] a brief summary of the Google App Engine

2008-07-13 Thread Godmar Back
Hi,

since I brought up the issue of the Google App Engine (GAE) (or
similar services, such as Amazon's EC2 Elastic Compute Cloud), I
thought I give a brief overview of what it can and cannot do, such
that we may judge its potential use for library services.

GAE is a cloud infrastructure into which developers can upload
applications. These applications are replicated among Google's network
of data centers and they have access to its computational resources.
Each application has access to a certain amount of resources at no
fee; Google recently announced the pricing for applications whose
resource use exceeds the no fee threshold [1]. The no fee threshold
is rather substantial: 500MB of persistent storage, and, according to
Google, enough bandwidth and cycles to serve about 5 million page
views per month.

Google Apps must be written in Python. They run in a sandboxed
environment. This environment limits what applications can do and how
they communicate with the outside world.  Overall, the sandbox is very
flexible - in particular, application developers have the option of
uploading additional Python libraries of their choice with their
application. The restrictions lie primarily in security and resource
management. For instance, you cannot use arbitrary socket connections
(all outside world communication must be through GAE's fetch service
which supports http/https only), you cannot fork processes or threads
(which would use up CPU cycles), and you cannot write to the
filesystem (instead, you must store all of your persistent data in
Google's scalable datastorage, which is also known as BigTable.)

All resource usage (CPU, Bandwidth, Persistent Storage - though not
memory) is accounted for and you can see your use in the application's
dashboard control panel. Resources are replenished on the fly where
possible, as in the case of CPU and Bandwidth. Developers are
currently restricted to 3 applications per account. Making
applications in multiple accounts work in tandem to work around quota
limitations is against Google's terms of use.

Applications are described by a configuration file that maps URI paths
to scripts in a manner similar to how you would use Apache
mod_rewrite.  URIs can also be mapped to explicitly named static
resources such as images. Static resources are uploaded along with
your application and, like the application, are replicated in Google's
server network.

The programming environment is CGI 1.1.  Google suggests, but doesn't
require, the use of supporting libraries for this model, such as WSGI.
 This use of high-level libraries allows applications to be written in
a very compact, high-level style, the way one is used to from Python.
In addition to the WSGI framework, this allows the use of several
template libraries, such as Django.  Since the model is CGI 1.1, there
are no or very little restrictions on what can be returned - you can
return, for instance, XML or JSON and you have full control over the
Content-Type: returned.

The execution model is request-based.  If a client request arrives,
GAE will start a new instance (or reuse an existing instance if
possible), then invoke the main() method. At this point, you have a
set limit to process this request (though not explicitly stated in
Google's doc, the limit appears to be currently 9 seconds) and return
a result to the client. Note that this per-request limit is a maximum;
you should usually be much quicker in your response. Also note that
any CPU cycles you use during those 9 seconds (but not time you spent
wait fetching results from other application tiers) count against your
overall CPU budget.

The key service the GAE runtime libraries provide is the Google
datastore, aka BigTable [2].
You can think of this service as a highly efficient, persistent store
for structured data. You may think of it as a simplified database that
allows the creation, retrieval, updating, and deletion (CRUD) of
entries using keys and, optionally, indices. It provides limited
support transactions as well. Though it is less powerful than
conventional relational databases - which aren't nearly as scalable -
it can be accessed using GQL, a query language that's similar in
spirit to SQL.  Notably, GQL (or BigTable) does not support JOINs,
which means that you will have to adjust your traditional approach to
database normalization.

The Python binding for the structured data is intuitive and seamless.
You simply declare a Python class for the properties of objects you
wish to store, along with the types of the properties you wish
included, and you can subsequently use a put() or delete() method to
write and delete. Queries will return instances of the objects you
placed in a given table.  Tables are named using the Python classes.

Google provides a number of additional runtime libraries, such as for
simple Image processing a la Google Picasa, for the sending of email
(subject to resource limits), and for user authentication, solely
using Google