Re: [CODE4LIB] a brief summary of the Google App Engine
Thanks for sharing that, Doug. It's not mentioned at all in the Developer's Guide (contains everything you need to know): http://code.google.com/appengine/docs/ I'll have to take a closer look at the src docs... Keith On Thu, Jul 17, 2008 at 3:01 PM, Doug Chestnut [EMAIL PROTECTED] wrote: There is a Searchable Entity with GAE. Refer to the src for docs. It is fairly straight forward, it takes the text of the properties, removes stop words, creates a new list property that contains the words. An index on this property allows fast retrieval. It is fairly limited, you don't want list properties to get too large (this is what the devs told me at google io). From the docs: Don't expect too much. First, there's no ranking, which is a killer drawback. There's also no exact phrase match, substring match, boolean operators, stemming, or other common full text search features. Finally, support for stop words (common words that are not indexed) is currently limited to English. I have been playing with reverse indexes in GAE with some potential success for search and faceted browse. --Doug
Re: [CODE4LIB] a brief summary of the Google App Engine
For an example, check out the bulk loader article (it includes a simple program that searches the datastore by keyword using the SearchableEntity): http://code.google.com/appengine/articles/bulkload.html The last time I tried the bulkloader it had problems with utf8. --Doug On Fri, Jul 18, 2008 at 9:39 AM, Keith Jenkins [EMAIL PROTECTED] wrote: Thanks for sharing that, Doug. It's not mentioned at all in the Developer's Guide (contains everything you need to know): http://code.google.com/appengine/docs/ I'll have to take a closer look at the src docs... Keith On Thu, Jul 17, 2008 at 3:01 PM, Doug Chestnut [EMAIL PROTECTED] wrote: There is a Searchable Entity with GAE. Refer to the src for docs. It is fairly straight forward, it takes the text of the properties, removes stop words, creates a new list property that contains the words. An index on this property allows fast retrieval. It is fairly limited, you don't want list properties to get too large (this is what the devs told me at google io). From the docs: Don't expect too much. First, there's no ranking, which is a killer drawback. There's also no exact phrase match, substring match, boolean operators, stemming, or other common full text search features. Finally, support for stop words (common words that are not indexed) is currently limited to English. I have been playing with reverse indexes in GAE with some potential success for search and faceted browse. --Doug
Re: [CODE4LIB] a brief summary of the Google App Engine
On Wed, Jul 16, 2008 at 6:29 AM, Keith Jenkins [EMAIL PROTECTED] wrote: [...] So it's a bit of a hack just to get a left-anchored search. Querying for a particular keyword anywhere within a string value would be even more work. For small datasets, I guess you could iterate through every record. But for anything larger, you'd probably want to figure out a way to manually build an index within the Google datastore, or else keep the indexing outside GAE, and just use GAE for fetching specified records. Any ideas on how that might work? Keith Hi Keith, There is a Searchable Entity with GAE. Refer to the src for docs. It is fairly straight forward, it takes the text of the properties, removes stop words, creates a new list property that contains the words. An index on this property allows fast retrieval. It is fairly limited, you don't want list properties to get too large (this is what the devs told me at google io). From the docs: Don't expect too much. First, there's no ranking, which is a killer drawback. There's also no exact phrase match, substring match, boolean operators, stemming, or other common full text search features. Finally, support for stop words (common words that are not indexed) is currently limited to English. I have been playing with reverse indexes in GAE with some potential success for search and faceted browse. --Doug
Re: [CODE4LIB] a brief summary of the Google App Engine
On Wed, Jul 16, 2008 at 12:21 AM, Godmar Back [EMAIL PROTECTED] wrote: Aside from the limitations imposed by the index model, the problem then is fundamentally similar to how you index MARC data for use in any discovery system. I think Godmar is referring to GAE's lack of keyword searching. To elaborate, the following is from http://code.google.com/appengine/docs/datastore/queriesandindexes.html Tip: Query filters do not have an explicit way to match just part of a string value, but you can fake a prefix match using inequality filters: db.GqlQuery(SELECT * FROM MyModel WHERE prop = :1 AND prop :2, abc, abc + \xEF\xBF\xBD) This matches every MyModel entity with a string property prop that begins with the characters abc. The byte string \xEF\xBF\xBD represents the largest possible Unicode character. When the property values are sorted in an index, the values that fall in this range are all of the values that begin with the given prefix. So it's a bit of a hack just to get a left-anchored search. Querying for a particular keyword anywhere within a string value would be even more work. For small datasets, I guess you could iterate through every record. But for anything larger, you'd probably want to figure out a way to manually build an index within the Google datastore, or else keep the indexing outside GAE, and just use GAE for fetching specified records. Any ideas on how that might work? Keith
Re: [CODE4LIB] a brief summary of the Google App Engine
On Wed, Jul 16, 2008 at 12:21 AM, Godmar Back [EMAIL PROTECTED] wrote: Aside from the limitations imposed by the index model, the problem then is fundamentally similar to how you index MARC data for use in any discovery system. Presumably, you could learn from the experiences of the many projects that have done that - some in Python, such as http://code.google.com/p/fac-back-opac/ (though they use Django, they don't appear to be using its object-relational db model for MARC records; I say this from a 2 min examination of parts of their code; I may be wrong. PyMarc itself doesn't support it.) Fac-Back-OPAC doesn't use the Django ORM because we're using Solr for indexing. Mark
Re: [CODE4LIB] a brief summary of the Google App Engine
On Tue, Jul 15, 2008 at 2:16 PM, Fernando Gomez [EMAIL PROTECTED] wrote: Any thoughts about a convenient way of storing and (more importantly) indexing retrieving MARC records using GAE's Bigtable? GAE uses Django's object-relational model. You can define a Python class, inherit from db.model, declare properties of your model; then instances can be created, stored, retrieved and updated. GAE performs automatic indexing on some fields, and you can tell it to index on others, or using certain combinations. Aside from the limitations imposed by the index model, the problem then is fundamentally similar to how you index MARC data for use in any discovery system. Presumably, you could learn from the experiences of the many projects that have done that - some in Python, such as http://code.google.com/p/fac-back-opac/ (though they use Django, they don't appear to be using its object-relational db model for MARC records; I say this from a 2 min examination of parts of their code; I may be wrong. PyMarc itself doesn't support it.) - Godmar
[CODE4LIB] a brief summary of the Google App Engine
Hi, since I brought up the issue of the Google App Engine (GAE) (or similar services, such as Amazon's EC2 Elastic Compute Cloud), I thought I give a brief overview of what it can and cannot do, such that we may judge its potential use for library services. GAE is a cloud infrastructure into which developers can upload applications. These applications are replicated among Google's network of data centers and they have access to its computational resources. Each application has access to a certain amount of resources at no fee; Google recently announced the pricing for applications whose resource use exceeds the no fee threshold [1]. The no fee threshold is rather substantial: 500MB of persistent storage, and, according to Google, enough bandwidth and cycles to serve about 5 million page views per month. Google Apps must be written in Python. They run in a sandboxed environment. This environment limits what applications can do and how they communicate with the outside world. Overall, the sandbox is very flexible - in particular, application developers have the option of uploading additional Python libraries of their choice with their application. The restrictions lie primarily in security and resource management. For instance, you cannot use arbitrary socket connections (all outside world communication must be through GAE's fetch service which supports http/https only), you cannot fork processes or threads (which would use up CPU cycles), and you cannot write to the filesystem (instead, you must store all of your persistent data in Google's scalable datastorage, which is also known as BigTable.) All resource usage (CPU, Bandwidth, Persistent Storage - though not memory) is accounted for and you can see your use in the application's dashboard control panel. Resources are replenished on the fly where possible, as in the case of CPU and Bandwidth. Developers are currently restricted to 3 applications per account. Making applications in multiple accounts work in tandem to work around quota limitations is against Google's terms of use. Applications are described by a configuration file that maps URI paths to scripts in a manner similar to how you would use Apache mod_rewrite. URIs can also be mapped to explicitly named static resources such as images. Static resources are uploaded along with your application and, like the application, are replicated in Google's server network. The programming environment is CGI 1.1. Google suggests, but doesn't require, the use of supporting libraries for this model, such as WSGI. This use of high-level libraries allows applications to be written in a very compact, high-level style, the way one is used to from Python. In addition to the WSGI framework, this allows the use of several template libraries, such as Django. Since the model is CGI 1.1, there are no or very little restrictions on what can be returned - you can return, for instance, XML or JSON and you have full control over the Content-Type: returned. The execution model is request-based. If a client request arrives, GAE will start a new instance (or reuse an existing instance if possible), then invoke the main() method. At this point, you have a set limit to process this request (though not explicitly stated in Google's doc, the limit appears to be currently 9 seconds) and return a result to the client. Note that this per-request limit is a maximum; you should usually be much quicker in your response. Also note that any CPU cycles you use during those 9 seconds (but not time you spent wait fetching results from other application tiers) count against your overall CPU budget. The key service the GAE runtime libraries provide is the Google datastore, aka BigTable [2]. You can think of this service as a highly efficient, persistent store for structured data. You may think of it as a simplified database that allows the creation, retrieval, updating, and deletion (CRUD) of entries using keys and, optionally, indices. It provides limited support transactions as well. Though it is less powerful than conventional relational databases - which aren't nearly as scalable - it can be accessed using GQL, a query language that's similar in spirit to SQL. Notably, GQL (or BigTable) does not support JOINs, which means that you will have to adjust your traditional approach to database normalization. The Python binding for the structured data is intuitive and seamless. You simply declare a Python class for the properties of objects you wish to store, along with the types of the properties you wish included, and you can subsequently use a put() or delete() method to write and delete. Queries will return instances of the objects you placed in a given table. Tables are named using the Python classes. Google provides a number of additional runtime libraries, such as for simple Image processing a la Google Picasa, for the sending of email (subject to resource limits), and for user authentication, solely using Google