Hi Simon, I have to admit that I'm not sure what exactly you are asking here. :) But I do have a comment/question about this:
> The next step is to retrieve the data from within the elements. > Elements have three types of content relevant for indexing plain text, > html, xhtml (binary content might be tough to index :) I looked at http://code.google.com/apis/gdata/protocol.html#Inserting-a-new-entry Examples in that section contain elements like these: <title type="text">Entry 1</title> <content type="text">This is my entry</content>Is that type="text" what you are referring to above, and are you saying this could also be type="html" and type="xhtml" and the actual content of between those container tags could be (X)HTML? Is that described somewhere in the protocol as supported? What if you wanted some other format, say some XML? Just curious. Thanks, Otis ----- Original Message ---- From: Simon Willnauer <[EMAIL PROTECTED]> To: [email protected] Sent: Wednesday, July 19, 2006 5:52:22 PM Subject: GData - Server, Indexing entries Hello everyone, well the last mailing about distributed indexing / searching did not receive many answers, maybe that's why the topic is very tough. Anyway I try to kick of the indexing / searching milestone with another mailing. The Gdata server has to index all incoming entries on inserts or updates and mark already indexed entries as deleted on delete requests. So the format of incoming data will be XML in the first place. How and which XML elements are supposed to be indexed will be defined in the server configuration. I guess it would be quiet handy to configure which elements to index using xpath expressions. That's fairly generic and the most developers and admins are more or less familiar with xpath. Analyzer etc. will also come from the configuration file. The next step is to retrieve the data from within the elements. Elements have three types of content relevant for indexing plain text, html, xhtml (binary content might be tough to index :) I have to remove the tags from the Html and XHtml content I'm aware of that there are several api's around doing that but it might be quite helpful to have some recommendations. GData defines a kind of a query "language" to query the a specific feed via get parameters and / or defined endings of the query string. (http://code.google.com/apis/gdata/protocol.html#Queries) I do have some experience with building parsers (not javacc but yacc / gentle) so I try to parse the so called "Gdata Query" to translate it into a lucene query string. Using javaCC I can create a quite fast and nice way to create lucene queries from incoming "Gdata Queries". I do have lots of ideas to extend the search capabilities described in the gdata protocol but I guess I will skip that after SoC has finished. I just wanna ask you guys to let me know if you have some ideas about all that. Every comment will be highly appreciated!!!! regards Simon --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
