Re: GData - Server, Indexing entries

Otis Gospodnetic Wed, 19 Jul 2006 21:46:10 -0700

Hi Simon,

I have to admit that I'm not sure what exactly you are asking here. :)
But I do have a comment/question about this:


> The next step is to retrieve the data from within the elements.
> Elements have three types of content relevant for indexing plain text,
> html, xhtml (binary content might be tough to index :)

I looked at 
http://code.google.com/apis/gdata/protocol.html#Inserting-a-new-entry
Examples in that section contain elements like these:
  <title type="text">Entry 1</title>
  <content type="text">This is my entry</content>Is that type="text" what you 
are referring to above, and are you saying this could also be type="html" and 
type="xhtml" and the actual content of between those container tags could be 
(X)HTML?  Is that described somewhere in the protocol as supported?
What if you wanted some other format, say some XML?

Just curious.

Thanks,
Otis


----- Original Message ----
From: Simon Willnauer <[EMAIL PROTECTED]>
To: [email protected]
Sent: Wednesday, July 19, 2006 5:52:22 PM
Subject: GData - Server, Indexing entries

Hello everyone,

well the last mailing about distributed indexing / searching did not
receive many answers, maybe that's why the topic is very tough. Anyway
I try to kick of the indexing / searching milestone with another
mailing.
The Gdata server has to index all incoming entries on inserts or
updates and mark already indexed entries as deleted on delete
requests. So the format of incoming data will be XML in the first
place. How and which XML elements are supposed to be indexed will be
defined in the server configuration. I guess it would be quiet handy
to configure which elements to index using xpath expressions. That's
fairly generic and the most developers and admins are more or less
familiar with xpath. Analyzer etc. will also come from the
configuration file.
The next step is to retrieve the data from within the elements.
Elements have three types of content relevant for indexing plain text,
html, xhtml (binary content might be tough to index :)
I have to remove the tags from the Html and XHtml content I'm aware of
that there are several api's around doing that but it might be quite
helpful to have some recommendations.

GData defines a kind of a query "language" to query the a specific
feed via get parameters and / or defined endings of the query string.
(http://code.google.com/apis/gdata/protocol.html#Queries)
I do have some experience with building parsers (not javacc but yacc /
gentle) so I try to parse the so called "Gdata Query" to translate it
into a lucene query string. Using javaCC I can create a quite fast and
nice way to create lucene queries from incoming "Gdata Queries".

I do have lots of ideas to extend the search capabilities described in
the gdata protocol but I guess I will skip that after SoC has
finished.

I just wanna ask you guys to let me know if you have some ideas about all that.
Every comment will be highly appreciated!!!!

regards Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: GData - Server, Indexing entries

Reply via email to