Hi Christoph

Sphinx has support for stripping out HTML tags (and possibly that works with 
XML as well, I'm not sure). It also supports stemming (also known as 
morphologies within Sphinx) and stop words:

http://sphinxsearch.com/docs/manual-1.10.html#conf-html-strip
http://sphinxsearch.com/docs/manual-1.10.html#conf-morphology
http://sphinxsearch.com/docs/manual-1.10.html#conf-stopwords

You can set all of these in config/sphinx.yml, which works much like Rails' 
database.yml (eg, settings per environment):
http://freelancing-god.github.com/ts/en/advanced_config.html

Also - while I've tested this briefly, it's very new in Thinking Sphinx - 
Sphinx 2.0.x (and 1.10-beta) has support for reading in files directly as part 
of the indexing process, when there's a column with the full file path. This 
was just added in the latest Thinking Sphinx release (1.4.6 for Rails 2, 2.0.5 
for Rails 3), so there's not any documentation just yet, but it works much like 
a normal field definition:

  indexes file_path_column, :file => true

Here's Sphinx's documentation on that setting:
http://sphinxsearch.com/docs/manual-1.10.html#conf-sql-file-field

Hopefully some combination of all of these will help :)

Cheers

-- 
Pat
e: [email protected]      || m: +614 1327 3337
w: http://freelancing-gods.com   || t: twitter.com/pat
bounce: http://trampolineday.com || skype: patallan

On 27/05/2011, at 5:31 PM, sol wrote:

> Hey there,
> 
> I'd like to ask for your input on the following scenario:
> 
> I've got a webcrawler that retrieves xml and html files.
> The files can be either stored in the database or the filesystem, the
> are identified with a unique id in the db.
> 
> Now I want to index these files to enable a better search.
> Ideally this would include some feature extraction from the files.
> 
> I'm not sure if sphinx or thinking sphinx is suitable for this or how
> I have to prepare
> the data for it. I can for example strip html/xml tags, put them into
> the db as text and index it.
> Does sphinx then do some kind of stemming, stop word removal etc?
> 
> It's kind of hard to find out where to get started, as I'm not very
> experienced with full text search in rails/ruby.
> 
> I'd be very happy if someone could help me a bit.
> Thank you,
> Christoph
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Thinking Sphinx" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/thinking-sphinx?hl=en.
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Thinking Sphinx" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/thinking-sphinx?hl=en.

Reply via email to