Hi Christoph Sphinx has support for stripping out HTML tags (and possibly that works with XML as well, I'm not sure). It also supports stemming (also known as morphologies within Sphinx) and stop words:
http://sphinxsearch.com/docs/manual-1.10.html#conf-html-strip http://sphinxsearch.com/docs/manual-1.10.html#conf-morphology http://sphinxsearch.com/docs/manual-1.10.html#conf-stopwords You can set all of these in config/sphinx.yml, which works much like Rails' database.yml (eg, settings per environment): http://freelancing-god.github.com/ts/en/advanced_config.html Also - while I've tested this briefly, it's very new in Thinking Sphinx - Sphinx 2.0.x (and 1.10-beta) has support for reading in files directly as part of the indexing process, when there's a column with the full file path. This was just added in the latest Thinking Sphinx release (1.4.6 for Rails 2, 2.0.5 for Rails 3), so there's not any documentation just yet, but it works much like a normal field definition: indexes file_path_column, :file => true Here's Sphinx's documentation on that setting: http://sphinxsearch.com/docs/manual-1.10.html#conf-sql-file-field Hopefully some combination of all of these will help :) Cheers -- Pat e: [email protected] || m: +614 1327 3337 w: http://freelancing-gods.com || t: twitter.com/pat bounce: http://trampolineday.com || skype: patallan On 27/05/2011, at 5:31 PM, sol wrote: > Hey there, > > I'd like to ask for your input on the following scenario: > > I've got a webcrawler that retrieves xml and html files. > The files can be either stored in the database or the filesystem, the > are identified with a unique id in the db. > > Now I want to index these files to enable a better search. > Ideally this would include some feature extraction from the files. > > I'm not sure if sphinx or thinking sphinx is suitable for this or how > I have to prepare > the data for it. I can for example strip html/xml tags, put them into > the db as text and index it. > Does sphinx then do some kind of stemming, stop word removal etc? > > It's kind of hard to find out where to get started, as I'm not very > experienced with full text search in rails/ruby. > > I'd be very happy if someone could help me a bit. > Thank you, > Christoph > > -- > You received this message because you are subscribed to the Google Groups > "Thinking Sphinx" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/thinking-sphinx?hl=en. > -- You received this message because you are subscribed to the Google Groups "Thinking Sphinx" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/thinking-sphinx?hl=en.
