[Ferret-talk] Someone getting RDig work for Linux?

ngoc Tue, 23 Jan 2007 06:58:44 -0800

I got this

[EMAIL PROTECTED]:~# rdig -c configfile
RDig version 0.3.4
using Ferret 0.10.14
added url file:///home/myaccount/documents/
waiting for threads to finish...
[EMAIL PROTECTED]:~# rdig -c configfile -q "Ruby"
RDig version 0.3.4
using Ferret 0.10.14
executing query >Ruby<
Query:
total results: 0
[EMAIL PROTECTED]:~#




my configfile
I changed from config to cfg, because of maybe mistyping
cfg.index.create = false

RDig.configuration do |cfg|

  ##################################################################
  # options you really should set

  # provide one or more URLs for the crawler to start from
  cfg.crawler.start_urls = [ 'http://www.example.com/' ]

  # use something like this for crawling a file system:
   cfg.crawler.start_urls = [ 'file:///home/myaccount/documents/' ]
  # beware, mixing file and http crawling is not possible and might
result in
  # unpredictable results.

  # limit the crawl to these hosts. The crawler will never
  # follow any links pointing to hosts other than those given here.
  # ignored for file system crawling
  cfg.crawler.include_hosts = [ 'www.example.com' ]

  # this is the path where the index will be stored
  # caution, existing contents of this directory will be deleted!
  cfg.index.path        = '/home/myaccount/index'

  ##################################################################
  # options you might want to set, the given values are the defaults

  # set to true to get stack traces on errors
   cfg.verbose = true

  # content extraction options
  cfg.content_extraction = OpenStruct.new(

  # HPRICOT configuration
  # this is the html parser used by default from RDig 0.3.3 upwards.
  # Hpricot by far outperforms Rubyful Soup, and is at least as flexible
when
  # it comes to selection of portions of the html documents.
    :hpricot      => OpenStruct.new(
      # css selector for the element containing the page title
      :title_tag_selector => 'title',
      # might also be a proc returning either an element or a string:
      # :title_tag_selector => lambda { |hpricot_doc| ... }
      :content_tag_selector => 'body'
      # might also be a proc returning either an element or a string:
      # :content_tag_selector => lambda { |hpricot_doc| ... }
    )

  # RUBYFUL SOUP
  # This is a powerful, but somewhat slow, ruby-only html parsing lib
which was
  # RDig's default html parser up to version 0.3.2. To use it, comment
the
  # hpricot config above, and uncomment the following:
  #
  #  :rubyful_soup => OpenStruct.new(
  #    # provide a method that returns the title of an html document
  #    # this method may either return a tag to extract the title from,
  #    # or a ready-to-index string.
  #    :content_tag_selector => lambda { |tagsoup|
  #      tagsoup.html.body
  #    },
  #    # provide a method that selects the tag containing the page
content you
  #    # want to index. Useful to avoid indexing common elements like
navigation
  #    # and page footers for every page.
  #    :title_tag_selector         => lambda { |tagsoup|
  #      tagsoup.html.head.title
  #    }
  #  )
  )

  # crawler options

  # Notice: for file system crawling the include/exclude_document
patterns are
  # applied to the full path of _files_ only (like /home/bob/test.pdf),
  # for http to full URIs (like http://example.com/index.html).

  # nil (include all documents) or an array of Regexps
  # matching the URLs you want to index.
   cfg.crawler.include_documents = nil

  # nil (no documents excluded) or an array of Regexps
  # matching URLs not to index.
  # this filter is used after the one above, so you only need
  # to exclude documents here that aren't wanted but would be
  # included by the inclusion patterns.
  # cfg.crawler.exclude_documents = nil

  # number of document fetching threads to use. Should be raised only if
  # your CPU has idle time when indexing.
  # cfg.crawler.num_threads = 2
  # suggested setting for file system crawling:
   cfg.crawler.num_threads = 1

  # maximum number of http redirections to follow
  # cfg.crawler.max_redirects = 5

  # number of seconds to wait with an empty url queue before
  # finishing the crawl. Set to a higher number when experiencing
incomplete
  # crawls on slow sites. Don't set to 0, even when crawling a local fs.
   cfg.crawler.wait_before_leave = 10

  # indexer options

  # create a new index on each run. Will append to the index if false.
Use when
  # building a single index from multiple runs, e.g. one across a
website and the
  # other a tree in a local file system
   cfg.index.create = false

  # rewrite document uris before indexing them. This is useful if you're
  # indexing on disk, but the documents should be accessible via http,
e.g. from
  # a web based search application. By default, no rewriting takes
place.
  # example:
  # cfg.index.rewrite_uri = lambda { |uri|
  #   uri.path.gsub!(/^\/base\//, '/virtual_dir/')
  #   uri.scheme = 'http'
  #   uri.host = 'www.mydomain.com'
  # }

end

-- 
Posted via http://www.ruby-forum.com/.
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

[Ferret-talk] Someone getting RDig work for Linux?

Reply via email to