document support for file system crawling
Hi there, browsing through the message thread I tried to find a trail addressing file system crawls. I want to implement an enterprise search over a networked filesystem, crawling all sorts of documents, such as html, doc, ppt and pdf. Nutch provides plugins enabling it to read proprietary formats. Is there support for the same functionality in solr? Bruno -- View this message in context: http://www.nabble.com/document-support-for-file-system-crawling-tf2188066.html#a6053318 Sent from the Solr - User forum at Nabble.com.
Can't use SnowballAnalyzer
Hi All, I'm trying to use the SnowballAnalyzer and for some strange reason i cannot. I got the following error message in the logs file: org.apache.solr.core.SolrException: Error instantiating class class org.apache.lucene.analysis.snowball.SnowballAnalyzer at org.apache.solr.core.Config.newInstance(Config.java:213) at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:466) at org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:294) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:67) at org.apache.solr.core.SolrCore.init(SolrCore.java:189) at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:170) at org.apache.solr.servlet.SolrServlet.init(SolrServlet.java:74) at javax.servlet.GenericServlet.init(GenericServlet.java:211) at org.apache.catalina.core.StandardWrapper.loadServlet( StandardWrapper.java:1105) at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java :932) at org.apache.catalina.core.StandardContext.loadOnStartup( StandardContext.java:3917) at org.apache.catalina.core.StandardContext.start(StandardContext.java :4201) at org.apache.catalina.core.ContainerBase.addChildInternal( ContainerBase.java:759) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java :739) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:524) at org.apache.catalina.startup.HostConfig.deployDirectory( HostConfig.java:904) at org.apache.catalina.startup.HostConfig.deployDirectories( HostConfig.java:867) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java :474) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1122) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java :310) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent( LifecycleSupport.java:119) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1021) at org.apache.catalina.core.StandardHost.start(StandardHost.java:718) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1013) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java :442) at org.apache.catalina.core.StandardService.start(StandardService.java :450) at org.apache.catalina.core.StandardServer.start(StandardServer.java :709) at org.apache.catalina.startup.Catalina.start(Catalina.java:551) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke( NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:294) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:432) Caused by: java.lang.InstantiationException: org.apache.lucene.analysis.snowball.SnowballAnalyzer at java.lang.Class.newInstance0(Class.java:335) at java.lang.Class.newInstance(Class.java:303) at org.apache.solr.core.Config.newInstance(Config.java:211) ... 33 more Does any one used it before? Thank you Diogo
Re: document support for file system crawling
On Aug 30, 2006, at 2:42 AM, Bruno wrote: browsing through the message thread I tried to find a trail addressing file system crawls. I want to implement an enterprise search over a networked filesystem, crawling all sorts of documents, such as html, doc, ppt and pdf. Nutch provides plugins enabling it to read proprietary formats. Is there support for the same functionality in solr? No. Solr is strictly a search server that takes plain text for the fields of documents added to it. The client is responsible parsing the text out of these types of documents. You could borrow the document parsing pieces from Lucene's contrib and Nutch and glue them together into your client that speaks to Solr, or perhaps Solr isn't the right approach for your needs? It certainly is possible to add these capabilities into Solr, but it would be awkward to have to stream binary data into XML documents such that Solr could parse them on the server side. Erik
Re: Add doc limit - Follow Up
On 8/29/06, sangraal aiken [EMAIL PROTECTED] wrote: The problem only occurs when adding docs that contain ![CDATA[]] tags in the body of the field tag. The problem also only seems to cause an add limit on an individual post. I limited the size of my HTTP posts to 5000 documents per post, and the problem never showed up. You do not need to do a commit after each batch as I previously thought. That's very interesting... it sounds like perhaps an XPP (the XML parser) bug that tomcat manages to tickle. I looked through the XPP changelogs quick - no mention of a problem like this being fixed. -Yonik
Re: Can't use SnowballAnalyzer
: constructor requires the language parameter. I see SnowballAnalyzer : mentioned in a comment in the example schema.xml, but there is no : specification for language. My guess is you'll need to construct Whooops ... i just changed that example so as not to misslead people. FYI: the SnowballFilter uses reflection, so it's not recommended for perofrmance... http://incubator.apache.org/solr/docs/api/org/apache/solr/analysis/SnowballPorterFilterFactory.html -Hoss
Re: acts_as_solr
You might want to look at acts_as_searchable for Ruby: http://rubyforge.org/projects/ar-searchable That's a similar plugin for the Hyperestraier search engine using its REST interface. On 8/28/06, Erik Hatcher [EMAIL PROTECTED] wrote: I've spent a few hours tinkering with an Ruby ActiveRecord plugin to index, delete, and search models fronted by a database into Solr.
Re: acts_as_solr
On Aug 28, 2006, at 10:25 PM, Erik Hatcher wrote: I'd like to commit this to the Solr repository. Any objections? Once committed, folks will be able to use script/plugin install ... to install the Ruby side of things, and using a binary distribution of Solr's example application and a custom solr/conf directory (just for schema.xml) they'd be up and running quite quickly. If ok to commit, what directory should I put things under? How about just ruby? Ok, /client/ruby it is. I'll get this committed in the next day or so. I have to admit that the stuff Seth did with Searchable (linked to from http://wiki.apache.org/solr/SolRuby) is very well done so hopefully he can work with us to perhaps integrate that work into what lives in Solr's repository. Having the Searchable abstraction is interesting, but it might be a bit limiting in terms of leveraging fancier return values from Solr, like the facets and highlighting - or maybe it's just an unnecessary abstraction for those always working with Solr. I like it though, and will certainly borrow ideas from it on how to do slick stuff with Ruby. While I'm at it, I'd be happy to commit the Java client into /client/ java. I'll check the status of that contribution when I can. Erik