data import scheduling
Hi, Has anyone gotten solr to schedule data imports at a certain time interval through configuring solr? I tried setting interval=1, which is import every minute but I don't see it happening. I'm trying to avoid cron jobs. Thanks, Tri
Re: solr dynamic core creation
Does anyone has any idea on how to do this? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1881374.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: To cache or to not cache
Jonathan, thanks for your statement. In fact, you are quite right: A lot of people developed great caching mechanisms. However, the solution I got in mind was something like an HTTP-Cache - in most cases on the same box. I talked to some experts who told me that Squid would be a relatively large monster, since we only want him for http-caching. Do you know any benchmarks about responses per second, if most of the queried data is in the cache? Regards -- View this message in context: http://lucene.472066.n3.nabble.com/To-cache-or-to-not-cache-tp1875289p1881714.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to use polish stemmer - Stempel - in schema.xml?
Hi! Sorry for such a break, but I was moving house... anyway: 1. I took the ~/apache-solr/src/java/org/apache/solr/analysis/StandardFilterFactory.java file and modified it (named as StempelFilterFactory.java) in Vim that way: package org.getopt.solr.analysis; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardFilter; public class StempelTokenFilterFactory extends BaseTokenFilterFactory { public StempelFilter create(TokenStream input) { return new StempelFilter(input); } } 2. Then I put the file to the extracted stempel-1.0.jar in ./org/getopt/solr/analysis/ 3. Then I created a class from it: jar -cf StempelTokenFilterFactory.class StempelFilterFactory.java 4. Then I created new stempel-1.0.jar archive: jar -cf stempel-1.0.jar -C ./stempel-1.0/ . 5. Then in schema.xml I've put: fieldType name=text_pl class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=org.getopt.solr.analysis.StempelTokenFilterFactory / /analyzer /fieldType 6. I started the solr server and I recieved the following error: 2010-11-11 11:50:56 org.apache.solr.common.SolrException log SEVERE: java.lang.ClassFormatError: Incompatible magic value 1347093252 in class file org/getopt/solr/analysis/StempelTokenFilterFactory at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:634) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) ... Question: What is wrong? :) I use jar (fastjar) 0.98 to create jars, I googled on that error but with no answer gave me idea what is wrong in my .java file. Please help, as I believe I am close to the end of that subject. Cheers, Jakub Godawa. 2010/11/3 Lance Norskog goks...@gmail.com: Here's the problem: Solr is a little dumb about these Filter classes, and so you have to make a Factory object for the Stempel Filter. There are a lot of other FilterFactory classes. You would have to just copy one and change the names to Stempel and it might actually work. This will take some Solr programming- perhaps the author can help you? On Tue, Nov 2, 2010 at 7:08 AM, Jakub Godawa jakub.god...@gmail.com wrote: Sorry, I am not Java programmer at all. I would appreciate more verbose (or step by step) help. 2010/11/2 Bernd Fehling bernd.fehl...@uni-bielefeld.de: So you call org.getopt.solr.analysis.StempelTokenFilterFactory. In this case I would assume a file StempelTokenFilterFactory.class in your directory org/getopt/solr/analysis/. And a class which extends the BaseTokenFilterFactory rigth? ... public class StempelTokenFilterFactory extends BaseTokenFilterFactory implements ResourceLoaderAware { ... Am 02.11.2010 14:20, schrieb Jakub Godawa: This is what stempel-1.0.jar consist of after jar -xf: jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R org/ org/: egothor getopt org/egothor: stemmer org/egothor/stemmer: Cell.class Diff.class Gener.class MultiTrie2.class Optimizer2.class Reduce.class Row.class TestAll.class TestLoad.class Trie$StrEnum.class Compile.class DiffIt.class Lift.class MultiTrie.class Optimizer.class Reduce$Remap.class Stock.class Test.class Trie.class org/getopt: stempel org/getopt/stempel: Benchmark.class lucene Stemmer.class org/getopt/stempel/lucene: StempelAnalyzer.class StempelFilter.class jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R META-INF/ META-INF/: MANIFEST.MF jgod...@ubuntu:~/apache-solr-1.4.1/ifaq/lib$ ls -R res res: tables res/tables: readme.txt stemmer_1000.out stemmer_100.out stemmer_2000.out stemmer_200.out stemmer_500.out stemmer_700.out 2010/11/2 Bernd Fehling bernd.fehl...@uni-bielefeld.de: Hi Jakub, if you unzip your stempel-1.0.jar do you have the required directory structure and file in there? org/getopt/stempel/lucene/StempelFilter.class Regards, Bernd Am 02.11.2010 13:54, schrieb Jakub Godawa: Erick I've put the jar files like that before. I also added the directive and put the file in instanceDir/lib What is still a problem is that even the files are loaded: 2010-11-02 13:20:48 org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/home/jgodawa/apache-solr-1.4.1/ifaq/lib/stempel-1.0.jar' to classloader I am not able to use the FilterFactory... maybe I am attempting it in a wrong way? Cheers, Jakub Godawa. 2010/11/2 Erick Erickson erickerick...@gmail.com: The polish stemmer jar file needs to be findable by Solr, if you copy it to solr_home/lib and restart solr you should be set. Alternatively, you can add another lib directive to the solrconfig.xml file (there are several examples in that file already). I'm a little confused about not being able to find TokenFilter, is that still a problem? HTH Erick On Tue, Nov 2, 2010 at
Error while indexing files with Solr
Hi, I am trying to index documents (PDF, Doc, XLS, RTF) using the ExtractingRequestHandler. I am following the tutorial at http://wiki.apache.org/solr/ExtractingRequestHandler But when i run the following command *curl http://localhost:8983/solr/update/extract?literal.id=mydoc.docuprefix=attr_fmap.content=attr_content; -F myfile=@/home/system/Documents/mydoc.doc* i am getting the following error : html head meta http-equiv=Content-Type content=text/html; charset=ISO-8859-1/ titleError 500 /title /head bodyh2HTTP ERROR: 500/h2prelazy loading error org.apache.solr.common.SolrException: lazy loading error at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:249) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.extraction.ExtractingRequestHandler' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:375) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240) ...21 more Caused by: java.lang.ClassNotFoundException: org.apache.solr.handler.extraction.ExtractingRequestHandler not found in java.net.URLClassLoader{urls=[], parent=contextloa...@null} at java.net.URLClassLoader.findClass(libgcj.so.90) at java.lang.ClassLoader.loadClass(libgcj.so.90) at java.lang.ClassLoader.loadClass(libgcj.so.90) at java.lang.Class.forName(libgcj.so.90) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359) ...24 more /pre pRequestURI=/solr/update/extract/ppismalla href=http://jetty.mortbay.org/;Powered by Jetty:///a/small/i/pbr/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ br/ /body /html I am running Debian Lenny and java version 1.6.0_22. I am running apache-solr-1.4.1 and running it from the examples directory. Please point me in the right direction and help me solve the problem. -- --- Regards, Kaustuv Royburman Senior Software Developer infoservices.in DLF IT Park, Rajarhat, 1st Floor, Tower - 3 Major Arterial Road, Kolkata - 700156, India
index just new articles from rss feeds - Data Import Request Handler
Hello, I'd like to use solr to index some documents coming from an rss feed, like the example at [1], but it seems that the configuration used there is just for a one-time indexing, trying to get all the articles exposed in the rss feed of the website. Is it possible to manage and index just the new articles coming from the rss source? I found that maybe the delta-import can be useful but, from what I understand, the delta-import is used to just update the index with contents of documents that have been modified since the last indexing: this is obviously useful, but I'd like to index just the new articles coming from an rss feed. Is it something managed automatically by solr or I have to deal with it in a separate way? Maybe a full import with clean=false parameters? Are there any solutions that you would suggest? Maybe storing the article feeds in a table like [2] and have a module that periodically sends each row to solr for indexing it? Thanks, Matteo [1] http://wiki.apache.org/solr/DataImportHandler#HttpDataSource_Example [2] http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS
IndexTank technology...
Does anyone know what technology they are using: http://www.indextank.com/ Is it Lucene under the hood? Thanks, and apologies for cross-posting. -Glen http://zzzoot.blogspot.com -- -
solr 1.3 how to parse rich documents
Hi, I use solr 1.3 with patch for parsing rich documents, and when uploading for example pdf file, only thing I see in solr.log is following: INFO: [] webapp=/solr path=/update/rich params={id=250stream.type=pdffieldnames=id,namecommit=truestream.fieldname=bodyname=iphone+user+guide+pdf+iphone_user_guide.pdf} status=0 QTime=12656 solrconfig.xml contains the line: requestHandler name=/update/rich class=solr.RichDocumentRequestHandler startup=lazy / What else am I missing? Since I am running solr as standalone, I do not need to build it with ant, or? Regards, Nikola -- Nikola Garafolic SRCE, Sveucilisni racunski centar tel: +385 1 6165 804 email: nikola.garafo...@srce.hr
Re: Adding new field after data is already indexed
@Jerry Li What version of Solr were you using? And was there any data in the new field? I have no problems here with a quick test I ran on trunk... Best Erick On Thu, Nov 11, 2010 at 1:37 AM, Jerry Li | 李宗杰 zongjie...@gmail.comwrote: but if I use this field to do sorting, there will be an error occured and throw an indexOfBoundArray exception. On Thursday, November 11, 2010, Robert Petersen rober...@buy.com wrote: 1) Just put the new field in the schema and stop/start solr. Documents in the index will not have the field until you reindex them but it won't hurt anything. 2) Just turn off their handlers in solrconfig is all I think that takes. -Original Message- From: gauravshetti [mailto:gaurav.she...@tcs.com] Sent: Monday, November 08, 2010 5:21 AM To: solr-user@lucene.apache.org Subject: Adding new field after data is already indexed Hi, I had a few questions regarding Solr. Say my schema file looks like field name=folder_id type=long indexed=true stored=true/ field name=indexed type=boolean indexed=true stored=true/ and i index data on the basis of these fields. Now, incase i need to add a new field, is there a way i can add the field without corrupting the previous data. Is there any feature which adds a new field with a default value to the existing records. 2) Is there any security mechanism/authorization check to prevent url like /admin and /update to only a few users. -- View this message in context: http://lucene.472066.n3.nabble.com/Adding-new-field-after-data-is-alread y-indexed-tp1862575p1862575.html Sent from the Solr - User mailing list archive at Nabble.com. -- Best Regards. Jerry. Li | 李宗杰
Re: solr dynamic core creation
Hi, nizan. I didn't realize that just replying to a thread from my email client wouldn't get back to you. Here's some info on this thread since your original post: On Nov 10, 2010, at 12:30pm, Bob Sandiford wrote: Why not use replication? Call it inexperience... We're really early into working with and fully understanding Solr and the best way to approach various issues. I did mention that this was a prototype and non-production code, so I'm covered, though :) We'll take a look at the replication feature... Replication doesn't replicate the top-level solr.xml file that defines available cores, so if dynamic cores is a requirement then your custom code isn't wasted :) -- Ken -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, November 10, 2010 3:26 PM To: solr-user@lucene.apache.org Subject: Re: Dynamic creating of cores in solr You could use the actual built-in Solr replication feature to accomplish that same function -- complete re-index to a 'master', and then when finished, trigger replication to the 'slave', with the 'slave' being the live index that actually serves your applications. I am curious if there was any reason you chose to roll your own solution using JSolr and dynamic creation of cores, instead of simply using the replication feature. Were there any downsides of using the replication feature for this purpose that you amerliorated through your solution? Jonathan Bob Sandiford wrote: We also use SolrJ, and have a dynamically created Core capability - where we don't know in advance what the Cores will be that we require. We almost always do a complete index build, and if there's a previous instance of that index, it needs to be available during a complete index build, so we have two cores per index, and switch them as required at the end of an indexing run. Here's a summary of how we do it (we're in an early prototype / implementation right now - this isn't production quality code - as you can tell from our voluminous javadocs on the methods...) 1) Identify if the core exists, and if not, create it: /** * This method instantiates two SolrServer objects, solr and indexCore. It requires that * indexName be set before calling. */ private void initSolrServer() throws IOException { String baseUrl = http://localhost:8983/solr/;; solr = new CommonsHttpSolrServer(baseUrl); String indexCoreName = indexName + SolrConstants.SUFFIX_INDEX; // SUFIX_INDEX = _INDEX String indexCoreUrl = baseUrl + indexCoreName; // Here we create two cores for the indexName, if they don't already exist - the live core used // for searching and a second core used for indexing. After indexing, the two will be switched so the // just-indexed core will become the live core. The way that core swapping works, the live core will always // be named [indexName] and the indexing core will always be named [indexname]_INDEX, but the // dataDir of each core will alternate between [indexName]_1 and [indexName]_2. createCoreIfNeeded(indexName, indexName + _1, solr); createCoreIfNeeded(indexCoreName, indexName + _2, solr); indexCore = new CommonsHttpSolrServer(indexCoreUrl); } /** * Create a core if it does not already exists. Returns true if a new core was created, false otherwise. */ private boolean createCoreIfNeeded(String coreName, String dataDir, SolrServer server) throws IOException { boolean coreExists = true; try { // SolrJ provides no direct method to check if a core exists, but getStatus will // return an empty list for any core that doesn't. CoreAdminResponse statusResponse = CoreAdminRequest.getStatus(coreName, server); coreExists = statusResponse.getCoreStatus(coreName).size() 0; if(!coreExists) { // Create the core LOG.info(Creating Solr core: + coreName); CoreAdminRequest.Create create = new CoreAdminRequest.Create(); create.setCoreName(coreName); create.setInstanceDir(.); create.setDataDir(dataDir); create.process(server); } } catch (SolrServerException e) { e.printStackTrace(); } return !coreExists; } 2) Do the index, clearing it first if it's a complete rebuild: [snip] if (fullIndex) { try { indexCore.deleteByQuery(*:*); } catch (SolrServerException e) { e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates. } } [snip] various logic, then (we submit batches of 100 : [snip]
Issue with facet fields
I am facing this weird issue in facet fields Within config xml under requestHandler name=standard class=solr.SearchHandler !-- default values for query parameters -- − lst name=defaults I have defined the fl as str name=fl file_id folder_id display_name file_name priority_text content_type last_upload upload_by business indexed /str But my out xml doesnt contain the element upload_by and business But i am able to do seach by upload_by: and business: Even when i add in the url fl=* i do not get this facet field in the response Any idea what i am doing wrong. -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-with-facet-fields-tp1883106p1883106.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: WELCOME to solr-user@lucene.apache.org
Hi, I have a question about boosting. I have the following fields in my schema.xml: 1. title 2. description 3. ISBN etc I want to boost the field title. I tried index time boosting but it did not work. I also tried Query time boosting but with no luck. Can someone help me on how to implement boosting on a specific field like title? Thanks, Solr User On Thu, Nov 11, 2010 at 10:26 AM, solr-user-h...@lucene.apache.org wrote: Hi! This is the ezmlm program. I'm managing the solr-user@lucene.apache.org mailing list. I'm working for my owner, who can be reached at solr-user-ow...@lucene.apache.org. Acknowledgment: I have added the address solr...@gmail.com to the solr-user mailing list. Welcome to solr-u...@lucene.apache.org! Please save this message so that you know the address you are subscribed under, in case you later want to unsubscribe or change your subscription address. --- Administrative commands for the solr-user list --- I can handle administrative requests automatically. Please do not send them to the list address! Instead, send your message to the correct command address: To subscribe to the list, send a message to: solr-user-subscr...@lucene.apache.org To remove your address from the list, send a message to: solr-user-unsubscr...@lucene.apache.org Send mail to the following for info and FAQ for this list: solr-user-i...@lucene.apache.org solr-user-...@lucene.apache.org Similar addresses exist for the digest list: solr-user-digest-subscr...@lucene.apache.org solr-user-digest-unsubscr...@lucene.apache.org To get messages 123 through 145 (a maximum of 100 per request), mail: solr-user-get.123_...@lucene.apache.org To get an index with subject and author for messages 123-456 , mail: solr-user-index.123_...@lucene.apache.org They are always returned as sets of 100, max 2000 per request, so you'll actually get 100-499. To receive all messages with the same subject as message 12345, send a short message to: solr-user-thread.12...@lucene.apache.org The messages should contain one line or word of text to avoid being treated as s...@m, but I will ignore their content. Only the ADDRESS you send to is important. You can start a subscription for an alternate address, for example j...@host.domain, just add a hyphen and your address (with '=' instead of '@') after the command word: solr-user-subscribe-john=host.dom...@lucene.apache.org To stop subscription for this address, mail: solr-user-unsubscribe-john=host.dom...@lucene.apache.org In both cases, I'll send a confirmation message to that address. When you receive it, simply reply to it to complete your subscription. If despite following these instructions, you do not get the desired results, please contact my owner at solr-user-ow...@lucene.apache.org. Please be patient, my owner is a lot slower than I am ;-) --- Enclosed is a copy of the request I received. Return-Path: solr...@gmail.com Received: (qmail 48883 invoked by uid 99); 11 Nov 2010 15:26:44 - Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 15:26:44 + X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of solr...@gmail.comdesignates 209.85.213.48 as permitted sender) Received: from [209.85.213.48] (HELO mail-yw0-f48.google.com) (209.85.213.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 15:26:35 + Received: by ywp4 with SMTP id 4so1394872ywp.35 for solr-user-sc.1289489103.apfngfdapdhadiahjfln-solrnew=gmail.com @lucene.apache.org; Thu, 11 Nov 2010 07:26:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=4KuKRrRVLjzTO4oB9/DNxMdQPfNQH2GnYznzPE6YqOo=; b=l5lBfUYcyvipJn9SE+5j+t1XUmBjTtbyPYlRVj7jDb6G+W3NzQ21EHOowiD9rNH2L9 gc2+6mGEZmRJOZQwpKD7SUQ2bXL9fVm7mVfS21TMAgC+ZsWQ3vvFOHXalWZa8dbtcOY7 C23KauLY7YH1UfducfXL77J7u0/snEZl5jQ7A= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=nb9+3a9bOHnjGO5T5BhMlW15adcafr+MPzvpgc5X5NXEUGCI05ViLho0SSoQP2Wp2i xp1Mfjrjw05umeKmHX23oeD5Idc2G6xgz8I3ZcJ1bUM+cD7c52cMKG2suE2VvhUHlfah z52rEtlqd0Q9fk/ZDWwR2DS7GoiVMRmgaWgD0= MIME-Version: 1.0 Received: by 10.229.216.201 with SMTP id hj9mr877669qcb.58.1289489174123; Thu, 11 Nov 2010 07:26:14 -0800 (PST) Received: by 10.229.66.165 with HTTP; Thu, 11 Nov 2010 07:26:14 -0800 (PST) In-Reply-To: 1289489103.46214.ez...@lucene.apache.org References: 1289489103.46214.ez...@lucene.apache.org
Boosting
Hi, I have a question about boosting. I have the following fields in my schema.xml: 1. title 2. description 3. ISBN etc I want to boost the field title. I tried index time boosting but it did not work. I also tried Query time boosting but with no luck. Can someone help me on how to implement boosting on a specific field like title? Thanks, Solr User
Re: solr dynamic core creation
Hi, Thanks for the offers, I'll take deeper look into them. In the offers you showed me, if I understand correctly, the call for creation is done in the client side. I need the mechanism we'll work in the server side. I know it sounds stupid, but I need the client side wouldn't know about which cores are there or not, and on the server side I (maybe with a handler?), will understand if the core is not created, and create it if needed. Thanks, nizan -- View this message in context: http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883213.html Sent from the Solr - User mailing list archive at Nabble.com.
problem with wildcard
Hi All, I'm having some trouble with a query using some wildcard and I was wondering if anyone could tell me why these two similar queries do not return the same number of results. Basically, the query I'm making should return all docs whose title starts (or contain) the string lowe'. I suspect some analyzer is causing this behaviour and I'd like to know if there is a way to fix this problem. 1) select?q=*:*fq=title:(+lowe')debugQuery=onrows=0 result name=response numFound=302 start=0/ lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedqueryMatchAllDocsQuery(*:*)/str str name=parsedquery_toString*:*/str lst name=explain/ str name=QParserLuceneQParser/str arr name=filter_queries strtitle:( lowe')/str /arr arr name=parsed_filter_queries strtitle:low/str /arr 2) select?q=*:*fq=title:(+lowe'*)debugQuery=onrows=0 result name=response numFound=0 start=0/ lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedqueryMatchAllDocsQuery(*:*)/str str name=parsedquery_toString*:*/str lst name=explain/ str name=QParserLuceneQParser/str arr name=filter_queries strtitle:( lowe'*)/str /arr arr name=parsed_filter_queries strtitle:lowe'*/str /arr ... /lst The title field is defined as: field name=title type=text indexed=true stored=true required=false/ where the text type is: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer /fieldType
Re: solr dynamic core creation
Hmmm. Maybe you need to define what you mean by 'server' and what you mean by 'client'. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883238.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr dynamic core creation
Hi, Maybe just don't understand all the concept there and I mix up server and client... Client - The place where I make the http calls (for index, search etc.) - where I use the CommonsHttpSolrServer as the solr server. This machine isn't defined as master or slave, it just use solr as search engine Server - The http calls I made in the client, goes to another server, the master solr server (or one of the slaves), where I have embeddedSolrServer, aren't they? thanks, nizan -- View this message in context: http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883269.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Crawling with nutch and mapping fields to solr
I'm going down the route of patching nutch so I can use this ParseMetaTags plugin: https://issues.apache.org/jira/browse/NUTCH-809 Also wondering whether I will be able to use the XMLParser to allow me to parse well formed XHTML, using xpath would be bonus: https://issues.apache.org/jira/browse/NUTCH-185 Any thoughts appreciated... -- View this message in context: http://lucene.472066.n3.nabble.com/Crawling-with-nutch-and-mapping-fields-to-solr-tp1879060p1883295.html Sent from the Solr - User mailing list archive at Nabble.com.
EdgeNGram relevancy
Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: solr dynamic core creation
No - in reading what you just wrote, and what you originally wrote, I think the misunderstanding was mine, based on the architecture of my code. In my code, it is our 'server' level that does the SolrJ indexing calls, but you meant 'server' to be the Solr instance, and what you mean by 'client' is what I was thinking of (without thinking) as the 'server'... Sorry about that. Hopefully someone else can chime in on your specific issue... -- View this message in context: http://lucene.472066.n3.nabble.com/solr-dynamic-core-creation-tp1867705p1883354.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Any Copy Field Caveats?
I've noticed that using camelCase in field names causes problems. On 11/5/2010 11:02 AM, Will Milspec wrote: Hi all, we're moving from an old lucene version to solr and plan to use the Copy Field functionality. Previously we had rolled our own implementation, sticking title, description, etc. in a field called 'content'. We lose some flexibility (i.e. java layer can no longer control what gets in the new copied field), at the expense of simplicity. A fair tradeoff IMO. My question: has anyone found any subtle issues or gotchas with copy fields? (from the subject line caveat--pronounced 'kah-VEY-AT' is Latin as in Caveat Emptor...let the buyer beware). thanks, will will
Re: Concatenate multiple tokens into one
Hi Robert, All, I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/ I want to include stopword removal and lowercase the incoming terms. The idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the EdgeNgram filter factory. If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful. Many thanks Nick On 11 Nov 2010, at 00:23, Robert Gründler wrote: On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: Are you sure you really want to throw out stopwords for your use case? I don't think autocompletion will work how you want if you do. in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like DJ or featuring which make sense to throw out of the query. Also searches for the beastie boys and beastie boys should return a match in the autocompletion. And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. I started out with the KeywordTokenizer, which worked well, except the StopWord problem. For now, i've come up with a quick-and-dirty custom ConcatFilter, which does what i'm after: public class ConcatFilter extends TokenFilter { private TokenStream tstream; protected ConcatFilter(TokenStream input) { super(input); this.tstream = input; } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class); boolean incremented = false; while (tstream.incrementToken()) { if (typeAttribute.type().equals(word)) { builder.append(termAttribute.term()); } incremented = true; } token.setTermBuffer(builder.toString()); if (incremented == true) return token; return null; } } I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all. best -robert Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do pre-tokenization, the 'field' query parser should work well for this. Jonathan From: Robert Gründler [rob...@dubture.com] Sent: Wednesday, November 10, 2010 6:39 PM To: solr-user@lucene.apache.org Subject: Concatenate multiple tokens into one Hi, i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes: tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens separated by whitespace -- filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / !-- throw out stopwords -- filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / !-- throw out all everything except a-z -- !-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -- filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / !-- create edgeNGram tokens for autocomplete matches -- With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results: Input Query: George Cloo Matches: - George Harrison - John Clooridge - George Smith -George Clooney - etc However, only George Clooney should match in the autocompletion use case. Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which concatenates all the tokens generated by the WhitespaceTokenizerFactory. Are there filters which can do such a thing? If not, are there examples how to implement a custom TokenFilter? thanks! -robert
Rollback can't be done after committing?
Hi, all I have a question about Solr and SolrJ's rollback. I try to rollback like below try{ server.addBean(dto); server.commit; }catch(Exception e){ if (server != null) { server.rollback();} } I wonder if any Exception thrown, rollback process is run. so all data would not be updated. but once commited, rollback would not be well done. rollback correctly will be done only when commit process will not? Solr and SolrJ's rollback system is not the same as any RDB's rollback?
Re: Rollback can't be done after committing?
What you say is true. Solr is not an rdbms. Kouta Osabe wrote: Hi, all I have a question about Solr and SolrJ's rollback. I try to rollback like below try{ server.addBean(dto); server.commit; }catch(Exception e){ if (server != null) { server.rollback();} } I wonder if any Exception thrown, rollback process is run. so all data would not be updated. but once commited, rollback would not be well done. rollback correctly will be done only when commit process will not? Solr and SolrJ's rollback system is not the same as any RDB's rollback?
using CJKTokenizerFactory for Japanese language
I am exploring support for Japanese language in solr. Solr seems to provide CJKTokenizerFactory. How useful is this module? Has anyone been using this in production for Japanese language? One shortfall it seems to have from what I have been able to read up on is that it can generate lot of false matches. For example mathcing kyoto when searching for tokyo etc. I did not see many questions related to this module so I wonder if people are actively using it. If not are there any other solution in the market that are recommended by solr users? Thanks Kumar
Re: EdgeNGram relevancy
You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: Issue with facet fields
Are you storing the upload_by and business fields? You will not be able to retrieve a field from your index if it is not stored. Check that you have stored=true for both of those fields. - Paige On Thu, Nov 11, 2010 at 10:23 AM, gauravshetti gaurav.she...@tcs.comwrote: I am facing this weird issue in facet fields Within config xml under requestHandler name=standard class=solr.SearchHandler !-- default values for query parameters -- − lst name=defaults I have defined the fl as str name=fl file_id folder_id display_name file_name priority_text content_type last_upload upload_by business indexed /str But my out xml doesnt contain the element upload_by and business But i am able to do seach by upload_by: and business: Even when i add in the url fl=* i do not get this facet field in the response Any idea what i am doing wrong. -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-with-facet-fields-tp1883106p1883106.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNGram relevancy
thanks a lot, that setup works pretty well now. the only problem now is that the StopWords do not work that good anymore. I'll provide an example, but first the 2 fieldtypes: !-- autocomplete field which finds matches inside strings (scor matches Martin Scorsese) -- fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType !-- autocomplete field which finds startsWith matches only (scor matches only Scorpio, but not Martin Scorsese) -- fieldType name=edgytext2 class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? thanks again! -robert On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: Concatenate multiple tokens into one
I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types). Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: EdgeNGram relevancy). best -robert See On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: Hi Robert, All, I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/ I want to include stopword removal and lowercase the incoming terms. The idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the EdgeNgram filter factory. If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful. Many thanks Nick On 11 Nov 2010, at 00:23, Robert Gründler wrote: On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: Are you sure you really want to throw out stopwords for your use case? I don't think autocompletion will work how you want if you do. in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like DJ or featuring which make sense to throw out of the query. Also searches for the beastie boys and beastie boys should return a match in the autocompletion. And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. I started out with the KeywordTokenizer, which worked well, except the StopWord problem. For now, i've come up with a quick-and-dirty custom ConcatFilter, which does what i'm after: public class ConcatFilter extends TokenFilter { private TokenStream tstream; protected ConcatFilter(TokenStream input) { super(input); this.tstream = input; } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class); boolean incremented = false; while (tstream.incrementToken()) { if (typeAttribute.type().equals(word)) { builder.append(termAttribute.term()); } incremented = true; } token.setTermBuffer(builder.toString()); if (incremented == true) return token; return null; } } I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all. best -robert Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do pre-tokenization, the 'field' query parser should work well for this. Jonathan From: Robert Gründler [rob...@dubture.com] Sent: Wednesday, November 10, 2010 6:39 PM To: solr-user@lucene.apache.org Subject: Concatenate multiple tokens into one Hi, i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes: tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens separated by whitespace -- filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / !-- throw out stopwords -- filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / !-- throw out all everything except a-z -- !-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -- filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / !-- create edgeNGram tokens for autocomplete matches -- With that kind of filterchain, the EdgeNGramFilterFactory will receive multiple tokens on input strings with whitespaces in it. This leads to the following results: Input Query: George Cloo Matches: - George Harrison - John Clooridge - George Smith -George Clooney - etc
Memory used by facet queries
Hello All. My first time post so be kind. Developing a document store with lots and lots of very small documents. (200 million at the moment. Final size will probably be double this at 400 million documents). This is Proof of concept development so we are seeing what a single code can do for us before we consider sharding. We'd rather not shard if we don't have to. I'm using SOLR 4.0 (for the simple facet pivots and groups which work well). We're into week 4 of our development and have the production servers etc set up. Everything working very well until we start to test queries with production volumes of data. I'm running into Java Heap Space exceptions during simple faceting on inverted fields. The fields we are currently faceting on are names - Country / Continent / City names all stored as a Solr.StringField (there are other fields using tokenization to provide initial search but we want to use the simple StringFields to provide faceted navigation). In total we have 10 fields we'd ever want to facet on (8 names fields that are strings and 2 Datepart fields (year and yearMonth) that are also strings)). This is our first time using SOLR and I didn't realise that we'd need so much heap for facets! Solr is running in tomcat container and I've currently set tomcat to use a max of JAVA_OPTS=$JAVA_OPTS -server -Xms512m -Xmx3m I've been reading all I can find online and have seen advice to populate the facets caches first as soon as we've started the solr service. However I'd really like to know if there are ways to reduce the memory footprint. We currently have 32g of physical ram. Adding more ram is an option but I'm being asked the (completely reasonable) question -- Why do you need so much? Please help! Charlie. -Original Message- From: Robert Gründler [mailto:rob...@dubture.com] Sent: 11 November 2010 18:14 To: solr-user@lucene.apache.org Subject: Re: Concatenate multiple tokens into one I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types). Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: EdgeNGram relevancy). best -robert See On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: Hi Robert, All, I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/ I want to include stopword removal and lowercase the incoming terms. The idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the EdgeNgram filter factory. If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful. Many thanks Nick On 11 Nov 2010, at 00:23, Robert Gründler wrote: On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: Are you sure you really want to throw out stopwords for your use case? I don't think autocompletion will work how you want if you do. in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like DJ or featuring which make sense to throw out of the query. Also searches for the beastie boys and beastie boys should return a match in the autocompletion. And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. I started out with the KeywordTokenizer, which worked well, except the StopWord problem. For now, i've come up with a quick-and-dirty custom ConcatFilter, which does what i'm after: public class ConcatFilter extends TokenFilter { private TokenStream tstream; protected ConcatFilter(TokenStream input) { super(input); this.tstream = input; } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class); boolean incremented = false; while (tstream.incrementToken()) { if (typeAttribute.type().equals(word)) { builder.append(termAttribute.term()); } incremented = true; } token.setTermBuffer(builder.toString()); if (incremented == true) return token; return null; } } I'm not sure if this is a safe way to do this, as i'm not
Re: EdgeNGram relevancy
This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 If no can you paste output of debugQuery=on
Search Result Differences a Puzzle
Hi, I cannot find out how this is occurring: Nolosearch/com/search/apachesolr_search/law You can see that the John Paul Stevens result yields more description in the search result because of the keyword relevancy, whereas, the other results just give you a snippet of the title based on keywords found. I am trying to figure out how to get a standard size search result no matter what the relevancy is. While application of this type of result would be irrelevant to many search engines it is completely practical in a legal setting as a keyword is only as good as how it is being referenced in the sentence or paragraph. What a dilemma I have! I have been trying to figure out if it is the actual schema.xml file or solrconfig.xml file and for the life of me, I can't find it referenced anywhere. I tried changing the fragsize to 200 instead of a default of like 70. Didn't do any damage at re-index. This problem is super critical to my search results. Like I said, as an attorney, the word is superfluous until it attached to a long sentence or two in order to describe if the keyword we searched for is relevant, let alone worthy of a click. That is why my titles are set to open in a new window, faster access and if the result is crud, then just close the window out and back to research. Eric
Retrieving indexed content containing multiple languages
My Solr corpus is currently created by indexing metadata from a relational database as well as content pointed to by URLs from the database. I'm using a pretty generic out of the box Solr schema. The search results are presented via an AJAX enabled HTML page. When I perform a search the document title (for example) has a mix of english and chinese characters. Everything there is fine - I can see the english and chinese returned from a facet query on title. I can search against the title using english words it contains and I get back an expected result. I asked a chinese friend to perform the same search using chinese and nothing is returned. How should I go about getting this search to work? Chinese is just one language, I'll probably need to support more in the future. My thought is that the chinese characters are indexed as their unicode equivalent so all I'll need to do is make sure the query is encoded appropriately and just perform a regular search as I would if the terms were in english. For some reason that sounds too easy. I see there is a CJK tokenizer that would help here. Do I need that for my situation? Is there a fairly detailed tutorial on how to handle these types of language challenges? Thanks in advance - Tod
Re: Concatenate multiple tokens into one
Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not sure what i need in the classpath and where Token comes from. Will check the thread you mention. Best Nick On 11 Nov 2010, at 18:13, Robert Gründler wrote: I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types). Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: EdgeNGram relevancy). best -robert See On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: Hi Robert, All, I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/ I want to include stopword removal and lowercase the incoming terms. The idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the EdgeNgram filter factory. If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful. Many thanks Nick On 11 Nov 2010, at 00:23, Robert Gründler wrote: On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: Are you sure you really want to throw out stopwords for your use case? I don't think autocompletion will work how you want if you do. in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like DJ or featuring which make sense to throw out of the query. Also searches for the beastie boys and beastie boys should return a match in the autocompletion. And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. I started out with the KeywordTokenizer, which worked well, except the StopWord problem. For now, i've come up with a quick-and-dirty custom ConcatFilter, which does what i'm after: public class ConcatFilter extends TokenFilter { private TokenStream tstream; protected ConcatFilter(TokenStream input) { super(input); this.tstream = input; } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class); boolean incremented = false; while (tstream.incrementToken()) { if (typeAttribute.type().equals(word)) { builder.append(termAttribute.term()); } incremented = true; } token.setTermBuffer(builder.toString()); if (incremented == true) return token; return null; } } I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all. best -robert Then lowercase, remove whitespace (or not), do whatever else you want to do to your single token to normalize it, and then edgengram it. If you include whitespace in the token, then when making your queries for auto-complete, be sure to use a query parser that doesn't do pre-tokenization, the 'field' query parser should work well for this. Jonathan From: Robert Gründler [rob...@dubture.com] Sent: Wednesday, November 10, 2010 6:39 PM To: solr-user@lucene.apache.org Subject: Concatenate multiple tokens into one Hi, i've created the following filterchain in a field type, the idea is to use it for autocompletion purposes: tokenizer class=solr.WhitespaceTokenizerFactory/ !-- create tokens separated by whitespace -- filter class=solr.LowerCaseFilterFactory/ !-- lowercase everything -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / !-- throw out stopwords -- filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / !-- throw out all everything except a-z -- !-- actually, here i would like to join multiple tokens together again, to provide one token for the EdgeNGramFilterFactory -- filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / !-- create edgeNGram tokens for autocomplete matches -- With that kind of filterchain, the
Re: Concatenate multiple tokens into one
this is the full source code, but be warned, i'm not a java developer, and i have no background in lucine/solr development: // ConcatFilter import java.io.IOException; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.TermAttribute; import org.apache.lucene.analysis.tokenattributes.TypeAttribute; public class ConcatFilter extends TokenFilter { protected ConcatFilter(TokenStream input) { super(input); } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) input.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) input.getAttribute(TypeAttribute.class); boolean hasToken = false; while (input.incrementToken()) { if (typeAttribute.type().equals(word)) { builder.append(termAttribute.term()); hasToken = true; } } if (hasToken == true) { token.setTermBuffer(builder.toString()); return token; } return null; } } //ConcatFilterFactory: import org.apache.lucene.analysis.TokenStream; import org.apache.solr.analysis.BaseTokenFilterFactory; public class ConcatFilterFactory extends BaseTokenFilterFactory { @Override public TokenStream create(TokenStream stream) { return new ConcatFilter(stream); } } and in your schema.xml, you can simply add the filterfactory using this element: filter class=com.example.ConcatFilterFactory / Jar files i have included in the buildpath (can be found in the solr download package): apache-solr-core-1.4.1.jar lucene-analyzers-2.9.3.jar lucene-core.2.9.3-jar good luck ;) -robert On Nov 11, 2010, at 8:45 PM, Nick Martin wrote: Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not sure what i need in the classpath and where Token comes from. Will check the thread you mention. Best Nick On 11 Nov 2010, at 18:13, Robert Gründler wrote: I've posted a ConcaFilter in my previous mail which does concatenate tokens. This works fine, but i realized that what i wanted to achieve is implemented easier in another way (by using 2 separate field types). Have a look at a previous mail i wrote to the list and the reply from Ahmet Arslan (topic: EdgeNGram relevancy). best -robert See On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: Hi Robert, All, I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/ I want to include stopword removal and lowercase the incoming terms. The idea being to take, Foo Bar Baz Ltd and turn it into foobarbaz for the EdgeNgram filter factory. If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful. Many thanks Nick On 11 Nov 2010, at 00:23, Robert Gründler wrote: On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: Are you sure you really want to throw out stopwords for your use case? I don't think autocompletion will work how you want if you do. in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like DJ or featuring which make sense to throw out of the query. Also searches for the beastie boys and beastie boys should return a match in the autocompletion. And if you don't... then why use the WhitespaceTokenizer and then try to jam the tokens back together? Why not just NOT tokenize in the first place. Use the KeywordTokenizer, which really should be called the NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates one token from the entire input string. I started out with the KeywordTokenizer, which worked well, except the StopWord problem. For now, i've come up with a quick-and-dirty custom ConcatFilter, which does what i'm after: public class ConcatFilter extends TokenFilter { private TokenStream tstream; protected ConcatFilter(TokenStream input) { super(input); this.tstream = input; } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class); boolean incremented = false; while (tstream.incrementToken()) { if (typeAttribute.type().equals(word)) {
Re: EdgeNGram relevancy
On 12 Nov 2010, at 01:46, Ahmet Arslan iori...@yahoo.com wrote: This setup now makes troubles regarding StopWords, here's an example: Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin Scorsese. Mr is in the stopword list. Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 This way, the only result i get is Mr Martin Scorsese, because the strict field edgytext2 is boosted by 2.0. Any idea why in this case Martin Scorsese is not in the result at all? Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 If no can you paste output of debugQuery=on This would still not deal with the problem of removing stop words from the indexing and query analysis stages. I really need something that will allow that and give a single token as in the example below. Best Nick
Re: Retrieving indexed content containing multiple languages
I look forward to the eanswers to this one. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Tod listac...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, November 11, 2010 11:35:23 AM Subject: Retrieving indexed content containing multiple languages My Solr corpus is currently created by indexing metadata from a relational database as well as content pointed to by URLs from the database. I'm using a pretty generic out of the box Solr schema. The search results are presented via an AJAX enabled HTML page. When I perform a search the document title (for example) has a mix of english and chinese characters. Everything there is fine - I can see the english and chinese returned from a facet query on title. I can search against the title using english words it contains and I get back an expected result. I asked a chinese friend to perform the same search using chinese and nothing is returned. How should I go about getting this search to work? Chinese is just one language, I'll probably need to support more in the future. My thought is that the chinese characters are indexed as their unicode equivalent so all I'll need to do is make sure the query is encoded appropriately and just perform a regular search as I would if the terms were in english. For some reason that sounds too easy. I see there is a CJK tokenizer that would help here. Do I need that for my situation? Is there a fairly detailed tutorial on how to handle these types of language challenges? Thanks in advance - Tod
Re: EdgeNGram relevancy
Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: problem with wildcard
I'm having some trouble with a query using some wildcard and I was wondering if anyone could tell me why these two similar queries do not return the same number of results. Basically, the query I'm making should return all docs whose title starts (or contain) the string lowe'. I suspect some analyzer is causing this behaviour and I'd like to know if there is a way to fix this problem. 1) select?q=*:*fq=title:(+lowe')debugQuery=onrows=0 wildcard queries are not analyzed http://search-lucene.com/m/pnmlH14o6eM1/
Re: EdgeNGram relevancy
according to the fieldtype i posted previously, i think it's because of: 1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: Clyde and Phillips 2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: C Cl Cly ... AND P Ph Phi ... The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the WhitespaceTokenizer. This creates a match for the 2nd token Ci of the query, and one of the subtokens the EdgeNGramFilter created: Cl. -robert On Nov 11, 2010, at 21:34 , Andy wrote: Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
FAST ESP - Solr migration webinar
We're holding a free webinar on migration from FAST to Solr. Details below. -Yonik http://www.lucidimagination.com = Solr To The Rescue: Successful Migration From FAST ESP to Open Source Search Based on Apache Solr Thursday, Nov 18, 2010, 14:00 EST (19:00 GMT) Hosted by SearchDataManagement.com For anyone concerned about the future of their FAST ESP applications since the purchase of Fast Search and Transfer by Microsoft in 2008, this webinar will provide valuable insights on making the switch to Solr. A three-person rountable will discuss factors driving the need for FAST ESP alternatives, differences between FAST and Solr, a typical migration project lifecycle methodology, complementary open source tools, best practices, customer examples, and recommended next steps. The speakers for this webinar are: Helge Legernes, Founding Partner CTO of Findwise Michael McIntosh, VP Search Solutions for TNR Global Eric Gaumer, Chief Architect for ESR Technology. For more information and to register, please go to: http://SearchDataManagement.bitpipe.com/detail/RES/1288718603_527.html?asrc=CL_PRM_Lucid2 =
Re: problem with wildcard
On 2010-11-11, at 3:45 PM, Ahmet Arslan wrote: I'm having some trouble with a query using some wildcard and I was wondering if anyone could tell me why these two similar queries do not return the same number of results. Basically, the query I'm making should return all docs whose title starts (or contain) the string lowe'. I suspect some analyzer is causing this behaviour and I'd like to know if there is a way to fix this problem. 1) select?q=*:*fq=title:(+lowe')debugQuery=onrows=0 wildcard queries are not analyzed http://search-lucene.com/m/pnmlH14o6eM1/ Yeah I found out about this a couple of minutes after I posted my problem. If there is no analyzer then why is Solr not finding any documents when a single quote precedes the wildcard?
facet+shingle in autosuggest
Hi, I am using a facet.prefix search with shingle's in my autosuggest: fieldType name=shingle class=solr.TextField positionIncrementGap=100 stored=false multiValued=true analyzer tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType Now I would like to prevent stop words to appear in the suggestions: lst name=autosuggest_shingle int name=member states52/int int name=member states experiencing6/int int name=member states in6/int int name=member states the5/int int name=member states to25/int int name=member states with7/int /lst Here I would like to filter out the last 4 suggestions really. Is there a way I can sensibly bring in a stop word filter here? Actually in theory the stop words could appear as the first or second word as well. So I guess when producing shingle's I want to skip any stop word from being part of any shingle. regards, Lukas Kahwe Smith m...@pooteeweet.org
Re: problem with wildcard
select?q=*:*fq=title:(+lowe')debugQuery=onrows=0 wildcard queries are not analyzed http://search-lucene.com/m/pnmlH14o6eM1/ Yeah I found out about this a couple of minutes after I posted my problem. If there is no analyzer then why is Solr not finding any documents when a single quote precedes the wildcard? Probably your index analyzer (WordDelimiterFilterFactory) eating that single quote. You can verify this at admin/analysis.jsp page. In other words there is no such term begins with (lowe') in your index. You can try searching just lowe*
Re: EdgeNGram relevancy
Ah I see. Thanks for the explanation. Could you set the defaultOperator to AND? That way both Bill and Cl must be a match and that would exclude Clyde Phillips. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: Re: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 3:51 PM according to the fieldtype i posted previously, i think it's because of: 1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: Clyde and Phillips 2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: C Cl Cly ... AND P Ph Phi ... The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the WhitespaceTokenizer. This creates a match for the 2nd token Ci of the query, and one of the subtokens the EdgeNGramFilter created: Cl. -robert On Nov 11, 2010, at 21:34 , Andy wrote: Could anyone help me understand what does Clyde Phillips appear in the results for Bill Cl?? Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so why is it even in the results? Thanks. --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote: You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. edgytext:(Bill Cl) OR edgytext2:Bill Cl You can even apply boost so that begins with matches comes first. --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote: From: Robert Gründler rob...@dubture.com Subject: EdgeNGram relevancy To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 5:51 PM Hi, consider the following fieldtype (used for autocompletion): fieldType name=edgytext class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / /analyzer /fieldType This works fine as long as the query string is a single word. For multiple words, the ranking is weird though. Example: Query String: Bill Cl Result (in that order): - Clyde Phillips - Clay Rogers - Roger Cloud - Bill Clinton Bill Clinton should have the highest rank in that case. Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token? thanks! -robert
Re: WELCOME to solr-user@lucene.apache.org
There's not much to go on here. Boosting works, and index time as opposed to query time boosting addresses two different needs. Could you add some detail? All you've really said is it didn't work, which doesn't allow a very constructive response. Perhaps you could review: http://wiki.apache.org/solr/HowToContribute Best Erick On Thu, Nov 11, 2010 at 10:32 AM, Solr User solr...@gmail.com wrote: Hi, I have a question about boosting. I have the following fields in my schema.xml: 1. title 2. description 3. ISBN etc I want to boost the field title. I tried index time boosting but it did not work. I also tried Query time boosting but with no luck. Can someone help me on how to implement boosting on a specific field like title? Thanks, Solr User
Re: WELCOME to solr-user@lucene.apache.org
Eric, Thank you so much for the reply and apologize for not providing all the details. The following are the field definitons in my schema.xml: field name=title type=string indexed=true stored=true omitNorms=false / field name=author type=string indexed=true stored=true multiValued=true omitNorms=true / field name=authortype type=string indexed=true stored=true multiValued=true omitNorms=true / field name=isbn13 type=string indexed=true stored=true / field name=isbn10 type=string indexed=true stored=true / field name=material type=string indexed=true stored=true / field name=pubdate type=string indexed=true stored=true / field name=pubyear type=string indexed=true stored=true / field name=reldate type=string indexed=false stored=true / field name=format type=string indexed=true stored=true / field name=pages type=string indexed=false stored=true / field name=desc type=string indexed=true stored=true / field name=series type=string indexed=true stored=true / field name=season type=string indexed=true stored=true / field name=imprint type=string indexed=true stored=true / field name=bisacsub type=string indexed=true stored=true multiValued=true omitNorms=true / field name=bisacstatus type=string indexed=false stored=true / field name=category type=string indexed=true stored=true multiValued=true omitNorms=true / field name=award type=string indexed=true stored=true multiValued=true omitNorms=true / field name=age type=string indexed=true stored=true / field name=reading type=string indexed=true stored=true / field name=grade type=string indexed=true stored=true / field name=path type=string indexed=false stored=true / field name=shortdesc type=string indexed=true stored=true / field name=subtitle type=string indexed=true stored=true omitNorms=true/ field name=price type=float indexed=true stored=true/ field name=searchFields type=textSpell indexed=true stored=true multiValued=true omitNorms=true/ Copy Fields: copyField source=title dest=searchFields/ copyField source=author dest=searchFields/ copyField source=isbn13 dest=searchFields/ copyField source=isbn10 dest=searchFields/ copyField source=format dest=searchFields/ copyField source=series dest=searchFields/ copyField source=season dest=searchFields/ copyField source=imprint dest=searchFields/ copyField source=bisacsub dest=searchFields/ copyField source=category dest=searchFields/ copyField source=award dest=searchFields/ copyField source=shortdesc dest=searchFields/ copyField source=desc dest=searchFields/ copyField source=subtitle dest=searchFields/ defaultSearchFieldsearchFields/defaultSearchField Before creating the indexes I feed XML file to the Solr job to create index files. I added Boost attribute to the title field before creating indexes and an example is below: ?xml version=1.0 encoding=UTF-8 standalone=no?adddocfield name=material1785440/fieldfield boost=10.0 name=titleEach Little Bird That Sings/fieldfield name=price16.0/fieldfield name=isbn100152051139/fieldfield name=isbn139780152051136/fieldfield name=formatHardcover/fieldfield name=pubdate2005-03-01/fieldfield name=pubyear2005/fieldfield name=reldate2005-02-22/fieldfield name=pages272/fieldfield name=bisacstatusActive/fieldfield name=seasonSpring 2005/fieldfield name=imprintChildren's/fieldfield name=age8.0-12.0/fieldfield name=grade3-6/fieldfield name=authorMarla Frazee/fieldfield name=authortypeJacket Illustrator/fieldfield name=authorDeborah Wiles/fieldfield name=authortypeAuthor/fieldfield name=bisacsubSocial Issues/Friendship/fieldfield name=bisacsubSocial Issues/General (see also headings under Family)/fieldfield name=bisacsubGeneral/fieldfield name=bisacsubGirls amp; Women/fieldfield name=categoryFiction/Middle Grade/fieldfield name=categoryFiction/Award Winners/fieldfield name=categoryComing of Age/fieldfield name=categorySocial Situations/Death amp; Dying/fieldfield name=categorySocial Situations/Friendship/fieldfield name=path/assets/product/0152051139.gif/fieldfield name=desclt;divgt;Ten-year-old Comfort Snowberger has attended 247 funerals. But that's not surprising, considering that her family runs the town funeral home. And even though Great-uncle Edisto keeled over with a heart attack and Great-great-aunt Florentine dropped dead--just like that--six months later, Comfort knows how to deal with loss, or so she thinks. She's more concerned with avoiding her crazy cousin Peach and trying to figure out why her best friend, Declaration, suddenly won't talk to her. Life is full of surprises. And the biggest one of all is learning what it takes to handle them.lt;brgt; lt;brgt;Deborah Wiles has created a unique, funny, and utterly real cast of characters in this heartfelt, and quintessentially Southern coming-of-age novel. Comfort will charm young readers with her wit, her warmth, and her struggles as she learns about life, loss, and ultimately, triumph.lt;brgt;lt;/divgt;/fieldfield name=shortdescTen-year-old Comfort Snowberger
Re: WELCOME to solr-user@lucene.apache.org
There are several mistakes in your approach: copyField just copies data. Index time boost is not copied. There is no such boosting syntax. /select?q=Eachtitle^9fl=score You are searching on your default field. This is not your cause of your problem but omitNorms=true disables index time boosts. http://wiki.apache.org/solr/DisMaxQParserPlugin can satisfy your need. --- On Thu, 11/11/10, Solr User solr...@gmail.com wrote: From: Solr User solr...@gmail.com Subject: Re: WELCOME to solr-user@lucene.apache.org To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 11:54 PM Eric, Thank you so much for the reply and apologize for not providing all the details. The following are the field definitons in my schema.xml: field name=title type=string indexed=true stored=true omitNorms=false / field name=author type=string indexed=true stored=true multiValued=true omitNorms=true / field name=authortype type=string indexed=true stored=true multiValued=true omitNorms=true / field name=isbn13 type=string indexed=true stored=true / field name=isbn10 type=string indexed=true stored=true / field name=material type=string indexed=true stored=true / field name=pubdate type=string indexed=true stored=true / field name=pubyear type=string indexed=true stored=true / field name=reldate type=string indexed=false stored=true / field name=format type=string indexed=true stored=true / field name=pages type=string indexed=false stored=true / field name=desc type=string indexed=true stored=true / field name=series type=string indexed=true stored=true / field name=season type=string indexed=true stored=true / field name=imprint type=string indexed=true stored=true / field name=bisacsub type=string indexed=true stored=true multiValued=true omitNorms=true / field name=bisacstatus type=string indexed=false stored=true / field name=category type=string indexed=true stored=true multiValued=true omitNorms=true / field name=award type=string indexed=true stored=true multiValued=true omitNorms=true / field name=age type=string indexed=true stored=true / field name=reading type=string indexed=true stored=true / field name=grade type=string indexed=true stored=true / field name=path type=string indexed=false stored=true / field name=shortdesc type=string indexed=true stored=true / field name=subtitle type=string indexed=true stored=true omitNorms=true/ field name=price type=float indexed=true stored=true/ field name=searchFields type=textSpell indexed=true stored=true multiValued=true omitNorms=true/ Copy Fields: copyField source=title dest=searchFields/ copyField source=author dest=searchFields/ copyField source=isbn13 dest=searchFields/ copyField source=isbn10 dest=searchFields/ copyField source=format dest=searchFields/ copyField source=series dest=searchFields/ copyField source=season dest=searchFields/ copyField source=imprint dest=searchFields/ copyField source=bisacsub dest=searchFields/ copyField source=category dest=searchFields/ copyField source=award dest=searchFields/ copyField source=shortdesc dest=searchFields/ copyField source=desc dest=searchFields/ copyField source=subtitle dest=searchFields/ defaultSearchFieldsearchFields/defaultSearchField Before creating the indexes I feed XML file to the Solr job to create index files. I added Boost attribute to the title field before creating indexes and an example is below: ?xml version=1.0 encoding=UTF-8 standalone=no?adddocfield name=material1785440/fieldfield boost=10.0 name=titleEach Little Bird That Sings/fieldfield name=price16.0/fieldfield name=isbn100152051139/fieldfield name=isbn139780152051136/fieldfield name=formatHardcover/fieldfield name=pubdate2005-03-01/fieldfield name=pubyear2005/fieldfield name=reldate2005-02-22/fieldfield name=pages272/fieldfield name=bisacstatusActive/fieldfield name=seasonSpring 2005/fieldfield name=imprintChildren's/fieldfield name=age8.0-12.0/fieldfield name=grade3-6/fieldfield name=authorMarla Frazee/fieldfield name=authortypeJacket Illustrator/fieldfield name=authorDeborah Wiles/fieldfield name=authortypeAuthor/fieldfield name=bisacsubSocial Issues/Friendship/fieldfield name=bisacsubSocial Issues/General (see also headings under Family)/fieldfield name=bisacsubGeneral/fieldfield name=bisacsubGirls amp; Women/fieldfield name=categoryFiction/Middle Grade/fieldfield name=categoryFiction/Award Winners/fieldfield name=categoryComing of Age/fieldfield name=categorySocial Situations/Death amp; Dying/fieldfield name=categorySocial Situations/Friendship/fieldfield name=path/assets/product/0152051139.gif/fieldfield name=desclt;divgt;Ten-year-old Comfort Snowberger has attended 247 funerals. But that's not surprising, considering that her family runs the town funeral home. And even though Great-uncle Edisto
Re: facet+shingle in autosuggest
I don't know all the implications here, but can't you just insert the StopwordFilterFactory before the ShingleFilterFactory and turn it loose? Best Erick On Thu, Nov 11, 2010 at 4:02 PM, Lukas Kahwe Smith m...@pooteeweet.orgwrote: Hi, I am using a facet.prefix search with shingle's in my autosuggest: fieldType name=shingle class=solr.TextField positionIncrementGap=100 stored=false multiValued=true analyzer tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.ShingleFilterFactory maxShingleSize=3 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType Now I would like to prevent stop words to appear in the suggestions: lst name=autosuggest_shingle int name=member states52/int int name=member states experiencing6/int int name=member states in6/int int name=member states the5/int int name=member states to25/int int name=member states with7/int /lst Here I would like to filter out the last 4 suggestions really. Is there a way I can sensibly bring in a stop word filter here? Actually in theory the stop words could appear as the first or second word as well. So I guess when producing shingle's I want to skip any stop word from being part of any shingle. regards, Lukas Kahwe Smith m...@pooteeweet.org
Re: using CJKTokenizerFactory for Japanese language
(10/11/12 1:49), Kumar Pandey wrote: I am exploring support for Japanese language in solr. Solr seems to provide CJKTokenizerFactory. How useful is this module? Has anyone been using this in production for Japanese language? CJKTokenizer is used in a lot of places in Japan. One shortfall it seems to have from what I have been able to read up on is that it can generate lot of false matches. For example mathcing kyoto when searching for tokyo etc. Yep, it is a well-known problem. I did not see many questions related to this module so I wonder if people are actively using it. If not are there any other solution in the market that are recommended by solr users? You may want to look at morphological analyzers. There are some of them in Japan. Search MeCab, Sen, GoSen by Google. Or in Lucene, there is a patch for a morphological-taste analyzer: https://issues.apache.org/jira/browse/LUCENE-2522 Koji -- http://www.rondhuit.com/en/
Re: facet+shingle in autosuggest
On 11.11.2010, at 17:42, Erick Erickson wrote: I don't know all the implications here, but can't you just insert the StopwordFilterFactory before the ShingleFilterFactory and turn it loose? havent tried this, but i would suspect that i would then get in trouble with stuff like united states of america. it would then generate a shingle with united states america which in turn wouldnt generate a proper phrase search string. one option of course would be to restrict the shingles to 2 words and then using the stop word filter would work as expected. regards, Lukas Kahwe Smith m...@pooteeweet.org
Re: EdgeNGram relevancy
Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
Re: EdgeNGram relevancy
Without the parens, the edgytext: only applied to Mr, the default field still applied to Scorcese. The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will pre-tokenize on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't pre-tokenize like this, it passes the whole phrase to the index intact. Robert Gründler wrote: Did you run your query without using () and operators? If yes can you try this? q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0 I didn't use () and in my query before. Using the query with those operators works now, stopwords are thrown out as the should, thanks. However, i don't understand how the () and operators affect the StopWordFilter. Could you give a brief explanation for the above example? thanks! -robert
Re: WELCOME to solr-user@lucene.apache.org
Hi, If you are looking for query time boosting on title field you can do the following: /select?q=title:android^10 Also unless you have a very good reason to use string for date data (in your case pubdate and reldate), you should be using solr.DateField. regards, Ram On Fri, Nov 12, 2010 at 3:41 AM, Ahmet Arslan iori...@yahoo.com wrote: There are several mistakes in your approach: copyField just copies data. Index time boost is not copied. There is no such boosting syntax. /select?q=Eachtitle^9fl=score You are searching on your default field. This is not your cause of your problem but omitNorms=true disables index time boosts. http://wiki.apache.org/solr/DisMaxQParserPlugin can satisfy your need. --- On Thu, 11/11/10, Solr User solr...@gmail.com wrote: From: Solr User solr...@gmail.com Subject: Re: WELCOME to solr-user@lucene.apache.org To: solr-user@lucene.apache.org Date: Thursday, November 11, 2010, 11:54 PM Eric, Thank you so much for the reply and apologize for not providing all the details. The following are the field definitons in my schema.xml: field name=title type=string indexed=true stored=true omitNorms=false / field name=author type=string indexed=true stored=true multiValued=true omitNorms=true / field name=authortype type=string indexed=true stored=true multiValued=true omitNorms=true / field name=isbn13 type=string indexed=true stored=true / field name=isbn10 type=string indexed=true stored=true / field name=material type=string indexed=true stored=true / field name=pubdate type=string indexed=true stored=true / field name=pubyear type=string indexed=true stored=true / field name=reldate type=string indexed=false stored=true / field name=format type=string indexed=true stored=true / field name=pages type=string indexed=false stored=true / field name=desc type=string indexed=true stored=true / field name=series type=string indexed=true stored=true / field name=season type=string indexed=true stored=true / field name=imprint type=string indexed=true stored=true / field name=bisacsub type=string indexed=true stored=true multiValued=true omitNorms=true / field name=bisacstatus type=string indexed=false stored=true / field name=category type=string indexed=true stored=true multiValued=true omitNorms=true / field name=award type=string indexed=true stored=true multiValued=true omitNorms=true / field name=age type=string indexed=true stored=true / field name=reading type=string indexed=true stored=true / field name=grade type=string indexed=true stored=true / field name=path type=string indexed=false stored=true / field name=shortdesc type=string indexed=true stored=true / field name=subtitle type=string indexed=true stored=true omitNorms=true/ field name=price type=float indexed=true stored=true/ field name=searchFields type=textSpell indexed=true stored=true multiValued=true omitNorms=true/ Copy Fields: copyField source=title dest=searchFields/ copyField source=author dest=searchFields/ copyField source=isbn13 dest=searchFields/ copyField source=isbn10 dest=searchFields/ copyField source=format dest=searchFields/ copyField source=series dest=searchFields/ copyField source=season dest=searchFields/ copyField source=imprint dest=searchFields/ copyField source=bisacsub dest=searchFields/ copyField source=category dest=searchFields/ copyField source=award dest=searchFields/ copyField source=shortdesc dest=searchFields/ copyField source=desc dest=searchFields/ copyField source=subtitle dest=searchFields/ defaultSearchFieldsearchFields/defaultSearchField Before creating the indexes I feed XML file to the Solr job to create index files. I added Boost attribute to the title field before creating indexes and an example is below: ?xml version=1.0 encoding=UTF-8 standalone=no?adddocfield name=material1785440/fieldfield boost=10.0 name=titleEach Little Bird That Sings/fieldfield name=price16.0/fieldfield name=isbn100152051139/fieldfield name=isbn139780152051136/fieldfield name=formatHardcover/fieldfield name=pubdate2005-03-01/fieldfield name=pubyear2005/fieldfield name=reldate2005-02-22/fieldfield name=pages272/fieldfield name=bisacstatusActive/fieldfield name=seasonSpring 2005/fieldfield name=imprintChildren's/fieldfield name=age8.0-12.0/fieldfield name=grade3-6/fieldfield name=authorMarla Frazee/fieldfield name=authortypeJacket Illustrator/fieldfield name=authorDeborah Wiles/fieldfield name=authortypeAuthor/fieldfield name=bisacsubSocial Issues/Friendship/fieldfield name=bisacsubSocial Issues/General (see also headings under Family)/fieldfield name=bisacsubGeneral/fieldfield name=bisacsubGirls Women/fieldfield name=categoryFiction/Middle Grade/fieldfield name=categoryFiction/Award Winners/fieldfield name=categoryComing of Age/fieldfield name=categorySocial Situations/Death Dying/fieldfield
Best practices to rebuild index on live system
Hi again, we're coming closer to the rollout of our newly created solr/lucene based search, and i'm wondering how people handle changes to their schema on live systems. In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 hours for a full dataimport from the relational database. The Index is being updated in realtime, through post insert/update/delete events in our ORM. So far, i can only think of 2 scenarios for rebuilding the index, if we need to update the schema after the rollout: 1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After importing, switch the application to cores A1, B1, C1 This will most likely cause a corrupt index, as in the 1.5 hours of indexing, the database might get inserts/updates/deletes. 2. Put the Livesystem in a Read-Only mode and rebuild the index during that time. This will ensure data integrity in the index, with the drawback for users not being able to write to the app. Does Solr provide any built-in approaches to this problem? best -robert
Re: Best practices to rebuild index on live system
You can do a similar thing to your case #1 with Solr replication, handling a lot of the details for you instead of you manually switching cores and such. Index to a new core, then tell your production solr to be a slave replicating from that master new core. It still may have some of the same downsides as your scenario #1, it's essentially the same thing, but with Solr replication taking care of the some of the nuts and bolts for you. I haven't hard of any better solutions. In general, Solr seems not really so great at use cases where the index changes frequently in response to user actions, it doesn't seem to really have been designed that way. You could store all your user-created data in an external store (rdbms or no-sql), as well as indexing it, and then when you rebuild the index you can get it all from there, so you won't lose any. It seems to often work best, getting along with Solr's assumptions, to avoid considering a Solr index ever the canonical storage location of any data -- Solr isn't really designed to be storage, it's designed to be an index. Always have the canonical storage location of any data being some actual store, with Solr just being an index. That approach tends to make it easier to work out things like this, although there can still be some tricks. (Like, after you're done building your new index, but before you replicate it to production, you might have to check the actual canonical store for any data that changed in between the time you started your re-index and now -- and then re-index that. And then any data that changed between the time your second re-index began and... this could go on forever. ) Robert Gründler wrote: Hi again, we're coming closer to the rollout of our newly created solr/lucene based search, and i'm wondering how people handle changes to their schema on live systems. In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 hours for a full dataimport from the relational database. The Index is being updated in realtime, through post insert/update/delete events in our ORM. So far, i can only think of 2 scenarios for rebuilding the index, if we need to update the schema after the rollout: 1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After importing, switch the application to cores A1, B1, C1 This will most likely cause a corrupt index, as in the 1.5 hours of indexing, the database might get inserts/updates/deletes. 2. Put the Livesystem in a Read-Only mode and rebuild the index during that time. This will ensure data integrity in the index, with the drawback for users not being able to write to the app. Does Solr provide any built-in approaches to this problem? best -robert
Re: Best practices to rebuild index on live system
If by corrupt index you mean an index that's just not quite up to date, could you do a delta import? In other words, how do you make our Solr index reflect changes to the DB even without a schema change? Could you extend that method to handle your use case? So the scenario is something like this: Record the time rebuild the index import all changes since you recorded the original time. switch cores or replicate. Best Erick 2010/11/11 Robert Gründler rob...@dubture.com Hi again, we're coming closer to the rollout of our newly created solr/lucene based search, and i'm wondering how people handle changes to their schema on live systems. In our case, we have 3 cores (ie. A,B,C), where the largest one takes about 1.5 hours for a full dataimport from the relational database. The Index is being updated in realtime, through post insert/update/delete events in our ORM. So far, i can only think of 2 scenarios for rebuilding the index, if we need to update the schema after the rollout: 1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After importing, switch the application to cores A1, B1, C1 This will most likely cause a corrupt index, as in the 1.5 hours of indexing, the database might get inserts/updates/deletes. 2. Put the Livesystem in a Read-Only mode and rebuild the index during that time. This will ensure data integrity in the index, with the drawback for users not being able to write to the app. Does Solr provide any built-in approaches to this problem? best -robert
Re: Spatial search in Solr 1.5
I just upgraded to a later version of the trunk and noticed my geofilter queries stopped working, apparently because the sfilt function was renamed to geofilt. I realize trunk is not stable, but other than looking at every change, is there an easy way to find changes that are not backward compatible so developers know what they need to update when upgrading? Thanks, Scott On Tue, Oct 12, 2010 at 17:42, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Oct 12, 2010 at 8:07 PM, PeterKerk vettepa...@hotmail.com wrote: Ok, so does this actually say: for now you have to do calculations based on bounding box instead of great circle? I tried to make the documentation a little simpler... there's - geofilt... filters within a radius of d km (i.e. great circle distance) - bbox... filters using a bounding box - geodist... function query that yields the distance (again, great circle distance) If you point out the part to the docs you found confusing, I can try and improve it. Did you try and step through the quick start? Those links actually work! And the fact that on top of the page it says Solr4.0, does that imply I cant use this right now? Or where could I find the latest trunk for this? The wiki says If you haven't already, get a recent nightly build of Solr4.0... and links to the Solr4.0 page, which points to http://wiki.apache.org/solr/FrontPage#solr_development for nightly builds. -Yonik http://www.lucidimagination.com
Re: index just new articles from rss feeds - Data Import Request Handler
On Thu, Nov 11, 2010 at 8:21 AM, Matteo Moci mox...@gmail.com wrote: Hello, I'd like to use solr to index some documents coming from an rss feed, like the example at [1], but it seems that the configuration used there is just for a one-time indexing, trying to get all the articles exposed in the rss feed of the website. Is it possible to manage and index just the new articles coming from the rss source? Each item in an RSS feed has a publishing date which you can use to ingest only the new articles. I found that maybe the delta-import can be useful but, from what I understand, the delta-import is used to just update the index with contents of documents that have been modified since the last indexing: this is obviously useful, but I'd like to index just the new articles coming from an rss feed. Is it something managed automatically by solr or I have to deal with it in a separate way? Maybe a full import with clean=false parameters? Are there any solutions that you would suggest? Maybe storing the article feeds in a table like [2] and have a module that periodically sends each row to solr for indexing it? The RSS import example is more of a proof-of-concept that it can be done, it may not be the best way to do it though. Storing the article feeds in a table is essential if you have multiple ones. You can use a parent entity for the table and a child entity to make the actual http calls to the RSS. Be sure to use onError=continue so that a bad RSS feed does not stop the whole process. It will probably work fine for a handful of feeds but if you are looking to develop a large feed ingestion system, I'd suggest looking into alternate methods. -- Regards, Shalin Shekhar Mangar.
Re: Boosting
On Thu, Nov 11, 2010 at 10:35 AM, Solr User solr...@gmail.com wrote: Hi, I have a question about boosting. I have the following fields in my schema.xml: 1. title 2. description 3. ISBN etc I want to boost the field title. I tried index time boosting but it did not work. I also tried Query time boosting but with no luck. Can someone help me on how to implement boosting on a specific field like title? If you use index time boosting, you have to restart Solr and re-index the documents after making the change to the schema.xml. For debugging problems with query-time boosting, append debugQuery=on as a request parameter to see the parsed query and scoring information. -- Regards, Shalin Shekhar Mangar.
Looking for help with Solr implementation
Hi, Not sure if this is the correct place to post but I'm looking for someone to help finish a Solr install on our LAMP based website. This would be a paid project. The programmer that started the project got too busy with his full-time job to finish the project. Solr has been installed and a basic search is working but we need to configure it to work across the site and also set-up faceted search. I tried posting on some popular freelance sites but haven't been able to find anyone with real Solr expertise / experience. If you think you can help me with this project please let me know and I can supply more details. Regards
Link to download solr4.0 is not working?
Hello, Does anyone know where to download solr4.0 source? I tried downloading from this page: http://wiki.apache.org/solr/FrontPage#solr_development but the link is not working... Best, Deche
importing from java
Hi, I'm restricted to the following in regards to importing. I have access to a list (Iterator) of Java objects I need to import into solr. Can I import the java objects as part of solr's data import interface (whenever an http request to solr to do a dataimport, it'll call my java class to get objects)? Before I had direct read only access to the db and specified the column mappings and things were fine with the data import. But now I am restricted to using a .jar file that has an api to get the records in the database and I need to publish these records in the db. I do see solrj and but solrj is seaparate from the solr webapp. Can I write my own dataimporthandler? Thanks, Tri
Re: Rollback can't be done after committing?
Hi, Kouta: Any data store does not support rollback AFTER commit, rollback works only BEFORE. On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote: Hi, all I have a question about Solr and SolrJ's rollback. I try to rollback like below try{ server.addBean(dto); server.commit; }catch(Exception e){ if (server != null) { server.rollback();} } I wonder if any Exception thrown, rollback process is run. so all data would not be updated. but once commited, rollback would not be well done. rollback correctly will be done only when commit process will not? Solr and SolrJ's rollback system is not the same as any RDB's rollback?
Re: Rollback can't be done after committing?
In some cases you can rollback to a named checkpoint. I am not too sure but I think I read in the lucene documentation that it supported named checkpointing. On Thu, Nov 11, 2010 at 7:12 PM, gengshaoguang gengshaogu...@ceopen.cnwrote: Hi, Kouta: Any data store does not support rollback AFTER commit, rollback works only BEFORE. On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote: Hi, all I have a question about Solr and SolrJ's rollback. I try to rollback like below try{ server.addBean(dto); server.commit; }catch(Exception e){ if (server != null) { server.rollback();} } I wonder if any Exception thrown, rollback process is run. so all data would not be updated. but once commited, rollback would not be well done. rollback correctly will be done only when commit process will not? Solr and SolrJ's rollback system is not the same as any RDB's rollback?
A Newbie Question
Hi, Pardon me if this sounds very elementary, but I have a very basic question regarding Solr search. I have about 10 storage devices running Solaris with hundreds of thousands of text files (there are other files, as well, but my target is these text files). The directories on the Solaris boxes are exported and are available as NFS mounts. I have installed Solr 1.4 on a Linux box and have tested the installation, using curl to post documents. However, the manual says that curl is not the recommended way of posting documents to Solr. Could someone please tell me what is the preferred approach in such an environment? I am not a programmer and would appreciate some hand-holding here :o) Thanks in advance, Sesh
Re: importing from java
another question is, can I write my own DataImportHandler class? thanks, Tri From: Tri Nguyen tringuye...@yahoo.com To: solr user solr-user@lucene.apache.org Sent: Thu, November 11, 2010 7:01:25 PM Subject: importing from java Hi, I'm restricted to the following in regards to importing. I have access to a list (Iterator) of Java objects I need to import into solr. Can I import the java objects as part of solr's data import interface (whenever an http request to solr to do a dataimport, it'll call my java class to get objects)? Before I had direct read only access to the db and specified the column mappings and things were fine with the data import. But now I am restricted to using a .jar file that has an api to get the records in the database and I need to publish these records in the db. I do see solrj and but solrj is seaparate from the solr webapp. Can I write my own dataimporthandler? Thanks, Tri
RE: importing from java
http://wiki.apache.org/solr/DIHQuickStart http://wiki.apache.org/solr/DataImportHandlerFaq http://wiki.apache.org/solr/DataImportHandler -Original Message- From: Tri Nguyen [mailto:tringuye...@yahoo.com] Sent: Thursday, November 11, 2010 9:34 PM To: solr-user@lucene.apache.org Subject: Re: importing from java another question is, can I write my own DataImportHandler class? thanks, Tri From: Tri Nguyen tringuye...@yahoo.com To: solr user solr-user@lucene.apache.org Sent: Thu, November 11, 2010 7:01:25 PM Subject: importing from java Hi, I'm restricted to the following in regards to importing. I have access to a list (Iterator) of Java objects I need to import into solr. Can I import the java objects as part of solr's data import interface (whenever an http request to solr to do a dataimport, it'll call my java class to get objects)? Before I had direct read only access to the db and specified the column mappings and things were fine with the data import. But now I am restricted to using a .jar file that has an api to get the records in the database and I need to publish these records in the db. I do see solrj and but solrj is seaparate from the solr webapp. Can I write my own dataimporthandler? Thanks, Tri
Re: Rollback can't be done after committing?
Oh, Pardeep: I don't think lucene is a advanced storage app to support rollback to a history check point (which would be support only in distributed system, such as tow phase commit or transactional web services) yours On Friday, November 12, 2010 11:25:45 am Pradeep Singh wrote: In some cases you can rollback to a named checkpoint. I am not too sure but I think I read in the lucene documentation that it supported named checkpointing. On Thu, Nov 11, 2010 at 7:12 PM, gengshaoguang gengshaogu...@ceopen.cnwrote: Hi, Kouta: Any data store does not support rollback AFTER commit, rollback works only BEFORE. On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote: Hi, all I have a question about Solr and SolrJ's rollback. I try to rollback like below try{ server.addBean(dto); server.commit; }catch(Exception e){ if (server != null) { server.rollback();} } I wonder if any Exception thrown, rollback process is run. so all data would not be updated. but once commited, rollback would not be well done. rollback correctly will be done only when commit process will not? Solr and SolrJ's rollback system is not the same as any RDB's rollback?
Looking for help with Solr implementation
Hi, Not sure if this is the correct place to post but I'm looking for someone to help finish a Solr install on our LAMP based website. This would be a paid project. The programmer that started the project got too busy with his full-time job to finish the project. Solr has been installed and a basic search is working but we need to configure it to work across the site and also set-up faceted search. I tried posting on some popular freelance sites but haven't been able to find anyone with real Solr expertise / experience. If you think you can help me with this project please let me know and I can supply more details. Regards, Abe
Re: Best practices to rebuild index on live system
On 11/11/2010 4:45 PM, Robert Gründler wrote: So far, i can only think of 2 scenarios for rebuilding the index, if we need to update the schema after the rollout: 1. Create 3 more cores (A1,B1,C1) - Import the data from the database - After importing, switch the application to cores A1, B1, C1 This will most likely cause a corrupt index, as in the 1.5 hours of indexing, the database might get inserts/updates/deletes. 2. Put the Livesystem in a Read-Only mode and rebuild the index during that time. This will ensure data integrity in the index, with the drawback for users not being able to write to the app. I can tell you how we handle this. The actual build system is more complicated than I have mentioned here, involving replication and error handling, but this is the basic idea. This isn't the only possible approach, but it does work. I have 6 main static shards and one incremental shard, each on their own machine (Xen VM, actually). Data is distributed by taking the Did value (primary key in the database) and doing a mod 6 on it, the resulting value is the static shard number. The system tracks two values at all times - minDid and maxDid. The static shards have Did values = minDid. The incremental is minDid and = maxDid. Once an hour, I write the current Did value to an RRD. Once a day, I use that RRD to figure out the Did value corresponding to one week ago. All documents minDid and = newMinDid are delta-imported into the static indexes and deleted from the incremental index, and minDid is updated. When it comes time to rebuild, I first rebuild the static indexes in a core named build which takes 5-6 hours. When that's done, I rebuild the incremental in its build core, which only takes about 10 minutes. Then on all the machines, I swap the build and live cores. While all the static builds are happening, the incremental continues to get new content, until it too is rebuilt. Shawn
Re: Link to download solr4.0 is not working?
On 11/11/2010 7:44 PM, Deche Pangestu wrote: Hello, Does anyone know where to download solr4.0 source? I tried downloading from this page: http://wiki.apache.org/solr/FrontPage#solr_development but the link is not working... Your best bet is to use svn. http://lucene.apache.org/solr/version_control.html For Solr 4.0, you need to check out trunk: http://svn.apache.org/repos/asf/lucene/dev/trunk For Solr 3.1, you'd use branch_3x: http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x Shawn