Re: CollapseFilter with the latest Solr in trunk
Ok, here is how I fixed this problem: public DocListAndSet getDocListAndSet(Query query, ListQuery filterList, DocSet docSet, Sort lsort, int offset, int len, int flags) throwsIOException { //DocListAndSet ret = new DocListAndSet(); //getDocListC(ret,query,filterList,docSet,lsort,offset,len, flags |= GET_DOCSET); DocSet theFilt = getDocSet(filterList); if (docSet != null) theFilt = (theFilt != null) ? theFilt.intersection(docSet) : docSet; QueryCommand qc = new QueryCommand(); qc.setQuery(query).setFilter(theFilt); qc.setSort(lsort).setOffset(offset).setLen(len).setFlags(flags |= GET_DOCSET); QueryResult result = new QueryResult(); getDocListC(result,qc); return result.getDocListAndSet(); } There is also one-off error in CollapseFilter which you can find solution on Jira. Cheers, Cuong On Sat, Apr 18, 2009 at 4:41 AM, Jeff Newburn jnewb...@zappos.com wrote: We are currently trying to do the same thing. With the patch unaltered we can use fq as long as collapsing is turned on. If we just send a normal document level query with an fq parameter it blows up. Additionally, it does not appear that the collapse.facet option works at all. -- Jeff Newburn Software Engineer, Zappos.com jnewb...@zappos.com - 702-943-7562 From: climbingrose climbingr...@gmail.com Reply-To: solr-user@lucene.apache.org Date: Fri, 17 Apr 2009 16:53:00 +1000 To: solr-user solr-user@lucene.apache.org Subject: CollapseFilter with the latest Solr in trunk Hi all, Have any one try to use CollapseFilter with the latest version of Solr in trunk? However, it looks like Solr 1.4 doesn't allow calling setFilterList() and setFilter() on one instance of the QueryCommand. I modified the code in QueryCommand to allow this: public QueryCommand setFilterList(Query f) { // if( filter != null ) { //throw new IllegalArgumentException( Either filter or filterList may be set in the QueryCommand, but not both. ); // } filterList = null; if (f != null) { filterList = new ArrayListQuery(2); filterList.add(f); } return this; } However, I still have a problem which prevent query filters from working when used in conjunction with CollapseFilter. In other words, query filters doesn't seem to have any effects on the result set when CollapseFilter is used. The other problem is related to OpenBitSet: java.lang.ArrayIndexOutOfBoundsException: 2183 at org.apache.lucene.util.OpenBitSet.fastSet(OpenBitSet.java:242) at org.apache.solr.search.CollapseFilter.addDoc(CollapseFilter.java:202) at org.apache.solr.search.CollapseFilter.adjacentCollapse(CollapseFilter.java:161 ) at org.apache.solr.search.CollapseFilter.lt;initgt;(CollapseFilter.java:141) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:2 17) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandle r.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.ja va:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1333) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303 ) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:23 2) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFi lterChain.java:202) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChai n.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java :213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java :178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:1 07) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processCon nection(Http11BaseProtocol.java:664) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java: 527) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWork erThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java: 684) at java.lang.Thread.run(Thread.java:619) I think CollapseFilter is rather an important function in Solr that gets used quite frequently. Does anyone have a solution for this? -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Best way to return ExternalFileField in the results
Hi all, I've been trying to return a field of type ExternalFileField in the search result. Upon examining XMLWriter class, it seems like Solr can't do this out of the box. Therefore, I've tried to hack Solr to enable this behaviour. The goal is to call to ExternalFileField.getValueSource(SchemaField field,QParser parser) in XMLWriter.writeDoc(String name, Document document,...) method. There are two issues with doing this: 1) I need to create an instance of QParser in writeDoc method. What is the best way to do this? What kind of overhead of creating a new QParser for every document returned? 2) I have to modify writeDoc method to include the internal Lucene document Id because I need it to retrieve the ExternalFileField: fileField.getValueSource(schemaField, qparser).getValues(request.getSearcher().getIndexReader()).floatVal(docId) The immediate affect is that it breaks writeVal() method (because this method references writeDoc()). Any comments? Thanks in advance. -- Regards, Cuong Hoang
Re: Document rating/popularity and scoring
Hi Yonik, I have had a looked at ExternalFileField. However, I coudn't figured out how to include the externally referenced field in the search results. Also, sorting on this type of field isn't possible right? Thanks. On Sat, Jul 12, 2008 at 2:28 AM, climbingrose [EMAIL PROTECTED] wrote: Thanks Yonik. I will try it out. Btw, what cache should we use for multivalued, untokenised fields with large number of terms? Faceted search on these fields seem to be noticeably slower even if I have allocated enough filterCache. There seems to be a lot of cache lookups for each query. On Sat, Jul 12, 2008 at 1:58 AM, Yonik Seeley [EMAIL PROTECTED] wrote: See ExternalFileField and BoostedQuery -Yonik On Fri, Jul 11, 2008 at 11:47 AM, climbingrose [EMAIL PROTECTED] wrote: Hi all, Has anyone tried to factor rating/popularity into Solr scoring? For example, I want documents with more page views to be ranked higher in the search results. From what I can see, the most difficult thing is that we have to update the number of page views for each document. With Solr-139, document can be updated at field level. However, it still have to retrieve the document and then do a reindex. With high traffic sites, the overhead might be too high. I'm thinking of using relational database to track page views / ratings and then do a daily sync with Solr. Is there a way for Solr to retrieve data from external sources (database server) and use the data for determining document ranking? Thanks. -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Document rating/popularity and scoring
Hi all, Has anyone tried to factor rating/popularity into Solr scoring? For example, I want documents with more page views to be ranked higher in the search results. From what I can see, the most difficult thing is that we have to update the number of page views for each document. With Solr-139, document can be updated at field level. However, it still have to retrieve the document and then do a reindex. With high traffic sites, the overhead might be too high. I'm thinking of using relational database to track page views / ratings and then do a daily sync with Solr. Is there a way for Solr to retrieve data from external sources (database server) and use the data for determining document ranking? Thanks. -- Regards, Cuong Hoang
Re: Document rating/popularity and scoring
Thanks Yonik. I will try it out. Btw, what cache should we use for multivalued, untokenised fields with large number of terms? Faceted search on these fields seem to be noticeably slower even if I have allocated enough filterCache. There seems to be a lot of cache lookups for each query. On Sat, Jul 12, 2008 at 1:58 AM, Yonik Seeley [EMAIL PROTECTED] wrote: See ExternalFileField and BoostedQuery -Yonik On Fri, Jul 11, 2008 at 11:47 AM, climbingrose [EMAIL PROTECTED] wrote: Hi all, Has anyone tried to factor rating/popularity into Solr scoring? For example, I want documents with more page views to be ranked higher in the search results. From what I can see, the most difficult thing is that we have to update the number of page views for each document. With Solr-139, document can be updated at field level. However, it still have to retrieve the document and then do a reindex. With high traffic sites, the overhead might be too high. I'm thinking of using relational database to track page views / ratings and then do a daily sync with Solr. Is there a way for Solr to retrieve data from external sources (database server) and use the data for determining document ranking? Thanks. -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Re: Do I need Searcher on indexing machine
You do, I think. Have a look at DirectUpdateHandler2 class. On Thu, Jul 10, 2008 at 9:16 PM, Gudata [EMAIL PROTECTED] wrote: Hi, I want (if possible) to dedicate one machine only for indexing and to be optimized only for that. In solrconfig.xml, I have: - commented all cache statements - set to use cold searchers. - set 1 In the log files I see this all the time: INFO: Registered new searcher [EMAIL PROTECTED] main Jul 10, 2008 12:49:59 PM org.apache.solr.search.SolrIndexSearcher close INFO: Closing [EMAIL PROTECTED] main Why Solr is registering new searcher all the time. Is this overhead, and if yes, how to stop it? -- View this message in context: http://www.nabble.com/Do-I-need-Searcher-on-indexing-machine-tp18380669p18380669.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Cuong Hoang
Re: Limit Porter stemmer to plural stemming only?
Attached is the modified Snowball source code for plural-only English stemmer. You need to compile it to Java using instruction here: http://snowball.tartarus.org/runtime/use.html. Essentially, you need to: 1) Download (Snowball, algorithms, and libstemmer library)http://snowball.tartarus.org/dist/snowball_code.tgz and compile Snowball compiler it self using this command: gcc -O -o snowball compiler/*.c. 2) Compile the the attached file to Java: ./snowball stem_ISO_8859_1.sbl -java -o EnglishStemmer -name EnglishStemmer You can change EnglishStemmer to whatever you like, for example, PluralEnglishStemmer. After that, you need to modify the generated Java class so that it references the appropriate classes in net.sf.snowball.* package instead of the one from Snowball website. I think only 2 classes you need to import are Among and SnowballProgram. Once, you have the new stemmer ready, write something similar to EnglishPorterFilterFactory to use it within Solr. Hope this helps. Cheers, Cuong On Tue, Jul 1, 2008 at 6:07 PM, Guillaume Smet [EMAIL PROTECTED] wrote: Hi Cuong, On Tue, Jul 1, 2008 at 4:45 AM, climbingrose [EMAIL PROTECTED] wrote: I modified the original English Stemmer written in Snowball language and regenerate the Java implementation using Snowball compiler. It's been working for me so far. I certainly can share the modified Snowball English Stemmer if anyone wants to use it. Yeah, it would be nice. A step by step explanation of how to regenerate the Java files would be nice too (or a pointer to such a documentation if you found one). Thanks, -- Guillaume
Limit Porter stemmer to plural stemming only?
Hi all, Porter stemmer in general is really good. However, there are some cases where it doesn't work. For example, accountant matches Accountant as well as Account Manager which isn't desirable. Is it possible to use this analyser for plural words only? For example: +Accountant - accountant +Accountants - accountant +Account - Account +Accounts - account Thanks. -- Regards, Cuong Hoang
Re: Limit Porter stemmer to plural stemming only?
Ok, it looks like step 1a in Porter algo does what I need. On Mon, Jun 30, 2008 at 6:39 PM, climbingrose [EMAIL PROTECTED] wrote: Hi all, Porter stemmer in general is really good. However, there are some cases where it doesn't work. For example, accountant matches Accountant as well as Account Manager which isn't desirable. Is it possible to use this analyser for plural words only? For example: +Accountant - accountant +Accountants - accountant +Account - Account +Accounts - account Thanks. -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Re: Limit Porter stemmer to plural stemming only?
I modified the original English Stemmer written in Snowball language and regenerate the Java implementation using Snowball compiler. It's been working for me so far. I certainly can share the modified Snowball English Stemmer if anyone wants to use it. Cheers, Cuong On Tue, Jul 1, 2008 at 4:12 AM, Mike Klaas [EMAIL PROTECTED] wrote: If you find a solution that works well, I encourage you to contribute it back to Solr. Plural-only stemming is probably a common need (I've definitely wanted to use it before). cheers, -Mike On 30-Jun-08, at 2:25 AM, climbingrose wrote: Ok, it looks like step 1a in Porter algo does what I need. On Mon, Jun 30, 2008 at 6:39 PM, climbingrose [EMAIL PROTECTED] wrote: Hi all, Porter stemmer in general is really good. However, there are some cases where it doesn't work. For example, accountant matches Accountant as well as Account Manager which isn't desirable. Is it possible to use this analyser for plural words only? For example: +Accountant - accountant +Accountants - accountant +Account - Account +Accounts - account Thanks. -- Regards, Cuong Hoang -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Re: Suggestion for short text matching using dictionary
Thanks Grant. I did try Secondstring before and found out that it wasn't particular good for doing a lot of text matching. I'm leaning toward the combination of Lucene and Secondstring. Googling around a bit, I came across this project http://datamining.anu.edu.au/projects/linkage.html. Looks interesting but the implementation is in Python though. I think they use Hidden Markov Model to label training data then matching records probalistically. On Fri, Jun 27, 2008 at 10:12 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: below On Jun 27, 2008, at 1:18 AM, climbingrose wrote: Firstly, my apologies for being off topic. I'm asking this question because I think there are some machine learning and text processing experts on this mailing list. Basically, my task is to normalize a fairly unstructured set of short texts using a dictionary. We have a pre-defined list of products and periodically receive product feeds from various websites. Basically, our site is similar to a shopping comparison engine but on a different domain. We would like to normalize the products' names in the feeds to using our pre-defined list. For example: Nokia N95 8GB Black --- Nokia N95 8GB Black Nokia N95, 8GB + Free bluetooth headset -- Nokia N95 8GB My original idea is to index the list of pre-defined names and then query the index using the product's name. The highest scored result will be used to normalize the product. The problem with this is sometimes you get wrong matches because of noise. For example, Black Nokia N95, 8GB + Free bluetooth headset can match Nokia Bluetooth Headset which is desirable. I assume you mean not desirable here given the context... Your approach is worth trying. At a deeper level, you may want to look into a topic called record linkage and an open source project called Second String by William Cohen's group at Carnegie Mellon ( http://secondstring.sourceforge.net/) which has a whole bunch of implementations of fuzzy string matching algorithms like Jaro-Winkler, Levenstein, etc. that can then be used to implement what you are after. You could potentially use the spell checking functionality to simulate some of this a bit better than just a pure vector match. Index your dictionary into a spelling index (see SOLR-572) and then send in spell checking queries. In fact, you probably could integrate Second String into the spell checker pretty easily since one can now plugin the distance measure into the spell checker. You may find some help on this by searching http://lucene.markmail.org for things like record linkage or record matching or various other related terms. Another option is to write up a NormalizingTokenFilter that analyzes the tokens as they come in to see if they match your dictionary list. As with all of these, there is going to be some trial and error here to come up with something that hits most of the time, as it will never be perfect. Good luck, Grant -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Regards, Cuong Hoang
Re: searching only within allowed documents
It depends on your query. The second query is better if you know that fieldb:bar filtered query will be reused often since it will be cached separately from the query. The first query occuppies one cache entry while the second one occuppies two cache entries, one in queryCache and one in filteredCache. Therefore, if you're not going to reuse fieldb:bar, the second query is better. On Wed, Jun 11, 2008 at 10:53 PM, Geoffrey Young [EMAIL PROTECTED] wrote: Solr allows you to specify filters in separate parameters that are applied to the main query, but cached separately. q=the user queryfq=folder:f13fq=folder:f24 I've been wanting more explanation around this for a while, so maybe now is a good time to ask :) the cached separately verbiage here is the same as in the twiki, but I don't really understand what it means. more precisely, I'm wondering what the real performance, caching, etc differences are between q=fielda:foo+fieldb:barmm=100% and q=fielda:foofq=fieldb:bar my situation is similar to the original poster's in that documents matching fielda is very large and common (say theaters across the world) while fieldb would narrow it considerably (one by country, then one by zipcode, etc). thanks --Geoff -- Regards, Cuong Hoang
Re: searching only within allowed documents
Just correct myself, in the last setence, the first query is better if fieldb:bar isn't reused often On Thu, Jun 12, 2008 at 2:02 PM, climbingrose [EMAIL PROTECTED] wrote: It depends on your query. The second query is better if you know that fieldb:bar filtered query will be reused often since it will be cached separately from the query. The first query occuppies one cache entry while the second one occuppies two cache entries, one in queryCache and one in filteredCache. Therefore, if you're not going to reuse fieldb:bar, the second query is better. On Wed, Jun 11, 2008 at 10:53 PM, Geoffrey Young [EMAIL PROTECTED] wrote: Solr allows you to specify filters in separate parameters that are applied to the main query, but cached separately. q=the user queryfq=folder:f13fq=folder:f24 I've been wanting more explanation around this for a while, so maybe now is a good time to ask :) the cached separately verbiage here is the same as in the twiki, but I don't really understand what it means. more precisely, I'm wondering what the real performance, caching, etc differences are between q=fielda:foo+fieldb:barmm=100% and q=fielda:foofq=fieldb:bar my situation is similar to the original poster's in that documents matching fielda is very large and common (say theaters across the world) while fieldb would narrow it considerably (one by country, then one by zipcode, etc). thanks --Geoff -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Re: Multiple Schema File
Hi Sachit, I think what you could do is to create all the core fields of your models such as username, role, title, body, images... You can name them with prefix like user.username, user.role, article.title, article.body... If you want to dynamically add more fields to your schema, you can use dynamic fields and keep a mapping between your model's properties and these fields somewhere. Have a look at the default schema.xml for examples. I did use this approach for a previous project and it worked fine for me. Cheers, Cuong On Thu, Jun 5, 2008 at 3:43 PM, Sachit P. Menon [EMAIL PROTECTED] wrote: Hi folks, I have a scenario as follows: I have a CMS where in I'm storing all the contents. I need to index all these contents and have a search on these indexes. For indexing, I can define a schema for all the contents. Some of the properties are like title, headline, body, keywords, images, etc. Now I have a user management wherein I store all the user information. I need to index this also. This may have properties like user name, role, joining date, etc. I want to use only one Solr instance. That means I can have only one schema file. How can I define all these totally different properties in one schema file? The unique id storage for content and user management may also be different. How can I achieve this? Thanks and Regards Sachit P. Menon| Programmer Analyst| MindTree Ltd. |West Campus, Phase-1, Global Village, RVCE Post, Mysore Road, Bangalore-560 059, INDIA |Voice +91 80 26264000 |Extn 64872|Fax +91 80 26264100 | Mob : +91 9986747356|www.mindtree.com https://indiamail.mindtree.com/exchweb/bin/redir.asp?URL=http://www.mindtree .com/ | DISCLAIMER: This message (including attachment if any) is confidential and may be privileged. If you have received this message by mistake please notify the sender by return e-mail and delete this message from your system. Any unauthorized use or dissemination of this message in whole or in part is strictly prohibited. E-mail may contain viruses. Before opening attachments please check them for viruses and defects. While MindTree Limited (MindTree) has put in place checks to minimize the risks, MindTree will not be responsible for any viruses or defects or any forwarded attachments emanating either from within MindTree or outside. Please note that e-mails are susceptible to change and MindTree shall not be liable for any improper, untimely or incomplete transmission. MindTree reserves the right to monitor and review the content of all messages sent to or from MindTree e-mail address. Messages sent to or from this e-mail address may be stored on the MindTree e-mail system or else where. -- Regards, Cuong Hoang
Ideas on how to implement sponsored results
Hi all, I'm trying to implement sponsored results in Solr search results similar to that of Google. We index products from various sites and would like to allow certain sites to promote their products. My approach is to query a slave instance to get sponsored results for user queries in addition to the normal search results. This part is easy. However, since the number of products indexed for each sites can be very different (100, 1000, 1 or 6 products), we need a way to fairly distribute the sponsored results among sites. My initial thought is utilising field collapsing patch to collapse the search results on siteId field. You can imagine that this will create a series of buckets of results, each bucket representing results from a site. After that, 2 or 3 buckets will randomly be selected from which I will randomly select one or two results from. However, since I want these sponsored results to be relevant to user queries, I'd like only want to have the first 30 results in each buckets. Obviously, it's desirable that if the user refreshes the page, new sponsored results will be displayed. On the other hand, I also want to have the advantages of Solr cache. What would be the best way to implement this functionality? Thanks. Cheers, Cuong
Re: Ideas on how to implement sponsored results
Hi Alexander, Thanks for your suggestion. I think my problem is a bit different from yours. We don't have any sponsored words but we have to retrieve sponsored results directly from the index. This is because a site can have 60,000 products which is hard to insert/update keywords. I can live with that by issuing a separate query to fetch sponsored results. My problem is to equally distribute sponsored results between sites so that each site will have an opportunity to show their sponsored results no matter how many products they have. For example, if site A has 6 products, site B has only 2000 then sponsored products from site B will have a very small chance to be displayed. On Wed, Jun 4, 2008 at 2:56 AM, Alexander Ramos Jardim [EMAIL PROTECTED] wrote: Cuong, I have implemented sponsored words for a client. I don't know if my working can help you but I will expose it and let you decide. I have an index containing products entries that I created a field called sponsored words. What I do is to boost this field , so when these words are matched in the query that products appear first on my result. 2008/6/3 climbingrose [EMAIL PROTECTED]: Hi all, I'm trying to implement sponsored results in Solr search results similar to that of Google. We index products from various sites and would like to allow certain sites to promote their products. My approach is to query a slave instance to get sponsored results for user queries in addition to the normal search results. This part is easy. However, since the number of products indexed for each sites can be very different (100, 1000, 1 or 6 products), we need a way to fairly distribute the sponsored results among sites. My initial thought is utilising field collapsing patch to collapse the search results on siteId field. You can imagine that this will create a series of buckets of results, each bucket representing results from a site. After that, 2 or 3 buckets will randomly be selected from which I will randomly select one or two results from. However, since I want these sponsored results to be relevant to user queries, I'd like only want to have the first 30 results in each buckets. Obviously, it's desirable that if the user refreshes the page, new sponsored results will be displayed. On the other hand, I also want to have the advantages of Solr cache. What would be the best way to implement this functionality? Thanks. Cheers, Cuong -- Alexander Ramos Jardim -- Regards, Cuong Hoang
Re: Announcement of Solr Javascript Client
Hi Matthias, How would you prevent Solr server from being exposed to outside world with this javascript client? I prefer running Solr behind firewall and access it from server side code. Cheers. On Mon, May 26, 2008 at 7:27 AM, Matthias Epheser [EMAIL PROTECTED] wrote: Hi users, As initially described in this thread [1] I am currently working on a javascript client library for solr. The idea is based on a demo [2] that introduces a reusable javascript widget client. I spent the last weeks evaluating the best fitting technologies that ensure a clean generic port of the demo into the solr project. The goal is to make it easy to use and include in webpages on the one hand, and creating a clean interface to the solr server on the other hand. With this announcement, I want to ask the community for their experience with solr and javascript and would appreciate feedback about this proposal: - javascript toolkit: JQuery, because it is already shipped with the solr webapp - Using a manager object on the client that holds all widgets and takes care of the communication to the solr server. - Using the JSONResponsewriter to get the data to the widgets so they could update their ui. These technologies seem to be the currently best ones IMHO, any feedback/experiences welcome. Regards, matthias [1] http://www.nabble.com/-GSOC-proposal-%3A-Solr-javascript-client-library-to16422808.html#a16430329 [2] http://lovo.test.dev.indoqa.com/mepheser/moobrowser/ -- Regards, Cuong Hoang
Re: query for number of field entries in a multivalued field?
Probably the easiest way to do this is keep track of the number of items yourself then retrieve it later on. On Wed, May 21, 2008 at 7:57 AM, Brian Whitman [EMAIL PROTECTED] wrote: Any way to query how many items are in a multivalued field? (Or use a functionquery against that # or anything?) -- Regards, Cuong Hoang
Re: Simple Solr POST using java
Agree. I've been using Solrj on product site for 9 months without any problem at all. You should probably give it a try instead of dealing with all those low level details. On Sun, May 11, 2008 at 4:14 AM, Chris Hostetter [EMAIL PROTECTED] wrote: : please post a snippet of Java code to add a document to the Solr index that : includes the URL reference as a String? you mean like this one... :) http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/util/SimplePostTool.java?view=markup FWIW: if you want to talk to Solr from a Java app, the SolrJ client API is probably worth looking into rather then dealing with the HTTP connections and XML formating directly... http://wiki.apache.org/solr/Solrj -Hoss -- Regards, Cuong Hoang
Re: Minimum should match and PhraseQuery
Thanks Christ. I probably have to repost this in Lucene mailing list. On Sun, Mar 23, 2008 at 9:49 AM, Chris Hostetter [EMAIL PROTECTED] wrote: the topic has come up before on the lucene java lists (allthough i can't think of any good search terms to find the old threads .. I can't really remember how people have discribed this idea in the past) I don't remember anyone ever suggesting/sharing a general purpose solution intrinsicly more efficient then if you just generated all the permutations yourself : 2) I also want to relax PhraseQuery a bit so that it not only match Senior : Java Developer~2 but also matches Java Developer~2 but of course with a : lower score. I can programmatically generate on the combination but it's not : gonna be efficient if user issues query with many terms. -Hoss -- Regards, Cuong Hoang
Minimum should match and PhraseQuery
Hi all, I thought many people would encounter the situation I'm having here. Basically, we'd like to have a PhraseQuery with minimum should match property similar to BooleanQuery. Consider the query Senior Java Developer: 1) I'd like to do a PhraseQuery on Senior Java Developer with a slop of say 2, so that the query only matches documents with these words located in proximity. I don't want to match documents like Senior Huge block of text Java Huge block of Text Developer. 2) I also want to relax PhraseQuery a bit so that it not only match Senior Java Developer~2 but also matches Java Developer~2 but of course with a lower score. I can programmatically generate on the combination but it's not gonna be efficient if user issues query with many terms. Is it possible to do this with Solr and Lucene? -- Cheers, Cuong Hoang
Re: Accented search
Hi Peter, It looks like a very promising approach for us. I'm going to implement an custom Tokeniser based on your suggestions and see how it goes. Thank you all for your comments! Cheers On Wed, Mar 12, 2008 at 2:37 AM, Binkley, Peter [EMAIL PROTECTED] wrote: We've done this in a pre-Solr Lucene context by using the position increment: when a token contains accented characters, you add a stripped version of that token with a zero increment, so that for matching purposes the original and the stripped version are at the same position. Accents are not stripped from queries. The effect is that an accented search matches your Doc A, and an unaccented search matches Docs A and B. We do that after lower-casing the token. There are some limitations: users might start to expect that they can freely add accents to restrict their search to accented hits, but if they don't match the accents exactly they won't get any hits: e.g. if a word contains two accented characters and the user only accents one of them in their query, they won't match the accented or the unaccented version. Peter Peter Binkley Digital Initiatives Technology Librarian Information Technology Services 4-30 Cameron Library University of Alberta Libraries Edmonton, Alberta Canada T6G 2J8 Phone: (780) 492-3743 Fax: (780) 492-9243 e-mail: [EMAIL PROTECTED] ~ The code is willing, but the data is weak. ~ -Original Message- From: climbingrose [mailto:[EMAIL PROTECTED] Sent: Monday, March 10, 2008 10:01 PM To: solr-user@lucene.apache.org Subject: Accented search Hi guys, I'm running to some problems with accented (UTF-8) language. I'd love to hear some ideas about how to use Solr with those languages. Basically, I want to achieve what Google did with UTF-8 language. My requirements including: 1) Accent insensitive search and proper highlighting: For example, we have 2 documents: Doc A (title:Lập Trình Viên) Doc B (title:Lap Trinh Vien) if the user enters Lập Trình Viên, then Doc B is also matched and Lập Trình Viên is highlighted. On the other hand, if the query is Lap Trinh Vien, Doc A is also matched. 2) Assign proper scores to accented or non-accented searches: if the user enters Lập Trình Viên, then Doc A should be given higher score than DOC B. if the query is Lap Trinh Vien, Doc A should be given higher score. Any ideas guys? Thanks in advance! -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Accented search
Hi guys, I'm running to some problems with accented (UTF-8) language. I'd love to hear some ideas about how to use Solr with those languages. Basically, I want to achieve what Google did with UTF-8 language. My requirements including: 1) Accent insensitive search and proper highlighting: For example, we have 2 documents: Doc A (title:Lập Trình Viên) Doc B (title:Lap Trinh Vien) if the user enters Lập Trình Viên, then Doc B is also matched and Lập Trình Viên is highlighted. On the other hand, if the query is Lap Trinh Vien, Doc A is also matched. 2) Assign proper scores to accented or non-accented searches: if the user enters Lập Trình Viên, then Doc A should be given higher score than DOC B. if the query is Lap Trinh Vien, Doc A should be given higher score. Any ideas guys? Thanks in advance! -- Regards, Cuong Hoang
Re: solr 1.3
I don't think they (Solr developers) have a time frame for 1.3 release. However, I've been using the latest code from the trunk and I can tell you it's quite stable. The only problem is the documentation sometimes doesn't cover lastest changes in the code. You'll probably have to dig into the code itself or post a question here and many people will be happy to help you. On Jan 21, 2008 12:07 PM, anuvenk [EMAIL PROTECTED] wrote: when will this be released? where can i find the list of improvements/enhancements in 1.3 if its been documented already? -- View this message in context: http://www.nabble.com/solr-1.3-tp14989395p14989395.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Cuong Hoang
Re: solr 1.3
I'm using code pulled directly from Subversion. On Jan 21, 2008 12:34 PM, anuvenk [EMAIL PROTECTED] wrote: Thanks. Would this be the latest code from the trunk that you mentioned? http://people.apache.org/builds/lucene/solr/nightly/solr-2008-01-19.zip climbingrose wrote: I don't think they (Solr developers) have a time frame for 1.3 release. However, I've been using the latest code from the trunk and I can tell you it's quite stable. The only problem is the documentation sometimes doesn't cover lastest changes in the code. You'll probably have to dig into the code itself or post a question here and many people will be happy to help you. On Jan 21, 2008 12:07 PM, anuvenk [EMAIL PROTECTED] wrote: when will this be released? where can i find the list of improvements/enhancements in 1.3 if its been documented already? -- View this message in context: http://www.nabble.com/solr-1.3-tp14989395p14989395.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Cuong Hoang -- View this message in context: http://www.nabble.com/solr-1.3-tp14989395p14989689.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Cuong Hoang
Merry Christmas and happy new year
Good day all Solr users developers, May I wish you and your family a merry Xmas and happy new year. Hope that new year brings you all health, wealth and peace. It's been my pleasure to be on this mailing list and working with Solr. Thank you all! -- Cheers, Cuong Hoang
Re: Issues with postOptimize
Make sure that the user running Solr has permission to execute snapshooter. Also, try ./snapshooter instead of snapshooter. Good luck. On Dec 18, 2007 10:57 AM, Sunny Bassan [EMAIL PROTECTED] wrote: I've set up solrconfig.xml to create a snap shot of an index after doing a optimize, but the snap shot cannot be created because of permission issues. I've set permissions to the bin, data and log directories to read/write/execute for all users. Even with these settings I cannot seem to be able to run snapshooter on the postOptimize event. Any ideas? Could it be a java permissions issue? Thanks. Sunny Config settings: listener event=postOptimize class=solr.RunExecutableListener str name=exesnapshooter/str str name=dir/search/replication_test/0/index/solr/bin/str bool name=waittrue/bool /listener Error: Dec 17, 2007 7:45:19 AM org.apache.solr.core.RunExecutableListener exec FINE: About to exec snapshooter Dec 17, 2007 7:45:19 AM org.apache.solr.core.SolrException log SEVERE: java.io.IOException: Cannot run program snapshooter (in directory /search/replication_test/0/index/solr/bin): java.io.IOException: error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at java.lang.Runtime.exec(Runtime.java:593) at org.apache.solr.core.RunExecutableListener.exec(RunExecutableListener.ja va:70) at org.apache.solr.core.RunExecutableListener.postCommit(RunExecutableListe ner.java:97) at org.apache.solr.update.UpdateHandler.callPostOptimizeCallbacks(UpdateHan dler.java:105) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2. java:516) at org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestH andler.java:214) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpd ateRequestHandler.java:84) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB ase.java:77) at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va:191) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j ava:159) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv e.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv e.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2 63) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84 4) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process( Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.init(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 23 more -- Regards, Cuong Hoang
Re: Replication hooks
I think there is a event listener interface for hooking into Solr events such as post commit, post optimise and open new searcher. I can't remember on top of my head but if you do a search for *EventListener in Eclipse, you'll find it. The Wiki shows how to trigger snapshooter after each commit and optimise. You should be able to follow this example to create your own listener. On Dec 11, 2007 1:03 PM, Tracy Flynn [EMAIL PROTECTED] wrote: Hi, I'm interested in setting up simple replication. I've reviewed all the Wiki information, looked at the scripts etc. and understand most of what I see. There are some references to 'hooks in the code' for both the master and slave nodes for handling replication. I've searched the 1.2 and trunk code bases for obvious phrases, but I can't identify these hooks. Can someone please point me to the correct place(s) to look? Thanks, Tracy Flynn -- Regards, Cuong Hoang
Re: solr + maven?
Hi Ryan, I'm using solr with Maven 2 in our project. Here is how my pom.xml looks like: !-- Solrj -- dependency groupIdorg.apache.solr/groupId artifactIdsolr-solrj/artifactId version1.3.0/version /dependency Since I have all solrj dependencies declared by other artifacts, I don't need to declare any of solrj dependencies. You'll probably have to add commons-httpclient artifacts. On Dec 5, 2007 10:08 AM, Ryan McKinley [EMAIL PROTECTED] wrote: Is anyone managing solr projects with maven? I see: https://issues.apache.org/jira/browse/SOLR-19 but that is 1 year old If someone has a current pom.xml, can you post it on SOLR-19? I just started messing with maven, so I don't really know what I am doing yet. thanks ryan -- Regards, Cuong Hoang
Re: SOLR sorting - question
I don't think you have to. Just try the query on the REST interface and you will know. On Dec 5, 2007 9:56 AM, Kasi Sankaralingam [EMAIL PROTECTED] wrote: Do I need to select the fields in the query that I am trying to sort on?, for example if I want sort on update date then do I need to select that field? Thanks, -- Regards, Cuong Hoang
Access to SolrIndexSearcher in UpdateProcessor
Hi all, I'm trying to implement a custom UpdateProcessor which requires access to SolrIndexSearcher. However, I'm constantly running into Too many open files exception. I'm confused about which is the correct way to get access to SolrIndexSearcher in UpdateProcessor: 1) req.getSearcher() 2) req.getCore().getSearcher() 3) req.getCore().newSearcher(MyCustomerProcessorFactory); I have tried 1) 3) but both produce Too many open files. The weird thing with 3) is the SolrIndexSearcher created gets set to null automatically by Solr so I didn't have a chance to call searcher.close() method. I suspect all searchers open this way will be set to null when a commit is made. Any recommendation? -- Regards, Cuong Hoang
Re: Get last updated/committed document
Assuming that you have the timestamp field defined: q=*:*sort=timestamp desc On Nov 23, 2007 10:43 PM, Thorsten Scherler [EMAIL PROTECTED] wrote: Hi all, I need to ask solr to return me the id of the last committed document. Is there a way to archive this via a standard lucene query or do I need a custom connector that gives me this information? TIA for any information salu2 -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions -- Regards, Cuong Hoang
Re: Near Duplicate Documents
The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote: Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart -- Regards, Cuong Hoang
Re: Help with Debian solr/jetty install?
Make sure you have JDK installed not just JRE. Also try to set JAVA_HOME directory. apt-get install sun-java5-jdk On Nov 21, 2007 5:50 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Phillip, I won't go into details, but I'll point out that the Java compiler is called javac and if memory serves me well, it is defined in one of Jetty's XML config files in its etc/ dir. The java compiler is used to compile JSPs that Solr uses for the admin UI. So, make sure you have javac and make sure Jetty can find it. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, November 20, 2007 5:55:27 PM Subject: Help with Debian solr/jetty install? Hi, I've successfully run as far as the example admin page on Debian linux 2.6. So I installed the solr-jetty packaged for Debian testing which gives me Jetty 5.1.14-1 and Solr 1.2.0+ds1-1. Jetty starts fine and so does the Solr home page at http://localhost:8280/solr But I get an error when I try to run http://localhost:8280/solr/admin HTTP ERROR: 500 No Java compiler available I have sun-java6-jre and sun-java6-jdk packages installed. I'm new to servlet containers and java webapps. What should I be looking for to fix this or what information could I provide the list to get me moving forward from here? I've included the trace from the Jetty log, and the java properties dump from the example below. Thanks, Phil --- Java properties (from the example): -- sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386 java.vm.version = 1.6.0-b105 java.vm.name = Java HotSpot(TM) Client VM user.dir = /tmp/apache-solr-1.2.0/example java.runtime.version = 1.6.0-b105 os.arch = i386 java.io.tmpdir = /tmp java.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib java.class.version = 50.0 jetty.home = /tmp/apache-solr-1.2.0/example sun.management.compiler = HotSpot Client Compiler os.version = 2.6.22-2-686 java.class.path = /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre java.version = 1.6.0 java.ext.dirs = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext sun.boot.class.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes Jetty log (from the error under Debian Solr/Jetty): org.apache.jasper.JasperException: No Java compiler available at org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473) at org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286) at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171) at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302) at org.mortbay.jetty.servlet.Default.service(Default.java:223) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428) at org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:830) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185) at org.mortbay.jetty.servlet.WebApplicationHandler$CachedChain.doFilter(WebApplicationHandler.java:821) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:471) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:568) at
Re: Near Duplicate Documents
Hi Ken, It's correct that uncommon words are most likely not showing up in the signature. However, I was trying to say that if two documents has 99% common tokens and differ in one token with frequency quantised frequency, the two resulted hashes are completely different. If you want true near dup detection, what you would like to have is two hashes that differ only in 1-2 bytes. That way, the signatures will truely reflect the content of the document they present. However, with this approach, you need a bit more work to cluster near dup documents. Basically, once you have the hash function as I describe above, finding similar documents comes down to Hamming distance problem: two docs are near dup if ther hashes different in k positions (with k small, might be 3). On Nov 22, 2007 2:35 AM, Ken Krugler [EMAIL PROTECTED] wrote: The duplication detection mechanism in Nutch is quite primitive. I think it uses a MD5 signature generated from the content of a field. The generation algorithm is described here: http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/crawl/TextProfileSignature.html. The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. I'm confused by your answer, assuming it's based on the page referenced by the URL you provided. The approach by TextProfileSignature would only generate a different MD5 hash with a single letter change if that change resulted in a change in the quantized frequency for that word. And if it's an uncommon word, then it wouldn't even show up in the signature. -- Ken You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on near duplication detection techniques and then implement one of them. I know Google has some papers that describe a technique called minhash. I read the paper and found it's very interesting. I'm not sure if you can implement the algorithm because they have patented it. That said, there are plenty literature on near dup detection so you should be able to get one for free! On Nov 21, 2007 6:57 PM, Rishabh Joshi [EMAIL PROTECTED] wrote: Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating, regarding near duplicate documents? Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: To whomever started this thread: look at Nutch. I believe something related to this already exists in Nutch for near-duplicate detection. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution g. -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra [EMAIL PROTECTED] wrote: On Nov 18, 2007 10:50 AM, Eswar K [EMAIL PROTECTED] wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes (We are not getting confused with threads, they are two different things.). Each email in this thread is treated as a document by the system. A reply to the original mail also includes the original mail in which case it becomes a near duplicate of the orginal mail (depending on the percentage of similarity). Similarly it goes on. The near dupes need not be limited to emails. I think this is what's known as shingling. See http://en.wikipedia.org/wiki/W-shingling Lucene (and therefore Solr) does not implement shingling. The MoreLikeThis query might be close enough, however. -Stuart -- Ken Krugler Krugle, Inc. +1 530-210-6378 If you can't find it, you can't fix it -- Regards, Cuong Hoang
Re: Pagination with Solr
Hi David, Do you use one of Solr client available http://wiki.apache.org/solr/IntegratingSolr? These clients should probably have done all the XML parsing jobs for you. I speak from Solrj experience. IMO, your approach is probably most commonly used when it comes to pagination. Solr caching mechanisms should speed up the request for next page. Cheers, On Nov 20, 2007 10:27 AM, Dave C. [EMAIL PROTECTED] wrote: Hello again, I'm trying to accomplish very basic pagination with my Solr search results. What I'm trying is to parse the response for numFound:some number and if this number is greater than the rows parameter, I send another search request to Solr with a new start parameter. Is there a better way to do this? Specifically, is there another way to obtain the numFound rather than parsing the response stream/string? Thanks a lot, David _ Share life as it happens with the new Windows Live.Download today it's FREE! http://www.windowslive.com/share.html?ocid=TXT_TAGLM_Wave2_sharelife_112007 -- Regards, Cuong Hoang
Re: Finding all possible synonyms for a word
One approach is to extend SynonymFilter so that it reads synonyms from database instead of a file. SynonymFilter is just a Java class so you can do whatever you want with it :D. From what I remember, the filter initialises a list of all input synonyms and store them in memory. Therefore, you need to make sure that all the synonyms can fit into memory at runtime. On Nov 20, 2007 1:54 AM, Kishore AVK. Veleti [EMAIL PROTECTED] wrote: Hi Eswar, Thanks for the update. I have gone through the below link provided by you and what I understood from it is, we need to have all possible synonyms in a text file. This file need to be given as input for SynonymFilterFactory to work. If my understanding is right then the approach may not suit my requirement. Reason is I need to find synonyms of all the keywords in category description and store those synonyms in the above said input file. The file may be too big. Let me know if my understanding is wrong. Thanks, Kishore Veleti A.V.K. -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 19, 2007 11:22 AM To: solr-user@lucene.apache.org Subject: Re: Finding all possible synonyms for a word Kishore, Solr has a SynonymFilterFactory which might be off use to you ( http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46) Regards, Eswar On Nov 18, 2007 10:39 PM, Kishore AVK. Veleti [EMAIL PROTECTED] wrote: Hi All, I am new to Lucene / SOLR and developing a POC as part of research. Check below my requirement and problem statement. Need help on how I can index the data such data I have a very good search functionality in my POC. -- Requirement: -- Assume my web application is an Online book store and it sell all categories of books like Computers, Social Studies, Physical Sciences etc. Each of these categories has sub-categories. For example Computers has sub-categories like Software Engineering, Java, SQL Server etc I have a database table called Categories and it contains both Parent Category descriptions and also Child Category descriptions. Data structure of Category table is: Category_ID_Primay_Key integer Parent_Category_ID integer Category_Name varchar(100) Category_Description varchar(1000) -- My Search UI: -- My search page is very simple. We have a text field with Search button. -- User Action: -- User enter below search text in above text field and clicks on Search button. Books on Data Center -- What is my expected behavior: -- Since the word Data Center more relevant computers I should show books related to computers. -- My Problem statement and Question to you all: -- To have a better search in my web applications what kind of strategy should I have and index the data accordingly in SOLR/Lucene. In my Lucene Index I may or may not have the word data center. Still I should be able to return data center One thought I have is as follows: Modify the Category table by adding one more column to it: Category_ID_Primay_Key integer Parent_Category_ID integer Category_Name varchar(100) Category_Description varchar(1000) Category_Description_Keywords varchar(8000) Now take each word in Category_description, find synonyms of it and store that data in Category_Description_Keywords column. After doing it, index the Category table records in SOLR/Lucene. Below are my questions to you all: Question 1: Need your feedbacks on above approach or any other approach which help me to make my search better that returns most relevant results to the user. Question 2: Can you suggest me Java based best Open Source or commercial synonym engines. I want such a best synonym engine that gives me all possible synonyms of a word. Thanks in Advance, Kishore Veleti A.V.K. -- Regards, Cuong Hoang
Re: multiple delete by id in one delete command?
The easiest solution I know is: deletequeryid:1 OR id:2 OR .../query/delete If you know that all of these ids can be found by issuing a query, you can do delete by query: deletequeryYOUR_DELETE_QUERY_HERE/query/delete Cheers On Nov 19, 2007 4:18 PM, Norberto Meijome [EMAIL PROTECTED] wrote: Hi everyone, I'm trying to issue, via curl to SOLR (testing at the moment), 3 deletes by id. I tried sending : deleteid1/idid2/idid3/id/delete and solr didn't like it at all. When I changed it to : deleteid1/id/deletedeleteid2/id/deletedeleteid3/id/delete as in : curl http://localhost:8983/vcs/update -H Content-Type: text/xml --data-binary 'deleteid816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b3/id/deletedeleteid53f3f80e65482a5be353e7110f5308949d51dfa93dbe3c1eca169edd19b3/id/delete' only the 1st ( id =1 , or id = 816bc47fd52ffb9c6059e6975eafa168949d51dfa93dbe3c1eca169edd19b3 gets deleted (after a commit, of course). So i figure I will have to issue a series of independent deleteidxxx/id/delete commandsIs it not possible to bunch them all together as it's possible with adddoc../docdoc.../doc/add ? thanks!! Beto _ {Beto|Norberto|Numard} Meijome Imagination is more important than knowledge. Albert Einstein, On Science I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. -- Regards, Cuong Hoang
Re: Spell Check Handler
Hi all, I've been so busy the last few days so I haven't replied to this email. I modified SpellCheckerHandler a while ago to include support for multiword query. To be honest, I didn't have time to write unit test for the code. However, I deployed it in a production environment and it has been working for me so far. My version, however, has two assumptions: 1) I assumpt that when user enter a misspelled multiword query, we should only check for words that are actually misspelled. For example, if user enter life expectancy calculatar, which has calculator misspelled, we should only spellcheck calculatar. 2) I only return the best string for a mispelled query. I guess I can just directly paste the code here so that others can adapt for their own purposes. If you have any question, just send me an email. I'll happy to help you. StringBuffer buf = null; if (null != words !.equals(words.trim())) { Analyzer analyzer = req.getSchema ().getField(field).getType().getAnalyzer(); TokenStream source = analyzer.tokenStream(field, new StringReader(words)); Token t; boolean hasSuggestion = false; boolean termExists = false; while (true) { try { t = source.next(); } catch (IOException e) { t = null; } if (t == null) break; String termText = t.termText(); String[] suggestions = spellChecker.suggestSimilar(termText, numSug, req.getSearcher().getReader(), restrictToField, true); if (suggestions != null suggestions.length 0) { if (!suggestions[0].equals(termText)) { hasSuggestion = true; } if (buf == null) { buf = new StringBuffer(suggestions[0]); } else buf.append( ).append(suggestions[0]); } else if (spellChecker.exist(termText)){ termExists = true; if (buf == null) { buf = new StringBuffer(termText); } else buf.append( ).append(termText); } else { hasSuggestion = false; termExists= false; break; } } try { source.close(); } catch (IOException e) { // ignore } // String[] suggestions = spellChecker.suggestSimilar(words, numSug, // nullReader, restrictToField, onlyMorePopular); if (hasSuggestion || (!hasSuggestion termExists)) rsp.add(suggestions, buf.toString()); else rsp.add(suggestions, null); On 10/11/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hoss, I had a feeling someone would be quoting Yonik's Law of Patches! ;-) For now, this is done. I created the changes, created JavaDoc comments on the various settings and their expected output, created a JUnit test for the SpellCheckerRequestHandler which tests various components of the handler, and I also created the supporting configuration files for the JUnit tests (schema and solrconfig files). I attached the patch to the JIRA issue so now we just have to wait until it gets added back in to the main code stream. For anyone who is interested, here is a link to the JIRA: https://issues.apache.org/jira/browse/SOLR-375 Could someone please drop me a hint on how to update the wiki or any other documentation that could benefit to being updated; I'll like to help out as much as possible, but first I need to know how. ;-) When these changes do get committed back in to the daily build, please review the generated JavaDoc for information on how to utilize these new features. If anyone has any questions, or comments, please do not hesitate to ask. As a general note of a self-critique on these changes, I am not 100% sure of the way I implemented the nested structure when the multiWords parameter is used. My interest is that it should work smoothly with some other technology such as Prototype using the JSon output type. Unfortunately, I will not be getting a chance to start on that coding until next week so it is up in the air as to if this structure will be conducive or not. I am planning on providing more details in the documentations as far as how to utilize these modifications in Prototype and AJax when I get a chance (even provide links to a production site so you can see it in action and view the source if interested). So stay tuned... Thanks for everyones time, Scott Tabar Chris Hostetter [EMAIL PROTECTED] wrote: : If you like, I can post the source code changes that I made to the : SpellCheckerRequestHandler, but at
Re: Spell Check Handler
Just to clarify this line of code: String[] suggestions = spellChecker.suggestSimilar(termText, numSug, req.getSearcher().getReader(), restrictToField, true); I only return suggestions if they are more popular than termText. You probably need to use code in Scott's patch to make this behaviour configurable. On 10/11/07, climbingrose [EMAIL PROTECTED] wrote: Hi all, I've been so busy the last few days so I haven't replied to this email. I modified SpellCheckerHandler a while ago to include support for multiword query. To be honest, I didn't have time to write unit test for the code. However, I deployed it in a production environment and it has been working for me so far. My version, however, has two assumptions: 1) I assumpt that when user enter a misspelled multiword query, we should only check for words that are actually misspelled. For example, if user enter life expectancy calculatar, which has calculator misspelled, we should only spellcheck calculatar. 2) I only return the best string for a mispelled query. I guess I can just directly paste the code here so that others can adapt for their own purposes. If you have any question, just send me an email. I'll happy to help you. StringBuffer buf = null; if (null != words !.equals(words.trim())) { Analyzer analyzer = req.getSchema ().getField(field).getType().getAnalyzer(); TokenStream source = analyzer.tokenStream(field, new StringReader(words)); Token t; boolean hasSuggestion = false; boolean termExists = false; while (true) { try { t = source.next(); } catch (IOException e) { t = null; } if (t == null) break; String termText = t.termText(); String[] suggestions = spellChecker.suggestSimilar(termText, numSug, req.getSearcher().getReader(), restrictToField, true); if (suggestions != null suggestions.length 0) { if (!suggestions[0].equals(termText)) { hasSuggestion = true; } if (buf == null) { buf = new StringBuffer(suggestions[0]); } else buf.append( ).append(suggestions[0]); } else if (spellChecker.exist(termText)){ termExists = true; if (buf == null) { buf = new StringBuffer(termText); } else buf.append( ).append(termText); } else { hasSuggestion = false; termExists= false; break; } } try { source.close(); } catch (IOException e) { // ignore } // String[] suggestions = spellChecker.suggestSimilar(words, numSug, // nullReader, restrictToField, onlyMorePopular); if (hasSuggestion || (!hasSuggestion termExists)) rsp.add(suggestions, buf.toString()); else rsp.add(suggestions, null); On 10/11/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hoss, I had a feeling someone would be quoting Yonik's Law of Patches! ;-) For now, this is done. I created the changes, created JavaDoc comments on the various settings and their expected output, created a JUnit test for the SpellCheckerRequestHandler which tests various components of the handler, and I also created the supporting configuration files for the JUnit tests (schema and solrconfig files). I attached the patch to the JIRA issue so now we just have to wait until it gets added back in to the main code stream. For anyone who is interested, here is a link to the JIRA: https://issues.apache.org/jira/browse/SOLR-375 Could someone please drop me a hint on how to update the wiki or any other documentation that could benefit to being updated; I'll like to help out as much as possible, but first I need to know how. ;-) When these changes do get committed back in to the daily build, please review the generated JavaDoc for information on how to utilize these new features. If anyone has any questions, or comments, please do not hesitate to ask. As a general note of a self-critique on these changes, I am not 100% sure of the way I implemented the nested structure when the multiWords parameter is used. My interest is that it should work smoothly with some other technology such as Prototype using the JSon output type. Unfortunately, I will not be getting a chance to start on that coding until next week so it is up in the air as to if this structure will be conducive or not. I am
Re: Solr replication
1)On solr.master: +Edit scripts.conf: solr_hostname=localhost solr_port=8983 rsyncd_port=18983 +Enable and start rsync: rsyncd-enable; rsyncd-start +Run snapshooter: snapshooter After running this, you should be able to see a new folder named snapshot.* in data/index folder. You can can solrconfig.xml to trigger snapshooter after a commit or optimise. 2) On slave: +Edit scripts.conf: solr_hostname=solr.master solr_port=8986 rsyncd_port=18986 data_dir= webapp_name=solr master_host=localhost master_data_dir=$MASTER_SOLR_HOME/data/ master_status_dir=$MASTER_SOLR_HOME/logs/clients/ +Run snappuller: snappuller -P 18983 +Run snapinstaller: snapinstaller You should setup crontab to run snappuller and snapinstaller periodically. On 10/1/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi ! I'm really new to Solr ! Could anybody please explain me with a short example how I can setup a simple Solr replication with 3 machines (a master node and 2 slaves) ? This is my conf: * master (linux 2.6.20) : - Hostname solr.master with IP 192.168.1.1 * 2 slaves (linux 2.6.20) : - Hostname solr.slave1 with IP 192.168.1.2 - Hostname solr.slave2 with IP 192.168.1.3 N.B: sorry if the question was already asked before, but I could't find anything better than the CollectionDistribution on the Wiki. Regards Y. -- Regards, Cuong Hoang
Re: Re: Re: Solr replication
sh /bin/commit should trigger a refresh. However, this command should be executed as part of snapinstaller so you should have to run it manually. On 10/1/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: One more question about replication. Now that the replication is working, how can I see the changes on slave nodes ? The page statistics : http://solr.slave1:8983/solr/admin/stats.jsp; doesn't reflect the correct number of indexed documents and still shows numDocs=0. Is there any command to tell Solr (on slave node) to sync itself with disk ? cheers Y. Message d'origine De: [EMAIL PROTECTED] A: solr-user@lucene.apache.org Sujet: Re: Re: Solr replication Date: Mon, 1 Oct 2007 15:00:46 +0200 Works like a charm. Thanks very much. cheers Y. Message d'origine Date: Mon, 1 Oct 2007 21:55:30 +1000 De: climbingrose A: solr-user@lucene.apache.org Sujet: Re: Solr replication boundary==_Part_10345_13696775.1191239730731 1)On solr.master: +Edit scripts.conf: solr_hostname=localhost solr_port=8983 rsyncd_port=18983 +Enable and start rsync: rsyncd-enable; rsyncd-start +Run snapshooter: snapshooter After running this, you should be able to see a new folder named snapshot.* in data/index folder. You can can solrconfig.xml to trigger snapshooter after a commit or optimise. 2) On slave: +Edit scripts.conf: solr_hostname=solr.master solr_port=8986 rsyncd_port=18986 data_dir= webapp_name=solr master_host=localhost master_data_dir=$MASTER_SOLR_HOME/data/ master_status_dir=$MASTER_SOLR_HOME/logs/clients/ +Run snappuller: snappuller -P 18983 +Run snapinstaller: snapinstaller You should setup crontab to run snappuller and snapinstaller periodically. On 10/1/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi ! I'm really new to Solr ! Could anybody please explain me with a short example how I can setup a simple Solr replication with 3 machines (a master node and 2 slaves) ? This is my conf: * master (linux 2.6.20) : - Hostname solr.master with IP 192.168.1.1 * 2 slaves (linux 2.6.20) : - Hostname solr.slave1 with IP 192.168.1.2 - Hostname solr.slave2 with IP 192.168.1.3 N.B: sorry if the question was already asked before, but I could't find anything better than the CollectionDistribution on the Wiki. Regards Y. -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Re: can solr do it?
I don't think you can with the current Solr because each instance runs in a separate web app. On 9/25/07, James liu [EMAIL PROTECTED] wrote: if use multi solr with one index, it will cache individually. so i think can it share their cache.(they have same config) -- regards jl -- Regards, Cuong Hoang
Synchronize large number of records with Solr
Hi all, I've been struggling to find a good way to synchronize Solr with a large number of records. We collect our data from a number of sources and each source produces around 50,000 docs. Each of these document has a sourceId field indicating the source of the document. Now assuming we're indexing all documents from SourceA (sourceId=SourceA), majority of these docs are already in Solr and we don't want to update them. However, there might be some docs in Solr that are not in the and we do want to delete them from the index. So in summary: 1) If a doc is already in Solr, do nothing 2) If a doc is in the batch but not in Solr, index it 3) If a doc is in Solr but not in the batch, remove it from Solr. The trick part is 1) because if not for that requirement, I can just simply delete all documents with sourceId=SourceA and reindex all documents from SourceA. Any suggestions? Thanks. -- Regards, Cuong Hoang
Re: Synchronize large number of records with Solr
Hi Erik, So in your case #1, documents are reindexed with this scheme - so if you truly need to skip a reindexing for some reason (why, though?) you'll need to come up with some other mechanism. [perhaps update could be enhanced to allow ignoring a duplicate id rather than reindexing?] It's pretty easy to ignore duplicate id during indexing but it won't solve my problem. I think the batch number works well in your case because you reindex existing documents which will get the updated batch number. In my case, I can't update existing documents and therefore, even if I use this approach, there is no way to know if an document is to be deleted. I think I will need to store all ids in the batch in a DocSet and then compare with the list of all ids after indexing. That way I can at least get rid of all expired documents. It's just not as elegant as using a batch identifier.
Re: Searching Versioned Resources
I think you can use the CollapseFilter to collapse on version field. However, I think you need to modify the CollapseFilter code to sort by version and get the latest version returned. On 9/13/07, Adrian Sutton [EMAIL PROTECTED] wrote: Hi all, The document's we're indexing are versioned and generally we only want search results to return the latest version of a document, however there's a couple of scenarios where I'd like to be able to include previous versions in the search result. It feels like a straight-forward case of a filter, but given that each document has independent version numbers it's hard to know what to filter on. The only solution I can think of at the moment is to index each new version twice - once with the version and once with version=latest. We'd then tweak the ID field in such a way that there is only one version of each document with version=latest. It's then simple to use a filter for version=latest when we search. Is there a better way? Is there a way to achieve this without having to index the document twice? Thanks in advance, Adrian Sutton http://www.symphonious.net -- Regards, Cuong Hoang
Re: Embedded about 50% faster for indexing
Agree. I was actually thinking of developing the embedded version early this year for one of my projects. I'm sure it will be needed in cases where running another web server is an overkill. On 8/28/07, Jonathan Woods [EMAIL PROTECTED] wrote: I don't think you should apologise for highlighting embedded usage. For circumstances in which you're at liberty to run a Solr instance in the same JVM as an app which uses it, I find it very strange that you should have to use anything _other_ than embedded, and jump through all the unnecessary hoops (XML conversion, HTTP transport) that this implies. It's a bit like suggesting you should throw away Java method invocations altogether, and write everything in XML-RPC. Bit of a pet issue of mine! I'll be creating a JIRA issue on the subject soon. Jon -Original Message- From: Sundling, Paul [mailto:[EMAIL PROTECTED] Sent: 28 August 2007 03:24 To: solr-user@lucene.apache.org Subject: RE: Embedded about 50% faster for indexing At this point I think I'm going recommend against embedded, regardless of any performance advantage. The level of documentation is just too low, while the XML API is clearly documented. It's clear that XML is preferred. The embedded example on the wiki is pretty good, but until mutliple core support comes out in the next version, you have to use multiple SolrCore. If they are accessed in the same webapp, then you can't just set JNDI (since you can only have one value). So you have to use a Config object as alluded to in the example. However, you look at the code and there is no javadoc for the constructor. The constructor args are (String name, InputStream is, String prefix). I think name is a unique name for the solr core, but that is a guess. Inputstream may be a stream to the solr home, but it could be anything. Prefix may be a URI prefix. These are all guesses without trying to read through the code. When I look at SolrCore, it looks like it's a singleton, so maybe I can't even access more than one SolrCore using embedded anyway. :( So I apologize for highlighting Embedded. Anyway it's clear how to do multiple solr cores using XML. You just have different post URI for the difference cores. You can easily inject that with Spring and externalize the config. Simple and easy. So I concede XML is the way to go. :) Paul Sundling -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 5:50 PM To: solr-user@lucene.apache.org Subject: Re: Embedded about 50% faster for indexing On 27-Aug-07, at 12:44 PM, Sundling, Paul wrote: Whether embedded solr should give me a performance boost or not, it did. :) I'm not surprised, since it skips XML parsing. Although you never know where cycles are used for sure until you profile. It certainly is possible that XML parsing dwarfs indexing, but I'd expect that only to occur under very light analysis and field storage workloads. I tried doing more records per post (200) and it was actually slightly slower and seemed to require more memory. This makes sense because you have to take up more memory for the StringBuilder to store the much larger XML. For 10,000 it was much slower. For that size I would need to XML streaming or something to make it work. The solr war was on the same machine, so network overhead was only from using loopback. The big question is still your connection handling strategy: are you using persistent http connections? Are you threadedly indexing? cheers, -Mike Paul Sundling -Original Message- From: climbingrose [mailto:[EMAIL PROTECTED] Sent: Monday, August 27, 2007 12:22 AM To: solr-user@lucene.apache.org Subject: Re: Embedded about 50% faster for indexing Haven't tried the embedded server but I think I have to agree with Mike. We're currently sending 2000 job batches to SOLR server and the amount of time required to transfer documents over http is insignificant compared with the time required to index them. So I do think unless you are sending document one by one, embedded SOLR shouldn't give you much more performance boost. On 8/25/07, Mike Klaas [EMAIL PROTECTED] wrote: On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, August 24, 2007 2:07 PM To: solr-user@lucene.apache.org Subject: Re: Embedded about 50% faster for indexing One thing I'd like to avoid is everyone trying to embed just for performance gains. If there is really that much difference, then we need a better way for people to get that without resorting to Java code. -Yonik Theoretically and practically, embedded solution will be faster
Re: Spell Check Handler
Thanks Karl. I'll check it out! On 8/18/07, karl wettin [EMAIL PROTECTED] wrote: I updated LUCENE-626 last night. It should now run smooth without LUCENE-550, but smoother with. Perhaps it is something you can use. 12 aug 2007 kl. 14.24 skrev climbingrose: I'm happy to contribute code for the SpellCheckerRequestHandler. I'll post the code once I strip off stuff related to our product. On 8/12/07, Pieter Berkel [EMAIL PROTECTED] wrote: http://issues.apache.org/jira/browse/LUCENE-626On 11/08/07, climbingrose [EMAIL PROTECTED] wrote: That's exactly what I did with my custom version of the SpellCheckerHandler. However, I didn't handle suggestionCount and only returned the one corrected phrase which contains the best corrected terms. There is an issue on Lucene issue tracker regarding multi-word spellchecker: https://issues.apache.org/jira/browse/LUCENE-550? page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel I'd be interested to take a look at your modifications to the SpellCheckerHandler, how did you handle phrase queries? maybe we can open a JIRA issue to expand the spell checking functionality to perform analysis on multi-word input values. I did find http://issues.apache.org/jira/browse/LUCENE-626 after looking at LUCENE-550, but since these patches are not yet included in the Lucene trunk yet it might be a little difficult to justify implementing them in Solr. -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Re: Spell Check Handler
That's exactly what I did with my custom version of the SpellCheckerHandler. However, I didn't handle suggestionCount and only returned the one corrected phrase which contains the best corrected terms. There is an issue on Lucene issue tracker regarding multi-word spellchecker: https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel . On 8/11/07, Pieter Berkel [EMAIL PROTECTED] wrote: On 11/08/07, climbingrose [EMAIL PROTECTED] wrote: The spellchecker handler doesn't seem to work with multi-word query. For example, when I tried to spellcheck Java developar, it returns nothing while if I tried developar, spellchecker correctly returns developer. I followed the setup on the wiki. While I suppose the general case for using the spelling checker would be a query containing a single misspelled word, it would be quite useful if the handler applied the analyzer specified by the termSourceField fieldType to the query input and then checked the spelling of each query token. This would seem to be the most flexible way of supporting multi-word queries (provided the termSourceField didn't use any stemmer filters I suppose). Piete -- Regards, Cuong Hoang
Re: FunctionQuery and boosting documents using date arithmetic
I'm having the date boosting function as well. I'm using this function: F = recip(rord(creationDate),1,1000,1000)^10. However, since I have around 10,000 of documents added in one day, rord(createDate) returns very different values for the same createDate. For example, the last document added with have rord(createdDate) =1 while the last document added will have rord(createdDate) = 10,000. When createDate 10,000, value of F is approaching 0. Therefore, the boost query doesn't make any difference between the the last document added today and the document added 10 days ago. Now if I replace 1000 in F with a large number, say 10, the boost function suddenly gives the last few documents enormous boost and make the other query scores irrelevant. So in my case (and many others' I believe), the true date value would be more appropriate. I'm thinking along the same line of adding timestamp. It wouldn't add much overhead this way, would it? Regards, On 8/11/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Actually, just thinking about this a bit more, perhaps adding a function : call such as parseDate() might add too much overhead to the actual query, : perhaps it would be better to first convert the date to a timestamp at index : time and store it in a field type slong? This might be more efficient but i would agree with you there, this is where a more robust (ie: less efficient) DateField-ish class that supports configuration options to specify: 1) the output format 2) the input format(s) 3) the indexed format ...as SimpleDateFormatter pattern strings would be handy. The ValueSource it uses could return seconds (or some other unit based on another config option) since epoch as the intValue. it's been discussed before, but there are a lot of tricky issues involved which is probably why no one has really tackled it. : that still leaves the problem of obtaining the current timestamp to use in : the boost function. it would be pretty easy to write a ValueSource that just knew about now as seconds since epoch. : While it seems to work pretty well, I've realised that this may not be : quite as effective as i had hoped given that the calculation is based on the : ordinal of the field value rather than the value of the field itself. In : cases where the field type is 'date' and the actual field values are not : distributed evenly across all documents in the index, the value returned by : rord() is not going to give a true reflection of document age. For example, be careful what you wish for. you are 100% correct that functions using hte (r)ord value of a DateField aren't a function of true age, but dependong on how you look at it that may be better then using the real age (i think so anyway). Why it sounds appealing to say that docA should score half as high as docB if it is twice as old, that typically isn't all that important when dealing with recent dates; and when dealing with older dates the ordinal value tends to approximate it decently well ... where a true measure of age might screw you up is when you have situations where few/no new articles get published on weekends (or late at night). it's also very confusing to people when the ordering of documents changes even though no new documents have been published -- that can easily happen if you are heavily boosting on a true age calculation but will never happen when dealing with an ordinal ranking of documents by age. (allthough, this could be compensated by doing all of your true age calculations relative the min age of all articles in your index -- but you would still get really weird 'big' shifts in scores as soon as that first article gets published on monday morning. -Hoss -- Regards, Cuong Hoang
Re: Spell Check Handler
Yeah. How stable is the patch Karl? Is it possible to use it in product environment? On 8/12/07, karl wettin [EMAIL PROTECTED] wrote: 11 aug 2007 kl. 10.36 skrev climbingrose: There is an issue on Lucene issue tracker regarding multi-word spellchecker: https://issues.apache.org/jira/browse/LUCENE-550 I think you mean LUCENE-626 that sort of depends on LUCENE-550. -- karl -- Regards, Cuong Hoang
Re: Spell Check Handler
The spellchecker handler doesn't seem to work with multi-word query. For example, when I tried to spellcheck Java developar, it returns nothing while if I tried developar, spellchecker correctly returns developer. I followed the setup on the wiki. Regards, Cuong Hoang On 7/10/07, Charles Hornberger [EMAIL PROTECTED] wrote: For what it's worth, I recently did a quick implementation of the spellchecker feature, and I simply created another field in my schema (Iike 'spell' in Tristan's example below). After feeding content into my search index, I used the spell field into add one single-field document for every distinct word in my document collection (I'm assuming the content folks have run spell-checkers :-)). E.g.: docfield name=spellaardvark/field/doc docfield name=spellabacus/field/doc docfield name=spellabbot/field/doc docfield name=spellacacia/field/doc etc. I also added some extra documents for proper names that appear in my documents. For instance, there are a couple fields that have comma-separated list of names, so I for each of those -- in addition to documents for john, doe, and jane, which were generated by the naive word-splitting done in the first pass -- I added documents like so: docfield name=spelljohn doe/field/doc docfield name=spelljane doe/field/doc etc. You could do the same for other searchable multi-word tokens in your input -- song/album/book/movie titles, publisher names, geographic names (cities, neighborhoods, etc.), product names, and so on. -Charlie On 7/9/07, Tristan Vittorio [EMAIL PROTECTED] wrote: I think there is some confusion regarding how the spell checker actually uses the termSourceField. It is suggested that you use a simple field type such a string, however since this field type does not tokenize or split words, it is only useful in situations where the whole field is considered a dictionary word: add doc field name=titleAccountant/field http://localhost:8984/solr/select/?q=Accountentqt=spellcheckercmd=rebuildand field name=titleAuditor/field field name=titleSolicitor/field /doc /add The follow example case will not work with spell checker since the whole field is considered a single word or string: add doc field name=titleAccountant reveals that Accounting is boring/field /doc /add I might suggest that you create an additional field in your schema that takes advantage of the StandardTokenizer and StandardFilter which doesn't perform a great deal of processing on the field yet should provide decent results when used with the spell checker: fieldType name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType If you want this field to be automatically populated with the contents of the title field when a document is added to the index, simply use a copyField: copyField source=title dest=spell/ Hope this helps, let me know if this is still not clear, I probably will add it to the wiki page soon. cheers, Tristan On 7/9/07, climbingrose [EMAIL PROTECTED] wrote: Thanks for the quick reply. However, I'm still not able to setup spellchecker. Solr does create spell directory under data but doesn't seem to build the spellchecker index. Here are snippets of my schema.xml: field name=title type=string indexed=true stored=true/ requestHandler name=spellchecker class= solr.SpellCheckerRequestHandler startup=lazy !-- default values for query parameters -- lst name=defaults int name=suggestionCount1/int float name=accuracy0.5/float /lst !-- Main init params for handler -- !-- The directory where your SpellChecker Index should live. -- !-- May be absolute, or relative to the Solr dataDir directory. -- !-- If this option is not specified, a RAM directory will be used -- str name=spellcheckerIndexDirspell/str !-- the field in your schema that you want to be able to build -- !-- your spell index on. This should be a field that uses a very -- !-- simple FieldType without a lot of Analysis (ie: string) -- str name=termSourceFieldtitle/str /requestHandler I tried this url: http://localhost:8984/solr/select/?q
Re: Spell Check Handler
After looking the SpellChecker code, I realised that it only supports single-word. I made a very naive modification of SpellCheckerHandler to get multi-word support. Now the other problem that I have is how to have different fields in SpellChecker index. For example, since my query has two parts: description and location, I don't want to build a spellchecker index which combines both description and location into one termSourceField. I want to check description part with the description field in the spellchecker index and location part with location field in the index. Otherwise I might have irrelevant suggestions for the location part since the number of terms in location is generally much smaller compared with that of description. Any ideas? Thanks. On 8/11/07, climbingrose [EMAIL PROTECTED] wrote: The spellchecker handler doesn't seem to work with multi-word query. For example, when I tried to spellcheck Java developar, it returns nothing while if I tried developar, spellchecker correctly returns developer. I followed the setup on the wiki. Regards, Cuong Hoang On 7/10/07, Charles Hornberger [EMAIL PROTECTED] wrote: For what it's worth, I recently did a quick implementation of the spellchecker feature, and I simply created another field in my schema (Iike 'spell' in Tristan's example below). After feeding content into my search index, I used the spell field into add one single-field document for every distinct word in my document collection (I'm assuming the content folks have run spell-checkers :-)). E.g.: docfield name=spellaardvark/field/doc docfield name=spellabacus/field/doc docfield name=spellabbot/field/doc docfield name=spellacacia/field/doc etc. I also added some extra documents for proper names that appear in my documents. For instance, there are a couple fields that have comma-separated list of names, so I for each of those -- in addition to documents for john, doe, and jane, which were generated by the naive word-splitting done in the first pass -- I added documents like so: docfield name=spelljohn doe/field/doc docfield name=spelljane doe/field/doc etc. You could do the same for other searchable multi-word tokens in your input -- song/album/book/movie titles, publisher names, geographic names (cities, neighborhoods, etc.), product names, and so on. -Charlie On 7/9/07, Tristan Vittorio [EMAIL PROTECTED] wrote: I think there is some confusion regarding how the spell checker actually uses the termSourceField. It is suggested that you use a simple field type such a string, however since this field type does not tokenize or split words, it is only useful in situations where the whole field is considered a dictionary word: add doc field name=titleAccountant/field http://localhost:8984/solr/select/?q=Accountentqt=spellcheckercmd=rebuildand field name=titleAuditor/field field name=titleSolicitor/field /doc /add The follow example case will not work with spell checker since the whole field is considered a single word or string: add doc field name=titleAccountant reveals that Accounting is boring/field /doc /add I might suggest that you create an additional field in your schema that takes advantage of the StandardTokenizer and StandardFilter which doesn't perform a great deal of processing on the field yet should provide decent results when used with the spell checker: fieldType name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType If you want this field to be automatically populated with the contents of the title field when a document is added to the index, simply use a copyField: copyField source=title dest=spell/ Hope this helps, let me know if this is still not clear, I probably will add it to the wiki page soon. cheers, Tristan On 7/9/07, climbingrose [EMAIL PROTECTED] wrote: Thanks for the quick reply. However, I'm still not able to setup spellchecker. Solr does create spell directory under data but doesn't seem to build the spellchecker index. Here are snippets of my schema.xml: field name=title type=string indexed=true stored=true
Re: Spell Check Handler
OK, I just need to define 2 spellcheckers in solrconfig.xml for my purpose. On 8/11/07, climbingrose [EMAIL PROTECTED] wrote: After looking the SpellChecker code, I realised that it only supports single-word. I made a very naive modification of SpellCheckerHandler to get multi-word support. Now the other problem that I have is how to have different fields in SpellChecker index. For example, since my query has two parts: description and location, I don't want to build a spellchecker index which combines both description and location into one termSourceField. I want to check description part with the description field in the spellchecker index and location part with location field in the index. Otherwise I might have irrelevant suggestions for the location part since the number of terms in location is generally much smaller compared with that of description. Any ideas? Thanks. On 8/11/07, climbingrose [EMAIL PROTECTED] wrote: The spellchecker handler doesn't seem to work with multi-word query. For example, when I tried to spellcheck Java developar, it returns nothing while if I tried developar, spellchecker correctly returns developer. I followed the setup on the wiki. Regards, Cuong Hoang On 7/10/07, Charles Hornberger [EMAIL PROTECTED] wrote: For what it's worth, I recently did a quick implementation of the spellchecker feature, and I simply created another field in my schema (Iike 'spell' in Tristan's example below). After feeding content into my search index, I used the spell field into add one single-field document for every distinct word in my document collection (I'm assuming the content folks have run spell-checkers :-)). E.g.: docfield name=spellaardvark/field/doc docfield name=spellabacus/field/doc docfield name=spellabbot/field/doc docfield name=spellacacia/field/doc etc. I also added some extra documents for proper names that appear in my documents. For instance, there are a couple fields that have comma-separated list of names, so I for each of those -- in addition to documents for john, doe, and jane, which were generated by the naive word-splitting done in the first pass -- I added documents like so: docfield name=spelljohn doe/field/doc docfield name=spelljane doe/field/doc etc. You could do the same for other searchable multi-word tokens in your input -- song/album/book/movie titles, publisher names, geographic names (cities, neighborhoods, etc.), product names, and so on. -Charlie On 7/9/07, Tristan Vittorio [EMAIL PROTECTED] wrote: I think there is some confusion regarding how the spell checker actually uses the termSourceField. It is suggested that you use a simple field type such a string, however since this field type does not tokenize or split words, it is only useful in situations where the whole field is considered a dictionary word: add doc field name=titleAccountant/field http://localhost:8984/solr/select/?q=Accountentqt=spellcheckercmd=rebuildand field name=titleAuditor/field field name=titleSolicitor/field /doc /add The follow example case will not work with spell checker since the whole field is considered a single word or string: add doc field name=titleAccountant reveals that Accounting is boring/field /doc /add I might suggest that you create an additional field in your schema that takes advantage of the StandardTokenizer and StandardFilter which doesn't perform a great deal of processing on the field yet should provide decent results when used with the spell checker: fieldType name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer /fieldType If you want this field to be automatically populated with the contents of the title field when a document is added to the index, simply use a copyField: copyField source=title dest=spell/ Hope this helps, let me know if this is still not clear, I probably will add it to the wiki page soon. cheers, Tristan On 7/9/07, climbingrose [EMAIL PROTECTED] wrote: Thanks for the quick reply
Date rounding up
Hi all, I think there might be something wrong with the date time rounding up. I tried this query: q=*:*fq=listedDate:[NOW/DAY-1DAY TO *] which I think should return results since yesterday. So if today is 9th of August, it should return all results from the 8th of August. However, Solr returns also returns result from the 7th of August. Any idea? -- Regards, Cuong Hoang
Re: mandatory and optional fields in the dismaxrequesthandler
I think I have the same question as Arnaud. For example, my dismax query has qf=title^5 description^2. Now if I search for Java developer, I want to make sure that the results have at least java or developer in the title. Is this possible with dismax query? On 7/30/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Is it possible to specify precisely one or more mandatory fields in a : DismaxRequestHandler? what would the semantics making a field mandatory mean? considering your specific example... : str name=qf : text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 : /str : str name=bla :+text +feature name manu : str : : where 'text' and 'feature' are mandatory and 'name' and 'manu' are : optional fields. if text and feature are mandatory but name and manu are not, how are the other fields in the qf treated? if the q param is: albino elephant ... what would it mean that text and feature are mandatory? do both words have to appear in text and in feature, or just one in each? -Hoss -- Regards, Cuong Hoang
DisMax query and date boosting
Hi all, I'm puzzling over how to boost a date field in a DisMax query. Atm, my qf is title^5 summary^1. However, what I really want to do is to allow document with latest listedDate to have better score. For example, documents with listedDate:[NOW-1DAY TO *] have additional score over documents with listedDate:[* TO NOW-10DAY]. Any idea? -- Regards, Cuong Hoang
Re: DisMax query and date boosting
Thanks for both answers. Which one is better in terms of performance? bq or bf? On 7/20/07, Daniel Alheiros [EMAIL PROTECTED] wrote: Sorry just correcting myself: str name=bqyour_date_field:[NOW-24HOURS TO NOW]^10.0/str Regards, Daniel On 19/7/07 15:25, Daniel Alheiros [EMAIL PROTECTED] wrote: I think in this case you can use a bq (Boost Query) so you can apply this boost to the range you want. str name=bqyour_date_field:[NOW/DAY-24HOURS TO NOW]^10.0/str This example will boost your documents with date within the last 24h. Regards, Daniel On 19/7/07 14:45, climbingrose [EMAIL PROTECTED] wrote: Hi all, I'm puzzling over how to boost a date field in a DisMax query. Atm, my qf is title^5 summary^1. However, what I really want to do is to allow document with latest listedDate to have better score. For example, documents with listedDate:[NOW-1DAY TO *] have additional score over documents with listedDate:[* TO NOW-10DAY]. Any idea? http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. -- Regards, Cuong Hoang
Re: DisMax query and date boosting
Just tried the bq approach and it works beautifully. Exactly what I was looking for. Still, I'd like to know which approach is the preferred? Thanks again guys. On 7/20/07, climbingrose [EMAIL PROTECTED] wrote: Thanks for both answers. Which one is better in terms of performance? bq or bf? On 7/20/07, Daniel Alheiros [EMAIL PROTECTED] wrote: Sorry just correcting myself: str name=bqyour_date_field:[NOW-24HOURS TO NOW]^ 10.0/str Regards, Daniel On 19/7/07 15:25, Daniel Alheiros [EMAIL PROTECTED] wrote: I think in this case you can use a bq (Boost Query) so you can apply this boost to the range you want. str name=bqyour_date_field:[NOW/DAY-24HOURS TO NOW]^10.0/str This example will boost your documents with date within the last 24h. Regards, Daniel On 19/7/07 14:45, climbingrose [EMAIL PROTECTED] wrote: Hi all, I'm puzzling over how to boost a date field in a DisMax query. Atm, my qf is title^5 summary^1. However, what I really want to do is to allow document with latest listedDate to have better score. For example, documents with listedDate:[NOW-1DAY TO *] have additional score over documents with listedDate:[* TO NOW-10DAY]. Any idea? http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this. -- Regards, Cuong Hoang -- Regards, Cuong Hoang
Re: DisMax query and date boosting
Thanks for the answer Chris. The DisMax query handler is just amazing! On 7/20/07, Chris Hostetter [EMAIL PROTECTED] wrote: : Just tried the bq approach and it works beautifully. Exactly what I was : looking for. Still, I'd like to know which approach is the preferred? Thanks : again guys. i personally recommend the function approach, because it gives you a more gradual falloff in terms of the scores of documents ... the BQ approahc works great for simple boosting of things in the last N days should score really high but 1 millisecond after that cut off the score plummets immediately. side note... : Sorry just correcting myself: : str name=bqyour_date_field:[NOW-24HOURS TO NOW]^ 10.0/str the first example is perfectly fine, and will be more efficient because it will take better advantage of the field cache... :str name=bqyour_date_field:[NOW/DAY-24HOURS TO NOW]^10.0/str ...if you don't round down to the nearest day, then every request will generate a new query which will get put in the filterCache. if a day isn't granular enough for you, you can round to the nearest hour (or even minute) but i strongly suggest you round to something so you don't wind up using millisecond precision str name=bqyour_date_field:[NOW/HOUR-1DAY TO NOW]^10.0/str -Hoss -- Regards, Cuong Hoang
Slow facet with custom Analyser
Hi all, My facet browsing performance has been decent on my system until I add my custom Analyser. Initially, I facetted title field which is of default string type (no analysers, tokenisers...) and got quick responses (first query is just under 1s, subsequent queries are 0.1s). I created a custom analyser which is not much different from the DefaultAnalyzer in FieldType class. Essentially, this analyzer will not do any tokonisations, but only convert the value into lower case, remove spaces, unwanted chars and words. After I applied the analyser to title field, facet performance degraded considerably. Every query is now 1.2s and the filterCache hit ratio is extremely small: lookups : 918485 hits : 23 hitratio : 0.00 inserts : 918487 evictions : 917971 size : 512 cumulative_lookups : 918485 cumulative_hits : 23 cumulative_hitratio : 0.00 cumulative_inserts : 918487 cumulative_evictions : 917971 Any idea? Here is my analyser code: public class FacetTextAnalyser extends SolrAnalyzer { final int maxChars; final SetCharacter ignoredChars; final SetString ignoredWords; public final static char[] IGNORED_CHARS = {'/', '\\', '\'', '\', '#', '', '!', '?', '*', '', '', ','}; public static final String[] IGNORED_WORDS = { a, an, and, are, as, at, be, but, by, for, if, in, into, is, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with }; public FacetTextAnalyser() { maxChars = 255; ignoredChars = new HashSetCharacter(); for (int i = 0; i IGNORED_CHARS.length; i++) { ignoredChars.add(IGNORED_CHARS[i]); } ignoredWords = new HashSetString(); for (int i = 0; i IGNORED_WORDS.length; i++) { ignoredWords.add(IGNORED_WORDS[i]); } } public FacetTextAnalyser(int maxChars, SetCharacter ignoredChars, SetString ignoredWords) { this.maxChars = maxChars; this.ignoredChars = ignoredChars; this.ignoredWords = ignoredWords; } @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new Tokenizer(reader) { char[] cbuf = new char[maxChars]; public Token next() throws IOException { int n = input.read(cbuf, 0, maxChars); if (n = 0) return null; char[] temp = new char[n]; int index = 0; boolean space = true; for (int i = 0; i n; i++) { char c = cbuf[i]; if (ignoredChars.contains(cbuf[i])) { c = ' '; } if (Character.isWhitespace(c)) { if (space) continue; else { temp[index] = ' '; if (index 0) { int j = index - 1; while (temp[j] != ' ' j 0) { j--; } String str = (j == 0)? new String(temp, 0, index): new String(temp, j + 1, index - j - 1); System.out.println(str); if (ignoredWords.contains(str)) index = j; } index++; //Check ignored words space = true; } } else { temp[index] = Character.toLowerCase(c); index++; space = false; } } temp[0] = Character.toUpperCase(temp[0]); String s = new String(temp, 0, index); return new Token(s, 0, n); }; }; } } Here is how I declare the analyser: fieldType name=text_em class=solr.TextField positionIncrementGap=100 analyzer class=net.jseeker.lucene.FacetTextAnalyser/ /fieldType -- Regards, Cuong Hoang
Re: Slow facet with custom Analyser
Thanks Yonik. In my case, there is only one title field per document so is there a way to force Solr to work the old way? My analyser doesn't break up the title field into multiple tokens. It only tries to format the field value (to lower case, remove unwanted chars and words). Therefore, it's no difference from using string single-valued type. I'll try your first recommendation to see how it goes. Thanks again. On 7/17/07, Yonik Seeley [EMAIL PROTECTED] wrote: Since you went from a non multi-valued string type (which Solr knows has at most one value per document) to a custom analyzer type (which could produce multiple tokens per document), Solr switched tactics from using the FieldCache for faceting to using the filterCache. Right now, you could try to 1) use facet.enum.cache.minDf=1000 (don't use the fieldCache except for large facets) 2) expand the size of the fieldcache to 100 if you have the memory Optimizing your index should also speed up faceting (but that is a lot of facets). -Yonik On 7/16/07, climbingrose [EMAIL PROTECTED] wrote: Hi all, My facet browsing performance has been decent on my system until I add my custom Analyser. Initially, I facetted title field which is of default string type (no analysers, tokenisers...) and got quick responses (first query is just under 1s, subsequent queries are 0.1s). I created a custom analyser which is not much different from the DefaultAnalyzer in FieldType class. Essentially, this analyzer will not do any tokonisations, but only convert the value into lower case, remove spaces, unwanted chars and words. After I applied the analyser to title field, facet performance degraded considerably. Every query is now 1.2s and the filterCache hit ratio is extremely small: lookups : 918485 hits : 23 hitratio : 0.00 inserts : 918487 evictions : 917971 size : 512 cumulative_lookups : 918485 cumulative_hits : 23 cumulative_hitratio : 0.00 cumulative_inserts : 918487 cumulative_evictions : 917971 -- Regards, Cuong Hoang
Re: Slow facet with custom Analyser
I've tried both of your recommendations (use facet.enum.cache.minDf=1000 and optimise the index). The query time is around 0.4-0.5s now but it's still slow compared to the old string type. I haven't tried to increase filterCache but 100 of cached items looks a bit too much for my server atm. It's quite pitty that we can't force Solr to use FieldCache. I think I might pre-process title field and index it as string instead of using analyser. However, it defeats the purpose of having pluggable analysers, tokenisers... On 7/17/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 7/16/07, climbingrose [EMAIL PROTECTED] wrote: Thanks Yonik. In my case, there is only one title field per document so is there a way to force Solr to work the old way? My analyser doesn't break up the title field into multiple tokens. It only tries to format the field value (to lower case, remove unwanted chars and words). Therefore, it's no difference from using string single-valued type. There is currently no way to force Solr to use the FieldCache method. Oh, and in 2) expand the size of the fieldcache to 100 if you have the memory should have been filterCache, not fieldcache. -Yonik I'll try your first recommendation to see how it goes. faceting typically proceeds much faster on an optimized index too. -Yonik -- Regards, Cuong Hoang
Re: Slow facet with custom Analyser
Thanks for the suggestion Chris. I modified SimpleFacets to check for [f.foo.]facet.field.type==(single|multi) and the performance has been improved significantly. On 7/17/07, Chris Hostetter [EMAIL PROTECTED] wrote: : ...but i don't understand why both checking isTokenized() ... shouldn't : multiValued() be enough? : : A field could return false for multiValued() and still have multiple : tokens per document for that field. ah .. right ... sorry: multiValued() indicates wether multiple discreet values can be added to the field (and stored if the field is stored) but says nothing baout what the Analyzer may do with any single value. perhaps we should really have an [f.foo.]facet.field.type=(single|multi) param to let clients indicate when they know exactly which method they wnat used (getFacetTermEnumCounts vs getFieldCacheCounts) ... if the property is not set, the default can be determeined using the sf.multiValued() || ft.isTokenized() || ft instanceof BoolField logic. -Hoss -- Regards, Cuong Hoang
A few questions regarding multi-word synonyms and parameters encoding
Hi all, I've been using Solr for the last few projects and the experience has been great. I'll post the link to the website once it finishes. Just have a few questions regarding synonyms and parameters encoding: 1) Is multi-word synonyms possible now in Solr? For example, can I have things like synonyms like: I.T. T, IT T, Information Technologies, Computer science I read the message on mailing list sometime ago (think back in mid 2006) saying that there is no clean way to implement this. Is it possible now? In my case, I have two field category and location in which category is of default string type and location is of default text type: +Category field is used only for faceting by category therefore, no anylasis needs to be done. Can I use the synonyms config above to do facet query on category field and the Solr will combine items having one of these category into one facet category? For example: I.T. T (10) IT T (20) Information Technologies (30) Computer science (40) Can I have something like: I.T. T (100) Or do I have to manually filter query on for each category:I.T. T and count the results? +Location field is used for searching by city, state and post code. Since I collect the data from different sources, there might be mix match information. For example, on one record I might have Inner Sydney, NSW while the other record I might have Inner Sydney, New South Wales. In Australia, NSW New South Wales are interchangeable used so when the users search for NSW, I want New South Wales record to be returned and vice versa. How could I achieve this? The location field is of the default text type. 2) I'm having trouble with using facet values in my url. For example, I have title facet field in my query and it returns something like: Software engineer C++ Programmer C Programmer PHP developer Now I want create a link for each of these value so that the user can filter the results by that title by clicking on the link. For example, if I click on Software Engineer, the results are now narrowed down to just include records with Software Engineer in their title. Since title field can contain special chars like '+', '' ..., I really can't find a clean way to do this. At the moment, I replace all the space by '+' and it seems to work for words like Software engineer (converted to Software+Engineer). However, C++ Programmer is converted to C+++Programmer, and it doesn't seem to work (return no results). Any ideas? Looking back, this is such a long email. If you reach this point, thanks a lot for your time!!! -- Regards, Cuong Hoang
Re: history
Accidentally I have a very similar use case. Thanks for advice. On 7/8/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 7/7/07, Brian Whitman [EMAIL PROTECTED] wrote: I have been trying to plan out a history function for Solr. When I update a document with an existing unique key, I would like the older version to stay around and get tagged with the date and some metadata to indicate it's not live. Any normal search would not touch history documents. Interesting... One might be able to accomplish this with the update processors that Ryan I have been batting around for the last few days, in conjunction with updateable documents, which is on-deck. The first idea that comes to mind is that during an update, you could change the id of the older document to be something like id_timestamp, and reindex it with the addition of a live:false field. For normal queries, use a filter of -live:false filter. For all old of a document, use a prefix query id:mydocid_* for all versions of a document, use query id:mydocid* So if you can hold off a little bit, you shouldn't need a custom query handler. This will be a good use case to ensure that our request processors and updateable documents are powerful enough. -Yonik -- Regards, Cuong Hoang
Re: Dynamic fields performance question
Thanks Yonik. I think both of the conditions hold true for our application ;). On 3/27/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 3/26/07, climbingrose [EMAIL PROTECTED] wrote: I'm developing an application that potentially creates thousands of dynamic fields. Does anyone know if large number of dynamic fields will degrade Solr performance? Thousands of fields won't be a problem if - you don't sort on most of them (sorting by a field takes up memory) - you can omit norms on most of them Provided the above is true, differences in searching + indexing performance shouldn't be noticeable. -Yonik -- Regards, Cuong Hoang
Dynamic fields performance question
Hi all, I'm developing an application that potentially creates thousands of dynamic fields. Does anyone know if large number of dynamic fields will degrade Solr performance? Thanks. -- Regards, Cuong Hoang
Solr use case
Hi all, Is it true that Solr is mainly used for applications that rarely change the underlying data? As I understand, if you submit new data or modify existing data on Solr server, you would have to refresh the cache somehow to display the updated data. If my application frequently gets new data/updates from users, should I use Solr? I love faceted browsing and dynamic properties so much but I need to justify the choice of Solr. Thanks. By the way, does anyone have any performance measure that can be shared (apart from the one on the Wiki)? As I estimated, my application probably has half a million docs, each of which has around 15 properties, does anyone know the type of hardware I would need for reasonable performance. Thanks. -- Regards, Cuong Hoang
Multiple schemas
Hi all, Am I right that we can only have one schema per solr server? If so, how would you deal with the issue of submitting completely different data models (such as clothes and cars)? Thanks. -- Regards, Cuong Hoang
Re: Mobile phone shop + Solr
I probably need to visualise my models: MobileInfo (1)(1...*) SellingItem MobileInfo has many fields to describe the characteristics of a mobile phone model (color, size..). SellingItem is an instance of MobileInfo that is currently sold by a user. So in the ERD terms, SellingItem will probably have foreign key call MobileInfoId that references the primary key of MobileInfo. Now obviously, I need to index MobileInfo to support faceted browsing. How should I index SellingItem? The simplest way probably is to combile mobile phone specs in MobileInfo and and fields in SellingItem, and then index all of them. In this case, if I have 1000 SellingItems referencing a particular MobileInfo, I have to repeat the fields in MobileInfo a thousand times. On 9/13/06, Chris Hostetter [EMAIL PROTECTED] wrote: : Because the mobile phone info has many fields (40), I don't want to : repeatedly submit it to Solr. i'm not really sure what you mean by repeatedly submit to Solr or how it relates to haveing more then 40 fields. 40 fields really isn't that many. To give you a basis of comparison: the last Solr index i built from scratch had 47 field declarations, and 4 dynamicField declarations ...those 4 dynamic fields result in approximately 1200 'fields' in the index -- not every document has a value for every field, but the average is above 200 fields per document. -Hoss -- Regards, Cuong Hoang