RE: Newbie Question - getting search results from dataimport request handler
As part of the ETL effort, please consider how to integrate with these two open-source ETL systems. I'm not asking for an implementation, just suggesting that having a concrete context will help you in the architecture phase. http://kettle.pentaho.org/ http://www.talend.com/products-data-integration/talend-open-studio.php Thanks, Lance -Original Message- From: Noble Paul നോബിള് नोब्ळ् [mailto:[EMAIL PROTECTED] Sent: Friday, November 21, 2008 8:12 PM To: solr-user@lucene.apache.org Subject: Re: Newbie Question - getting search results from dataimport request handler On Sat, Nov 22, 2008 at 3:10 AM, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : > it might be worth considering a new @attribute for to > indicate > : > that they are going to be used purely as "component" fields (ie: > your > : > first-name/last-name example) and then have DIH pass all > non-component > : > fields along and error if undefined in the schema just like other > updating > : > RequestHandlers do. > : > > : > either that, or require that people declaure indexed="false" > : > stored="false" fields in the schema for these intermediate > component > : > fields so that we can properly warn then when DIH is getting data > it > : > doesn't know what to do with -- protecting people from field name > typos > : > and returning errors instead of silently ignoring unexpected input > is > : > fairly important behavir -- especially for new users. > > : Actually it is done by DIH . When the dataconfig is loaded DIH > reports > : these information on the console. though it is limited , it helps to > a > : certain extent > > Hmmm. > > Logging an error and returning successfully (without adding any docs) > is still inconsistent with the way all other RequestHandlers work: > fail the request. > > I know DIH isn't a typical RequestHandler, but some things (like > failing on failure) seem like they should be a given. SOLR-842 . DIH is an ETL tool pretending to be a RequestHandler. Originally it was built to run outside of Solr using SolrJ. For better integration and ease of use we changed it later. SOLR-853 aims to achieve the oroginal goal The goal of DIH is to become a full featured ETL tool. > > > > -Hoss > > -- --Noble Paul
DIH and repeated chunked input
In http://wiki.apache.org/solr/DataImportHandler there is this paragraph: If an API supports chunking (when the dataset is too large) multiple calls need to be made to complete the process. XPathEntityprocessor supports this with a transformer. If transformer returns a row which contains a field $hasMore with a the value "true" the Processor makes another request with the same url template (The actual value is recomputed before invoking ). A transformer can pass a totally new url too for the next call by returning a row which contains a field $nextUrl whose value must be the complete url for the next call. Does this translate as: "Nobody wrote this yet, but it would be really cool"? Thanks, Lance
RE: Regex Transformer Error
There is a nice HTML stripper inside Solr. "solr.HTMLStripStandardTokenizerFactory" -Original Message- From: Ahmed Hammad [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 05, 2008 10:43 AM To: solr-user@lucene.apache.org Subject: Re: Regex Transformer Error Hi, It works with the attribute regex="<(.|\n)*?>" Sorry for the disturbance. Regards, ahmd On Wed, Nov 5, 2008 at 8:18 PM, Ahmed Hammad <[EMAIL PROTECTED]> wrote: > Hi, > > I am using Solr 1.3 data import handler. One of my table fields has > html tags, I want to strip it of the field text. So obviously I need > the Regex Transformer. > > I added transformer="RegexTransformer" attribute to my entity and a > new field with: > > replaceWith="X"/> > > Every thing works fine. The text is replace without any problem. The > provlem happend with my regular experession to strip html tags. So I > use regex="<(.|\n)*?>". Of course the charecters '<' and '>' are not > allowed in XML. I tried the following regex="<(.|\n)*?>" and > regex="C;(.|\n)*?E;" but I get the following error: > > The value of attribute "regex" associated with an element type "field" > must not contain the '<' character. at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > Source) ... > > The full stack trace is following: > > *FATAL: Could not create importer. DataImporter config invalid > org.apache.solr.common.SolrException: FATAL: Could not create importer. > DataImporter config invalid at > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > Handler.java:114) > at > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody > (DataImportHandler.java:206) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle > rBase.java:131) at > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter. > java:303) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter > .java:232) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli > cationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi > lterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa > lve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa > lve.java:191) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja > va:128) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja > va:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv > e.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java > :286) > at > org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor > .java:857) > at > org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.pro > cess(Http11AprProtocol.java:565) at > org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:150 > 9) at java.lang.Thread.run(Unknown Source) Caused by: > org.apache.solr.handler.dataimport.DataImportHandlerException: > Exception occurred while initializing context Processing Document # at > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > orter.java:176) > at > org.apache.solr.handler.dataimport.DataImporter.(DataImporter.ja > va:93) > at > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > Handler.java:106) ... 17 more Caused by: > org.xml.sax.SAXParseException: The value of attribute "regex" > associated with an element type "field" must not contain the '<' > character. at > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown > Source) at > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unkn > own > Source) at > org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImp > orter.java:166) > ... 19 more * > > *description* *The server encountered an internal error (FATAL: Could > not create importer. DataImporter config invalid > org.apache.solr.common.SolrException: FATAL: Could not create importer. > DataImporter config invalid at > org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImport > Handler.java:114) > at > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody > (DataImportHandler.java:206) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle > rBase.java:131) at > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter. > java:303) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter > .java:232) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli > cationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi > lterChain.java:206) > at > org.apache.catalina.core.Stan
RE: DIH Http input bug - problem with two-level RSS walker
The inner entity drills down and gets more detail about each item in the outer loop. It creates one document. -Original Message- From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] Sent: Friday, October 31, 2008 10:24 PM To: solr-user@lucene.apache.org Subject: Re: DIH Http input bug - problem with two-level RSS walker On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > I wrote a nested HttpDataSource RSS poller. The outer loop reads an > rss feed which contains N links to other rss feeds. The nested loop > then reads each one of those to create documents. (Yes, this is an > obnoxious thing to do.) Let's say the outer RSS feed gives 10 items. > Both feeds use the same > structure: /rss/channel with a node and then N nodes > inside the channel. This should create two separate XML streams with > two separate Xpath iterators, right? > > > > > > > > > > > This does indeed walk each url from the outer feed and then fetch the > inner rss feed. Bravo! > > However, I found two separate problems in xpath iteration. They may be > related. The first problem is that it only stores the first document > from each "inner" feed. Each feed has several documents with different > title fields but it only grabs the first. > The idea behind nested entities is to join them together so that one Solr document is created for each root entity and the child entities provide more fields which are added to the parent document. I guess you want to create separate Solr documents from the root entity as well as the child entities. I don't think that is possible with nested entities. Essentially, you are trying to crawl feeds, not join them. Probably an integration with Apache Droids can be thought about. http://incubator.apache.org/projects/droids.html http://people.apache.org/~thorsten/droids/ If you are going to crawl only one level, there may be a workaround. However, it may be easier to implement all this with your own Java program and just post results to Solr as usual. > The other is an off-by-one bug. The outer loop iterates through the 10 > items and then tries to pull an 11th. It then gives this exception > trace: > > INFO: Created URL to: [inner url] > Oct 31, 2008 11:21:20 PM > org.apache.solr.handler.dataimport.HttpDataSource > getData > SEVERE: Exception thrown while getting data > java.net.MalformedURLException: no protocol: null/account.rss >at java.net.URL.(URL.java:567) >at java.net.URL.(URL.java:464) >at java.net.URL.(URL.java:413) >at > > org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour > ce.jav > a:90) >at > > org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour > ce.jav > a:47) >at > > org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.j > ava:18 > 3) >at > > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat > hEntit > yProcessor.java:210) >at > > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X > PathEn > tityProcessor.java:180) >at > > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE > ntityP > rocessor.java:160) >at > > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j ava: > 285) > ... > Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder > buildDocument > SEVERE: Exception while processing: album document : > SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}] > org.apache.solr.handler.dataimport.DataImportHandlerException: > Exception in invoking url null Processing Document # 11 >at > > org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour > ce.jav > a:115) >at > > org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour > ce.jav > a:47) > > > > > > -- Regards, Shalin Shekhar Mangar.
RE: customizing results in StandardQueryHandler
Ah! This will let you post-process result sets with an XSL script: http://wiki.apache.org/solr/XsltResponseWriter -Original Message- From: Manepalli, Kalyan [mailto:[EMAIL PROTECTED] Sent: Friday, October 24, 2008 11:44 AM To: solr-user@lucene.apache.org Subject: RE: customizing results in StandardQueryHandler Ryan, Actually, what I need is: I always query for a set of fields say (f1, f2, f3 .. f6). Now once I get the results, based on some logic, I need to generate the XML which is customized and contains only fields say (f2, f3, and some new data). So the fl will always be (f1 ... f6) Thanks, Kalyan Manepalli -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Friday, October 24, 2008 1:25 PM To: solr-user@lucene.apache.org Subject: Re: customizing results in StandardQueryHandler isn't this just: fl=f1,f3,f4 etc or am I missing something? On Oct 24, 2008, at 12:26 PM, Manepalli, Kalyan wrote: > Hi, > In my usecase, I query a set of fields. Then based on the results, I > want to output a customized set of fields. Can I do this without using > a search component? > E:g. I query for fields f1, f2, f3, f4. Now based on some conditions, > I want to output just f1, f3, f4 (the list of final fields may vary). > > How do I rewrite the resultant xml optimally? > Any thoughts on this will be helpful > > Thanks, > Kalyan
RE: scaling / sharding questions
I cannot facet on one huge index; it runs out of ram when it attempts to allocate a giant array. If I store several shards in one JVM, there is no problem. Are there any performance benefits to a large index v.s. several small indexes? Lance -Original Message- From: Marcus Herou [mailto:[EMAIL PROTECTED] Sent: Sunday, June 15, 2008 10:24 PM To: solr-user@lucene.apache.org Subject: Re: scaling / sharding questions Yep got that. Thanks. /M On Sun, Jun 15, 2008 at 8:42 PM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > With Lance's MD5 schema you'd do this: > > 1 shard: 0-f* > 2 shards: 0-8*, 9-f* > 3 shards: 0-5*, 6-a*, b-f* > 4 shards: 0-3*, 4-7*, 8-b*, c-f* > ... > 16 shards: 0*, 1*, 2*... d*, e*, f* > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > - Original Message > > From: Marcus Herou <[EMAIL PROTECTED]> > > To: solr-user@lucene.apache.org > > Cc: [EMAIL PROTECTED] > > Sent: Saturday, June 14, 2008 5:53:35 AM > > Subject: Re: scaling / sharding questions > > > > Hi. > > > > We as well use md5 as the uid. > > > > I guess by saying each 1/16th is because the md5 is hex, right? (0-f). > > Thinking about md5 sharding. > > 1 shard: 0-f > > 2 shards: 0-7:8-f > > 3 shards: problem! > > 4 shards: 0-3 > > > > This technique would require that you double the amount of shards > > each > time > > you split right ? > > > > Split by delete sounds really smart, damn that I did'nt think of > > that :) > > > > Anyway over time the technique of moving the whole index to a new > > shard > and > > then delete would probably be more than challenging. > > > > > > > > > > I will never ever store the data in Lucene mainly because of bad exp > > and since I want to create modules which are fast, scalable and > > flexible and storing the data alongside with the index do not match > > that for me at > least. > > > > So yes I will have the need to do a "foreach id in ids get document" > > approach in the searcher code, but at least I can optimize the > > retrieval > of > > docs myself and let Lucene do what it's good at: indexing and > > searching > not > > storage. > > > > I am more and more thinking in terms of having different levels of > searching > > instead of searcing in all shards at the same time. > > > > Let's say you start with 4 shards where you each document is > > replicated 4 times based on publishdate. Since all shards have the > > same data you can > lb > > the query to any of the 4 shards. > > > > One day you find that 4 shards is not enough because of search > performance > > so you add 4 new shards. Now you only index these 4 new shards with > > the > new > > documents making the old ones readonly. > > > > The searcher would then prioritize the new shards and only if the > > query returns less than X results you start querying the old shards. > > > > This have a nice side effect of having the most relevant/recent > > entries > in > > the index which is searched the most. Since the old shards will be > > mostly idle you can as well convert 2 of the old shards to "new" > > shards reducing the need for buying new servers. > > > > What I'm trying to say is that you will end up with an architecture > > which have many nodes on top which each have few documents and fewer > > and fewer nodes as you go down the architecture but where each node > > store more documents since the search speed get's less and less relevant. > > > > Something like this: > > > > - Primary: 10M docs per shard, make sure 95% of the results > comes > > from here. > > - Standby: 100M docs per shard - merges of 10 primary indices. > > zz - Archive: 1000M docs per shard - merges of 10 standby indices. > > > > Search top-down. > > The numbers are just speculative. The drawback with this > > architecture is that you get no indexing benefit at all if the > > architecture drawn above > is > > the same as which you use for indexing. I think personally you > > should use > X > > indexers which then merge indices (MapReduce) for max performance > > and lay them out as described above. > > > > I think Google do something like this. > > > > > > Kindly > > > > //Marcus > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Jun 14, 2008 at 2:27 AM, Lance Norskog wrote: > > > > > Yes, I've done this split-by-delete several times. The halved > > > index > still > > > uses as much disk space until you optimize it. > > > > > > As to splitting policy: we use an MD5 signature as our unique ID. > > > This > has > > > the lovely property that we can wildcard. 'contentid:f*' denotes > > > 1/16 > of > > > the whole index. This 1/16 is a very random sample of the whole index. > We > > > use this for several things. If we use this for shards, we have a > > > query that matches a shard's contents. > > > > > > The Solr/Lucene syntax does not support modular arithmetic,and so > > > it > will > > > not let you query a subset
RE: Strategy for presenting fresh data
You can also use a shared file system mounted on a common SAN. (This is a high-end server configuration.) -Original Message- From: James Brady [mailto:[EMAIL PROTECTED] Sent: Thursday, June 12, 2008 9:59 AM To: solr-user@lucene.apache.org Subject: Re: Strategy for presenting fresh data >> >> In the meantime, I had imagined that, although clumsy, federated >> search could be used for this purpose - posting the new documents to >> a group of servers ('latest updates servers') with v limited amount >> of documents with v. fast "reload / refresh" times, and sending them >> again (on a work queue, possibly), to the 'core servers'. Regularly >> cleaning the 'latest updates servers' >> of the >> already posted documents to 'core servers' would keep them lean... >> of course, >> this approach sucks compared to a proper solution like what James is >> suggesting >> :) >> Otis - is there an issue I should be looking at for more information on this? Yes, in principle, sending updates both to a fresh, forgetful and fast index and a larger, slower index is what I'm thinking of doing. The only difference is that I'm talking about having the fresh index be implemented as a RAMDirectory in the same JVM as the large index. This means that I can avoid the slowness of cross-disk or cross- machine replication, I can avoid having to index all documents in two places and I cut out the extra moving part of federated search. On the other hand, I am going to have to write my own piece to handle the index flushes and federate searches to the fast and large indices. Thanks for your input! James
XSL scripting
This started out in the num-docs thread, but deserves its own. And a wiki page. There is a more complex and general way to get the number documents in the index. I run a query against solr and postprocess the output with an XSL script. Install this xsl script as home/conf/xslt/numfound.xsl. http://www.w3.org/1999/XSL/Transform";> Make sure 'curl' is installed, and add numfound.sh, a unix shell script. SHARD=localhost:8080/solr QUERY="$1" LINK="http://$SHARD/select?indent=on&version=2.2&q=$QUERY&start=0&rows=0 &fl=*&wt=xslt&tr=numfound.xsl" curl --silent "$LINK" -H "Content-Type:text" -X GET Run it as sh numfound.sh "*:*" How to install the XSLT script is to be found on the Wiki. Star-colon-star is magic for 'all records'. XSL is appalling garbage. Cheers!
RE: Solr indexing configuration help
Solr 1.2 ignores the 'number of documents' attribute. It honors the "every 30 minutes" attribute. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Sunday, June 01, 2008 6:47 AM To: solr-user@lucene.apache.org Subject: Re: Solr indexing configuration help On Sun, Jun 1, 2008 at 4:43 AM, Gaku Mak <[EMAIL PROTECTED]> wrote: > I have tried Yonik's suggestions with the following: > 1) all autowarming are off > 2) commented out firstsearch and newsearcher event handlers > 3) increased autocommit interval to 600 docs and 30 minutes > (previously 50 docs and 5 minutes) Glad it looks like your memory issues are solved, but I really wouldn't use "docs" at all for an autocommit criteria it will just slow down your full index builds. -Yonik > In addition, I updated the java option with the following: > -d64 -server -Xms2048M -Xmx3072M -XX:-HeapDumpOnOutOfMemoryError > -XX:+UseSerialGC > > Results: > I'm currently at 100,000 documents now with about 9.0GB index on a > quad machine with 4GB ram. The stress test is to add 20 documents > every 30 seconds now. > > It seems like the serial GC works better than the other two > alternatives (-XX:+UseParallelGC or -XX:+UseConcMarkSweepGC) for some > reason. I have not seen any OOM since the changes mentioned above > (yet). If others have better experience with other GC and know how to > configure it properly, please let me know because using serial GC just doesn't sound right on a quad machine. > > Additional questions: > Does anyone know how solr/lucene use heap in terms of their > generations (young vs tenured) on the indexing environment? If we > have this answer, we would be able to better configure the > young/tenured ratio in the heap. Any help is appreciated! Thanks! > > Now, I'm looking into configuring the slave machines. Well, that's a > separate question. > > > > Yonik Seeley wrote: >> >> Some things to try: >> - turn off autowarming on the master >> - turn off autocommit, unless you really need it, or change it to be >> less agressive: autocommitting every 50 docs is bad if you are >> rapidly adding documents. >> - set maxWarmingSearchers to 1 to prevent the buildup of searchers >> >> -Yonik >> >> On Fri, May 30, 2008 at 3:39 PM, Gaku Mak <[EMAIL PROTECTED]> wrote: >>> >>> I started running the test on 2 other machines with similar specs >>> but more RAM (4G). One of them now has about 60k docs and still >>> running fine. On the other machine, solr died at about 43k docs. A >>> short while before solr died, I saw that there were 5 searchers at >>> the same time. Do any of you know why would solr create 5 searchers, >>> and if that could cause solr to die? Is there any way to prevent >>> this? Also is there a way to totally disable the searcher and >>> whether that is a way to optimize the solr master? >>> >>> I copied the following from the SOLR Statistics page in case it has >>> interested info: >>> >>> name:[EMAIL PROTECTED] main >>> class: org.apache.solr.search.SolrIndexSearcher >>> version:1.0 >>> description:index searcher >>> stats: caching : true >>> numDocs : 42754 >>> maxDoc : 42754 >>> readerImpl : MultiSegmentReader >>> readerDir : >>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/so >>> lr/data/index >>> indexVersion : 1211702500453 >>> openedAt : Fri May 30 10:04:15 PDT 2008 registeredAt : Fri May 30 >>> 10:05:05 PDT 2008 >>> >>> name: [EMAIL PROTECTED] main >>> class: org.apache.solr.search.SolrIndexSearcher >>> version:1.0 >>> description:index searcher >>> stats: caching : true >>> numDocs : 42754 >>> maxDoc : 42754 >>> readerImpl : MultiSegmentReader >>> readerDir : >>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/so >>> lr/data/index >>> indexVersion : 1211702500453 >>> openedAt : Fri May 30 10:03:24 PDT 2008 registeredAt : Fri May 30 >>> 10:03:41 PDT 2008 >>> >>> name: [EMAIL PROTECTED] main >>> class: org.apache.solr.search.SolrIndexSearcher >>> version:1.0 >>> description:index searcher >>> stats: caching : true >>> numDocs : 42675 >>> maxDoc : 42675 >>> readerImpl : MultiSegmentReader >>> readerDir : >>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/so >>> lr/data/index >>> indexVersion : 1211702500450 >>> openedAt : Fri May 30 10:00:53 PDT 2008 registeredAt : Fri May 30 >>> 10:01:05 PDT 2008 >>> >>> name: [EMAIL PROTECTED] main >>> class: org.apache.solr.search.SolrIndexSearcher >>> version:1.0 >>> description:index searcher >>> stats: caching : true >>> numDocs : 42697 >>> maxDoc : 42697 >>> readerImpl : MultiSegmentReader >>> readerDir : >>> org.apache.lucene.store.FSDirectory@/var/lib/solr/peoplesolr_0002/so >>> lr/data/index >>> indexVersion : 1211702500451 >>> openedAt : Fri May 30 10:02:20 PDT 2008 registeredAt : Fri May 30 >>> 10:02:22 PDT 2008 >>> >>> name: [EMAIL PROTECTED] main >>> class: org.ap
RE: MultiCore on Wiki
I think I meant: this writeup implies to me that two cores could share the same "default" index. I don't see how this would work, or be useful. Thanks, Lance Norskog -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 30, 2008 9:30 PM To: solr-user@lucene.apache.org Subject: Re: MultiCore on Wiki On Apr 30, 2008, at 11:52 PM, Lance Norskog wrote: > The MultiCore writeup on the Wiki > (http://wiki.apache.org/solr/MultiCore > ) > says: > > ... > Configuration->core->dataDir > The data directory for a given core. (optional) > > How can a core not have its own dataDir? What happens if this is not > set? > It defaults to the "normal" location, that is whatever is specified in solrconfig.xml or "data" relative to the solr.home for that directory. (I'm not looking at the code now, so try it out and see...) ryan
RE: indexing text containing xml tags
We wrap everything in CDATA tags. Works great. -Original Message- From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] Sent: Friday, April 18, 2008 10:41 PM To: [EMAIL PROTECTED] Cc: solr-user@lucene.apache.org Subject: Re: indexing text containing xml tags CC'ing the solr-user mailing list because that is the right list for usage questions. You'll need to XML encode your title field. Basically you need to replace '<' with < etc, then you will be able to index them. On Sat, Apr 19, 2008 at 10:54 AM, Saurabh Kataria <[EMAIL PROTECTED]> wrote: > > Hi everyone, > > I am having a problem while indexing my document. A very typical field > of my document looks like: > > pKa Values of the Opened > Form of a Thieno-1,2,4-triazolo-1,4-diazepine in Water > > solr has a problem indexing this because of the xml tags. I was > wondering if there is any way that I can index this field "title" > without stripping off my tags. If anyone could help me out, that wld be great. > > Thanks, > SK. > -- Regards, Shalin Shekhar Mangar.
RE: capping term frequency?
Doing this well is harder. Giving a spam score to each page and boosting by a function on this score is probably a stronger tool.Can't remember where I found it. Gives a solid spam score algorithm for several easy-to-code text analyses and a scoring function. This assumes you pre-process. Detecting Spam Web Pages through Content Analysis WWW 2006, May 23-26, 2006, Edinburgh, Scotland. ACM 1-59593-323-9/06/0005. Also "Z. Gyongyi and H. Garcia-Molina." have some interesting papers. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, April 11, 2008 1:12 PM To: solr-user@lucene.apache.org Subject: Re: capping term frequency? Hi, Probably by writing your own Similarity (Lucene codebase) and implementing the following method with capping: /** Implemented as sqrt(freq). */ public float tf(float freq) { return (float)Math.sqrt(freq); } Then put that custom Similarity in a jar in Solr's lib and specify your Similarity FQCN at the bottom of solrconfig.xml Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: peter360 <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, April 11, 2008 2:16:53 PM Subject: capping term frequency? Hi, How do I cap the term frequency when computing relevancy scores in solr? The problem is if a keyword repeats many times in the same document, I don't want it to hijack the relevancy score. Can I tell solr to cap the term frequency at a certain threshold? thanks. -- View this message in context: http://www.nabble.com/capping-term-frequency--tp16628189p16628189.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Facet Query
Ok. I have a query that returns a set A. Doing a facet on field F gives me: All values of F in the index given as count(*) And these values can include 0. I add a facet query that returns B. The facet operation now returns count(*) on only the values of F that are found in query B. Query B is only used as a set, none of the counts in query B are used. Is this it? Thanks, Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, April 11, 2008 1:36 PM To: solr-user@lucene.apache.org Cc: Norskog, Lance Subject: Re: Facet Query On Fri, Apr 11, 2008 at 4:32 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > What do facet queries do that is different from the regular query? > What is a use case where I would use a facet.query in addition to the regular query? It returns the number of documents that match the query AND the facet.query. -Yonik
RE: indexing slow, IO-bound?
Also Linux has optional file systems that might be better for this. We plan to try them. ReiserFS and XFS have good reputations. (Reiser himself, that's a different story :( Cheers, Lance -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Monday, April 07, 2008 12:04 PM To: solr-user@lucene.apache.org Subject: Re: indexing slow, IO-bound? On 5-Apr-08, at 7:09 AM, Britske wrote: > Indexing of these documents takes a long time. Because of the size of > the documents (because of the indexed fields) I am currently batching > 50 documents at once which takes about 2 seconds.Without adding the > 1 indexed fields to the document, indexing flies at about 15 ms > for these 50 documents. INdexing is done using SolrJ > > This is on a intel core 2 6400 @2.13ghz and 2 gb ram. > > To speed this up I let 2 threads do the indexing in parallel. What > happens is that solr just takes double the time (about 4 seconds) to > complete these two jobs of 50 docs each in parallel. I figured because > of the multi- core setup indexing should improve, which it doesn't. Multiple processors really only help indexing speeds when there is heavy analysis. > Does this perhaps indicate that the setup is IO-bound? What would be > your best guess (given the fact that the schema has a big amount of > indexed > fields) to try next to improve indexing performance? Use Lucene 2.3 with solr 1.2, or simple try out solr trunk. The indexing has been reworked to be considerably faster (it also makes better use of multiple processors by spawing a background merging thread). -Mike
RE: Merging Solr index
Thanks! I have learned Solr as a power user and written a couple of simple filters. I'm not a Lucene heavy. Where is this in Lucene? Is it the default? I don't remember Lucene having the notion of a unique id (primary key). In this merge code, with the latest Lucene 2.3, will the duplicates in solr/data1 override the records in solr/data0? Or the other way around? How do I add the new Lucene implementation? try { IndexWriter writer = new IndexWriter(new File("solr/data0/index"), new StandardAnalyzer(), false); Directory[] dirs = new Directory[]{FSDirectory.getDirectory(new File("solr/data1/index"))}; System.out.println(writer); writer.addIndexes(dirs); writer.close(); } catch (Exception e) { e.printStackTrace(); } Thanks, Lance Norskog -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Saturday, April 05, 2008 2:37 PM To: solr-user@lucene.apache.org Cc: Norskog, Lance Subject: Re: Merging Solr index On Fri, Apr 4, 2008 at 6:26 PM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > http://wiki.apache.org/solr/MergingSolrIndexes recommends using the > Lucene contributed app IndexMergeTool to merge two Solr indexes. What > happens if both indexes have records with the same unique key? Will > they both go into the new index? Yes. > Is the implementation of unique IDs in the Solr java or in Lucene? Both. It was originally just in Solr, but Lucene now has an implementation. Neither implementation will prevent this as both just remember documents (in memory) that were added and then periodically delete older documents with the same id. -Yonik
Merging Solr index
Hi- http://wiki.apache.org/solr/MergingSolrIndexes recommends using the Lucene contributed app IndexMergeTool to merge two Solr indexes. What happens if both indexes have records with the same unique key? Will they both go into the new index? Is the implementation of unique IDs in the Solr java or in Lucene? If it is in Solr, how would I hackup a Solr IndexMergeTool? Cheers, Lance Norskog
RE: Search exact terms
This is confusing advice to a beginner. A string field will not find a word in the middle of a sentence. To get normal searches without this confusions, copy the 'text' type and make a variant without the Stemmer. The problem is that you are using an English language stemmer for what appears to be Dutch. There is a Dutch stemmer, it might be better for your needs if the content is all Dutch. To make an exact search field which still has helpful searching properties, make another variant of text that breaks up words but does not stem. You might also want to add the ISOLatin1 filter which maps all European characters to USASCII equivalents. This is also very helpful for multi-language searching. Lance -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 02, 2008 7:06 AM To: solr-user@lucene.apache.org Subject: Re: Search exact terms search is based on the fields you index and how you index them. If you index using the "text" field -- with stemming etc, you will have to search with the same criteria. If you want exact search, consider the "string" type. If you want both, you can use the to copy the same content into multiple fields so it is searchable multiple ways ryan On Apr 2, 2008, at 4:46 AM, Tim Mahy wrote: > Hi all, > > is there a Solr wide setting that with which I can achieve the > following : > > if I now search for q=onderwij, I also receive documents with results > of "onderwijs" etc.. this is ofcourse the behavior that is described > but if I search on "onderwij", I still get the "onderwijs" > hits, I use for this field the type "text" from the schema.xml that is > supplied with the default Solr. > > Is there a global setting on Solr to always search Exact ? > > Greetings, > > Tim > > > > > > Info Support - http://www.infosupport.com > > Alle informatie in dit e-mailbericht is onder voorbehoud. Info Support > is op geen enkele wijze aansprakelijk voor vergissingen of > onjuistheden in dit bericht en staat niet in voor de juiste en > volledige overbrenging van de inhoud hiervan. Op al de werkzaamheden > door Info Support uitgevoerd en op al de aan ons gegeven opdrachten > zijn - tenzij expliciet anders overeengekomen - onze Algemene > Voorwaarden van toepassing, gedeponeerd bij de Kamer van Koophandel te > Utrecht onder nr. 30135370. Een exemplaar zenden wij u op uw verzoek > per omgaande kosteloos toe. > > De informatie in dit e-mailbericht is uitsluitend bestemd voor de > geadresseerde. Gebruik van deze informatie door anderen is verboden. > Openbaarmaking, vermenigvuldiging, verspreiding en/of verstrekking van > deze informatie aan derden is niet toegestaan. > > Dit e-mailbericht kan vertrouwelijke informatie bevatten. Indien u dit > bericht dus per ongeluk ontvangt, stelt Info Support het op prijs als > u de zender door een antwoord op deze e-mail hiervan op de hoogte > brengt en deze e-mail vervolgens vernietigt.
RE: sort by index id descending?
... another "magic" field name like "score" ... This could be done with a separate "magic" punctuation like $score, $mean (the mean score), etc.so $docid would work. Cheers, Lance -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 18, 2008 9:01 PM To: solr-user Subject: Re: sort by index id descending? : Is there any way to sort by index id - descending? (by order of indexed) Not that i can think of. Lucene already has support for it, so it would probably be a fairly simple patch if someone wanted to try to implement it, we just need some syntax to make the parameter parsing construct the right Sort object -- allthough I'm loath to add another "magic" field name like "score" since "docid" or "id" or anything else we can think of could easily conflict with a field name in someones schema. if we add something like this I'd want to add configuration to solrconfig.xml to determine what the "magic" field names for sorting by internal id and score should be. -Hoss
RE: Finding an empty field
It was a surprise to discover that dateorigin_sort:"" is a syntax error, but dateorigin_sort:["" TO *] is legit. This says that there's a bug in the Lucene syntax parser? Anyway, with a little more research I discover that this query: http://64.71.164.205:8080/solr/select/?q=*:*&version=2.2&start=0&rows=0&; indent=on&facet=true&facet.field=dateorigin_sort&facet.mincount=0&facet. sort=false .../solr/select/?q=*:*&version=2.2&start=0&rows=0&indent=on&facet=true&f acet.field=dateorigin_sort&facet.mincount=0&facet.sort=false This query says, "Select all records in the index. For each indexed value of dateorigin_sort, count the number of records with that value." It yields the following output snippet: 0 1 1 Umm... it has an indexed empty value that does not correspond to a record? Is it an unanchored data item in the index? Would optimizing make this index data go away? Thanks, Lance -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Friday, March 14, 2008 4:46 PM To: solr-user@lucene.apache.org Subject: RE: Finding an empty field : dateorigin_sort:"" gives a syntax error. I'm using Solr 1.2. Should : this work in Solr 1.3? Is it legal in a newer Lucene parser? Hmm.. not sure. did you try the range query suggestion? ... : well, technically range queries "work" they just don't "work" on numeric : ranges ... they'd be lexigraphical ranges on the string value, so... : : dateorigin_sort:[* TO " "] : : ...could probably help you find anything that is lexigraphically lower : then a string representation of an integer (assuming dateorigin_sort:"" : doesn't work) -Hoss
RE: Finding an empty field
dateorigin_sort:"" gives a syntax error. I'm using Solr 1.2. Should this work in Solr 1.3? Is it legal in a newer Lucene parser? message Query parsing error: Cannot parse 'dateorigin_sort:""': Lexical error at line 1, column 19. Encountered: after : "\"\"" description The request sent by the client was syntactically incorrect (Query parsing error: Cannot parse 'dateorigin_sort:""': Lexical error at line 1, column 19. Encountered: after : "\"\""). Thanks, Lance -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Friday, March 14, 2008 11:38 AM To: solr-user@lucene.apache.org Subject: Re: Finding an empty field : Somehow the index has acquired one record out of millions in which an : integer value has been populated by an empty string. I would like to isolate : this record and remove it. This field exists solely to make sorting faster, : and since it has an empty record, sorting blows up. : : Is it possible to find this record? Is there any way to differentiate : between this record and all of the other records which have real numbers : populated? have you tried searching for... dateorigin_sort:"" ? : This query will isolate records which do not have the field populated. (It : works on all field types.) : -dateorigin_sort:[* TO *] : But, since this record is an integer (not an sint) no other range query : works. well, technically range queries "work" they just don't "work" on numeric ranges ... they'd be lexigraphical ranges on the string value, so... dateorigin_sort:[* TO " "] ...could probably help you find anything that is lexigraphically lower then a string representation of an integer (assuming dateorigin_sort:"" doesn't work) disclaimer: i haven't actaully tested either of these on an index with a bogus integer like you describe ... but i'm pretty sure they should work given what i'm remembering about the code) -Hoss
RE: Tomcat and Solr - out of memory
On Tomcat, an OutOfMemory on a query leaves the server in an OK state, and future queries work. But a facet query that runs out of ram does not free its undone state and all future requests get OutOfMemory. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stuart Sierra Sent: Tuesday, January 08, 2008 7:05 AM To: solr-user@lucene.apache.org Subject: Re: Tomcat and Solr - out of memory On Jan 7, 2008 12:08 PM, Jae Joo <[EMAIL PROTECTED]> wrote: > What happens if Solr application hit the max. memory of heap assigned? > > Will be die or just slow down? In my (limited) experience (with Jetty), Solr will not die but it will return HTTP 500 errors on all requests until it is restarted. -Stuart No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.17.13/1213 - Release Date: 1/7/2008 9:14 AM No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.17.13/1213 - Release Date: 1/7/2008 9:14 AM
RE: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)
As a reference: I have several million records where are about 20 fields. One of them is 100-1k bytes, and the rest are 20-50 bytes. There is a reliable 5% performance difference between pulling just the unique key field and pulling all of the fields. -Original Message- From: Geert-Jan Brits [mailto:[EMAIL PROTECTED] Sent: Thursday, December 27, 2007 8:44 AM To: solr-user@lucene.apache.org Subject: Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver) yeah, that makes sense. so, in in all, could scanning all the fields and loading the 10 fields add up to cost about the same or even more as performing the intial query? (Just making sure) I am wondering if the following change to the schema would help in this case: current setup: It's possible to have up to 2000 product-variants. each product-variant has: - 1 price field (stored / indexed) - 1 multivalued field which contains product-variant characteristics (strored / not indexed). This adds up to the 4000 fields described. Moreover there are some fields on the product level but these would contibute just a tiny bit to the overall scanning / loading costs (about 50 -stored and indexed- fields in total) possible new setup (only the changes) : - index but not store the price-field. - store the price as just another one of the product-variant characteristics in the multivalued product-variant field. as a result this would bring back the maximum number of stored fields to about 2050 from 4050 and thereby about halving scanning / loading costs while leaving the current quering-costs intact. Indexing costs would increase a bit. Would you expect the same performance gain? Thanks, Geert-Jan 2007/12/27, Yonik Seeley <[EMAIL PROTECTED]>: > > On Dec 27, 2007 11:01 AM, Britske <[EMAIL PROTECTED]> wrote: > > after inspecting solrconfig.xml I see that I already have enabled > > lazy > field > > loading by: > > true (I guess it > > was enabled by default) > > > > Since any query returns about 10 fields (which differ from query to > query) , > > would this mean that only these 10 of about 2000-4000 fields are > retrieved / > > loaded? > > Yes, but that's not the whole story. > Lucene stores all of the fields back-to-back with no index (there is > no random access to particular stored fields)... so all of the fields > must be at least scanned. > > -Yonik >
RE: An observation on the "Too Many Files Open" problem
In Java files, database handles, and other external open resources are not automatically closed when the object is garbage-collected. You have to explicitly close the resource. (There is a feature called 'finalization' where you can implement this for your own classes, but this has turned out to be a badly designed feature.) -Original Message- From: Mark Baird [mailto:[EMAIL PROTECTED] Sent: Monday, December 24, 2007 10:25 AM To: Solr Mailing List Subject: An observation on the "Too Many Files Open" problem Running our Solr server (latest 1.2 Solr release) on a Linux machine we ran into the "Too Many Open Files" issue quite a bit. We've since changed the ulimit max filehandle setting, as well as the Solr mergeFactor setting and haven't been running into the problem anymore. However we are seeing some behavior from Solr that seems a little odd to me. When we are in the middle of our batch index process and we run the lsof command we see a lot of open file handles hanging around that reference Solr index files that have been deleted by Solr and no longer exist on the system. The documents we are indexing are potentially very large, so due to various memory constraints we only send 300 docs to Solr at a time. With a commit between each set of 300 documents. Now one of the things that I read may cause old file handles to hang around was if you had an old IndexReader still open pointing to those old files. However whenever you issue a commit to the server it is supposed to close the old IndexReader and open a new one. So my question is, when the Reader is being closed due to a commit, what exactly is happening? Is it just being set to null and a new instance being created? I'm thinking the reader may be sitting around in memory for a while before the garbage collector finally gets to it, and in that time it is still holding those files open. Perhaps an explicit method call that closes any open file handles should occur before setting the reference to null? After looking at the code, it looks like reader.close() is explicitly being called as long as the closeReader property in SolrIndexSearcher is set to true, but I'm not sure how to check if that is always getting set to true or not. There is one constructor of SolrIndexSearcher that sets it to false. Any insight here would be appreciated. Are stale file handles something I should just expect from the JVM? I've never ran into the "Too Many Files Open" exception before, so this is my first time looking at the lsof command. Perhaps I'm reading too much into the data it's showing me. Mark Baird
RE: Solr replication
You can probably find an rsync port for Windows in the gnu32 or cygnus distributions. There is a bigger problem here. To quote myself in another recent mail: The replication scripts use two Unix file system tricks. 1) Files are not directly bound with with filenames, instead there is a layer of indirection called an 'inode'. So, multiple file and directory names point to the same physical file. The "." and ".." directory entries are implemented this way. 2) Physical files are bound to all open file descriptors even after there are no file names for the files. So, file data exists until all file names are gone AND all open files are gone. Windows does not (I think) support these features, even if they use an indirection in their file system. The hardlink tricks are not available. If you want to replicate with snapshots, you will have to make a complete copy of your new files at the source, and copy those into the index directory at the target. You may have to stop Solr at the source and/or target during these operations. Lance -Original Message- From: Dilip.TS [mailto:[EMAIL PROTECTED] Sent: Monday, December 17, 2007 8:26 PM To: SOLR Subject: RE: Solr replication Hi, I understand that the Rsync is a Unix/Linux daemon thread which needs to be enable/run to achieve Solr Collection Distribution. Do we have any similar support for the Solr Collection Distribution in the Windows environment or Do we need to write equivalent commands (in the form of batch files) which will do the same steps as the shell scripts placed under solr/bin folder. Thanks in advance. Regards, Dilip. -Original Message- From: Bill Au [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 18, 2007 4:00 AM To: [EMAIL PROTECTED] Subject: Re: Solr replication Rsync is a Unix/Linux command. I dont' know if that's available on Windows. All the distribution scripts were developed and tested under Unix/Linux. They may or may not work on Windows. I don't know much about Windows so if you are running on Windows that I am the wrong person to be asking help. You may want to use the mailing list to see if anyone is doing collection distribution on Windows. Solr is accessed through HTTP so you just need to use HTTP (for example, IE) on a Windows system to access a Solr server. Bill On Dec 17, 2007 8:53 AM, Dilip.TS < [EMAIL PROTECTED]> wrote: Hi Bill, I have a basic question (as im not an expert in unix). I understand that the rsync is a deamon thread (similar to services in Windows). Im not clear about what are the things/steps required to set up this rysncd deamon thread? (Dont mind asking this question againg since im not very much clear about this) Does it mean that the SOLR servers(both master and slave) should be made running on a unix/linux machine only? How does a client (using Windows environment) able to access the SOLR Server running on Unix/Platform? Any links/references would be of great help. Thanks in advance. Regards Dilip -Original Message- From: Bill Au [mailto: [EMAIL PROTECTED] Sent: Saturday, December 15, 2007 1:08 AM To: solr-user@lucene.apache.org; [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: Solr replication On Dec 14, 2007 7:00 AM, Dilip.TS <[EMAIL PROTECTED]> wrote: > Hi, > I have the following requirement for SOLR Collection Distribution using > Embedded Solr with the Jetty server: > > I have different data folders for multiple instances of SOLR within the > Same > application. > Im using the same SOLR_HOME with a single bin and conf folder. > > My query is: > 1)Is is possible to have the same SOLR_HOME for multiple solr instances > and > still be able to > achieve Solr Distribution? > (As i understand that we need to have differnet rsync port for different > solr instances) Yes, solr distribution will work for multiple solr instances even if they all use the same SOLR_HOME. All the distribution scripts have a command line argument for specifying the data directory. > > 2)Can i get some more information about how to start this rsyncd daemon > and > which is the best way of doing it i.e. to start during system reboot or > doing it manually? Please note that the rsyncd -CollectionDistributionScripts#head-1e6cdce516ecf1eb31bffceaccf2abeb72bd ce81 So it is best to configure the master server to run the rsyncd-start script at system boot time. If the rsync daemon has for some reasons been disabled, it will not be started automatically at system reboot even if it is configured to do so. If rsyncd is started manually, then one will have to remember to start it every time the master server is rebooted. > > 3)Let me know if my understanding is correct. We require 1 Master Server
RE: retrieve lucene "doc id"
Exactly. We have done some projects where we extract records en masse. With this technique we can make a query that will fetch exactly 3000 +-50 records, and walk through every 50 records using the query as a filter. Works pretty well. Lance -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 18, 2007 11:07 AM To: solr-user@lucene.apache.org Subject: Re: retrieve lucene "doc id" Hi Lance, You said: We use the standard (some RFC) text representation of 32 hex characters. This has the advantage that F* pulls 1/16 of the total index, with a completely randomized distribution, F** 1/256, etc. This is very handy for data analysis and document extraction. Could you elaborate on the last sentence? Maybe give an example of what you have in mind? Are you thinking that this, because of uniform distribution, lets you easily get a subset of documents of predictable size and thus have an apriori knowledge of how large of a data set you'll get and work with? Or something else? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: "Norskog, Lance" <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Monday, December 17, 2007 2:43:55 PM Subject: RE: retrieve lucene "doc id" We are using MD5 to generate our IDs. MD5s are 128 bits creating a very unique and very randomized number for the content. Nobody has ever reported two different data sets that create the same MD5. We use the standard (some RFC) text representation of 32 hex characters. This has the advantage that F* pulls 1/16 of the total index, with a completely randomized distribution, F** 1/256, etc. This is very handy for data analysis and document extraction. MD5 creates 128 bits, but if your index is small enough that you are willing to risk it, you could pick 64 bits and park them in a Java long. -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Monday, December 17, 2007 8:15 AM To: solr-user@lucene.apache.org Subject: Re: retrieve lucene "doc id" Yonik Seeley wrote: > On Dec 17, 2007 1:40 AM, Ben Incani <[EMAIL PROTECTED]> wrote: >> I have converted to using the Solr search interface and I am trying >> to retrieve documents from a list of search results (where previously >> I had used the doc id directly from the lucene query results) and the >> solr id I have got currently indexed is unfortunately configured not be unique! > > Ouch... I'd try to make a unique Id then! > Or barring that, just try to make the query match exactly the docs you > want back (don't do the 2 phase thing). > In 1.3-dev, you can use UUIDField to have solr generate a UUID for each doc. ryan
RE: Issues with postOptimize
Also, the script itself has to be execute mode. Lance -Original Message- From: climbingrose [mailto:[EMAIL PROTECTED] Sent: Monday, December 17, 2007 4:38 PM To: solr-user@lucene.apache.org Subject: Re: Issues with postOptimize Make sure that the user running Solr has permission to execute snapshooter. Also, try ./snapshooter instead of snapshooter. Good luck. On Dec 18, 2007 10:57 AM, Sunny Bassan <[EMAIL PROTECTED]> wrote: > I've set up solrconfig.xml to create a snap shot of an index after > doing a optimize, but the snap shot cannot be created because of > permission issues. I've set permissions to the bin, data and log > directories to read/write/execute for all users. Even with these > settings I cannot seem to be able to run snapshooter on the postOptimize event. Any ideas? > Could it be a java permissions issue? Thanks. > > Sunny > > Config settings: > > > snapshooter > /search/replication_test/0/index/solr/bin > true > > > Error: > > Dec 17, 2007 7:45:19 AM org.apache.solr.core.RunExecutableListener > exec > FINE: About to exec snapshooter > Dec 17, 2007 7:45:19 AM org.apache.solr.core.SolrException log > SEVERE: java.io.IOException: Cannot run program "snapshooter" (in > directory "/search/replication_test/0/index/solr/bin"): > java.io.IOException: error=13, Permission denied at > java.lang.ProcessBuilder.start(ProcessBuilder.java:459) > at java.lang.Runtime.exec(Runtime.java:593) > at > org.apache.solr.core.RunExecutableListener.exec(RunExecutableListener. > ja > va:70) > at > org.apache.solr.core.RunExecutableListener.postCommit(RunExecutableLis > te > ner.java:97) > at > org.apache.solr.update.UpdateHandler.callPostOptimizeCallbacks(UpdateH > an > dler.java:105) > at > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2. > java:516) > at > org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateReques > tH > andler.java:214) > at > org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlU > pd > ateRequestHandler.java:84) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle > rB > ase.java:77) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:658) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter. > ja > va:191) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter > .j > ava:159) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli > ca > tionFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi > lt > erChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa > lv > e.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa > lv > e.java:175) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja > va > :128) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja > va > :102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. > java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java > :2 > 63) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java: > 84 > 4) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proces > s( > Http11Protocol.java:584) > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447 > ) at java.lang.Thread.run(Thread.java:619) > Caused by: java.io.IOException: java.io.IOException: error=13, > Permission denied at > java.lang.UNIXProcess.(UNIXProcess.java:148) > at java.lang.ProcessImpl.start(ProcessImpl.java:65) > at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) > ... 23 more > > > > -- Regards, Cuong Hoang
RE: Replication hooks - changing the index while the slave is running ...
It works via two Unix file system tricks. 1) Files are not directly bound with with filenames, instead there is a layer of indirection called an 'inode'. So, multiple file and directory names point to the same physical file. The "." and ".." directory entries are implemented this way. 2) Physical files are bound to all open file descriptors even after there are no file names for the files. So, file data exists until all file names are gone AND all open files are gone. Lance -Original Message- From: Tracy Flynn [mailto:[EMAIL PROTECTED] Sent: Saturday, December 15, 2007 7:36 AM To: solr-user@lucene.apache.org Subject: Re: Replication hooks - changing the index while the slave is running ... That helps Thanks for the prompt reply On Dec 15, 2007, at 10:15 AM, Yonik Seeley wrote: > On Dec 14, 2007 7:36 PM, Tracy Flynn > <[EMAIL PROTECTED]> wrote: >> 1) The existing index(es) being used by the Solr slave instance are >> physically deleted >> 2) The new index snapshots are renamed/moved from their temporary >> installation location to the default index location >> 3) The slave is sent a 'commit' to force a new IndexReader to start >> to read the new index. >> >> What happens to search requests against the existing/old index during >> step 1) and between steps 1 and 2? > > Search requests will still work on the old searcher/index. > >> Where do they get information if >> they need to go to disk for results that are not cached? Do they a) >> hang b) produce no results c) error in some other way? > > A lucene IndexReader keeps all the files open that aren't loaded into > memory... and external deletion has no effect on the ability to keep > reading these open files (they aren't really deleted yet). > > -Yonik
RE: retrieve lucene "doc id"
We are using MD5 to generate our IDs. MD5s are 128 bits creating a very unique and very randomized number for the content. Nobody has ever reported two different data sets that create the same MD5. We use the standard (some RFC) text representation of 32 hex characters. This has the advantage that F* pulls 1/16 of the total index, with a completely randomized distribution, F** 1/256, etc. This is very handy for data analysis and document extraction. MD5 creates 128 bits, but if your index is small enough that you are willing to risk it, you could pick 64 bits and park them in a Java long. -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Monday, December 17, 2007 8:15 AM To: solr-user@lucene.apache.org Subject: Re: retrieve lucene "doc id" Yonik Seeley wrote: > On Dec 17, 2007 1:40 AM, Ben Incani <[EMAIL PROTECTED]> wrote: >> I have converted to using the Solr search interface and I am trying >> to retrieve documents from a list of search results (where previously >> I had used the doc id directly from the lucene query results) and the >> solr id I have got currently indexed is unfortunately configured not be unique! > > Ouch... I'd try to make a unique Id then! > Or barring that, just try to make the query match exactly the docs you > want back (don't do the 2 phase thing). > In 1.3-dev, you can use UUIDField to have solr generate a UUID for each doc. ryan
RE: Solr 1.3 expected release date
... SOLR-303 (Distributed Search over HTTP)... Woo-hoo! -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 12, 2007 12:09 PM To: solr-user@lucene.apache.org Subject: Re: Solr 1.3 expected release date Owens, Martin wrote: > What date or year do we believe Solr 1.3 will be released? > > Regards, Martin Owens 2008 for sure. It will be after lucene 2.3 and that is a month(more?) away. My honest guess is late Jan to mid Feb. I think the last *major* change going into 1.3 is SOLR-303 (Distributed Search over HTTP) -- this will require some reworking of new features like SearchComponents and solrj. After that, changes will mostly be for stability and clarity. I don't really want to promote using nightly builds, but if you need 1.3 features, the current ones are stable. The interfaces may change, but it should not crash or anything like that. ryan
RE: Facets - What's a better term for non technical people?
In SQL terms they are: 'select unique'. Except on only one field. -Original Message- From: Charles Hornberger [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 11, 2007 9:49 AM To: solr-user@lucene.apache.org Subject: Re: Facets - What's a better term for non technical people? FAST calls them "navigators" (which I think is a terrible term - YMMV of course :-)) I tend to think that "filters" -- or perhaps "dynamic filters" -- captures the essential function. On Dec 11, 2007 2:38 AM, "DAVIGNON Andre - CETE NP/DIODé/PANDOC" <[EMAIL PROTECTED]> wrote: > Hi, > > > So, has anyone got a good example of the language they might use > > over, say, a set of radio buttons and fields on a web form, to > > indicate that selecting one or more of these would return facets. 'Show > > grouping by' > > or 'List the sets that the results fall into' or something similar. > > Here's what i found some time : > http://www.searchtools.com/info/faceted-metadata.html > > It has been quite useful to me. > > André Davignon > >
RE: SOLR X FAST
FAST is a little less flexible (no dynamic fields) and not programmable at the Lucene level. We recently switched from FAST to Solr because of cost reasons. They did not know how to license us; they are used to, say, IBM running FAST on hundreds of servers. We are a startup with very specific needs. It's turned out to be worthwhile because we only want to do one thing really well and we can customize Solr for it. Lance -Original Message- From: Nuno Leitao [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 11, 2007 5:51 PM To: solr-user@lucene.apache.org Subject: Re: SOLR X FAST FAST uses two pipelines - an ingestion pipeline (for document feeding) and a query pipeline which are fully programmable (i.e., you can customize it fully). At ingestion time you typically prepare documents for indexing (tokenize, character normalize, lemmatize, clean up text, perform entity extraction for facets, perform static boosting for certain documents, etc.), while at query time you can expand synonyms, and do other general query side tasks (not unlike Solr). Horizontal scalability means the ability to cluster your search engine across a large number of servers, so you can scale up on the number of documents, queries, crawls, etc. There are FAST deployments out there which run on dozens, in some cases hundreds of nodes serving multiple terabyte size indexes and achieving hundreds of queries per seconds. Yet again, if your requirements are relatively simple then Lucene might do the job just fine. Hope this helps. --Nuno. On 12 Dec 2007, at 01:33, Ravish Bhagdev wrote: > Could you please elaborate on what you mean by ingestion pipeline and > horizontal scalability? I apologize if this is a stupid question > everyone else on the forum is familiar with. > > Thanks, > Ravi > > On Dec 12, 2007 1:09 AM, Nuno Leitao <[EMAIL PROTECTED]> wrote: >> Depends, if you are looking for a small sized index (gigabytes rather >> than dozens or hundreds of gigabytes or terabytes) with relatively >> simple requirements (a few facets, simple tokenization, English only >> linguistics, etc.) Solr is likely to be appropriate for most cases. >> >> FAST however gives you great horizontal scalability, out of the box >> linguistics for many languages (including CJK), contextual and scope >> searching, a web, file and database crawler, programmable ingestion >> pipeline, etc. >> >> Regards. >> >> --Nuno >> >> >> On 11 Dec 2007, at 22:09, William Silva wrote: >> >>> Hi, >>> How is the best way to compare SOLR and FAST Search ? >>> Thanks, >>> William. >> >>
RE: Cache use
There are query and document field caches. A query cache is a list of records that match a query. A document cache actually contains the fields. Fetching from your query cache still has to assemble the results from the indexed data. If the ram-based index is paging, that is an answerr. Note that Lucene stores different fields of the same query, and the index output, in different ares of the index. In my case, with very small records of maybe 20 fields, there was a 5% difference between fetching one field and all fields. This could be very different with your index. Lance -Original Message- From: sfox [mailto:[EMAIL PROTECTED] Sent: Thursday, December 06, 2007 1:24 PM To: solr-user@lucene.apache.org Subject: Re: Cache use One possible explanation is that the OS's native file system caching is being successful at keeping these files mostly in RAM most of the time. And so the performance benefits of 'forcing' the files into RAM by using tmpfs aren't significant. So the slowness of the queries is the result of being CPU bound, rather than IO bound. The cache within Solr is faster because it is saving and returning the information for which the CPU-bound work has already been done. Just one possible explanation. Sean Fox Matthew Phillips wrote: > No one has a suggestion? I must be missing something because as I > understand it from Dennis' email, all of queries are very quick > (cached type response times) whereas mine are not. I can clearly see > time differences between queries that are cached (things that have > been auto > warmed) and queries that are not. This seems odd as my whole index is > loaded on a tmpfs memory based file system. Thanks for the help. > > Matt > > On Dec 4, 2007, at 3:55 PM, Matthew Phillips wrote: > >> Thanks for the suggestion, Dennis. I decided to implement this as you >> described on my collection of about 400,000 documents, but I did not >> receive the results I expected. >> >> Prior to putting the indexes on a tmpfs, I did a bit of benchmarking >> and found that it usually takes a little under two seconds for each >> facet query. After moving my indexes from disk to a tmpfs file >> system, I seem to get about the same result from facet queries: about >> two seconds. >> >> Does anyone have any insight into this? Doesn't it seem odd that my >> response times are about the same? Thanks for the help. >> >> Matt Phillips >> >> Dennis Kubes wrote: >>> One way to do this if you are running on linux is to create a tempfs >>> (which is ram) and then mount the filesystem in the ram. Then your >>> index acts normally to the application but is essentially served >>> from Ram. This is how we server the Nutch lucene indexes on our web >>> search engine (www.visvo.com) which is ~100M pages. Below is how >>> you can achieve this, assuming your indexes are in /path/to/indexes: >>> mv /path/to/indexes /path/to/indexes.dist mkdir /path/to/indexes cd >>> /path/to mount -t tmpfs -o size=2684354560 none /path/to/indexes >>> rsync --progress -aptv indexes.dist/* indexes/ chown -R user:group >>> indexes This would of course be limited by the amount of RAM you >>> have on the machine. But with this approach most searches are >>> sub-second. >>> Dennis Kubes >>> Evgeniy Strokin wrote: Hello,... we have 110M records index under Solr. Some queries takes a while, but we need sub-second results. I guess the only solution is cache (something else?)... We use standard LRUCache. In docs it says (as far as I understood) that it loads view of index in to memory and next time works with memory instead of hard drive. So, my question: hypothetically, we can have all index in memory if we'd have enough memory size, right? In this case the result should come up very fast. We have very rear updates. So I think this could be a solution. How should I configure the cache to achieve such approach? Thanks for any advise. Gene >
RE: out of heap space, every day
"String[nTerms()]": Does this mean that you compare the first term, then the second, etc.? Otherwise I don't understand how to compare multiple terms in two records. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Tuesday, December 04, 2007 8:06 AM To: solr-user@lucene.apache.org Subject: Re: out of heap space, every day On Dec 4, 2007 10:59 AM, Brian Whitman <[EMAIL PROTECTED]> wrote: > > > > For faceting and sorting, yes. For normal search, no. > > > > Interesting you mention that, because one of the other changes since > last week besides the index growing is that we added a sort to an sint > field on the queries. > > Is it reasonable that a sint sort would require over 2.5GB of heap on > a 8M index? Is there any empirical data on how much RAM that will need? int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms. Then double that to allow for a warming searcher. One can decrease this memory usage by using an "integer" instead of an "sint" field if you don't need range queries. The memory usage would then drop to a straight int[maxDoc()] (4 bytes per document). -Yonik
RE: out of heap space, every day
Thanks! I've seen a few formulae like this go by over the months. Can someone please make a wiki page for memory and processing estimation with locality properties? Or is there a Lucene page we can use? Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Tuesday, December 04, 2007 8:06 AM To: solr-user@lucene.apache.org Subject: Re: out of heap space, every day On Dec 4, 2007 10:59 AM, Brian Whitman <[EMAIL PROTECTED]> wrote: > > > > For faceting and sorting, yes. For normal search, no. > > > > Interesting you mention that, because one of the other changes since > last week besides the index growing is that we added a sort to an sint > field on the queries. > > Is it reasonable that a sint sort would require over 2.5GB of heap on > a 8M index? Is there any empirical data on how much RAM that will need? int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms. Then double that to allow for a warming searcher. One can decrease this memory usage by using an "integer" instead of an "sint" field if you don't need range queries. The memory usage would then drop to a straight int[maxDoc()] (4 bytes per document). -Yonik
RE: How to delete records that don't contain a field?
Oops, I should explain. *:* means all records. This trick puts a positive query in front of your negative query, and that allows it to work. Lance -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 04, 2007 7:44 AM To: solr-user@lucene.apache.org Subject: Re: How to delete records that don't contain a field? i'm using this: *:* -[* TO *] which is what lance suggested..works just fine. fyi: https://issues.apache.org/jira/browse/SOLR-381 On Dec 3, 2007 8:09 PM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > Wouldn't this be: *:* AND "negative query" > > > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik > Seeley > Sent: Monday, December 03, 2007 2:23 PM > To: solr-user@lucene.apache.org > Subject: Re: How to delete records that don't contain a field? > > On Dec 3, 2007 5:22 PM, Jeff Leedy <[EMAIL PROTECTED]> wrote: > > > I was wondering if there was a way to post a delete query using curl > > to delete all records that do not contain a certain field--something > > like > > this: > > > > curl http://localhost:8080/solr/update --data-binary > > '-_title:[* TO *]' -H > > 'Content-type:text/xml; charset=utf-8' > > > > The minus syntax seems to return the correct list of ids (that is, > > all > > > records that do not contain the "_title" field) when I use the Solr > > administrative console to do the above query, so I'm wondering if > > Solr > > > just doesn't support this type of delete. > > > Not yet... it makes sense to support this in the future though. > > -Yonik >
RE: How to delete records that don't contain a field?
Wouldn't this be: *:* AND "negative query" -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Monday, December 03, 2007 2:23 PM To: solr-user@lucene.apache.org Subject: Re: How to delete records that don't contain a field? On Dec 3, 2007 5:22 PM, Jeff Leedy <[EMAIL PROTECTED]> wrote: > I was wondering if there was a way to post a delete query using curl > to delete all records that do not contain a certain field--something > like > this: > > curl http://localhost:8080/solr/update --data-binary > '-_title:[* TO *]' -H > 'Content-type:text/xml; charset=utf-8' > > The minus syntax seems to return the correct list of ids (that is, all > records that do not contain the "_title" field) when I use the Solr > administrative console to do the above query, so I'm wondering if Solr > just doesn't support this type of delete. Not yet... it makes sense to support this in the future though. -Yonik
RE: LowerCaseFilterFactory and spellchecker
What would also help is a query to find records for the spellcheck dictionary builder. We would like to make separate spelling indexes for all records in english, one in spanish, etc. We would also like to slice&dice the records by other dimensions as well, and have separate spelling DBs for each partition. That is, we would like to make an english spelling dictionary and a spanish dictionary, and also make subject-specific dictionaries like News and Sports. These are separate orthogonal partitions of our index. The usual practice for this is to create separate fields in the records where one field is only populated for english records, one for spanish records, etc. In our situation this is not practical for space reasons and other proprietary reasons. Lance -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Thursday, November 29, 2007 6:01 PM To: solr-user@lucene.apache.org Subject: Re: LowerCaseFilterFactory and spellchecker On 29-Nov-07, at 5:40 PM, Chris Hostetter wrote: > > I'm not very familiar with the SpellCheckerRequestHandler, but i don't > think you are doing anything wrong. > > a quick skim of the code indicates that the "q" param isn't being > analyzed by that handler, so the raw input string is pased to the > SpellChecker.suggestSimilar method. This may or may not have been > intentional. > > I personally can't think of > any reason why it wouldn't make sense to get the query analyzer for > the termSourceField and use it to analyze the q param before getting > suggestions. It does make some sense, but I'm not sure that it should be blindly analyzed without adding logic to handle certain cases (like the QueryParser does). What happens if the analyzer produces two tokens? The spellchecker has to deal with this appropriately. Spell checkers should be able to "reverse analyze" the suggestions as well, so "Pyhton" gets corrected to "Python" and not "python". Similarly, "ad-hco" should probably suggest "ad-hoc" and not "adhoc". -Mike
Schema class configuration syntax
Hi- What is the element in an element that will load this class: org.apache.lucene.analysis.cn.ChineseFilter This did not work: This is in Solr 1.2. Thanks, Lance Norskog
RE: LowerCaseFilterFactory and spellchecker
Oops, sorry, didn't think that through. The query to the spellchecker is not filtered through the field query definition. You have to do your own lower-case transformation when you do the query. This is a simple thing to resolve. But, I'm working with international alphabets and I would like 'protege' and 'protege with both e's accented` to match. The ISOLatin1 filter does this in indexing & querying. But I have to rip off the code and use it in my app to preprocess words for spell-checks. Lance -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 5:16 PM To: solr-user@lucene.apache.org Subject: Re: LowerCaseFilterFactory and spellchecker lance, thanks for the quick replylooks like 'thorne' is getting added to the dictionary, as it comes up as a suggestion for 'Thorne' i could certainly just lowercase in my client, but just confirming that i'm not just screwing it up in the firstplace :) thanks again, rc On Nov 28, 2007 8:11 PM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > There are a few parameters for limiting what words are added to the > dictionary. You might be trimming out 'thorne'. See this page: > > http://wiki.apache.org/solr/SpellCheckerRequestHandler > > > -Original Message- > From: Rob Casson [mailto:[EMAIL PROTECTED] > Sent: Wednesday, November 28, 2007 4:25 PM > To: solr-user@lucene.apache.org > Subject: LowerCaseFilterFactory and spellchecker > > think i'm just doing something wrong... > > was experimenting with the spellcheck handler with the nightly > checkout from 11-28; seems my spellchecking is case-sensitive, even > tho i think i'm adding the LowerCaseFilterFactory to both the index > and query analyzers. > > here's a brief rundown of my testing steps. > > from schema.xml: > > positionIncrementGap="100"> > > > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > > > class="solr.RemoveDuplicatesTokenFilterFactory"/> > > > > > multiValued="true"/> > multiValued="true"/> > > > > > > from solrconfig.xml: > > class="solr.SpellCheckerRequestHandler" startup="lazy"> > > 1 > 0.5 > > spell > spelling > > > > > adding the doc: > > curl http://localhost:8983/solr/update -H "Content-Type: text/xml" > --data-binary ' name="title">Thorne' > curl http://localhost:8983/solr/update -H "Content-Type: text/xml" > --data-binary '' > > > > building the spellchecker: > > http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker&cmd=rebuil > d > > > > querying the spellchecker: > > results from > http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker > > > > 0 > 1 > > Thorne > false > > thorne > > > > results from > http://localhost:8983/solr/select/?q=thorne&qt=spellchecker > > > > 0 > 2 > > thorne > true > > > > > any pointers as to what i'm doing wrong, misinterpreting? i suspect i'm > just doing something bone-headed in the analyzer sections... > > thanks as always, > > rob casson > miami university libraries >
RE: LowerCaseFilterFactory and spellchecker
There are a few parameters for limiting what words are added to the dictionary. You might be trimming out 'thorne'. See this page: http://wiki.apache.org/solr/SpellCheckerRequestHandler -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 28, 2007 4:25 PM To: solr-user@lucene.apache.org Subject: LowerCaseFilterFactory and spellchecker think i'm just doing something wrong... was experimenting with the spellcheck handler with the nightly checkout from 11-28; seems my spellchecking is case-sensitive, even tho i think i'm adding the LowerCaseFilterFactory to both the index and query analyzers. here's a brief rundown of my testing steps. from schema.xml: from solrconfig.xml: 1 0.5 spell spelling adding the doc: curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary 'Thorne' curl http://localhost:8983/solr/update -H "Content-Type: text/xml" --data-binary '' building the spellchecker: http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker&cmd=rebuild querying the spellchecker: results from http://localhost:8983/solr/select/?q=Thorne&qt=spellchecker 0 1 Thorne false thorne results from http://localhost:8983/solr/select/?q=thorne&qt=spellchecker 0 2 thorne true any pointers as to what i'm doing wrong, misinterpreting? i suspect i'm just doing something bone-headed in the analyzer sections... thanks as always, rob casson miami university libraries
RE: LSA Implementation
WordNet itself is English-only. There are various ontology projects for it. http://www.globalwordnet.org/ is a separate world language database project. I found it at the bottom of the WordNet wikipedia page. Thanks for starting me on the search! Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:50 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation The languages also include CJK :) among others. - Eswar On Nov 27, 2007 8:16 AM, Norskog, Lance <[EMAIL PROTECTED]> wrote: > The WordNet project at Princeton (USA) is a large database of synonyms. > If you're only working in English this might be useful instead of > running your own analyses. > > http://en.wikipedia.org/wiki/WordNet > http://wordnet.princeton.edu/ > > Lance > > -Original Message- > From: Eswar K [mailto:[EMAIL PROTECTED] > Sent: Monday, November 26, 2007 6:34 PM > To: solr-user@lucene.apache.org > Subject: Re: LSA Implementation > > In addition to recording which keywords a document contains, the > method examines the document collection as a whole, to see which other > documents contain some of those same words. this algo should consider > documents that have many words in common to be semantically close, and > ones with few words in common to be semantically distant. This simple > method correlates surprisingly well with how a human being, looking at > content, might classify a document collection. Although the algorithm > doesn't understand anything about what the words *mean*, the patterns > it notices can make it seem astonishingly intelligent. > > When you search an such an index, the search engine looks at > similarity values it has calculated for every content word, and > returns the documents that it thinks best fit the query. Because two > documents may be semantically very close even if they do not share a > particular keyword, > > Where a plain keyword search will fail if there is no exact match, > this algo will often return relevant documents that don't contain the > keyword at all. > > - Eswar > > On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > > > > On Nov 26, 2007, at 6:06 PM, Eswar K wrote: > > > > > We essentially are looking at having an implementation for doing > > > search which can return documents having conceptually similar > > > words without necessarily having the original word searched for. > > > > Very challenging. Say someone searches for "LSA" and hits an > > archived > > > version of the mail you sent to this list. "LSA" is a reasonably > > discriminating term. But so is "Eswar". > > > > If you knew that the original term was "LSA", then you might look > > for documents near it in term vector space. But if you don't know > > the original term, only the content of the document, how do you know > > whether you should look for docs near "lsa" or "eswar"? > > > > Marvin Humphrey > > Rectangular Research > > http://www.rectangular.com/ > > > > > > >
RE: LSA Implementation
The WordNet project at Princeton (USA) is a large database of synonyms. If you're only working in English this might be useful instead of running your own analyses. http://en.wikipedia.org/wiki/WordNet http://wordnet.princeton.edu/ Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:34 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. this algo should consider documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the algorithm doesn't understand anything about what the words *mean*, the patterns it notices can make it seem astonishingly intelligent. When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all. - Eswar On Nov 27, 2007 7:51 AM, Marvin Humphrey <[EMAIL PROTECTED]> wrote: > > On Nov 26, 2007, at 6:06 PM, Eswar K wrote: > > > We essentially are looking at having an implementation for doing > > search which can return documents having conceptually similar words > > without necessarily having the original word searched for. > > Very challenging. Say someone searches for "LSA" and hits an archived > version of the mail you sent to this list. "LSA" is a reasonably > discriminating term. But so is "Eswar". > > If you knew that the original term was "LSA", then you might look for > documents near it in term vector space. But if you don't know the > original term, only the content of the document, how do you know > whether you should look for docs near "lsa" or "eswar"? > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > >
DirectUpdateHandler and DirectUpdateHandler2
Hi- We have a situation where we are submitting the same document several times, and have not handled this the right way yet. So, DirectUpdateHandler2 overwrites the existing record. If we used DirectUpdateHandler, we could use the feature where we tell it to not overwrite existing records. This option is in the DUH2 arguments, but is not implemented in DUH2 for speed reasons. Are there any features in DUH2 that are not in DUH? I mean semantic differences, not just speedups. Thanks, Lance Norskog
RE: Performance problems for OR-queries
https://issues.apache.org/jira/browse/lucene-997 is a patch to limit the time used for a query. Google clearly estimates the total # of results, and over-estimates. Lance -Original Message- From: Mike Klaas [mailto:[EMAIL PROTECTED] Sent: Thursday, November 22, 2007 1:37 PM To: solr-user@lucene.apache.org Subject: Re: Performance problems for OR-queries On 22-Nov-07, at 6:02 AM, Jörg Kiegeland wrote: > >>> 1. Does Solr support this kind of index access with better >>> performance ? >>> Is there anything special to define in schema.xml? >>> >> >> No... Solr uses Lucene at it's core, and all matching documents for a >> query are scored. >> > So it is not possible to have a "google" like performance with Solr, > i.e. to search for a set of keywords and only the 10 best documents > are listed, without touching the other millions of (web) documents > matching less keywords. > I infact would not know how to program such an index, however google > has done it somehow.. I can be fairly certain that google does not execute queries that match millions of documents on a single machine. The default query operator is (mostly) AND, so the possible match sets is much smaller. Also, I imagine they have relatively few documents per machine. >>> 2. Can one switch off this ordering and just return any 100 >>> documents fullfilling the query (though getting best-matching >>> documents would be a nice feature if it would be fast)? >>> >> >> a feature like this could be developed... but what is the usecase for >> this? What are you tring to accomplish where either relevancy or >> complete matching doesn't matter? There may be an easier workaround >> for your specific case. >> > This is not an actual Use-Case for my project, however I just wanted > to know if it would be possible. > > Because of the performance results, we designed a new type of query. I > would like to know how fast it would be before I implement the > following query: > > I have N keywords and execute a query of the form > > keyword1 AND keyword2 AND .. AND keywordN > > there may be again some millions of matching documents and I want to > get the first 100 documents. > To have a ordering criteria, each Solr document has a field named > "REV" which has a natural number. The returned 100 documents shall be > those with the lowest numbers in the "REV" field. > > My questions now are: > > (1) Will the query perform in O(100) or in O(all possible matches)? O(all possible matches) > (2) If the answer to (1) is O(all possible matches), what will be the > performance if I dont order for the "REV" field? Will Solr order it > after the point of time where a document was created/ modified? What I > have to do to get O(100) complexity finally? Ordering by natural document order in the index is sufficient to achieve O(100), but you'll have to insert code in Solr to stop after 100 docs (another alternative is to stop processing after a given amount of time). Also, using O() in the case isn't quite accurate: there are costs that vary based on the number of docs in the index too. -Mike
RE: CJK Analyzers for Solr
I notice this is in the future tense. Is the CJKTokenizer available yet? >From what I can see, the CJK code should be a Filter instead anyway. Also, the ChineseFilter and CJKTokenizer do two different things. CJKTokenizer turns C1C2C3C4 into 'C1C2 C2C3 C3C4'. ChineseFilter (from 2001) turns C1C2 into 'C1 C2'. I hope someone who speaks Mandarin or Cantonese understands what this should do. Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 10:28 AM To: solr-user@lucene.apache.org Subject: Re: CJK Analyzers for Solr Hoss, Thanks a lot. Will look into it. Regards, Eswar On Nov 26, 2007 11:55 PM, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > : Does Solr come with Language analyzers for CJK? If not, can you > please > : direct me to some good CJK analyzers? > > Lucene has a CJKTokenizer and CJKAnalyzer in the contrib/analyzers jar. > they can be used in Solr. both have been included in Solr for a while > now, so you can specify CJKAnalyzer in your schema with Solr 1.2, but > starting with Solr 1.3 a Factory for the Tokenizer will also be > included so it can be used in a more complex analysis chain defined in the schema. > > > > -Hoss > >
RE: Weird memory error.
AppPerfect has a free-for-noncommercial-use version of their tools. I've used them before and was very impressed. http://www.appperfect.com/products/devtest.html#versions -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Tuesday, November 20, 2007 9:12 AM To: solr-user@lucene.apache.org Subject: Re: Weird memory error. On Nov 20, 2007 11:29 AM, Brian Carmalt <[EMAIL PROTECTED]> wrote: > Can you recommend one? I am not familar with how to profile under Java. Netbeans has one for free: http://www.netbeans.org/products/profiler/ -Yonik
RE: Solr cluster topology.
http://wiki.apache.org/solr/CollectionDistribution http://wiki.apache.org/solr/SolrCollectionDistributionScripts http://wiki.apache.org/solr/SolrCollectionDistributionStatusStats http://wiki.apache.org/solr/SolrOperationsTools http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline http://wiki.apache.org/solr/CollectionRebuilding http://wiki.apache.org/solr/SolrAdminGUI -Original Message- From: Matthew Runo [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 20, 2007 10:54 AM To: solr-user@lucene.apache.org Subject: Re: Solr cluster topology. Yes. The clients will always be a minute or two behind the master. I like the way some people are doing it - make them all masters! Just post your updates to each of them - you loose a bit of performance perhaps, but it doesn't matter if a server bombs out or you have to upgrade them, since they're all exactly the same. --Matthew On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote: > Hi All! > > I just started reading about Solr a couple of days ago (not full time > of course) and it looks like a pretty impressive set of > technologies... I have still a few questions I have not clearly found: > > Q: On a cluster, as I understand it, one and only one machine is a > master, and N servers could be slaves...The clients, do they all > talk to the master for indexing and to a load balancer for > searching? Is one particular machine configured to know it is the > master? Or is it only the settings for replicating the index that > matter? Or does one post reindex petitions to any of the slaves > and they will forward it to the master? > > How can we have failover in the master? > > It is a correct assumption that slaves could always be a bit out of > sync with the master, correct? A matter of minutes perhaps... > > Thanks in advance for your responses! > >
RE: snappuller rsync parameter error? - "solr" hardcoded
Be careful. 'rsync' has different meanings for 'directory' v.s. 'directory/'. I ran afoul of this. -Original Message- From: Walter Underwood [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 14, 2007 8:49 AM To: solr-user@lucene.apache.org Subject: Re: snappuller rsync parameter error? - "solr" hardcoded I'm not an rsync expert, but I beleive that /solr/ is a virtual directory defined in the rsyncd config. It is mapped to the real directory. wunder On 11/14/07 8:43 AM, "Jae Joo" <[EMAIL PROTECTED]> wrote: > In the snappuller, the "solr" is hardcoded. Should it be > "${master_data_dir}? > > # rsync over files that have changed > rsync -Wa${verbose}${compress} --delete ${sizeonly} \ ${stats} > rsync://${master_host}:${rsyncd_port}/solr/${name}/ > ${data_dir}/${name}-wip > > Thanks, > > Jae
RE: Delte all docs in a SOLR index?
A safer way is to stop Solr and remove the index directory. There is less chance of corruption, and it will faster. -Original Message- From: David Neubert [mailto:[EMAIL PROTECTED] Sent: Friday, November 09, 2007 10:56 AM To: solr-user@lucene.apache.org Subject: Re: Delte all docs in a SOLR index? Thanks! - Original Message From: Chris Hostetter <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Friday, November 9, 2007 1:51:03 PM Subject: Re: Delte all docs in a SOLR index? : Sorry for another basic question -- but what is the best safe way to : delete all docs in a SOLR index. I thought this was a FAQ, but it's hidden in another question (rebuilding if schema changes) i'll pull it out into a top level question... *:* : I am in my first few days using SOLR and Lucene, am iterating the schema : often, starting and stoping with test docs, etc. I like to know a very : quick way to clean out the index and start over repeatedly -- can't seem : to find it on the wiki -- maybe its Friday :) Huh .. that's actually the FAQ that does talk about deleting all docs :) "How can I rebuild my index from scratch if I change my schema?" http://wiki.apache.org/solr/FAQ#head-9aafb5d8dff5308e8ea4fcf4b71f19f029c 4bb99 -Hoss __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
RE: Score of exact matches
What is the performance profile of this against merely searching against one field? My situation is millions of small records with an average of 200 bytes/text field. Lance -Original Message- From: Walter Underwood [mailto:[EMAIL PROTECTED] Sent: Monday, November 05, 2007 9:38 PM To: solr-user@lucene.apache.org Subject: Re: Score of exact matches This is fairly straightforward and works well with the DisMax handler. Indes the text into three different fields with three different sets of analyzers. Use something like this in the request handler: 0.01 exact^16 noaccent^4 stemmed exact^16 noaccent^4 stemmed You will probably need to adjust the weights for your content, though I expect these are a good starting place. Per-field analyzers are very easy to use in Solr and are extremely powerful. I wish we'd thought of that in Ultraseek. wunder == Search Guy, Netflix Formerly: Architect, Ultraseek On 11/5/07 9:05 PM, "Papalagi Pakeha" <[EMAIL PROTECTED]> wrote: > Hi all, > > I use Solr 1.2 on a job advertising site. I started from the default > setup that runs all documents and queries through > EnglishPorterFilterFactory. As a result for example an ad with > "accounts" in its title is matched when someone runs a query for > "accountant" because both are stemmed to the "account" word and then > they match. > > Is it somehow possible to give a higher score to exact matches and > sort them before matches from stemmed terms? > > Close to this is a problem with accents - I can remove accents from > both documents and from queries and then run the query on non-accented > terms. But I'd like to give higher score to documents where the search > term matches exactly (i.e. including accents and possibly letter > capitalization, etc) and sort them before more fuzzy searches. > > To me it looks like I have to run multiple sub-queries for each query, > one for exact match, one for accents removed and one for stemmed words > and then combine the results and compute the final score for each > match. Is that possible? > > Thanks! > > PaPa
RE: specify index location
Snapshots are in that directory, and the spellchecker has its own indexes under there. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Monday, November 05, 2007 8:57 PM To: solr-user@lucene.apache.org Subject: Re: specify index location On 11/5/07, evol__ <[EMAIL PROTECTED]> wrote: > Just a remark: >Might be a good idea to change this to ./data/index > to reflect the location that is expected in there. ./data is the generic solr data directory "index" stores the main index under the data directory. -Yonik
RE: My filters are not used
This search has up to 8000 records. Does this require a query cache of 8000 records? When is the query cache filled? This answers a second question: the filter design is intended for small search sets. I'm interested in selecting maybe 1/10 of a few million records as a search limiter. Is it possible to create a similar feature that caches low-level data areas for aquery? Let's say that the if query selects 1/10 of the document space, this means that only 40% of the total memory area contains data for that 1/10. Is there a cheap way to record this data? Would it be a feature like filters which records a much lower-level data structure like disk blocks? Thanks, Lance Norskog -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, October 24, 2007 8:24 PM To: solr-user@lucene.apache.org Subject: Re: My filters are not used On 10/24/07, Norskog, Lance <[EMAIL PROTECTED]> wrote: > I am creating a filter that is never used. Here is the query sequence: > > q=*:*&fq=contentid:00*&start=0&rows=200 > > q=*:*&fq=contentid:00*&start=200&rows=200 > > q=*:*&fq=contentid:00*&start=400&rows=200 > > q=*:*&fq=contentid:00*&start=600&rows=200 > > q=*:*&fq=contentid:00*&start=700&rows=200 > > Accd' to the statistics here is my filter cache usage: > > lookups : 1 [...] > > I'm completely confused. I thought this should be 1 insert, 4 lookups, > 4 hits, and a hitratio of 100%. Solr has a query cache too... the query cache is checked, there's a hit, and the query process is short circuited. -Yonik
New issue: request for limit parameter for search time, hits, and estimated ram usage
http://issues.apache.org/jira/browse/SOLR-392 Summary: It would be good for end-user applications if Solr allowed searches to cease before finishing, and still return partial results.
RE: Converting German special characters / umlaute
Isn't this what ISOLatin1Filter does? Turn Björk into Bjork? This should be much faster than PatternReplaceFilterFactory. -Original Message- From: Matthias Eireiner [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 24, 2007 1:47 PM To: solr-user@lucene.apache.org Subject: AW: Converting German special characters / umlaute Dear list, it has been some time, but here is what I did. I had a look at Thomas Traeger's tip to use the SnowballPorterFilterFactory, which does not actually do the job. Its purpose is to convert regular ASCII into special characters. And I want it the other way, such that all special character are converted to regular ASCII. The tip of J.J. Larrea, to use the PatternReplaceFilterFactory, solved the problem. And as Chris Hostetter noted, stored fields always return the initial value, which turned the second part of my question obsolete. Thanks a lot for your help! best Matthias -Ursprüngliche Nachricht- Von: Thomas Traeger [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 26. September 2007 23:44 An: solr-user@lucene.apache.org Betreff: Re: Converting German special characters / umlaute Try the SnowballPorterFilterFactory described here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters You should use the German2 variant that converts ä and ae to a, ö and oe to o and so on. More details: http://snowball.tartarus.org/algorithms/german2/stemmer.html Every document in solr can have any number of fields which might have the same source but have different field types and are therefore handled differently (stored as is, analyzed in different ways...). Use copyField in your schema.xml to feed your data into multiple fields. During searching you decide which fields you like to search on (usually the analyzed ones) and which you retrieve when getting the document back. Tom Matthias Eireiner schrieb: > Dear list, > > I have two questions regarding German special characters or umlaute. > > is there an analyzer which automatically converts all german special > characters to their specific dissected from, such as ü to ue and ä to > ae, etc.?! > > I also would like to have, that the search is always run against the > dissected data. But when the results are returned the initial data > with the non modified data should be returned. > > Does lucene GermanAnalyzer this job? I run across it, but I could not > figure out from the documentation whether it does the job or not. > > thanks a lot in advance. > > Matthias >
RE: Solr and security
Solr does not do security itself. Servlet containers usually support various security options: account/password through HTTP authentication (very weak security) and certificates (very strong security) are what I would look at first. Lance -Original Message- From: Wagner,Harry [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 24, 2007 9:25 AM To: solr-user@lucene.apache.org Subject: RE: Solr and security One effective method is to block access to the port Solr runs on. Force application access to come thru the HTTP server, and let it map to the application server (i.e., like mod_jk does for for Apache & Tomcat). Simple, but effective. Cheers! harry -Original Message- From: Cool Coder [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 24, 2007 12:17 PM To: solr-user@lucene.apache.org Subject: Solr and security Hi Group, As far as I know, to use solr, we need to deploy it as a server and communicate to solr using http protocol. How about its security? i.e. how can we ensure that it only accepts request from predefined set of users only. Is there any way we can specify this in solr or solr depends only on web server security model. I am not sure whether my interpretation is right? Your suggestion/input? - BR __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
RE: history
We have another use case. We would like count the number of times a document came up in any search, and the total number of times it was read. If these counters are not indexed, it seems like an update would be a simple integer poke into the index. Also, thanks for the spellcheck info. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Saturday, July 07, 2007 9:21 AM To: solr-user@lucene.apache.org Subject: Re: history On 7/7/07, Brian Whitman <[EMAIL PROTECTED]> wrote: > I have been trying to plan out a history function for Solr. When I > update a document with an existing unique key, I would like the older > version to stay around and get tagged with the date and some metadata > to indicate it's not "live." Any normal search would not touch history > documents. Interesting... One might be able to accomplish this with the update processors that Ryan & I have been batting around for the last few days, in conjunction with updateable documents, which is on-deck. The first idea that comes to mind is that during an update, you could change the id of the older document to be something like id_, and reindex it with the addition of a live:false field. For normal queries, use a filter of -live:false filter. For all old of a document, use a prefix query id:mydocid_* for all versions of a document, use query id:mydocid* So if you can hold off a little bit, you shouldn't need a custom query handler. This will be a good use case to ensure that our request processors and updateable documents are powerful enough. -Yonik
RE: most popular/most commonly accessed records
Documents in Lucene are read-only. You need to track accesses separately. You can have a changeble score value in your documents, but you'll have to re-index them for each change. We use Google Analytics as a first cut. If you look at http:/www.divvio.com (pimping my employer) and look at the page source, down at the bottom we trigger a message to GA. Unforch it has our license key, which I think we want to change :) The next step is to use such dynamic statistics in calculating boosts; it seems like we would want to combine relational DB accesses with Lucene indexing to calculate relevance and boosts. Lance -Original Message- From: Karen Loughran [mailto:[EMAIL PROTECTED] Sent: Friday, July 06, 2007 6:59 AM To: solr-user@lucene.apache.org Subject: most popular/most commonly accessed records Hi all, Is there a way through solr to find out about "most commonly accessed" solr documents ? So for example, my client may wish to list the top 10 most popular videos, based on previous accesses to them in the solr server db. If there are any solr features to help with this can someone point me to them ? Had a browse through the user documentation, but can't see anything obvious ? Many thanks Karen Loughran
Checking for empty fields
I understand that I cannot query on the 'null' value for a field, and so I should make null fields -1 instead. About dynamic fields: is there a way to query for the existence of a dynamic field? Thanks, Lance Norskog Divvio, Inc.