Re: Does solr supports indexing of files other than UTF-8
On Thu, Jan 27, 2011 at 3:51 AM, prasad deshpande prasad.deshpand...@gmail.com wrote: The size of docs can be huge, like suppose there are 800MB pdf file to index it I need to translate it in UTF-8 and then send this file to index. PDF is binary AFAIK... you shouldn't need to do any charset translation before sending it to solr, or any other extraction library. If you're using solr-cell then it's the Tika component that is responsible for pulling out the text in the right format. -Yonik http://lucidimagination.com
Re: Searching for negative numbers very slow
On Thu, Jan 27, 2011 at 6:32 PM, Simon Wistow si...@thegestalt.org wrote: If I do qt=dismax fq=uid:1 (or any other positive number) then queries are as quick as normal - in the 20ms range. However, any of fq=uid:\-1 or fq=uid:[* TO -1] or fq=uid:[-1 to -1] or fq=-uid:[0 TO *] then queries are incredibly slow - in the 9 *second* range. That's odd - there should be nothing special about negative numbers. Here are a couple of ideas: - if you have a really big index and querying by a negative number is much more rare, it could just be that part of the index wasn't cached by the OS and so the query needs to hit the disk. This can happen with any term and a really big index - nothing special for negatives here. - if -1 is a really common value, it can be slower. is fq=uid:\-2 or other negative numbers really slow also? -Yonik http://lucidimagination.com
Re: edismax vs dismax
On Fri, Jan 28, 2011 at 3:00 PM, Thumuluri, Sai sai.thumul...@verizonwireless.com wrote: I recently upgraded to Solr 1.4.1 from Solr 1.3 and with the upgrade used edismax query parser. Here is my solrconfig.xml . When I search for mw verification and payment information - I get no results with defType set to edismax, It's probably a bit of natural language query parsing in edismax... - and is treated as AND (the lucene operator) in the appropriate context (i.e. we won't if it's at the start or end of the query, etc) - or is treated as OR in the appropriate context The lowercaseOperators parameter can control this, so try setting lowercaseOperators=false -Yonik http://lucidimagination.com if I switch the deftype to dismax - I get the results I am looking for Can anyone explain, why this would be the case? I thought edismax is dismax and more. Thank you, For 1.4.1 requestHandler name=partitioned class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf body^1.0 title^10.0 name^3.0 taxonomy_names^2.0 tags_h1^5.0 tags_h2_h3^3.0 tags_h4_h5_h6^2.0 tags_inline^1.0 /str str name=pf body^10.0 /str int name=ps4/int str name=mm 2lt;-25% /str str name=q.alt*:*/str str name=hltrue/str str name=hl.flbody/str int name=hl.snippets3/int str name=hl.mergeContiguoustrue/str !-- instructs Solr to return the field itself if no query terms are found -- str name=f.body.hl.alternateFieldbody/str str name=f.body.hl.maxAlternateFieldLength256/str !--str name=f.body.hl.fragmenterregex/str-- !-- defined below -- Sai Thumuluri
Re: Local param tag voodoo ?
On Thu, Jan 20, 2011 at 4:59 AM, Xavier SCHEPLER xavier.schep...@sciences-po.fr wrote: Ok, I tryed to use nested queries this way: wt=jsonindent=truefl=qFRq=sarkozy _query_:{!tag=test}chiracfacet=truefacet.field={!ex=test}studyDescriptionId It resulted in this error: facet_counts:{ facet_queries:{}, exception:java.lang.NullPointerException\n\tat There's currently no way to exclude part of a query... the things you tag must be a top level q or fq query. But this has uncovered a bug - we don't handle the case when everything is excluded (all q and fq). -Yonik http://www.lucidimagination.com
Re: utf-8 tomcat and solr problem
On Thu, Jan 6, 2011 at 2:23 AM, Julian Hille julian.hi...@netimpact.de wrote: Hi, if i search for a german umlaut like ä or ö i get something like weird conversions from latin to utf in query response. The encoding of the result is ok, but not the you queried for this part. There is my ä wrong encoded. There it seems like it had been interpreted from latin to utf 8. Solr is set to use utf-8 and tomcat got in the connector URIEncoding=UTF-8 but that didnt change anything. You can verify that the container is configured correctly via example/exampledocs/test_utf8.sh Another trick I sometimes use is to use the python response format (wt=python) since that uses escapes for anything outside of ASCII and then it's easy to see the actual unicode value that's being returned in a response. -Yonik http://www.lucidimagination.com
Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)
On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch java...@gmail.com wrote: Solr/lucene newbie here .. We would like searches against a solr/lucene index to immediately be able to view data that was added. I stress small amount of new data given that any significant amount would require excessive latency. There has been significant ongoing work in lucene-core for NRT (near real time). We need to overhaul Solr's DirectUpdateHandler2 to take advantage of all this work. Mark Miller took a first crack at it (sharing a single IndexWriter, letting lucene handle the concurrency issues, etc) but if there's a JIRA issue, I'm having trouble finding it. Looking around, i'm wondering if the direction would be a MultiSearcher living on top of our standard directory-based IndexReader as well as a custom Searchable that handles the newest documents - and then combines the two results? If you look at trunk, MultiSearcher has already gone away. -Yonik http://www.lucidimagination.com
Re: Will Result Grouping return documents that don't contain the specified group.field?
On Thu, Jan 6, 2011 at 5:55 PM, Andy angelf...@yahoo.com wrote: So by default Solr will not return documents that don't contain the specified group.field? Solr will. Documents without a value for that field should be grouped under the null value. -Yonik http://www.lucidimagination.com
Re: Replication: the web application [/solr] .. likely to create a memory leak
On Tue, Jan 4, 2011 at 9:34 AM, Robert Muir rcm...@gmail.com wrote: [junit] WARNING: test class left thread running: Thread[MultiThreadedHttpConnectionManager cleanup,5,main] I suppose we should move MultiThreadedHttpConnectionManager to CoreContainer. -Yonik http://www.lucidimagination.com
Re: SpatialTierQueryParserPlugin Loading Error
On Tue, Dec 28, 2010 at 8:54 PM, Adam Estrada estrada.a...@gmail.com wrote: I would gladly update this page if I could just get it working. http://wiki.apache.org/solr/SpatialSearch Everything on that wiki page should work w/o patches on trunk. I just ran through all of the examples, and everything seemed to be working fine. -Yonik http://www.lucidimaignation.com
Re: Map failed at getSearcher
On Fri, Dec 24, 2010 at 10:23 AM, Robert Muir rcm...@gmail.com wrote: hmm, i think you are actually running out of virtual address space, even on 64-bit! I don't know if there are any x86 processors that allow 64 bits of address space yet. AFAIK, they are mostly 48 bit. http://msdn.microsoft.com/en-us/library/aa366778(v=VS.85).aspx#memory_limits Apparently windows limits you to 8TB virtual address space (ridiculous), so i think you should try one of the following: * continue using mmap directory, but specify MMapDirectoryFactory yourself, and specify the maxChunkSize parameter. The default maxChunkSize is Integer.MAX_VALUE, but with a smaller one you might be able to work around fragmentation problems. Hmmm, maybe we should default to a smaller value? Perhaps something like 1G wouldn't impact performance, but could help avoid OOM due to fragmentation? -Yonik http://www.lucidimagination.com
Re: White space in facet values
On Wed, Dec 22, 2010 at 9:53 AM, Dyer, James james.d...@ingrambook.com wrote: The phrase solution works as does escaping the space with a backslash: fq=Product:Electric\ Guitar ... actually a lot of characters need to be escaped like this (amperstands and parenthesis come to mind)... One way to avoid escaping is to use the raw or term query parsers: fq={!raw f=Product}Electric Guitar In 4.0-dev, use {!term} since that will work with field types that need to transform the external representation into the internal one (like numeric fields need to do). http://wiki.apache.org/solr/SolrQuerySyntax -Yonik http://www.lucidimagination.com I assume you already have this indexed as string, not text... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Andy [mailto:angelf...@yahoo.com] Sent: Wednesday, December 22, 2010 1:11 AM To: solr-user@lucene.apache.org Subject: White space in facet values How do I handle facet values that contain whitespace? Say I have a field Product that I want to facet on. A value for Product could be Electric Guitar. How should I handle the white space in Electric Guitar during indexing? What about when I apply the constraint fq=Product:Electric Guitar?
Re: Faceting memory requirements
On Tue, Dec 21, 2010 at 4:02 PM, Rok Rejc rokrej...@gmail.com wrote: Dear all, I have created an index with aprox. 1.1 billion of documents (around 500GB) running on Solr 1.4.1. (64 bit JVM). I want to enable faceted navigation on am int field, which contains around 250 unique values. According to the wiki there are two methods: facet.method=fc which uses field cache. This method should use MaxDoc*4 bytes of memory which is around: 4.1GB. facet.method=fc uses the fieldcache, but it uses the StringIndex for all field types currently, so you need to add in space for the string representation of all the unique values. But this is only 250, so given the large number of docs, your estimate should still be close. facet.method=enum which crated a bitset for each unique value. This method should use NumberOfUniqueValues * SizeOfBitSet which is around 32GB. A more efficient representation is used for a set when the set size is less than maxDoc/64. This set type uses an int per doc in the set, so should use roughly the same amount of memory as a numeric fieldcache entry. Are my calculations correct? My memory settings in Tomcat (windows) are: Initial memory pool: 4096 MB Maximum memory pool: 8192 MB (total 12GB in my test machine) I have tried to run a query (...facet=truefacet.field=PublisherIdfacet.method=fc) but I am still getting OOM: HTTP Status 500 - Java heap space java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(FieldCacheImpl.java:703) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:224) at org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl.java:692) at org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:350) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:255) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:283) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:166) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72) at ... Any idea what am I doing wrong, or have I miscalculated the memory requirements? Perhaps you are already sorting by another field or faceting on another field that is causing a lot of memory to already be used, and this pushes it over the edge? Or perhaps the JVM simply can't find a contiguous area of memory this large? Line 703 is this: so it's failing to create the first array: final int[] retArray = new int[reader.maxDoc()]; Although the line after it is even more troublesome: String[] mterms = new String[reader.maxDoc()+1]; Although you only need an array of 250 to contain all the unique terms, the FieldCacheImpl starts out with maxDoc. I think trunk will be far better in this regard. You should also try facet.method=enum though too. -Yonik http://www.lucidimagination.com
Re: Why does Solr commit block indexing?
On Fri, Dec 17, 2010 at 8:05 AM, Grant Ingersoll gsing...@apache.org wrote: I'm not sure if there is a issue open, but I know I've talked w/ Yonik about this and a few other changes to the DirectUpdateHandler2 in the past. It does indeed need to be fixed. It stems from the APIs that were available at the time in Lucene 1.4. IIRC, Mark worked up a patch that avoided ever closing the reader I think, and delegated more of the concurrency control to Lucene (since it can handle it these days). I think maybe there was just a problem with rollback or something... -Yonik http://www.lucidimagination.com -Grant On Dec 17, 2010, at 7:04 AM, Renaud Delbru wrote: Hi Michael, thanks for your answer. Do the Solr team is aware of the problem ? Is there an issue opened about this, or ongoing work about that ? Regards, -- Renaud Delbru On 16/12/10 16:45, Michael McCandless wrote: Unfortunately, (I think?) Solr currently commits by closing the IndexWriter, which must wait for any running merges to complete, and then opening a new one. This is really rather silly because IndexWriter has had its own commit method (which does not block ongoing indexing nor merging) for quite some time now. I'm not sure why we haven't switched over already... there must be some trickiness involved. Mike On Thu, Dec 16, 2010 at 9:39 AM, Renaud Delbrurenaud.del...@deri.org wrote: Hi, See log at [1]. We are using the latest snapshot of lucene_branch3.1. We have configured Solr to use the ConcurrentMergeScheduler: mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler/ When a commit() runs, it blocks indexing (all imcoming update requests are blocked until the commit operation is finished) ... at the end of the log we notice a 4 minute gap during which none of the solr cients trying to add data receive any attention. This is a bit annoying as it leads to timeout exception on the client side. Here, the commit time is only 4 minutes, but it can be larger if there are merges of large segments I thought Solr was able to handle commits and updates at the same time: the commit operation should be done in the background, and the server still continue to receive update requests (maybe at a slower rate than normal). But it looks like it is not the case. Is it a normal behaviour ? [1] http://pastebin.com/KPkusyVb Regards -- Renaud Delbru -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search
Re: WARNING: re-index all trunk indices!
On Fri, Dec 17, 2010 at 11:18 AM, Michael McCandless luc...@mikemccandless.com wrote: If you are using Lucene's trunk (nightly build) release, read on... I just committed a change (for LUCENE-2811) that changes the index format on trunk, thus breaking (w/ likely strange exceptions on reading the segments_N file) any trunk indices created in the past week or so. For reference, the exception I got trying to start Solr with an older index on Windows is below. -Yonik http://www.lucidimagination.com SEVERE: java.lang.RuntimeException: java.io.IOException: read past EOF at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095) at org.apache.solr.core.SolrCore.init(SolrCore.java:587) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:660) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:412) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.mortbay.start.Main.invokeMain(Main.java:194) at org.mortbay.start.Main.start(Main.java:534) at org.mortbay.start.Main.start(Main.java:441) at org.mortbay.start.Main.main(Main.java:119) Caused by: java.io.IOException: read past EOF at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(MMapDirectory.java:242) at org.apache.lucene.store.ChecksumIndexInput.readBytes(ChecksumIndexInput.java:48) at org.apache.lucene.store.DataInput.readString(DataInput.java:121) at org.apache.lucene.store.DataInput.readStringStringMap(DataInput.java:148) at org.apache.lucene.index.SegmentInfo.init(SegmentInfo.java:192) at org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:57) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:220) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:90) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86) at org.apache.lucene.index.IndexReader.open(IndexReader.java:437) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1084) ... 31 more
Re: bulk commits
On Thu, Dec 16, 2010 at 3:06 PM, Dennis Gearon gear...@sbcglobal.net wrote: That easy, huh? Heck, this gets better and better. BTW, how about escaping? The CSV escaping? It's configurable to allow for loading different CSV dialects. http://wiki.apache.org/solr/UpdateCSV By default it uses double quote encapsulation, like excel would. The bottom of the wiki page shows how to configure tab separators and backslash escaping like MySQL produces by default. -Yonik http://www.lucidimagination.com Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Adam Estrada estrada.adam.gro...@gmail.com To: Dennis Gearon gear...@sbcglobal.net; solr-user@lucene.apache.org Sent: Thu, December 16, 2010 10:58:47 AM Subject: Re: bulk commits This is how I import a lot of data from a cvs file. There are close to 100k records in there. Note that you can either pre-define the column names using the fieldnames param like I did here *or* include header=true which will automatically pick up the column header if your file has it. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 This seems to load everything in to some kind of temporary location before it's actually committed. If something goes wrong there is a rollback feature that will undo anything that happened before the commit. As far as batching a bunch of files, I copied and pasted the following in to Cygwin and it worked just fine. curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,lat,lng,countrycode,population,elevation,gtopo30,timezone,modificationdate,catstream.file=C :\tmp\cities1000.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xab.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xac.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xad.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xae.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xaf.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xag.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xah.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl http://localhost:8983/solr/update/csv?commit=trueseparator=%2Cfieldnames=id,name,asciiname,latitude,longitude,featureclass,featurecode,countrycode,admin1code,admin2code,admin3code,admin4code,population,elevation,gtopo30,timezone,modificationdatestream.file=C :\tmp\xai.csvoverwrite=truestream.contentType=text/plain;charset=utf-8 curl
Re: Memory use during merges (OOM)
On Thu, Dec 16, 2010 at 5:51 AM, Michael McCandless luc...@mikemccandless.com wrote: If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. FWIW, if you're going to add a batch of documents you know aren't already in the index, you can use the overwrite=false parameter for that Solr update request. -Yonik http://www.lucidimagination.com
Re: Faceted Search Slows Down as index gets larger
Another thing you can try is trunk. This specific case has been improved by an order of magnitude recenty. The case that has been sped up is initial population of the filterCache, or when the filterCache can't hold all of the unique values, or when faceting is configured to not use the filterCache much of the time via facet.enum.cache.minDf. -Yonik http://www.lucidimagination.com On Thu, Dec 16, 2010 at 6:39 PM, Furkan Kuru furkank...@gmail.com wrote: I am sorry for raising up this thread after 6 months. But we have still problems with faceted search on full-text fields. We try to get most frequent words in a text field that is created in 1 hour. The faceted search takes too much time even the matching number of documents (created_at within 1 HOUR) is constant (10-20K) as the total number of documents increases (now 20M) the query gets slower. Solr throws exceptions and does not respond. We have to restart and delete old docs. (3G RAM) Index is around 2.2 GB. And we store the data in solr as well. The documents are small. $response = $solr-search('created_at:[NOW-'.$hours.'HOUR TO NOW]', 0, 1, array( 'facet' = 'true', 'facet.field'= $field, 'facet.mincount' = 1, 'facet.method' = 'enum', 'facet.enum.cache.minDf' = 100 )); Yonik had suggested distributed search. But I am not sure if we set every configuration correctly. For example the solr caches if they are related with faceted searching. We use default values: filterCache class=solr.FastLRUCache size=512 initialSize=512 autowarmCount=0/ queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ Any help is appreciated. On Sun, Jun 6, 2010 at 8:54 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Sun, Jun 6, 2010 at 1:12 PM, Furkan Kuru furkank...@gmail.com wrote: We try to provide real-time search. So the index is changing almost in every minute. We commit for every 100 documents received. The facet search is executed every 5 mins. OK, that's the problem - pretty much every facet search is rebuilding the facet cache, which takes most of the time (and facet.fc is more expensive than facet.enum in this regard). One strategy is to use distributed search... have some big cores that don't change often, and then small cores for the new stuff that changes rapidly. -Yonik http://www.lucidimagination.com -- Furkan Kuru
Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?
On Mon, Dec 13, 2010 at 8:47 PM, John Russell jjruss...@gmail.com wrote: Wow, you read my mind. We are committing very frequently. We are trying to get as close to realtime access to the stuff we put in as possible. Our current commit time is... ahem every 4 seconds. Is that insane? Not necessarily insane, but challenging ;-) I'd start by setting maxWarmingSearchers to 1 in solrconfig.xml. When that is exceeded, a commit will fail (this just means a new searcher won't be opened on that commit... the docs will be visible with the next commit that does succeed.) -Yonik http://www.lucidimagination.com
Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?
On Mon, Dec 13, 2010 at 9:27 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Yonik, how will maxWarmingSearchers in this scenario effect replication? If a slave is pulling down new indexes so quickly that the warming searchers would ordinarily pile up, but maxWarmingSearchers is set to 1 what happens? Like any other commits, this will limit the number of searchers warming in the background to 1. If a commit is called, and that tries to open a new searcher while another is already warming, it will fail. The next commit that does succeed will have all the updates though. Today, this maxWarmingSearchers check is done after the writer has closed and before a new searcher is opened... so calling commit too often won't affect searching, but it will currently affect indexing speed (since the IndexWriter is constantly being closed/flushed). -Yonik http://www.lucidimagination.com
Re: Userdefined Field type - Faceting
Perhaps try overriding indexedToReadable() also? -Yonik http://www.lucidimagination.com On Mon, Dec 13, 2010 at 10:00 PM, Viswa S svis...@hotmail.com wrote: Hello, We implemented an IP-Addr field type which internally stored the ips as hex-ed string (e.g. 192.2.103.29 will be stored as c002671d). My toExternal and toInternal methods for appropriate conversion seems to be working well for query results, but however when faceting on this field it returns the raw strings. in other words the query response would have 192.2.103.29, but facet on the field would return int name=c002671d1/int Why are these methods not used by the faceting component to convert the resulting values? Thanks Viswa
Re: Shards + dismax - scoring process?
On Sat, Dec 11, 2010 at 2:18 AM, bbarani bbar...@gmail.com wrote: Also, if I try to sort the query result from shards.. will sorting happens on the consolidated data or on each individual core data? Both - to find the top 10 docs by any sort, the top 10 docs from each shard are collected and then sorted to find the top 10 out of those. I am just trying to figure out best possible way to implement distributed search without affecting the search relevancy. The IDF part of the relevancy score is the only place that distributed search scoring won't match up with no distributed scoring because the document frequency used for the term is local to every core instead of global. If you distribute your documents fairly randomly to the different shards, this won't matter. There is a patch in the works to add global idf, but I think that even when it's committed, it will default to off because of the higher cost associated with it. -Yonik http://www.lucidimagination.com
Re: Map size must not be negative with spatial results + php serialized
On Wed, Dec 8, 2010 at 9:45 AM, Markus Jelsma markus.jel...@openindex.io wrote: I know, but since it's an Apache component throwing the exception, i'd figure someone just might know more about this. That's fine - it could be a Solr bug too. IMO, solr-user traffic just needs to be solr related and hopefully useful to other uses. -Yonik http://www.lucidimagination.com
Webcast: Better Search Results Faster with Apache Solr and LucidWorks Enterprise
We're holding a free webinar about relevancy enhancements in our commercial version of Solr. Details below. -Yonik http://www.lucidimagination.com - Join us for a free technical webcast Better Search Results Faster with Apache Solr and LucidWorks Enterprise Thursday, December 16, 2010 11:00 AM PST / 2:00 PM EST / 20:00 CET Click here to sign up http://www.eventsvc.com/lucidimagination/121610?trk=AP In the key dimensions of search relevancy and query-targeted results, users have become accustomed to internet-search style facilities like page-rank, user-driven feedback, auto-suggest and more. Even with the power of Apache Lucene/Solr, building such features into your own search application is easier said than done. Now, with LucidWorks Enterprise, the search solution development platform built on the Solr/Lucene open source technology, developing killer search apps with these features and more is faster, simpler, and more powerful than ever before! Join Andrzej Bialecki, Lucene/Solr Committer and inventor of the Luke index utility, for a hands-on technical workshop that details how LucidWorks Enterprise puts powerful search and relevancy at your fingertips -- at a fraction of the time and effort required to program them yourself with native Apache Solr. Andrzej will discuss and present how you can use LucidWorks Enterprise for: * Click Scoring to automatically configure relevance for most popular results * Simplified implementation of auto-complete and did-you-mean functionality * Unsupervised feedback to automatically provide relevance improvement on every query Click here to sign up http://www.eventsvc.com/lucidimagination/121610?trk=AP -- About the presenter: Andrzej Bialecki is a committer of the Apache Lucene/Solr project, a Lucene PMC member, and chairman of the Apache Nutch project. He is also the author of Luke, the Lucene Index Toolbox. Andrzej participates in many commercial projects that use Lucene/Solr, Nutch and Hadoop to implement enterprise and vertical search. -- Presented by Lucid Imagination, the commercial entity exclusively dedicated to Apache Lucene/Solr open source search technology. LucidWorks Enterprise, our search solution development platform, helps you build better search application more quickly and productively, develop and We also offer solutions including SLA-based support, professional training, best practices consulting, free developer downloads free documentation. Follow us on Twitter:twitter.com/LucidImagineer. -- Apache Lucene and Apache Solr are trademarks of the Apache Software Foundation.
Re: How to handle multivalued hierarchical facets?
Hoss had a great webinar on faceting that also covered how you could do hierarchical. http://www.lucidimagination.com/solutions/webcasts/faceting See taxonomy facets, about 28 minutes in. -Yonik http://www.lucidimagination.com On Wed, Dec 8, 2010 at 5:28 PM, Andy angelf...@yahoo.com wrote: I have facets that are hierarchical. For example, Location can be represented as this hierarchy: Country State City If each document can only have a single value for each of these facets, then I can just use separate fields for each facet. But if multiple values are allowed, then that approach would not work. For example if a document has 2 Location values: USCASan Francisco USMABoston If I just put the values CA MA in the field State, and San Francisco Boston in City, facetting would not work. Someone could select CA and the value Boston would be displayed for the field City. How do I handle this use case? Thanks
Re: Changing a solr schema from non-stored to stored on the fly
On Wed, Dec 8, 2010 at 6:07 PM, Kaktu Chakarabati jimmoe...@gmail.com wrote: Can I do this? i.e change that value in schema, and then incrementally re-index documents to populate it? would that work? what would be returned if at all for documents that werent re-indexed post-schema change? Yes, this should work fine. A document that was added with an unstored field will act exactly like a document with that field missing. -Yonik http://www.lucidimagination.com
Re: Field Collapsing - sort by group count, get total groups
On Tue, Dec 7, 2010 at 7:03 AM, ssetem sse...@googlemail.com wrote: I wondered if it is possible to sort groups by the total within the group, and to bring back total amount groups? That is planned, but not currently implemented. You can use faceting to get both totals and sort by highest total though. Total number of groups is a different problem - we don't return it because we don't know. It will take a different algorithm (that's more memory intensive) to find out the total number of groups. If the number is unlikely to be too large, you could just return all groups (or use faceting to do that more efficiently). -Yonik http://www.lucidimagination.com
Re: Field Collapsing - sort by group count, get total groups
On Tue, Dec 7, 2010 at 9:07 AM, ssetem sse...@googlemail.com wrote: Thanks for the reply, How would i get the total amount of possible facets(non zero), I've searched around but have no luck. Only current way would be to request them all. Just like field collapsing, this is a number we don't (generally) have. There are optimizations like short-circuiting on the docfreq that would need to be disabled to generate that count. -Yonik http://www.lucidimagination.com
Re: autocommit commented out -- what is the default?
On Sat, Dec 4, 2010 at 10:36 AM, Brian Whitman br...@echonest.com wrote: Hi, if you comment out the block in solrconfig.xml !-- autoCommit maxDocs1/maxDocs maxTime60/maxTime /autoCommit -- Does this mean that (a) commits never happen automatically or (b) some default autocommit is applied? Commented out means they never happen automatically (i.e., no default). In general commitWithin is a better strategy to use... bulk updates can use a large value (or no value w/ explicit commit at end) for better indexing performance, while other updates can use a smaller value depending on how soon the update needs to be visible. -Yonik http://www.lucidimagination.com
Re: ramBufferSizeMB not reflected in segment sizes in index
On Wed, Dec 1, 2010 at 3:01 PM, Shawn Heisey s...@elyograg.org wrote: I have seen this. In Solr 1.4.1, the .fdt, .fdx, and the .tv* files do not segment, but all the other files do. I can't remember whether it behaves the same under 3.1, or whether it also creates these files in each segment. Yep, that's the shared doc store (where stored fields go.. the non-inverted part of the index), and it works like that in 3.x and trunk too. It's nice because when you merge segments, you don't have to re-copy the docs (provided you're within a single indexing session). There have been discussions about removing it in trunk though... we'll see. -Yonik http://www.lucidimagination.com
Re: ArrayIndexOutOfBoundsException for query with rows=0 and sort param
On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke martin.grot...@googlemail.com wrote: Still I'm wondering, why this issue does not occur with the plain example solr setup with 2 indexed docs. Any explanation? It's an old option you have in your solrconfig.xml that causes a different code path to be followed in Solr: !-- An optimization that attempts to use a filter to satisfy a search. If the requested sort does not include score, then the filterCache will be checked for a filter matching the query. If found, the filter will be used as the source of document ids, and then the sort will be applied to that. -- useFilterForSortedQuerytrue/useFilterForSortedQuery Most apps would be better off commenting that out or setting it to false. It only makes sense when a high number of queries will be duplicated, but with different sorts. But: why is your app doing this? Ie, if numHits (rows) is 0, the only useful thing you can get is totalHits? Actually I don't know this (yet). Normally our search logic should optimize this and ignore a requested sorting with rows=0, but there seems to be a case that circumvents this - still figuring out. Still I think we should fix it in Lucene -- it's a nuisance to push such corner case checks up into the apps. I'll open an issue... Just for the record, this is https://issues.apache.org/jira/browse/LUCENE-2785 One question: as leaving out sorting leads to better performance, this should also be true for rows=0. Or is lucene/solr already that clever that it makes this optimization (ignoring sort) automatically? Solr has always special-cased this case and avoided sorting altogether (for the normal code path... but overlooked it when useFilterForSortedQuery=true. -Yonik http://www.lucidimagination.com
Re: entire farm fails at the same time with OOM issues
On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen rober...@buy.com wrote: My question is this. Why in the world would all of my slaves, after running fine for some days, suddenly all at the exact same minute experience OOM heap errors and go dead? If there is no change in query traffic when this happens, then it's due to what the index looks like. My guess is a large index merge happened, which means that when the searchers re-open on the new index, it requires more memory than normal (much less can be shared with the previous index). I'd try bumping the heap a little bit, and then optimizing once a day during off-peak hours. If you still get OOM errors, bump the heap a little more. -Yonik http://www.lucidimagination.com
Re: Preventing index segment corruption when windows crashes
On Mon, Nov 29, 2010 at 10:46 AM, Peter Sturge peter.stu...@gmail.com wrote: If a Solr index is running at the time of a system halt, this can often corrupt a segments file, requiring the index to be -fix'ed by rewriting the offending file. Really? That shouldn't be possible (if you mean the index is truly corrupt - i.e. you can't open it). -Yonik http://www.lucidimagination.com
Re: solr admin
On Mon, Nov 29, 2010 at 8:02 PM, Ahmet Arslan iori...@yahoo.com wrote: in Solr admin (http://localhost:8180/services/admin/) I can specify something like: +category_id:200 +xxx:300 but how can I specify a sort option? sort:category_id+asc There is an [FULL INTERFACE] /admin/form.jsp link but it does not have sort option. It seems that you need to append it to your search url. Heh - yeah... that's an old interface, from the times when sort was specified along with the query. Can someone provide a patch to add a way to specify the sort? -Yonik http://www.lucidimagination.com
Re: geospatial
On Wed, Nov 24, 2010 at 2:41 PM, Dennis Gearon gear...@sbcglobal.net wrote: What is the recommended Solr version and/or plugin combination to get geospatial search up and running the quickest and easiest? It depends on what capabilities you need. The current state of what is committed to trunk is reflected here: http://wiki.apache.org/solr/SpatialSearch -Yonik http://www.lucidimagination.com
Re: Problem with synonyms
On Sat, Nov 20, 2010 at 5:59 AM, sivaprasad sivaprasa...@echidnainc.com wrote: Even after expanding the synonyms also i am unable to get same results. What you are trying to do should work with index-time synonym expansion. Just make sure to remove the synonym filter at query time (or use a synonym filter w/o multi-word synonyms). What's the original text in the document you are trying to match? -Yonik http://www.lucidimagination.com
Re: Problem with synonyms
On Mon, Nov 22, 2010 at 10:29 AM, Yonik Seeley yo...@lucidimagination.com wrote: On Sat, Nov 20, 2010 at 5:59 AM, sivaprasad sivaprasa...@echidnainc.com wrote: Even after expanding the synonyms also i am unable to get same results. What you are trying to do should work with index-time synonym expansion. Just make sure to remove the synonym filter at query time (or use a synonym filter w/o multi-word synonyms). Actually, to be more precise, the current query-time restriction is that you can't produce synonyms of different lengths. Hence you could normalize High Definition TV to hdtv at both query time and index time. Optionally you can expand to both High Definition TV and hdtv at index time (in which case you would normally turn off query time synonym processing). -Yonik http://www.lucidimagination.com
Re: Must require quote with single word token query?
On Tue, Nov 16, 2010 at 10:28 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote: I have one question related to single word token with dismax query. In order to be found I need to add the quote around the search query all the time. This is quite hard for me to do since it is part of full text search. Here is my solr query and field type definition (Solr 1.4): fieldType name=text_keyword class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType field name=keyphrase type=text_keyword indexed=true stored=false multiValued=true/ With this query q=smart%20mobileqf=keyphrasedebugQuery=ondefType=dismax, solr returns nothing. However, with quote on the search query q=smart mobileqf=keyphrasedebugQuery=ondefType=dismax, the result is found. Is it a must to use quote for a single word token field? Yes, you must currently quote tokens if they contain whitespace - otherwise the query parser first breaks on whitespace before doing analysis on each part separately. Using dismax is an odd choice if you are only querying on keyphrase though. You might look at the field query parser - it is a basic single-field single-value parser with no operators (hence no need to escape any special characters). q={!field f=keyphrase}smart%20mobile or you can decompose it using param dereferencing (sometimes easier to construct) q={!field f=keyphrase v=$qq}qq=smart%20mobile -Yonik http://www.lucidimagination.com
Re: Must require quote with single word token query?
On Fri, Nov 19, 2010 at 9:41 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote: Wow, i never know this syntax before. What's that called? I dubbed it local params since it adds local info to a parameter (think extra metadata, like XML attributes on an element). http://wiki.apache.org/solr/LocalParams It's used mostly to invoke different query parsers, but it's also used to add extra metadata to faceting commands too (and is required for stuff like multi-select faceting): http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams -Yonik http://www.lucidimagination.com On 11/19/10, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Nov 16, 2010 at 10:28 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote: I have one question related to single word token with dismax query. In order to be found I need to add the quote around the search query all the time. This is quite hard for me to do since it is part of full text search. Here is my solr query and field type definition (Solr 1.4): fieldType name=text_keyword class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.TrimFilterFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType field name=keyphrase type=text_keyword indexed=true stored=false multiValued=true/ With this query q=smart%20mobileqf=keyphrasedebugQuery=ondefType=dismax, solr returns nothing. However, with quote on the search query q=smart mobileqf=keyphrasedebugQuery=ondefType=dismax, the result is found. Is it a must to use quote for a single word token field? Yes, you must currently quote tokens if they contain whitespace - otherwise the query parser first breaks on whitespace before doing analysis on each part separately. Using dismax is an odd choice if you are only querying on keyphrase though. You might look at the field query parser - it is a basic single-field single-value parser with no operators (hence no need to escape any special characters). q={!field f=keyphrase}smart%20mobile or you can decompose it using param dereferencing (sometimes easier to construct) q={!field f=keyphrase v=$qq}qq=smart%20mobile -Yonik http://www.lucidimagination.com -- Sent from my mobile device Chhorn Chamnap http://chamnapchhorn.blogspot.com/
result grouping / field collapsing changes
We've recently added randomized testing for result grouping that resulted in finding + fixing a number of bugs. I've you've been using this feature, you should move to the latest trunk version. I've also added a section at the bottom of the wiki page to list current limitations. http://wiki.apache.org/solr/FieldCollapsing -Yonik http://www.lucidimagination.com
Re: hash uniqueKey generation?
On Tue, Nov 16, 2010 at 5:31 AM, Dennis Gearon gear...@sbcglobal.net wrote: hashing is not 100% guaranteed to produce unique values. But if you go to enough bits with a good hash function, you can get the odds lower than the odds of something else changing the value like cosmic rays flipping a bit on you. -Yonik http://www.lucidimagination.com
Re: hash uniqueKey generation?
On Tue, Nov 16, 2010 at 9:05 PM, Dennis Gearon gear...@sbcglobal.net wrote: Read up on WikiPedia, but I believe that no Hash Function is much good above 50% of the address space it generates. 50% is way to high - collisions will happen before that. But given that something like MD5 has 128 bits, that's 3.4e38, so even a small fraction of that address space will work. The probabilities follow the birthday problem: http://en.wikipedia.org/wiki/Birthday_problem Using a 128 bit hash, you can hash 26B docs with a hash collision probability of e-18 (and yes, that is lower than the probability of something else going wrong). It also says: For comparison, 10-18 to 10-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more. -Yonik http://www.lucidimagination.com
Re: Solr Negative query
On Mon, Nov 15, 2010 at 12:42 AM, Viswa S svis...@hotmail.com wrote: Apologies for starting a new thread again, my mailing list subscription didn't finalize till later than Yonik's response. Using Field1:Val1 AND (*:* NOT Field2:Val2) works, thanks. Does my original query Field1:Value1 AND (NOT Field2:Val2) fall into need the *:* trick if all of the clauses of a boolean query are negative case? Yes - the parens create a new boolean query, and all of it's clauses are negative. The top level boolean query has that as a required clause, hence it won't match anything because that sub-query won't match anything. But, your original example without the parens should have worked. -Yonik http://www.lucidimagination.com
Re: Solr Negative query
On Sun, Nov 14, 2010 at 4:17 AM, Leonardo Menezes leonardo.menez...@googlemail.com wrote: try Field1:Val1 AND (*:* NOT Field2:Val2), that shoud work ok That should be equivalent to Field1:Val1 -Field2:Val2 You only need the *:* trick if all of the clauses of a boolean query are negative. -Yonik http://www.lucidimagination.com
Re: facetting when using field collapsing
On Wed, Nov 10, 2010 at 9:12 AM, Lukas Kahwe Smith m...@pooteeweet.org wrote: The above wiki page seems to be out of date. Reading the comments in https://issues.apache.org/jira/browse/SOLR-236 it seems like group should be replaced with collapse. The Wiki page is not expansive, but I've tried to make it easy for people to get started, and make everything there correct. If you can point out what is incorrect, we can fix! With regards to faceting, it works, but is unaffected by grouping (i.e. facet counts will be the same as a non-grouped response). -Yonik http://www.lucidimagination.com
Re: facetting when using field collapsing
On Sat, Nov 13, 2010 at 10:46 AM, Lukas Kahwe Smith m...@pooteeweet.org wrote: On 13.11.2010, at 10:30, Yonik Seeley wrote: On Wed, Nov 10, 2010 at 9:12 AM, Lukas Kahwe Smith m...@pooteeweet.org wrote: The above wiki page seems to be out of date. Reading the comments in https://issues.apache.org/jira/browse/SOLR-236 it seems like group should be replaced with collapse. The Wiki page is not expansive, but I've tried to make it easy for people to get started, and make everything there correct. If you can point out what is incorrect, we can fix! With regards to faceting, it works, but is unaffected by grouping (i.e. facet counts will be the same as a non-grouped response). The wiki page uses group, but in the ticket all examples always speak of collapse. Which syntax is correct? It's group - try out the examples on the wiki page. JIRA tickets are for development, not documentation. Other than that the ticket also speaks of a few parameters not mentioned, specifically if facetting should happen before or after group/collapse: collapse.facet=before|after This currently doesn't exist in the committed code, hence the param is not documented. Grouping/collapsing currently has no effect on faceting (i.e. set group=false and you will get a non grouped result with the exact same facet counts). -Yonik http://www.lucidimagination.com
Re: IndexableBinaryStringTools (was FieldCache)
On Sat, Nov 13, 2010 at 1:50 PM, Steven A Rowe sar...@syr.edu wrote: Looks to me like the returned value is in a Solr-internal form of XML character escaping: \u is represented as #0; and \u0008 is represented as #8;. (The escaping code is in solr/src/java/org/apache/common/util/XML.java.) Yep, there is no legal way to represent some unicode code points in XML. You can get the value back in its original binary form by unescaping the /#[0-9]+;/ format. Here is a test illustrating this fix that I added to SolrExampleTests, then ran from SolrExampleEmbeddedTest: The problem here is that one might then unescape what was meant to be a literal #8; One could come up with a full escaping mechanism over XML I suppose... but I'm not sure it would be worth it. -Yonik http://www.lucidimagination.com
FAST ESP - Solr migration webinar
We're holding a free webinar on migration from FAST to Solr. Details below. -Yonik http://www.lucidimagination.com = Solr To The Rescue: Successful Migration From FAST ESP to Open Source Search Based on Apache Solr Thursday, Nov 18, 2010, 14:00 EST (19:00 GMT) Hosted by SearchDataManagement.com For anyone concerned about the future of their FAST ESP applications since the purchase of Fast Search and Transfer by Microsoft in 2008, this webinar will provide valuable insights on making the switch to Solr. A three-person rountable will discuss factors driving the need for FAST ESP alternatives, differences between FAST and Solr, a typical migration project lifecycle methodology, complementary open source tools, best practices, customer examples, and recommended next steps. The speakers for this webinar are: Helge Legernes, Founding Partner CTO of Findwise Michael McIntosh, VP Search Solutions for TNR Global Eric Gaumer, Chief Architect for ESR Technology. For more information and to register, please go to: http://SearchDataManagement.bitpipe.com/detail/RES/1288718603_527.html?asrc=CL_PRM_Lucid2 =
Re: solr 4.0 - pagination
On Sun, Nov 7, 2010 at 10:55 AM, Papp Richard ccode...@gmail.com wrote: this is fantastic, but can you tell any time it will be ready ? It already is ;-) Grab the latest trunk or the latest nightly build. -Yonik http://www.lucidimagination.com
Re: solr 4.0 - pagination
On Sun, Nov 7, 2010 at 2:45 PM, Papp Richard ccode...@gmail.com wrote: Hi Yonik, I've just tried the latest stable version from nightly build: apache-solr-4.0-2010-11-05_08-06-28.war I have some concerns however: I have 3 documents; 2 in the first group, 1 in the 2nd group. 1. I got for matches 3 - which is good, but I still don't know how many groups I have. (using start = 0, rows = 10) 2. as far as I see the start / rows is working now, but the matches is returned incorrectly = it said matches = 3 instead of = 1, when I used start = 1, rows = 1 matches is the number of documents before grouping, so start/rows or group.offset/group.limit will not affect this number. so can you help me, how to compute how many pages I'll have, because the matches can't use for this. Solr doesn't even know given the current algorithm, hence it can't return that info. The issue is that to calculate the total number of groups, we would need to keep each group in memory (which could cause a big blowup if there are tons of groups). The current algorithm only keeps the top 10 groups (assuming rows=10) in memory at any one time, hence it has no idea what the total number of groups is. -Yonik http://www.lucidimagination.com
Re: Negative or zero value for fieldNorm
On Thu, Nov 4, 2010 at 8:04 AM, Markus Jelsma markus.jel...@openindex.io wrote: The question remains, why does the title field return a fieldNorm=0 for many queries? Because the index-time boost was set to 0 when the doc was indexed. I can't say how that happened... look to your indexing code. And a subquestion, does the luke request handler return boost values for documents? I know i get boost values for fields but i haven't seen boost values for documents. The doc boost is just multiplied into each field boost and doesn't have a separate representation in the index. -Yonik http://www.lucidimagination.com
Re: Negative or zero value for fieldNorm
On Thu, Nov 4, 2010 at 9:51 AM, Markus Jelsma markus.jel...@openindex.io wrote: I've done some testing with the example docs and it behaves similar when there is a zero doc boost. Luke, however, does not show me the index-time boosts. Remember that the norm is a product of the length norm and the index time boost... it's recorded as a single number in the index. Bost document and field boosts are not visible in Luke's output. I've changed doc boost and field boosts for the mp500.xml document but all i ever see returned is boost=1.0. Is this correct? Perhaps you still have omitNorms=true for the field you are querying? -Yonik http://www.lucidimagination.com
Re: Negative or zero value for fieldNorm
Regarding Negative or zero value for fieldNorm, I don't see any negative fieldNorms here... just very small positive ones? Anyway the fieldNorm is the product of the lengthNorm and the index-time boost of the field (which is itself the product of the index time boost on the document and the index time boost of all instances of that field). Index time boosts default to 1 though, so they have no effect unless something has explicitly set a boost. -Yonik http://www.lucidimagination.com On Wed, Nov 3, 2010 at 2:30 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi all, I've got some puzzling issue here. During tests i noticed a document at the bottom of the results where it should not be. I query using DisMax on title and content field and have a boost on title using qf. Out of 30 results, only two documents also have the term in the title. Using debugQuery and fl=*,score i quickly noticed large negative maxScore of the complete resultset and a portion of the resultset where scores sum up to zero because of a product with 0 (fieldNorm). See below for debug output for a result with score = 0: 0.0 = (MATCH) sum of: 0.0 = (MATCH) max of: 0.0 = (MATCH) weight(content:kunstgrasveld in 7), product of: 0.75658196 = queryWeight(content:kunstgrasveld), product of: 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.113743275 = queryNorm 0.0 = (MATCH) fieldWeight(content:kunstgrasveld in 7), product of: 2.236068 = tf(termFreq(content:kunstgrasveld)=5) 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.0 = fieldNorm(field=content, doc=7) 0.0 = (MATCH) fieldWeight(title:kunstgrasveld in 7), product of: 1.0 = tf(termFreq(title:kunstgrasveld)=1) 8.791729 = idf(docFreq=3, maxDocs=9682) 0.0 = fieldNorm(field=title, doc=7) And one with a negative score: 3.0716116E-4 = (MATCH) sum of: 3.0716116E-4 = (MATCH) max of: 3.0716116E-4 = (MATCH) weight(content:kunstgrasveld in 1462), product of: 0.75658196 = queryWeight(content:kunstgrasveld), product of: 6.6516633 = idf(docFreq=33, maxDocs=9682) 0.113743275 = queryNorm 4.059853E-4 = (MATCH) fieldWeight(content:kunstgrasveld in 1462), product of: 1.0 = tf(termFreq(content:kunstgrasveld)=1) 6.6516633 = idf(docFreq=33, maxDocs=9682) 6.1035156E-5 = fieldNorm(field=content, doc=1462) There are no funky issues with term analysis for the text fieldType, in fact, the term passes through unchanged. I don't do omitNorms, i store termVectors etc. Because fieldNorm = fieldBoost / sqrt(numTermsForField) i suspect my input from Nutch is messed up. A fieldNorm can never be = 0 for a normal positive boost and field boosts should not be zero or negative (correct me if i'm wrong). But, since i can't yet figure out what field boosts Nutch sends to me i thought i'd drop by on this mailing list first. There are quite a few query terms that return with zero or negative scores and many that behave as i expect. I find it also a bit hard to comprehend why the docs with negative score rank higher in the result set than documents with zero score. Sorting defaults to score DESC, but this is perhaps another issue. Anyway, the test runs on a Solr 1.4.1 instance with Java 6 under the hood. Help or directions are appreciated =) Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350
Re: blacklist docs by uniqueKey
On Wed, Nov 3, 2010 at 3:05 PM, Erick Erickson erickerick...@gmail.com wrote: How dynamic is this list? Is it feasable to add a field to your docs like blacklisteddocs, and at editorial's discretion add values to that field like app1, app2? At that point you can just filter them out via a filter query... Right, or a combination of the two approaches. For a realtime approach, add the newest filters (say any filters added that day) to a filter query, and roll those into a nightly reindex. -Yonik http://www.lucidimagination.com Best Erick On Wed, Nov 3, 2010 at 2:40 PM, Ravi Kiran ravi.bhas...@gmail.com wrote: Hello, I have a single core servicing 3 different applications, one of the application doesnt want some specific docs to show up (driven by Editorial decision). Over a period of time the amount of blacklisted docs could grow, hence I do not want to restrict them in a query as it the query could get extremely large. Is there a configuration option where we can blacklist ids (uniqueKey) from showing up in results. Is there anything similar to EvelationComponent that demotes docs ? This could be ideal. I tried to look up and see if there was a boosting option in elevation component so that I could negatively boost certain docs but could not find any. Can anybody kindly point me in the right direction. Thanks Ravi Kiran Bhaskar
Re: Possible memory leaks with frequent replication
On Tue, Nov 2, 2010 at 12:32 PM, Simon Wistow si...@thegestalt.org wrote: On Mon, Nov 01, 2010 at 05:42:51PM -0700, Lance Norskog said: You should query against the indexer. I'm impressed that you got 5s replication to work reliably. That's our current solution - I was just wondering if there was anything I was missing. You could also try dialing down maxWarmingSearchers to 1 - that should prevent multiple searchers warming at the same time and may be the source of you running out of memory. -Yonik http://www.lucidimagination.com
Re: big terms in UnInvertedField
2010/11/1 Koji Sekiguchi k...@r.email.ne.jp: With solr example, using facet.field=text creates UnInvertedField for the text field in fieldValueCache. After that, I saw stats page and I was surprised at counters in *filterCache* were up: Do they cause of big words in UnInvertedField? Yes. big terms (defined as matching more than 5% of the index) are not uninverted since it's more efficient (both CPU and memory) to use the filterCache and calculate intersections. If so, when using both facet for multiValued field and facet for single valued field/facet query, it is difficult to estimate the size of filterCache. Yep. At least fieldValueCache (for UnInvertedField) tells you the number of big terms in each field you are faceting on though. -Yonik http://www.lucidimagination.com
Re: Facet count of zero
On Mon, Nov 1, 2010 at 12:55 PM, Tod listac...@gmail.com wrote: I'm trying to exclude certain facet results from a facet query. It seems to work but rather than being excluded from the facet list its returned with a count of zero. If you don't want to see 0 counts, use facet.mincount=1 http://wiki.apache.org/solr/SimpleFacetParameters -Yonik http://www.lucidimagination.co Ex: q=(-foo:bar)facet=truefacet.field=foofacet.sort=idxwt=jsonindent=true This returns bar with a count of zero. All the other foo's show up with valid counts. Can I do this? Is my syntax incorrect? Thanks - Tod
Re: solr 4.0 - pagination
On Sat, Oct 30, 2010 at 12:22 PM, Papp Richard ccode...@gmail.com wrote: I'm using Solr 4.0 with grouping (field collapsing), but unfortunately I can't solve the pagination. It's not implemented yet, but I'm working on that right now. -Yonik http://www.lucidimagination.com
Re: eDismax result differs from Dismax
On Fri, Oct 29, 2010 at 9:30 AM, Ryan Walker r...@recruitmilitary.com wrote: We are launching a new version of our job board helping returning veterans find a civilian job, and we chose Solr and Sunspot[1] to power our search. We really didn't consider the power users in the HR world who are trained to use boolean search, for example: Engineer AND (Electrical OR Mechanical) Sunspot supports the Dismax request handler, which unfortunately does not handle the query above properly. So we read about eDismax and that it was baked into Solr 1.5. At the same time, Sunspot has switched from LocalSolr integration to storing a geohash in a full-text searchable field. We're having some problems with some complex queries that Sunspot generates: INFO: [] webapp=/solr path=/select params={fl=+scorestart=0q=query:{!dismax+qf%3D'title_text+description_text'}Ruby+on+Rails+Developer+(location_details_s:dngythdb25fu^1.0+OR+location_details_s:dngythdb25f^0.0625+OR+location_details_s:dngythdb25*^0.00391+OR+location_details_s:dngythdb2*^0.000244+OR+location_details_s:dngythdb*^0.153+OR+location_details_s:dngythd*^0.00954+OR+location_details_s:dngyth*^0.000596+OR+location_details_s:dngyt*^0.373+OR+location_details_s:dngy*^0.0233+OR+location_details_s:dng*^0.00146)wt=rubyfq=type:JobdefType=edismaxrows=20} hits=1 status=0 QTime=13 Under Dismax no results are returned for this query, however, as you can see above with eDismax a result is returned -- the only difference between the two queries are 'defType=edismax' vs 'defType=dismax' That's to be expected. Dismax doesn't even support fielded queries (where you specify the fieldname in the query itself) so this clause is treated all as text: (location_details_s:dngythdb25fu^1.0 and dismax QP will be looking for tokens like location_details_s dngythdb25fu (assuming tokenization would split on the non-alphanumeric chars) in your text fields. -Yonik http://www.lucidimagination.com
Re: Custom Sorting in Solr
On Fri, Oct 29, 2010 at 3:39 PM, Ezequiel Calderara ezech...@gmail.com wrote: Hi all guys! I'm in a weird situation here. We have index a set of documents which are ordered using a linked list (each documents has the reference of the previous and the next). Is there a way when sorting in the solr search, Use the linked list to sort? It seems like you should be able to encode this linked list as an integer instead, and sort by that? If there are multiple linked lists in the index, it seems like you could even use the high bits of the int to designate which list the doc belongs to, and the low order bits as the order in that list. -Yonik http://www.lucidimagination.com
Re: documentCache clarification
On Fri, Oct 29, 2010 at 3:49 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : This is a limitation in the SolrCache API. : The key into the cache does not contain rows, so the cache returns the : first 10 docs and increments it's hit count. Then the cache user : (SolrIndexSearcher) looks at the entry and determines it can't use it. Wow, I never realized that. Why don't we just include the start rows (modulo the window size) in the cache key? The implementation of equals() would be rather difficult... actually impossible w/o abusing the semantics. It would also be impossible w/o the Map implementation guaranteeing what object was on the LHS vs the RHS when equals was called. Unless I'm missing something obvious? -Yonik http://www.lucidimagination.com
Re: documentCache clarification
On Fri, Oct 29, 2010 at 4:21 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Why don't we just include the start rows (modulo the window size) in : the cache key? : : The implementation of equals() would be rather difficult... actually : impossible w/o abusing the semantics. : It would also be impossible w/o the Map implementation guaranteeing : what object was on the LHS vs the RHS when equals was called. : : Unless I'm missing something obvious? You've totally confused me. What i'm saying is that SolrIndexSearcher should consult the window size before consulting the cache -- the start param should be rounded down to the nearest multiple of hte window size, and start+rows (ie: end) should be rounded up to one less then the nearest multiple of the windows size, and then that should be looked up in the cache. That's already done. In example, do q=*:*rows=12 q=*:*rows=16 and you should see a queryResultCache hit since queryResultWindowSize is 20 and both requests round up to that. *but* if you do this (with an index with more than 20 docs in it) q=*:*rows=25 Currently that query will round up to 40, but since nResults (start+row) isn't in the key, it will still get a cache hit but then not be usable. Now, if your proposal is to put nResults into the key, we then have a worse problem. Assume we're starting over with a clean cache. q=*:*rows=25 // cached under a key including nResults=40 q=*:*rows=15 // looked up under a key including nResults=20... not found! but that's why people are suppose to pick a window size greater then the largest number of rows typically requested) Hmmm, I don't think so. If that were the case, there would be no need for two parameters (no need for queryResultWindowSize) since we would always just pick queryResultMaxDocsCached. -Yonik http://www.lucidimagination.com
Re: SolrCore.getSearcher() and postCommit()
On Fri, Oct 29, 2010 at 5:36 PM, Grant Ingersoll gsing...@apache.org wrote: Is it OK to call and increment a Searcher ref (i.e. SolrCore.getSearcher()) in a SolrEventListener.postCommit() hook as long as I decrement it when I am done? I need to get a handle on an IndexReader so I can dump out a portion of the index to an external process. Yes, just be aware that the searcher you will get will not contain the recently committed documents. If you want that, look at the newSearcher hook instead. -Yonik http://www.lucidimagination.com
Re: How to index long words with StandardTokenizerFactory?
On Sun, Oct 24, 2010 at 10:47 AM, Sergey Bartunov sbos@gmail.com wrote: I did it just as you recommended. Solr indexes files around 15kb, but no more. The same effect was with patched constants Lucene also has max token sizes it can index. IIRC, lengths used to be stored inline with the char data, and a single char was used for the length. The bigger question: Is this a problem for you (do you actually have a use case)? -Yonik http://www.lucidimagination.com
Re: How to index long words with StandardTokenizerFactory?
On Sun, Oct 24, 2010 at 11:29 AM, Sergey Bartunov sbos@gmail.com wrote: It's a kind of research. There is no particular practical use case as far as I know. Do you know how to set all these max token lengths? It's a practical limit given how things are coded, not an arbitrary one. Given the lack of use cases, It would be a mistake to complicate the code or make it less performant trying to support a larger limit. -Yonik http://www.lucidimagination.com
Re: How to index long words with StandardTokenizerFactory?
On Fri, Oct 22, 2010 at 12:07 PM, Sergey Bartunov sbos@gmail.com wrote: I'm trying to force solr to index words which length is more than 255 If the field is not a text field, the Solr's default analyzer is used, which currently limits the token to 256 bytes. Out of curiosity, what's your usecase that you really need a single 34KB token? -Yonik http://www.lucidimagination.com
Re: Date faceting +1MONTH problem
On Fri, Sep 17, 2010 at 9:51 PM, Chris Hostetter hossman_luc...@fucit.org wrote: the default query parser doesn't support range queries with mixed upper/lower bound inclusion. This has just been added to trunk. Things like [0 TO 100} now work. -Yonik http://www.lucidimagination.com
Re: Date faceting +1MONTH problem
On Fri, Oct 22, 2010 at 6:02 PM, Shawn Heisey s...@elyograg.org wrote: On 10/22/2010 3:01 PM, Yonik Seeley wrote: On Fri, Sep 17, 2010 at 9:51 PM, Chris Hostetter hossman_luc...@fucit.org wrote: the default query parser doesn't support range queries with mixed upper/lower bound inclusion. This has just been added to trunk. Things like [0 TO 100} now work. Awesome! Is it easily ported back to branch_3x? Between the refactoring work on the QP, and the back compat concerns, it's not trivial. -Yonik http://www.lucidimagination.com
Re: why sorl is slower than lucene so much?
2010/10/21 kafka0102 kafka0...@163.com: I found the problem's cause.It's the DocSetCollector. my fitler query result's size is about 300,so the DocSetCollector.getDocSet() is OpenBitSet. And 300 OpenBitSet.fastSet(doc) op is too slow. As I said in my other response to you, that's a perfect reason why you want Solr to cache that for you (unless the filter will be different each time). -Yonik http://www.lucidimagination.com
Re: why solr search is slower than lucene so much?
Careful comparing apples to oranges ;-) For one, your lucene code doesn't retrieve stored fields. Did you try the solr request more than once (with a different q, but the same filters?) Also, by default, Solr independently caches the filters. This can be higher up-front cost, but a win when filters are reused. If you want something closer to your lucene code, you could add all the filters to the main query and not use fq. -Yonik http://www.lucidimagination.com On Wed, Oct 20, 2010 at 7:07 AM, kafka0102 kafka0...@163.com wrote: HI. my solr seach has some performance problem recently. my query is like that: q=xxfq=fid:1fq=atm:[int_time1 TO int_time2], fid's type is : fieldType name=int class=solr.TrieIntField precisionStep=0 omitNorms=true positionIncrementGap=0/ atm's type is : fieldType name=sint class=solr.TrieIntField precisionStep=8 omitNorms=true positionIncrementGap=0/ my index's size is about 500M and record num is 3984274. when I use solr's SolrIndexSearcher.search(QueryResult qr, QueryCommand cmd), it cost about70ms. When I changed use lucence'API, just like bottom: final SolrQueryRequest req = rb.req; final SolrIndexSearcher searcher = req.getSearcher(); final SolrIndexSearcher.QueryCommand cmd = rb.getQueryCommand(); final ExecuteTimeStatics timeStatics = ExecuteTimeStatics.getExecuteTimeStatics(); final ExecuteTimeUnit staticUnit = timeStatics.addExecuteTimeUnit(test2); staticUnit.start(); final ListQuery query = cmd.getFilterList(); final BooleanQuery booleanFilter = new BooleanQuery(); for (final Query q : query) { booleanFilter.add(new BooleanClause(q,Occur.MUST)); } booleanFilter.add(new BooleanClause(cmd.getQuery(),Occur.MUST)); logger.info(q:+query); final Sort sort = cmd.getSort(); final TopFieldDocs docs = searcher.search(booleanFilter,null,20,sort); final StringBuilder sbBuilder = new StringBuilder(); for (final ScoreDoc doc :docs.scoreDocs) { sbBuilder.append(doc.doc+,); } logger.info(hits:+docs.totalHits+,result:+sbBuilder.toString()); staticUnit.end(); it cost only about 20ms. I'm so confused. For solr's config, I closed cache. For test, I first called lucene's, and then solr's. Maybe I should look solr's source more carefully. But now, can anyone knows the reason?
Re: filter query from external list of Solr unique IDs
On Fri, Oct 15, 2010 at 11:49 AM, Burton-West, Tom tburt...@umich.edu wrote: At the Lucene Revolution conference I asked about efficiently building a filter query from an external list of Solr unique ids. Yeah, I've thought about a special query parser and query to deal with this (relatively) efficiently, both from a query perspective and a memory perspective. Should be pretty quick to throw together: - comma separated list of terms (unique ids are a special case of this) - in the query, store as a single byte array for efficiency - sort the ids if they aren't already sorted - do lookups with a term enumerator and skip weighting or anything else like that - configurable caching... may, or may not want to cache this big query That's only part of the stuff you mention, but seems like it would be useful to a number of people. -Yonik http://www.lucidimagination.com
Re: facet.field :java.lang.NullPointerException
This is https://issues.apache.org/jira/browse/SOLR-2142 I'll look into it soon. -Yonik http://www.lucidimagination.com On Fri, Oct 15, 2010 at 3:12 PM, Pradeep Singh pksing...@gmail.com wrote: Faceting blows up when the field has no data. And this seems to be random. Sometimes it will work even with no data, other times not. Sometimes the error goes away if the field is set to multiValued=true (even though it's one value every time), other times it doesn't. In all cases setting facet.method to enum takes care of the problem. If this param is not set, the default leads to null pointer exception. 09:18:52,218 SEVERE [SolrCore] Exception during facet.field of xyz:java.lang.NullPointerException at java.lang.System.arraycopy(Native Method) at org.apache.lucene.util.PagedBytes.copy(PagedBytes.java:247) at org.apache.solr.request.TermIndex$1.setTerm(UnInvertedField.java:1164) at org.apache.solr.request.NumberedTermsEnum.init(UnInvertedField.java:960) at org.apache.solr.request.TermIndex$1.init(UnInvertedField.java:1151) at org.apache.solr.request.TermIndex.getEnumerator(UnInvertedField.java:1151) at org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:204) at org.apache.solr.request.UnInvertedField.init(UnInvertedField.java:188) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:911) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:298) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:354) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:190) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:210) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1323) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:337) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at
Re: Which version of Solr to use?
On Thu, Oct 14, 2010 at 1:58 PM, Lukas Kahwe Smith m...@pooteeweet.org wrote: the current confusing list of branches is a result of the merge of the lucene and solr svn repositories. what baffpes me is that so far the countless plea's for at least a rough roadmap or even just explanation for why so many branches are needed There is one branch users need to be concerned about: branch_3x All 3.x releases will be made from that branch. trunk (which is technically not a branch) is 4.0 -Yonik http://www.lucidimagination.com
Re: Which version of Solr to use?
On Thu, Oct 14, 2010 at 1:50 PM, Jonathan Rochkind rochk...@jhu.edu wrote: I'm kind of confused about Solr development plans in general, highlighted by this thread. I think 1.4.1 is the latest officially stable release, yes? Why is there both a 1.5 and a 3.x, anyway? Not to mention a 4.x? Which of these will end up being a stable release? Both? From which will come the next stable release? 1.5 is pre lucene/solr merge, and is very unlikely to ever be released. 3.1 is the next lucene/solr point release (3x branch in svn) 4.0 is the next major release (trunk in svn) -Yonik http://www.lucidimagination.com
Re: Which version of Solr to use?
On Thu, Oct 14, 2010 at 2:39 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Thanks Yonik! So I gather that the 1.5 branch has essentially been abandoned, we can pretend it doesn't exist at all, it's been entirely superceded by the 3.x branch, with the changes made just for the purposes of syncronizing versions with lucene. Right. Everything marked as 1.5 in the past is in 3.1-dev and 4.0-dev. 1.5 was always just a place-holder for the next release, which could have been 2.0 if we had upgraded Lucene and changed enough stuff in Solr. So even before the Lucene/Solr merge, a 1.5 release was never really guaranteed. -Yonik http://www.lucidimagination.com
Re: Which version of Solr to use?
On Thu, Oct 14, 2010 at 2:55 PM, Mike Squire mike.squ...@gmail.com wrote: As pointed out before it would be useful to have some kind of documented road map for development, and some kind of indication of how close certain versions are to release. Such things have proven to be very unreliable in the past, due to the volunteer nature of open source. It would also require everyone agreeing up-front - which rarely happens ;-) Specifically for 3.1, everyone seems to want to do a release, and we have plenty of new features to support that. I expect it's close, but the work still needs to be done. Anyway, our new split branch_3x / trunk development model *should* allow for more frequent releases in the future, once we get things rolling. Side note: I would submit that those projects that release every few weeks add no additional value over our (currently) infrequent releases. Due to our high quality test suites and peer reviewed patches, I'd bet the stability of our nightly snapshots over some of those other projects any day! -Yonik http://www.lucidimagination.com
Re: Faceting and first letter of fields
On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind rochk...@jhu.edu wrote: I believe that should work fine in Solr 1.4.1. Creating a field with just first letter of author is definitely the right (possibly only) way to allow facetting on first letter of author's name. I have very voluminous facets (few facet values, many docs in each value) like that in my app too, works fine. I get confused over the different facetting methods available in 1.4.1, and exactly when each is called for. If you see initial problems, you could try switching the facet.method and see what happens. Right - for faceting on first letter, you should probably use facet.method=enum since there will only be 26 values (assuming english/western languages). In the future, I'm hoping we can come up with a smarter way to pick the facet.method if it's not supplied. The new flex API in 4.0-dev should help out here. -Yonik http://www.lucidimagination.com
Re: LuceneRevolution - NoSQL: A comparison
On Tue, Oct 12, 2010 at 12:11 PM, Jan Høydahl / Cominvent jan@cominvent.com wrote: I'm pretty sure the 2nd phase to fetch doc-summaries goes directly to same server as first phase. But what if you stick a LB in between? A related point - the load balancing implementation that's part of SolrCloud (and looks like it will be committed to trunk soon), does keep track of what server it used for the first phase and uses that for subsequent phases. -Yonik http://www.lucidimagination.com
Re: Spatial search in Solr 1.5
On Wed, Oct 13, 2010 at 7:28 AM, PeterKerk vettepa...@hotmail.com wrote: Hi, Thanks for the quick reply :) I downloaded the latest version from the trunk. Got it up and running, and got the error below: Hopefully the QuickStart on the wiki all worked for you, but you only got the error when customizing your own config? Anyway, it looks like you haven't defined a _latLon dynamic field type for the lat / lon components. Here's what is in the example schema: fieldType name=location class=solr.LatLonType subFieldSuffix=_coordinate/ dynamicField name=*_coordinate type=tdouble indexed=true stored=false/ field name=store type=location indexed=true stored=true/ -Yonik http://www.lucidimagination.com URL: http://localhost:8983/solr/db/select/?wt=xmlindent=onfacet=truefl=id,title,lat,lng,cityfacet.field=province_rawq=*:*fq={!geofilt%20pt=45.15,-93.85%20sfield=geolocation%20d=5} HTTP ERROR 400 Problem accessing /solr/db/select/. Reason: undefined field geolocation_0_latLon Powered by Jetty:// My field definition is: I added this in schema.xml: field name=geolocation type=latLon indexed=true stored=true/ fieldType name=latLon class=solr.LatLonType subFieldSuffix=_latLon/ data-config.xml: entity name=location_geolocations query=select (lat+','+lng) as geoloc FROM locations WHERE id='${location.id}' field name=geolocation column=geoloc / /entity I looked in the schema.xml of the latest download, but it turns out in the download there's nothing defined in that schema.xml on latLon type either. Any suggestions what im doing wrong? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Spatial-search-in-Solr-1-5-tp489948p1693797.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spatial search in Solr 1.5
On Wed, Oct 13, 2010 at 9:42 AM, PeterKerk vettepa...@hotmail.com wrote: Im now thinking I downloaded the wrong solr zip, I tried this one: https://hudson.apache.org/hudson/job/Solr-trunk/lastSuccessfulBuild/artifact/trunk/solr/dist/apache-solr-4.0-2010-10-12_08-05-48.zip In that example scheme (\apache-solr-4.0-2010-10-12_08-05-48\example\example-DIH\solr\db\conf\schema.xml) nothing is mentioned about a fieldtype of class solr.LatLonType. Ah, right - DIH has a separate schema. Blech. -Yonik http://www.lucidimagination.com
Re: Spatial search in Solr 1.5
On Wed, Oct 13, 2010 at 10:06 AM, PeterKerk vettepa...@hotmail.com wrote: haha ;) But so I DO have the right solr version? Anyways...I have added the lines you mentioned, what else can I do? The fact that the geolocation field does not show up in the results means that it's not getting added (i.e. something is probably wrong with your DIH config). -Yonik http://www.lucidimagination.com
Re: Spatial search in Solr 1.5
You may want to check the docs, which were recently updated to reflect the state of trunk: http://wiki.apache.org/solr/SpatialSearch -Yonik http://www.lucidimagination.com On Tue, Oct 12, 2010 at 7:49 PM, PeterKerk vettepa...@hotmail.com wrote: Hey Grant, Just came accross this post of yours. Run a query: http://localhost:8983/solr/select/?q=_val_:recip(dist(2, store, vector(34.0232,-81.0664)),1,1,0)fl=*,score // Note, I just updated this, it used to be point instead of vector and that was wrong. What does your suggested query actually do? I really need great circle calcucation. Dont care if its from the trunk, as long as I can have it in my projects asap :) Thanks ahead! -- View this message in context: http://lucene.472066.n3.nabble.com/Spatial-search-in-Solr-1-5-tp489948p1691361.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spatial search in Solr 1.5
On Tue, Oct 12, 2010 at 8:07 PM, PeterKerk vettepa...@hotmail.com wrote: Ok, so does this actually say: for now you have to do calculations based on bounding box instead of great circle? I tried to make the documentation a little simpler... there's - geofilt... filters within a radius of d km (i.e. great circle distance) - bbox... filters using a bounding box - geodist... function query that yields the distance (again, great circle distance) If you point out the part to the docs you found confusing, I can try and improve it. Did you try and step through the quick start? Those links actually work! And the fact that on top of the page it says Solr4.0, does that imply I cant use this right now? Or where could I find the latest trunk for this? The wiki says If you haven't already, get a recent nightly build of Solr4.0... and links to the Solr4.0 page, which points to http://wiki.apache.org/solr/FrontPage#solr_development for nightly builds. -Yonik http://www.lucidimagination.com
Re: LuceneRevolution - NoSQL: A comparison
On Mon, Oct 11, 2010 at 8:32 PM, Peter Keegan peterlkee...@gmail.com wrote: I listened with great interest to Grant's presentation of the NoSQL comparisons/alternatives to Solr/Lucene. It sounds like the jury is still out on much of this. Here's a use case that might favor using a NoSQL alternative for storing 'stored fields' outside of Lucene. When Solr does a distributed search across shards, it does this in 2 phases (correct me if I'm wrong): 1. 1st query to get the docIds and facet counts 2. 2nd query to retrieve the stored fields of the top hits The problem here is that the index could change between (1) and (2), so it's not an atomic transaction. Yep. As I discussed with Peter at Lucene Revolution, if this feature is important to people, I think the easiest way to solve it would be via leases. During a query, a client could request a lease for a certain amount of time on whatever index version is used to generate the response. Solr would then return the index version to the client along with the response, and keep the index open for that amount of time. The client could make consistent additional requests (such as the 2nd phase of a distributed request) by requesting the same version of the index. -Yonik
Re: Upgrade to Solr 1.4, very slow at start up when loading all cores
On Fri, Oct 1, 2010 at 5:42 PM, Renee Sun renee_...@mcafee.com wrote: Hi Yonik, I attached the solrconfig.xml to you in previous post, and we do have firstSearch and newSearch hook ups. I commented them out, all 130 cores loaded up in 1 minute, same as in solr 1.3. total memory took about 1GB. Whereas in 1.3, with hook ups, it took about 6.5GB for same amount of data. For other's reference: here is the warming query (it's the same for newSearcher too): listener event=firstSearcher class=solr.QuerySenderListener arr name=queries lst str name=qtype:message/str str name=start0/str str name=rows10/str str name=sortmessage_date desc/str /lst /arr /listener The sort field message_date is what will be taking up the memory. Starting with Lucene 2.9 (which is used in Solr 1.4), searching and sorting is per-segment. This is generally beneficial, but in this case I believe it is causing the extra memory usage because the same date value that would have been shared across all documents in the fieldcache is now repeated in each segment it is used in. One potential fix (that requires you to reindex) is to use the date fieldType as defined in the new 1.4 schema: fieldType name=date class=solr.TrieDateField omitNorms=true precisionStep=0 positionIncrementGap=0/ This will use 8 bytes per document in your index, rather than 4 bytes per doc + an array of unique string-date values per index. Trunk (4.0-dev) is also much more efficient at storing string-based fields in the FieldCache - but that will only help you if you're comfortable with using development versions. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Local Solr, Spatial Search, and LatLonType clarification
On Thu, Sep 30, 2010 at 1:09 PM, webdev1977 webdev1...@gmail.com wrote: 1. I noticed that it said that the type of LatLongType can not be mulitvalued. Does that mean that I can not have multiple lat/lon values for one document. That means that if you want to have multiple points per document, each point must be in a different field. This often makes sense anyway, when the points have different semantics - i.e. work and home locations. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Local Solr, Spatial Search, and LatLonType clarification
On Thu, Sep 30, 2010 at 1:40 PM, webdev1977 webdev1...@gmail.com wrote: Or.. do you mean each field must have a unique name, but both be of type latLon(solr.LatLonType). work x,y/work homex,y/home Yes. If the statement directly above is true (I hope that it is not), how does one dynamically create fields when adding geotags? Dynamic field types. You can configure it such that anything ending with _latlon is of type LatLonType. Perhaps we should do this in the example schema. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Local Solr, Spatial Search, and LatLonType clarification
On Thu, Sep 30, 2010 at 1:48 PM, Yonik Seeley yo...@lucidimagination.com wrote: Dynamic field types. You can configure it such that anything ending with _latlon is of type LatLonType. Perhaps we should do this in the example schema. Looks like we already have it: dynamicField name=*_p type=location indexed=true stored=true/ So you should be able to add stuff like home_p and work_p w/o defining them ahead of time. Anything ending in _p is of type location. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Upgrade to Solr 1.4, very slow at start up when loading all cores
On Thu, Sep 30, 2010 at 10:41 AM, Renee Sun renee_...@mcafee.com wrote: Hi - I posted this problem but no response, I guess I need to post this in the Solr-User forum. Hopefully you will help me on this. We were running Solr 1.3 for long time, with 130 cores. Just upgrade to Solr 1.4, then when we start the Solr, it took about 45 minutes. The catalina.log shows Solr is very slowly loading all the cores. Have you tried 1.4.1 yet? Could you open a JIRA issue for this and give whatever info you can? Info like: - do you have any warming queries configured? - do the cores have documents already, and if so, how many per core? - are you using the same schema solrconfig, or did you upgrade? - have you tried finding out what is taking up all the memory (or all the CPU time)? -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Queries, Functions, and Params
On Tue, Sep 28, 2010 at 6:08 PM, Robert Thayer robert.tha...@bankserv.com wrote: On the http://wiki.apache.org/solr/FunctionQuery page, the following query function is listed: q={!func}add($v1,$v2)v1=sqrt(popularity)v2=100.0 When run against the default solr instance, server returns the error(400): undefined field $v1. Any way to remedy this? Using version: 3.1-2010-09-28_05-53-44 The wiki page indicates this is a 4.0 feature - so you need a recent 4.0-dev build to try it out. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Conditional Function Queries
On Tue, Sep 28, 2010 at 11:33 AM, Jan Høydahl / Cominvent jan@cominvent.com wrote: Have anyone written any conditional functions yet for use in Function Queries? Nope - but it makes sense and has been on my list of things to do for a long time. -Y http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: bi-grams for common terms - any analyzers do that?
On Sat, Sep 25, 2010 at 8:21 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Huh, okay, I didn't know that #2 happened at all. Can you explain or point me to documentation to explain when it happens? I'm afraid I'm having trouble understanding if the analyzer returns more than one position back from a queryparser token (whitespace). Not entirely sure what that means. Can you give an example? It's always happened, up until recently when it's been made configurable. An example is IndexReader being split into two tokens by WordDelimiterFilter and searched as index reader (i.e. the two terms must be directly next to each other for the document to match). If the new autoGeneratePhraseQueries is off, position doesn't matter, and the query will be treated as index OR reader. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: matches in result grouping
2010/9/23 Koji Sekiguchi k...@r.email.ne.jp: (10/09/23 18:14), Koji Sekiguchi wrote: I'm using recent committed field collapsing / result grouping feature in trunk. I'm confusing matches parameter in the result at the second sample output of Wiki: http://wiki.apache.org/solr/FieldCollapsing#Quick_Start I cannot understand why there are two matches:5 entries in the result. Can anyone explain it? Probably multiple GroupCollectors are generated for each group.field, group.func and group.query and match can be counted per collector. Correct. The matches is the doc count before any grouping (and for field.query that means before the restriction given by field.query is applied). It won't always be the same though - for example we might implement filter excludes like we do with faceting, etc. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Range query not working
On Thu, Sep 23, 2010 at 4:30 PM, PeterKerk vettepa...@hotmail.com wrote: I have this in my query: q=*:*facet.query=location_rating_total:[3 TO 100] And this document: result name=response numFound=6 start=0 maxScore=1.0 - doc float name=score1.0/float str name=id1/str int name=location_rating_total2/int /doc But still my total results equals 6 (total population) and not 0 as I would expect Why? facet.query will give you the number of docs matching location_rating_total:[3 TO 100], it does not restrict the results list. If you want that, you want a filter. Try q=*:*fq=location_rating_total:[3 TO 100] -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Range query not working
On Thu, Sep 23, 2010 at 5:44 PM, Jonathan Rochkind rochk...@jhu.edu wrote: The field type in a standard schema.xml that's defined as integer is NOT sortable. Right - before 1.4. There is no integer field type in 1.4 and beyond in the example schema. You can not sort on this and get what you want. (What's the point of it even existing then, if it pretty much does the same thing as a string field? You can sort on it... you just can't do range queries on it because the term order isn't correct for numerics. It's there only for support of legacy lucene indexes that indexed numerics as plain strings. They are now named pint for plain integer in 1.4 and above. Perhaps we should retain support for that, but remove them from the example schema and only document them somewhere (under supporting lucene indexed built by other software or something?) -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: multiple spatial values
On Tue, Sep 21, 2010 at 12:12 PM, dan sutton danbsut...@gmail.com wrote: I was looking at the LatLonType and how it might represent multiple lon/lat values ... it looks to me like the lat would go in {latlongfield}_0_LatLon and the long in {latlongfield}_1_LatLon ... how then if we have multiple lat/long points for a doc when filtering for example we choose the correct points. e.g. if thinking in cartisean coords and we have P1(3,4), P2(6,7) ... x is stored with 3,6 and y with 4,7 ... then how does it ensure we're not erroneously picking (3,7) or (6,4) whilst filtering with the spatial query? That's why it's a single-valued field only for now... don't we have to store both values together ? what am i missing here? The problem is that we don't have a way to query both values together, so we must index them separately. The basic LatLonType uses numeric queries on the lat and lon fields separately. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Version stability [was: svn branch issues]
I think we aim for a stable trunk (4.0-dev) too, as we always have (in the functional sense... i.e. operate correctly, don't crash, etc). The stability is more a reference to API stability - the Java APIs are much more likely to change on trunk. Solr's *external* APIs are much less likely to change for core services. For example, I don't see us ever changing the rows parameter or the XML update format in a non-back-compat way. Companies can (and do) go to production on trunk versions of Solr after thorough testing in their scenario (as they should do with *any* new version of solr that isn't strictly bugfix). -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 On Fri, Sep 17, 2010 at 10:16 AM, Mark Miller markrmil...@gmail.com wrote: The 3.x line should be pretty stable. Hopefully we will do a release soon. A conversation was again started about more frequent releases recently, and hopefully that will lead to a 3.x release near term. In any case, 3.x is the stable branch - 4.x is where the more crazy stuff happens. If you are used to the terms, 4.x is the unstable branch, though some freak out if you call that for fear you think its 'really unstable'. In reality, it just means likely less stable than the stable branch (3.x), as we target 3.x for stability and 4.x for stickier or non back compat changes. Eventually 4.x will be stable and 5.x unstable, with possible maintenance support for previous stable lines as well. - Mark lucidimagination.com On 9/17/10 9:58 AM, Mark Allan wrote: OK, 1.5 won't be released, so we'll avoid that. I've now got my code additions compiling against a version of 3.x so we'll stick with that rather than solr_trunk for the time being. Does anyone have any sense of when 3.x might be considered stable enough for a release? We're hoping to go to service with something built on Solr in Jan 2011 and would like to avoid development phase software, but if needs must... Thanks Mark On 9 Sep 2010, at 12:10 pm, Markus Jelsma wrote: Well, it's under heavy development but the 3.x branch is more likely to become released than 1.5.x, which is highly unlikely to be ever released. On Thursday 09 September 2010 13:04:38 Mark Allan wrote: Thanks. Are you suggesting I use branch_3x and is that considered stable? Cheers Mark On 9 Sep 2010, at 10:47 am, Markus Jelsma wrote: http://svn.apache.org/repos/asf/lucene/dev/branches/
Re: Version stability [was: svn branch issues]
On Fri, Sep 17, 2010 at 10:46 AM, Mark Miller markrmil...@gmail.com wrote: I agree it's mainly API wise, but there are other issues - largely due to Lucene right now - consider the bugs that have been dug up this year on the 4.x line because flex has been such a large rewrite deep in Lucene. We wouldn't do flex on the 3.x stable line and it's taken a while for everything to shake out in 4.x (and it's prob still swaying). Right. That big difference also has implications for the 3.x line too though - possible backports of new features like field collapsing or per-segment faceting that involve the flex API would involve a good amount of re-writing (along with the introduction of new bugs). I'd put my money on 4.0-dev being actually *more* stable for these new features. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: doc into doc
On Fri, Sep 17, 2010 at 4:12 PM, facholi rfach...@gmail.com wrote: Hi, I would like a json result like that: { id:2342, name:Abracadabra, metadatas: [ {type:tag, name:tutorial}, {type:value, name:2323.434/434}, ] } Do you mean JSON with the tags not quoted (that's not legal JSON), or do you mean the metadata part? Anyway, I assume you're not asking about how to get a JSON response in general? If so, search for json here:http://lucene.apache.org/solr/tutorial.html If you're looking for something else, you'll need to be more specific. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: Null Pointer Exception while indexing
On Wed, Sep 15, 2010 at 2:01 PM, andrewdps mstpa...@gmail.com wrote: I still get the same error when I try to index the mrc file... If you get the exact same error, then you are still using GCJ. When you type java it's probably going to GCJ because of your path (i.e. change it or directly specify the path to the new JVM you just installed). -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: SOLR interface with PHP using javabin?
On Thu, Sep 16, 2010 at 2:30 PM, onlinespend...@gmail.com onlinespend...@gmail.com wrote: I am planning on creating a website that has some SOLR search capabilities for the users, and was also planning on using PHP for the server-side scripting. My goal is to find the most efficient way to submit search queries from the website, interface with SOLR, and display the results back on the website. If I use PHP, it seems that all the solutions use some form of character based stream for the interface. It would seem that using a binary representation, such as javabin, would be more efficient. If using javabin, or some similar efficient binary stream to interface SOLR with PHP is not possible, what do people recommend as the most efficient solution that provides the best performance, even if that means not using PHP and going with some other alternative? I'd recommend going with JSON - it will be quite a bit smaller than XML, and the parsers are generally quite efficient. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8