Re: Experience with indexing billions of documents?
Bradford Stephens: Hey there, We've actually been tackling this problem at Drawn to Scale. We'd really like to get our hands on LuceHBase to see how it scales. Our faceting still needs to be done in-memory, which is kinda tricky, but it's worth exploring. Hi Bradford, thank you for your interest. Just yesterday I found out, that somebody else did apparently exactly the same as I did, porting lucandra to HBase: http://github.com/akkumar/hbasene I'll have a look at this project and most likely abandon luceHBase in favor of the other, since it's more advanced. Best regards, Thomas Koch, http://www.koch.ro
Re: deploying nightly updates to slaves
Lukas Kahwe Smith: On 07.04.2010, at 14:24, Lukas Kahwe Smith wrote: For Solr the idea is also just copy the index files into a new directory and then use http://wiki.apache.org/solr/CoreAdmin#RELOAD after updating the config file (I assume its not possible to hot swap like with MySQL). Since I want to keep a local backup of the index, I guess it might be better to first call UNLOAD and then CREATE after having moved the current index data to a back dir and having moved the new index data into position. Now UNLOAD has the feature of continuing to serve existing requests. In my case I actually lock the slaves, so there shouldn't be any requests and if so, they do not matter anyways. I do not want to shutdown the solr server in order to not accidentally tick-off the monitoring. But I also want to make sure I do not corrupt the index (then again I am only reading anyways). But I am worried if for some reason there is still some request open and I do not poll via STATUS action to make sure the core is UNLOADed, that I might corrupt the index. regards, Lukas Kahwe Smith m...@pooteeweet.org Hallo Lukas, it sounds as if you could just use SOLR replication out of the box. The replication only happens, if a commit on the master happens or on some other trigger, so you don't waste time on unnecessary replications during the day. Is there by any chance the possibility that you'd rather want to store your data in HBase then in MySQL? I'm working on a project right now to store SOLR/Lucene indices directly in HBase too. I'll be at the webtuesday tomorrow. Maybe I could give an introduction to Hadoop/HBase on a next webtuesday? Beste Grüße, Thomas Koch, http://www.koch.ro
Re: Experience with indexing billions of documents?
Hi, could I interest you in this project? http://github.com/thkoch2001/lucehbase The aim is to store the index directly in HBase, a database system modelled after google's Bigtable to store data in the regions of tera or petabytes. Best regards, Thomas Koch Lance Norskog: The 2B limitation is within one shard, due to using a signed 32-bit integer. There is no limit in that regard in sharding- Distributed Search uses the stored unique document id rather than the internal docid. On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens richcari...@gmail.com wrote: A colleague of mine is using native Lucene + some home-grown patches/optimizations to index over 13B small documents in a 32-shard environment, which is around 406M docs per shard. If there's a 2B doc id limitation in Lucene then I assume he's patched it himself. On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote: My guess is that you will need to take advantage of Solr 1.5's upcoming cloud/cluster renovations and use multiple indexes to comfortably achieve those numbers. Hypthetically, in that case, you won't be limited by single index docid limitations of Lucene. We are currently indexing 5 million books in Solr, scaling up over the next few years to 20 million. However we are using the entire book as a Solr document. We are evaluating the possibility of indexing individual pages as there are some use cases where users want the most relevant pages regardless of what book they occur in. However, we estimate that we are talking about somewhere between 1 and 6 billion pages and have concerns over whether Solr will scale to this level. Does anyone have experience using Solr with 1-6 billion Solr documents? The lucene file format document (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) mentions a limit of about 2 billion document ids. I assume this is the lucene internal document id and would therefore be a per index/per shard limit. Is this correct? Tom Burton-West. Thomas Koch, http://www.koch.ro
[ANN] Eclipse GIT plugin beta version released
GIT is one of the most popular distributed version control system. In the hope, that more Java developers may want to explore the world of easy branching, merging and patch management, I'd like to inform you, that a beta version of the upcoming Eclipse GIT plugin is available: http://www.infoq.com/news/2010/03/egit-released http://aniszczyk.org/2010/03/22/the-start-of-an-adventure-egitjgit-0-7-1/ Maybe, one day, some apache / hadoop projects will use GIT... :-) (Yes, I know git.apache.org.) Best regards, Thomas Koch, http://www.koch.ro
continuously creating index packages for katta with solr
Hi, I'd like to use SOLR to create indices for deployment with katta. I'd like to install a SOLR server on each crawler. The crawling script then sends the content directly to the local SOLR server. Every 5-10 minutes I'd like to take the current SOLR index, add it to katta and let SOLR start with an empty index again. Does anybody has an idea, how this could be achieved? Thanks a lot, Thomas Koch, http://www.koch.ro
Overwriting cores with the same core name
Hi, I'm currently evaluating the following solution: My crawler sends all docs to a SOLR core named WHATEVER. Every 5 minutes a new SOLR core with the same name WHATEVER is created, but with a new datadir. The datadir contains a timestamp in it's name. Now I can check for datadirs that are older then the newest one and all these can be picked up for submission to katta. Now there remain two questions: - When the old core is closed, will there be an implicit commit? - How to be sure, that no more work is in progress on an old core datadir? Thanks, Thomas Koch, http://www.koch.ro
highlighting and external storage
Hi, I'm working on a news crawler with continuous indexing. Thus indexes are merged frequently and older documents aren't as important as recent ones. Therefor I'd like to store the fulltext of documents in an external storage (HBase?) so that merging of indexes isn't as IO intensive. This would give me the additional benefit, that I could selectively delete the fulltext of older articles when running out of disc space while keeping the url of the document in the index. Do you know, whether sth. like this would be possible? Best regards, Thomas Koch, http://www.koch.ro
Multiple default search fields or catchall field?
Hi, I'm indexing feeds and websites referenced by the feeds. So I have as text fields: title - from the feed entries title description - from the feed entries description text - the websites text When the user doesn't define a default search field, then all three fields should be used for search. And I need to have highlighting. However it should still be possible to search only in title or description. - Do I need a catchall text field with content copied from all text fields? - Do I need to store the content in the catchall field as well as in the individual fields to get highlighting in every case? - Isn't it a big waste of hard disc space to store the content two times? Thanks for any help, Thomas Koch, http://www.koch.ro
Limit of a one-server-SOLR-installation
Hi, I'm running a read only index with SOLR 1.3 on a server with 8GB RAM and the Heap set to 6GB. The index contains 17 million documents and occupies 63GB of disc space with compression turned on. Replication frequency from the SOLR master is 5 minutes. The index should be able to support around 10 concurrent searches. Now we start hitting RAM related errors like: - java.lang.OutOfMemoryError: Java heap space or - java.lang.OutOfMemoryError: GC overhead limit exceeded which over time make the SOLR instance unresponsive. Before asking for advices on how to optimize my setup, I'd kindly ask for your experiences with setups of this size. Is it possible to run such a large index on only one server? Can I support even larger indexes when I tweak my configuration? Where's the limit when I need to split the index on multiple shards? When do I need to start considering a setup like/with Katta? Thanks for your insights, Thomas Koch, http://www.koch.ro
Re: Limit of a one-server-SOLR-installation
Hi Gasol Wu, thanks for your reply. I tried to make the config and syslog shorter and more readable. solrconfig.xml (shortened): config indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor15/mergeFactor maxBufferedDocs1500/maxBufferedDocs maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout /indexDefaults mainIndex useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor maxBufferedDocs1000/maxBufferedDocs maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength /mainIndex updateHandler class=solr.DirectUpdateHandler2 / query filterCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ enableLazyFieldLoadingtrue/enableLazyFieldLoading queryResultWindowSize10/queryResultWindowSize HashDocSet maxSize=3000 loadFactor=0.75/ boolTofilterOptimizer enabled=true cacheSize=32 threshold=.05/ useColdSearcherfalse/useColdSearcher maxWarmingSearchers4/maxWarmingSearchers /query requestDispatcher handleSelect=true requestParsers enableRemoteStreaming=false multipartUploadLimitInKB=2048 / /requestDispatcher requestHandler name=standard class=solr.StandardRequestHandler lst name=defaults str name=echoParamsexplicit/str /lst /requestHandler requestHandler name=dismax class=solr.DisMaxRequestHandler lst name=defaults str name=echoParamsexplicit/str float name=tie0.01/float str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=pf text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9 /str str name=bf ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3 /str str name=fl id,name,price,score /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str int name=ps100/int str name=q.alt*:*/str /lst /requestHandler requestHandler name=partitioned class=solr.DisMaxRequestHandler lst name=defaults str name=echoParamsexplicit/str str name=qftext^0.5 features^1.0 name^1.2 sku^1.5 id^10.0/str str name=mm2lt;-1 5lt;-2 6lt;90%/str str name=bqincubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2/str /lst lst name=appends str name=fqinStock:true/str /lst lst name=invariants str name=facet.fieldcat/str str name=facet.fieldmanu_exact/str str name=facet.queryprice:[* TO 500]/str str name=facet.queryprice:[500 TO *]/str /lst /requestHandler requestHandler name=instock class=solr.DisMaxRequestHandler str name=fq inStock:true /str str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str /requestHandler queryResponseWriter name=xslt class=org.apache.solr.request.XSLTResponseWriter int name=xsltCacheLifetimeSeconds5/int /queryResponseWriter /config syslog (shortened and formated): o.a.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-8080 o.a.catalina.startup.Catalina load INFO: Initialization processed in 416 ms o.a.catalina.core.StandardService start INFO: Starting service Catalina o.a.catalina.core.StandardEngine start INFO: Starting Servlet Engine: Apache Tomcat/6.0.20 o.a.s.servlet.SolrDispatchFilter init INFO: SolrDispatchFilter.init() o.a.s.core.SolrResourceLoader locateInstanceDir INFO: Using JNDI solr.home: /usr/share/solr o.a.s.core.CoreContainer$Initializer initialize INFO: looking for solr.xml: /usr/share/solr/solr.xml o.a.s.core.SolrResourceLoader init INFO: Solr home set to '/usr/share/solr/' o.a.s.core.SolrResourceLoader createClassLoader INFO: Reusing parent classloader o.a.s.core.SolrResourceLoader locateInstanceDir INFO: Using JNDI solr.home: /usr/share/solr o.a.s.core.SolrResourceLoader init INFO: Solr home set to '/usr/share/solr/' o.a.s.core.SolrResourceLoader createClassLoader INFO: Reusing parent classloader o.a.s.core.SolrConfig init INFO: Loaded SolrConfig: solrconfig.xml o.a.s.core.SolrCore init INFO: Opening new SolrCore at /usr/share/solr/, dataDir=/var/lib/solr/data/ o.a.s.schema.IndexSchema readSchema INFO: Reading Solr Schema o.a.s.schema.IndexSchema readSchema INFO: Schema name=memoarticle o.a.s.schema.IndexSchema readSchema INFO: default search field is catchalltext o.a.s.schema.IndexSchema readSchema INFO: query parser default operator is AND o.a.s.schema.IndexSchema readSchema INFO: unique key field: id o.a.s.core.SolrCore init INFO: JMX monitoring not detected
eternal optimize interrupted
Hi, last evening we started an optimize over our solr index of 45GB. This morning the optimize was still running, discs spinning like crazy and de index directory has grew to 83GB. We stopped and restarted tomcat since solr was unresponsive and we needed to query the index. Now I don't know what to do? How to find out which ratio of the index is optimized, how many nights will it take to finish? Best regards, Thomas Koch, http://www.koch.ro