Re: Experience with indexing billions of documents?

2010-04-14 Thread Thomas Koch
Bradford Stephens:
 Hey there,
 
 We've actually been tackling this problem at Drawn to Scale. We'd really
 like to get our hands on LuceHBase to see how it scales. Our faceting still
 needs to be done in-memory, which is kinda tricky, but it's worth
 exploring.
Hi Bradford,

thank you for your interest. Just yesterday I found out, that somebody else 
did apparently exactly the same as I did, porting lucandra to HBase:

http://github.com/akkumar/hbasene

I'll have a look at this project and most likely abandon luceHBase in favor of 
the other, since it's more advanced.

Best regards,

Thomas Koch, http://www.koch.ro


Re: deploying nightly updates to slaves

2010-04-12 Thread Thomas Koch
Lukas Kahwe Smith:
 On 07.04.2010, at 14:24, Lukas Kahwe Smith wrote:
  For Solr the idea is also just copy the index files into a new directory
  and then use http://wiki.apache.org/solr/CoreAdmin#RELOAD after updating
  the config file (I assume its not possible to hot swap like with MySQL).
 
 Since I want to keep a local backup of the index, I guess it might be
  better to first call UNLOAD and then CREATE after having moved the
  current index data to a back dir and having moved the new index data into
  position. Now UNLOAD has the feature of continuing to serve existing
  requests. In my case I actually lock the slaves, so there shouldn't be any
  requests and if so, they do not matter anyways.
 
 I do not want to shutdown the solr server in order to not accidentally
  tick-off the monitoring. But I also want to make sure I do not corrupt the
  index (then again I am only reading anyways). But I am worried if for some
  reason there is still some request open and I do not poll via STATUS
  action to make sure the core is UNLOADed, that I might corrupt the index.
 
 regards,
 Lukas Kahwe Smith
 m...@pooteeweet.org

Hallo Lukas,

it sounds as if you could just use SOLR replication out of the box. The 
replication only happens, if a commit on the master happens or on some other 
trigger, so you don't waste time on unnecessary replications during the day.

Is there by any chance the possibility that you'd rather want to store your 
data in HBase then in MySQL? I'm working on a project right now to store 
SOLR/Lucene indices directly in HBase too.

I'll be at the webtuesday tomorrow. Maybe I could give an introduction to 
Hadoop/HBase on a next webtuesday?

Beste Grüße,

Thomas Koch, http://www.koch.ro


Re: Experience with indexing billions of documents?

2010-04-12 Thread Thomas Koch
Hi,

could I interest you in this project?
http://github.com/thkoch2001/lucehbase

The aim is to store the index directly in HBase, a database system modelled 
after google's Bigtable to store data in the regions of tera or petabytes.

Best regards, Thomas Koch

Lance Norskog:
 The 2B limitation is within one shard, due to using a signed 32-bit
 integer. There is no limit in that regard in sharding- Distributed
 Search uses the stored unique document id rather than the internal
 docid.
 
 On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens richcari...@gmail.com wrote:
  A colleague of mine is using native Lucene + some home-grown
  patches/optimizations to index over 13B small documents in a 32-shard
  environment, which is around 406M docs per shard.
 
  If there's a 2B doc id limitation in Lucene then I assume he's patched it
  himself.
 
  On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote:
  My guess is that you will need to take advantage of Solr 1.5's upcoming
  cloud/cluster renovations and use multiple indexes to comfortably
  achieve those numbers. Hypthetically, in that case, you won't be limited
  by single index docid limitations of Lucene.
 
   We are currently indexing 5 million books in Solr, scaling up over the
   next few years to 20 million.  However we are using the entire book as
   a Solr document.  We are evaluating the possibility of indexing
   individual pages as there are some use cases where users want the most
   relevant
 
  pages
 
   regardless of what book they occur in.  However, we estimate that we
   are talking about somewhere between 1 and 6 billion pages and have
   concerns over whether Solr will scale to this level.
  
   Does anyone have experience using Solr with 1-6 billion Solr
   documents?
  
   The lucene file format document
   (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
   mentions a limit of about 2 billion document ids.   I assume this is
   the lucene internal document id and would therefore be a per index/per
   shard limit.  Is this correct?
  
  
   Tom Burton-West.
 

Thomas Koch, http://www.koch.ro


[ANN] Eclipse GIT plugin beta version released

2010-03-31 Thread Thomas Koch
GIT is one of the most popular distributed version control system. 
In the hope, that more Java developers may want to explore the world of easy 
branching, merging and patch management, I'd like to inform you, that a beta 
version of the upcoming Eclipse GIT plugin is available:

http://www.infoq.com/news/2010/03/egit-released
http://aniszczyk.org/2010/03/22/the-start-of-an-adventure-egitjgit-0-7-1/

Maybe, one day, some apache / hadoop projects will use GIT... :-)

(Yes, I know git.apache.org.)

Best regards,

Thomas Koch, http://www.koch.ro


continuously creating index packages for katta with solr

2010-02-11 Thread Thomas Koch
Hi,

I'd like to use SOLR to create indices for deployment with katta. I'd like to 
install a SOLR server on each crawler. The crawling script then sends the 
content directly to the local SOLR server. Every 5-10 minutes I'd like to take 
the current SOLR index, add it to katta and let SOLR start with an empty index 
again.
Does anybody has an idea, how this could be achieved?

Thanks a lot,

Thomas Koch, http://www.koch.ro


Overwriting cores with the same core name

2010-02-11 Thread Thomas Koch
Hi,

I'm currently evaluating the following solution: My crawler sends all docs to 
a SOLR core named WHATEVER. Every 5 minutes a new SOLR core with the same 
name WHATEVER is created, but with a new datadir. The datadir contains a 
timestamp in it's name.
Now I can check for datadirs that are older then the newest one and all these 
can be picked up for submission to katta.

Now there remain two questions:

- When the old core is closed, will there be an implicit commit?
- How to be sure, that no more work is in progress on an old core datadir?

Thanks,

Thomas Koch, http://www.koch.ro


highlighting and external storage

2009-12-22 Thread Thomas Koch
Hi,

I'm working on a news crawler with continuous indexing. Thus indexes are 
merged frequently and older documents aren't as important as recent ones.

Therefor I'd like to store the fulltext of documents in an external storage 
(HBase?) so that merging of indexes isn't as IO intensive. This would give me 
the additional benefit, that I could selectively delete the fulltext of older 
articles when running out of disc space while keeping the url of the document 
in the index.

Do you know, whether sth. like this would be possible?

Best regards,

Thomas Koch, http://www.koch.ro


Multiple default search fields or catchall field?

2009-12-08 Thread Thomas Koch
Hi,

I'm indexing feeds and websites referenced by the feeds. So I have as text 
fields:
title - from the feed entries title
description - from the feed entries description
text - the websites text

When the user doesn't define a default search field, then all three fields 
should be used for search. And I need to have highlighting. However it should 
still be possible to search only in title or description.

- Do I need a catchall text field with content copied from all text fields?
- Do I need to store the content in the catchall field as well as in the 
individual fields to get highlighting in every case?
- Isn't it a big waste of hard disc space to store the content two times?

Thanks for any help,

Thomas Koch, http://www.koch.ro


Limit of a one-server-SOLR-installation

2009-10-05 Thread Thomas Koch
Hi,

I'm running a read only index with SOLR 1.3 on a server with 8GB RAM and the 
Heap set to 6GB. The index contains 17 million documents and occupies 63GB of 
disc space with compression turned on. Replication frequency from the SOLR 
master is 5 minutes. The index should be able to support around 10 concurrent 
searches.

Now we start hitting RAM related errors like:

- java.lang.OutOfMemoryError: Java heap space or
- java.lang.OutOfMemoryError: GC overhead limit exceeded

which over time make the SOLR instance unresponsive.

Before asking for advices on how to optimize my setup, I'd kindly ask for your 
experiences with setups of this size. Is it possible to run such a large index 
on only one server? Can I support even larger indexes when I tweak my 
configuration? Where's the limit when I need to split the index on multiple 
shards? When do I need to start considering a setup like/with Katta?

Thanks for your insights,

Thomas Koch, http://www.koch.ro


Re: Limit of a one-server-SOLR-installation

2009-10-05 Thread Thomas Koch
Hi Gasol Wu,

thanks for your reply. I tried to make the config and syslog shorter and more 
readable.

solrconfig.xml (shortened):

config
  indexDefaults
useCompoundFilefalse/useCompoundFile
mergeFactor15/mergeFactor
maxBufferedDocs1500/maxBufferedDocs
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
writeLockTimeout1000/writeLockTimeout
commitLockTimeout1/commitLockTimeout
  /indexDefaults

  mainIndex
useCompoundFilefalse/useCompoundFile
mergeFactor10/mergeFactor
maxBufferedDocs1000/maxBufferedDocs
maxMergeDocs2147483647/maxMergeDocs
maxFieldLength1/maxFieldLength
  /mainIndex

  updateHandler class=solr.DirectUpdateHandler2 /

  query
filterCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=0/

queryResultCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=0/

documentCache
  class=solr.LRUCache
  size=512
  initialSize=512
  autowarmCount=0/

enableLazyFieldLoadingtrue/enableLazyFieldLoading
queryResultWindowSize10/queryResultWindowSize
HashDocSet maxSize=3000 loadFactor=0.75/
boolTofilterOptimizer enabled=true cacheSize=32 threshold=.05/
useColdSearcherfalse/useColdSearcher
maxWarmingSearchers4/maxWarmingSearchers
  /query

  requestDispatcher handleSelect=true 
requestParsers enableRemoteStreaming=false 
multipartUploadLimitInKB=2048 /
  /requestDispatcher

  requestHandler name=standard class=solr.StandardRequestHandler
 lst name=defaults
   str name=echoParamsexplicit/str
 /lst
  /requestHandler

  requestHandler name=dismax class=solr.DisMaxRequestHandler 
lst name=defaults
 str name=echoParamsexplicit/str
 float name=tie0.01/float
 str name=qf
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
 /str
 str name=pf
text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9
 /str
 str name=bf
ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3
 /str
 str name=fl
id,name,price,score
 /str
 str name=mm
2lt;-1 5lt;-2 6lt;90%
 /str
 int name=ps100/int
 str name=q.alt*:*/str
/lst
  /requestHandler

  requestHandler name=partitioned class=solr.DisMaxRequestHandler 
lst name=defaults
 str name=echoParamsexplicit/str
 str name=qftext^0.5 features^1.0 name^1.2 sku^1.5 id^10.0/str
 str name=mm2lt;-1 5lt;-2 6lt;90%/str
 str name=bqincubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2/str
/lst
lst name=appends
  str name=fqinStock:true/str
/lst
lst name=invariants
  str name=facet.fieldcat/str
  str name=facet.fieldmanu_exact/str
  str name=facet.queryprice:[* TO 500]/str
  str name=facet.queryprice:[500 TO *]/str
/lst
  /requestHandler
  
  requestHandler name=instock class=solr.DisMaxRequestHandler 
 str name=fq
inStock:true
 /str
 str name=qf
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
 /str
 str name=mm
2lt;-1 5lt;-2 6lt;90%
 /str
  /requestHandler

  queryResponseWriter name=xslt 
class=org.apache.solr.request.XSLTResponseWriter
int name=xsltCacheLifetimeSeconds5/int
  /queryResponseWriter 
/config


syslog (shortened and formated):

o.a.coyote.http11.Http11Protocol init
INFO: Initializing Coyote HTTP/1.1 on http-8080
o.a.catalina.startup.Catalina load
INFO: Initialization processed in 416 ms
o.a.catalina.core.StandardService start
INFO: Starting service Catalina
o.a.catalina.core.StandardEngine start
INFO: Starting Servlet Engine: Apache Tomcat/6.0.20
o.a.s.servlet.SolrDispatchFilter init
INFO: SolrDispatchFilter.init()
o.a.s.core.SolrResourceLoader locateInstanceDir
INFO: Using JNDI solr.home: /usr/share/solr
o.a.s.core.CoreContainer$Initializer initialize
INFO: looking for solr.xml: /usr/share/solr/solr.xml
o.a.s.core.SolrResourceLoader init
INFO: Solr home set to '/usr/share/solr/'
o.a.s.core.SolrResourceLoader createClassLoader
INFO: Reusing parent classloader
o.a.s.core.SolrResourceLoader locateInstanceDir
INFO: Using JNDI solr.home: /usr/share/solr
o.a.s.core.SolrResourceLoader init
INFO: Solr home set to '/usr/share/solr/'
o.a.s.core.SolrResourceLoader createClassLoader
INFO: Reusing parent classloader
o.a.s.core.SolrConfig init
INFO: Loaded SolrConfig: solrconfig.xml
o.a.s.core.SolrCore init
INFO: Opening new SolrCore at /usr/share/solr/, 
dataDir=/var/lib/solr/data/
o.a.s.schema.IndexSchema readSchema
INFO: Reading Solr Schema
o.a.s.schema.IndexSchema readSchema
INFO: Schema name=memoarticle
o.a.s.schema.IndexSchema readSchema
INFO: default search field is catchalltext
o.a.s.schema.IndexSchema readSchema
INFO: query parser default operator is AND
o.a.s.schema.IndexSchema readSchema
INFO: unique key field: id
o.a.s.core.SolrCore init
INFO: JMX monitoring not detected 

eternal optimize interrupted

2009-08-04 Thread Thomas Koch
Hi, 

last evening we started an optimize over our solr index of 45GB. This morning 
the optimize was still running, discs spinning like crazy and de index 
directory has grew to 83GB.
We stopped and restarted tomcat since solr was unresponsive and we needed to 
query the index.
Now I don't know what to do? How to find out which ratio of the index is 
optimized, how many nights will it take to finish?

Best regards,

Thomas Koch, http://www.koch.ro