Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?
For what it's worth, we've had good luck using the ICUTokenizer and associated filters. A native Chinese speaker here at the office gave us an enthusiastic thumbs up on our Chinese search results. Your mileage may vary of course. On Wed, Sep 23, 2015 at 11:04 AM, Erick Ericksonwrote: > In a word, no. The CJK languages in general don't > necessarily tokenize on whitespace so using a tokenizer > that uses whitespace as it's default tokenizer simply won't > work. > > Have you tried it? It seems a simple test would get you > an answer faster. > > Best, > Erick > > On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo > > wrote: > > > Hi, > > > > Would like to check, will StandardTokenizerFactory works well for > indexing > > both English and Chinese (Bilingual) documents, or do we need tokenizers > > that are customised for chinese (Eg: HMMChineseTokenizerFactory)? > > > > > > Regards, > > Edwin > > >
Re: Implementing custom analyzer for multi-language stemming
Yes, each token could have a LanguageAttribute on it, just like ScriptAttributes. I didn't *think* a span would be necessary. I would also add a multivalued lang field to the document. Searching English documents for die might look like: q=dielang=eng. The lang param could tell the RequestHandler to add a filter query fq=lang:eng to constrain the search to the English corpus, as well as recruit an English analyzer when tokenizing the die query term. Since I can't control text length, I would just let the language detection tool do it's best and not sweat it. On Wed, Aug 6, 2014 at 12:11 AM, TK kuros...@sonic.net wrote: On 8/5/14, 8:36 AM, Rich Cariens wrote: Of course this is extremely primitive and basic, but I think it would be possible to write a CharFilter or TokenFilter that inspects the entire TokenStream to guess the language(s), perhaps even noting where languages change. Language and position information could be tracked, the TokenStream rewound and then Tokens emitted with LanguageAttributes for downstream Token stemmers to deal with. I'm curious how you are planning to handle the languageAttribute. Would each token have this attribute denoting a span of Tokens with a language? But then how would you search English documents that includes the term die while skipping all the German documents which most likely to have die? Automatic language detection works OK for long text of regular kind of contents. But it doesn't work well with short text. What strategy would you use to deal with short text? -- TK
Re: Implementing custom analyzer for multi-language stemming
I've started a GitHub project to try out some cross-lingual analysis ideas ( https://github.com/whateverdood/cross-lingual-search). I haven't played over there for about 3 months, but plan on restarting work there shortly. In a nutshell, the interesting component (SimplePolyGlotStemmingTokenFilter) relies on ICU4J ScriptAttributes: each token is inspected for it's script, i.e. latin or arabic, and then a ScriptStemmer recruits the appropriate stemmer to handle the token. Of course this is extremely primitive and basic, but I think it would be possible to write a CharFilter or TokenFilter that inspects the entire TokenStream to guess the language(s), perhaps even noting where languages change. Language and position information could be tracked, the TokenStream rewound and then Tokens emitted with LanguageAttributes for downstream Token stemmers to deal with. Or is that a crazy idea? On Tue, Aug 5, 2014 at 12:10 AM, TK kuros...@sonic.net wrote: On 7/30/14, 10:47 AM, Eugene wrote: Hello, fellow Solr and Lucene users and developers! In our project we receive text from users in different languages. We detect language automatically and use Google Translate APIs a lot (so having arbitrary number of languages in our system doesn't concern us). However we need to be able to search using stemming. Having nearly hundred of fields (several fields for each language with language-specific stemmers) listed in our search query is not an option. So we need a way to have a single index which has stemmed tokens for different languages. Do you mean to have a Tokenizer that switches among supported languages depending on the lang field? This is something I thought about when I started working on Solr/Lucene and soon I realized it is not possible because of the way Lucene is designed; The Tokenizer in an analyzer chain cannot peek other field's value, or there is no way to control which field is processed first. If that's not what you are trying to achieve, could you tell us what it is? If you have different language text in a single field, and if someone search for a word common to many languages, such as sports (or Lucene for that matter), Solr will return the documents of different languages, most of which the user doesn't understand. Would that be useful? If you have a special use case, would you like to share it? -- Kuro
Re: MMapDirectory failed to map a 23G compound index segment
My colleague and I thought the same thing - that this is an O/S configuration issue. /proc/sys/vm/max_map_count = 65536 I honestly don't know how many segments were in the index. Our merge factor is 10 and there were around 4.4 million docs indexed. The OOME was raised when the MMapDirectory was opened, so I don't think were reopening the reader several times. Our MMapDirectory is set to use the unmapHack. We've since switched back to non-compound index files and are having no trouble at all. On Tue, Sep 20, 2011 at 3:32 PM, Michael McCandless luc...@mikemccandless.com wrote: Since you hit OOME during mmap, I think this is an OS issue not a JVM issue. Ie, the JVM isn't running out of memory. How many segments were in the unoptimized index? It's possible the OS rejected the mmap because of process limits. Run cat /proc/sys/vm/max_map_count to see how many mmaps are allowed. Or: is it possible you reopened the reader several times against the index (ie, after committing from Solr)? If so, I think 2.9.x never unmaps the mapped areas, and so this would accumulate against the system limit. My memory of this is a little rusty but isn't mmap also limited by mem + swap on the box? What does 'free -g' report? I don't think this should be the case; you are using a 64 bit OS/JVM so in theory (except for OS system wide / per-process limits imposed) you should be able to mmap up to the full 64 bit address space. Your virtual memory is unlimited (from ulimit output), so that's good. Mike McCandless http://blog.mikemccandless.com On Wed, Sep 7, 2011 at 12:25 PM, Rich Cariens richcari...@gmail.com wrote: Ahoy ahoy! I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound index segment file. The stack trace looks pretty much like every other trace I've found when searching for OOM map failed[1]. My configuration follows: Solr 1.4.1/Lucene 2.9.3 (plus SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969 ) CentOS 4.9 (Final) Linux 2.6.9-100.ELsmp x86_64 yada yada yada Java SE (build 1.6.0_21-b06) Hotspot 64-bit Server VM (build 17.0-b16, mixed mode) ulimits: core file size (blocks, -c) 0 data seg size(kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals(-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files(-n) 256000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size(kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 1064959 virtual memory(kbytes, -v) unlimited file locks(-x) unlimited Any suggestions? Thanks in advance, Rich [1] ... java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source) at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.DirectoryReader.init(Unknown Source) at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source) at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown Source) at org.apache.lucene.index.DirectoryReader.open(Unknown Source) at org.apache.lucene.index.IndexReader.open(Unknown Source) ... Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) ...
Re: MMapDirectory failed to map a 23G compound index segment
Thanks. It's definitely repeatable and I may spend some time plumbing this further. I'll let the list know if I find anything. The problem went away once I optimized the index down to a single segment using a simple IndexWriter driver. This was a bit strange since the resulting index contained similarly large ( 23G) files. The JVM didn't seem to have any trouble MMap'ing those. No, I don't need (or necessarily want) to use compound index file formats. That was actually a goof on my part which I've since corrected :). On Fri, Sep 9, 2011 at 9:42 PM, Lance Norskog goks...@gmail.com wrote: I remember now: by memory-mapping one block of address space that big, the garbage collector has problems working around it. If the OOM is repeatable, you could try watching the app with jconsole and watch the memory spaces. Lance On Thu, Sep 8, 2011 at 8:58 PM, Lance Norskog goks...@gmail.com wrote: Do you need to use the compound format? On Thu, Sep 8, 2011 at 3:57 PM, Rich Cariens richcari...@gmail.com wrote: I should add some more context: 1. the problem index included several cfs segment files that were around 4.7G, and 2. I'm running four SOLR instances on the same box, all of which have similiar problem indeces. A colleague thought perhaps I was bumping up against my 256,000 open files ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file handle/descriptor? On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens richcari...@gmail.com wrote: FWiW I optimized the index down to a single segment and now I have no trouble opening an MMapDirectory on that index, even though the 23G cfx segment file remains. On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens richcari...@gmail.com wrote: Thanks for the response. free -g reports: totalusedfreesharedbuffers cached Mem: 141 95 46 0 093 -/+ buffers/cache: 2 139 Swap: 3 0 3 2011/9/7 François Schiettecatte fschietteca...@gmail.com My memory of this is a little rusty but isn't mmap also limited by mem + swap on the box? What does 'free -g' report? François On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote: Ahoy ahoy! I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound index segment file. The stack trace looks pretty much like every other trace I've found when searching for OOM map failed[1]. My configuration follows: Solr 1.4.1/Lucene 2.9.3 (plus SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969 ) CentOS 4.9 (Final) Linux 2.6.9-100.ELsmp x86_64 yada yada yada Java SE (build 1.6.0_21-b06) Hotspot 64-bit Server VM (build 17.0-b16, mixed mode) ulimits: core file size (blocks, -c) 0 data seg size(kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals(-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files(-n) 256000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size(kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 1064959 virtual memory(kbytes, -v) unlimited file locks(-x) unlimited Any suggestions? Thanks in advance, Rich [1] ... java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source) at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.DirectoryReader.init(Unknown Source) at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source) at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown Source) at org.apache.lucene.index.DirectoryReader.open(Unknown Source) at org.apache.lucene.index.IndexReader.open(Unknown Source) ... Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) ... -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: MMapDirectory failed to map a 23G compound index segment
Thanks for the response. free -g reports: totalusedfreesharedbuffers cached Mem: 141 95 46 0 093 -/+ buffers/cache: 2 139 Swap: 3 0 3 2011/9/7 François Schiettecatte fschietteca...@gmail.com My memory of this is a little rusty but isn't mmap also limited by mem + swap on the box? What does 'free -g' report? François On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote: Ahoy ahoy! I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound index segment file. The stack trace looks pretty much like every other trace I've found when searching for OOM map failed[1]. My configuration follows: Solr 1.4.1/Lucene 2.9.3 (plus SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969 ) CentOS 4.9 (Final) Linux 2.6.9-100.ELsmp x86_64 yada yada yada Java SE (build 1.6.0_21-b06) Hotspot 64-bit Server VM (build 17.0-b16, mixed mode) ulimits: core file size (blocks, -c) 0 data seg size(kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals(-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files(-n) 256000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size(kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 1064959 virtual memory(kbytes, -v) unlimited file locks(-x) unlimited Any suggestions? Thanks in advance, Rich [1] ... java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source) at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.DirectoryReader.init(Unknown Source) at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source) at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown Source) at org.apache.lucene.index.DirectoryReader.open(Unknown Source) at org.apache.lucene.index.IndexReader.open(Unknown Source) ... Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) ...
Re: MMapDirectory failed to map a 23G compound index segment
FWiW I optimized the index down to a single segment and now I have no trouble opening an MMapDirectory on that index, even though the 23G cfx segment file remains. On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens richcari...@gmail.com wrote: Thanks for the response. free -g reports: totalusedfreesharedbuffers cached Mem: 141 95 46 0 093 -/+ buffers/cache: 2 139 Swap: 3 0 3 2011/9/7 François Schiettecatte fschietteca...@gmail.com My memory of this is a little rusty but isn't mmap also limited by mem + swap on the box? What does 'free -g' report? François On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote: Ahoy ahoy! I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound index segment file. The stack trace looks pretty much like every other trace I've found when searching for OOM map failed[1]. My configuration follows: Solr 1.4.1/Lucene 2.9.3 (plus SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969 ) CentOS 4.9 (Final) Linux 2.6.9-100.ELsmp x86_64 yada yada yada Java SE (build 1.6.0_21-b06) Hotspot 64-bit Server VM (build 17.0-b16, mixed mode) ulimits: core file size (blocks, -c) 0 data seg size(kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals(-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files(-n) 256000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size(kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 1064959 virtual memory(kbytes, -v) unlimited file locks(-x) unlimited Any suggestions? Thanks in advance, Rich [1] ... java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source) at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.DirectoryReader.init(Unknown Source) at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source) at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown Source) at org.apache.lucene.index.DirectoryReader.open(Unknown Source) at org.apache.lucene.index.IndexReader.open(Unknown Source) ... Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) ...
Re: MMapDirectory failed to map a 23G compound index segment
I should add some more context: 1. the problem index included several cfs segment files that were around 4.7G, and 2. I'm running four SOLR instances on the same box, all of which have similiar problem indeces. A colleague thought perhaps I was bumping up against my 256,000 open files ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file handle/descriptor? On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens richcari...@gmail.com wrote: FWiW I optimized the index down to a single segment and now I have no trouble opening an MMapDirectory on that index, even though the 23G cfx segment file remains. On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens richcari...@gmail.comwrote: Thanks for the response. free -g reports: totalusedfreesharedbuffers cached Mem: 141 95 46 0 093 -/+ buffers/cache: 2 139 Swap: 3 0 3 2011/9/7 François Schiettecatte fschietteca...@gmail.com My memory of this is a little rusty but isn't mmap also limited by mem + swap on the box? What does 'free -g' report? François On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote: Ahoy ahoy! I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound index segment file. The stack trace looks pretty much like every other trace I've found when searching for OOM map failed[1]. My configuration follows: Solr 1.4.1/Lucene 2.9.3 (plus SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969 ) CentOS 4.9 (Final) Linux 2.6.9-100.ELsmp x86_64 yada yada yada Java SE (build 1.6.0_21-b06) Hotspot 64-bit Server VM (build 17.0-b16, mixed mode) ulimits: core file size (blocks, -c) 0 data seg size(kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals(-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files(-n) 256000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size(kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 1064959 virtual memory(kbytes, -v) unlimited file locks(-x) unlimited Any suggestions? Thanks in advance, Rich [1] ... java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source) at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.DirectoryReader.init(Unknown Source) at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source) at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown Source) at org.apache.lucene.index.DirectoryReader.open(Unknown Source) at org.apache.lucene.index.IndexReader.open(Unknown Source) ... Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) ...
MMapDirectory failed to map a 23G compound index segment
Ahoy ahoy! I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound index segment file. The stack trace looks pretty much like every other trace I've found when searching for OOM map failed[1]. My configuration follows: Solr 1.4.1/Lucene 2.9.3 (plus SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969 ) CentOS 4.9 (Final) Linux 2.6.9-100.ELsmp x86_64 yada yada yada Java SE (build 1.6.0_21-b06) Hotspot 64-bit Server VM (build 17.0-b16, mixed mode) ulimits: core file size (blocks, -c) 0 data seg size(kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals(-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files(-n) 256000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size(kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 1064959 virtual memory(kbytes, -v) unlimited file locks(-x) unlimited Any suggestions? Thanks in advance, Rich [1] ... java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source) at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.DirectoryReader.init(Unknown Source) at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source) at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown Source) at org.apache.lucene.index.DirectoryReader.open(Unknown Source) at org.apache.lucene.index.IndexReader.open(Unknown Source) ... Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) ...
SSD experience
Ahoy ahoy! Does anyone have any experiences or stories they can share with the list about how SSDs impacted search performance for better or worse? I found a Lucene SSD performance benchmark dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut the wiki engine is refusing to let me view the attachment (I get You are not allowed to do AttachFile on this page.). Thanks in advance!
Re: SSD experience
Thanks folks! On Mon, Aug 22, 2011 at 11:13 AM, Erick Erickson erickerick...@gmail.comwrote: That link appears to be foo'd, and I can't find the original doc. But others (mostly on the user's list historically) have seen very significant performance improvements with SSDs, *IF* the entire index doesn't fit in memory. If your index does fit entirely in memory, there will probably be some improvement when fetching stored fields, especially if the stored fields are large. But I'm not sure the cost is worth the incremental speed in this case.. Of course if you can get your IT folks to spring for SSDs, go for it :) Best Erick On Mon, Aug 22, 2011 at 11:02 AM, Daniel Skiles daniel.ski...@docfinity.com wrote: I haven't tried it with Solr yet, but with straight Lucene about two years ago we saw about a 40% boost in performance on our tests with no changes except the disk. On Mon, Aug 22, 2011 at 10:54 AM, Rich Cariens richcari...@gmail.com wrote: Ahoy ahoy! Does anyone have any experiences or stories they can share with the list about how SSDs impacted search performance for better or worse? I found a Lucene SSD performance benchmark doc http://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdf but the wiki engine is refusing to let me view the attachment (I get You are not allowed to do AttachFile on this page.). Thanks in advance!
Re: how to enable MMapDirectory in solr 1.4?
We patched our 1.4.1 build with SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969(making MMapDirectory configurable) and realized a 64% search performance boost on our Linux hosts. On Mon, Aug 8, 2011 at 10:05 AM, Dyer, James james.d...@ingrambook.comwrote: If you want to try MMapDirectory with Solr 1.4, then copy the class org.apache.solr.core.MMapDirectoryFactory from 3.x or Trunk, and either add it to the .war file (you can just add it under src/java and re-package the war), or you can put it in its own .jar file in the lib directory under solr_home. Then, in solrconfig.xml, add this entry under the root config element: directoryFactory class=org.apache.solr.core.MMapDirectoryFactory / I'm not sure if MMapDirectory will perform better for you with Linux over NIOFSDir. I'm pretty sure in Trunk/4.0 it's the default for Windows and maybe Solaris. In Windows, there is a definite advantage for using MMapDirectory on a 64-bit system. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Monday, August 08, 2011 4:09 AM To: solr-user@lucene.apache.org Subject: how to enable MMapDirectory in solr 1.4? hi all, I read Apache Solr 3.1 Released Note today and found that MMapDirectory is now the default implementation in 64 bit Systems. I am now using solr 1.4 with 64-bit jvm in Linux. how can I use MMapDirectory? will it improve performance?
Re: document storage
We've decided to store the original document in both Solr and external repositories. This is to support the following: 1. highlighting - We need to mark-up the entire document with hit-terms. However if this was the only reason to store the text I'd seriously consider calling out to the external repository via a custom highlighter. 2. hot documents - We need to index user-generated data like activity streams, folksonomy tags, annotations, and comments. When our indexer is made aware of those events we decorate the existing SolrDocument with new fields and re-index it. 3. in-place index rebuild - Our search service is still evolving so we periodically change our schema and indexing code. We believe it's more efficient, not to mention faster, to rebuild the index if we've got all the data. Hope that helps! On Fri, May 13, 2011 at 3:10 PM, Mike Sokolov soko...@ifactory.com wrote: Would anyone care to comment on the merits of storing indexed full-text documents in Solr versus storing them externally? It seems there are three options for us: 1) store documents both in Solr and externally - this is what we are doing now, and gives us all sorts of flexibility, but doesn't seem like the most scalable option, at least in terms of storage space and I/O required when updating/inserting documents. 2) store documents externally: For the moment, the only thing that requires us to store documents in Solr is the need to highlight them, both in search result snippets and in full document views. We are considering hunting for or writing a Highlighter extension that could pull in the document text from an external source (eg filesystem). 3) store documents only in Solr. We'd just retrieve document text as a Solr field value rather than reading from the filesystem. Somehow this strikes me as the wrong thing to do, but it could work: I'm not sure why. A lot of unnecessary merging I/O activity perhaps. Makes it hard to grep the documents or use other filesystem tools, I suppose. Which one of these sounds best to you? Under which circumstances? Are there other possibilities? Thanks! -- Michael Sokolov Engineering Director www.ifactory.com
Re: Guidance for event-driven indexing
Thanks Jan. For the JMSUpdateHandler option, how does one plugin a custom UpdateHandler? I want to make sure I'm not missing any important pieces of Solr processing pipeline. Best, Rich On Tue, Feb 15, 2011 at 4:36 AM, Jan Høydahl jan@cominvent.com wrote: Solr is multi threaded, so you are free to send as many parallel update requests needed to utilize your HW. Each request will get its own thread. Simply configure StreamingUpdateSolrServer from your client. If there is some lengthy work to be done, it needs to be done in SOME thread, and I guess you just have to choose where :) A JMSUpdateHandler sounds heavy weight, but does not need to be, and might be the logically best place for such a feature imo. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 14. feb. 2011, at 17.42, Rich Cariens wrote: Thanks Jan, I don't think I want to tie up a thread on two boxes waiting for an UpdateRequestProcessor to finish. I'd prefer to offload it all to the target shards. And a special JMSUpdateHandler feels like overkill. I *think* I'm really just looking for a simple API that allows me to add a SolrInputDocument to the index in-process. Perhaps I just need to use the EmbeddedSolrServer in the Solrj packages? I'm worried that this will break all the nice stuff one gets with the standard SOLR webapp (stats, admin, etc). Best, Rich On Mon, Feb 14, 2011 at 11:18 AM, Jan Høydahl jan@cominvent.com wrote: Hi, One option would be to keep the JMS listener as today but move the downloading and transforming part to a SolrUpdateRequestProcessor on each shard. The benefit is that you ship only a tiny little SolrInputDocument over the wire with a reference to the doc to download, and do the heavy lifting on Solr side. If each JMS topic/channel corresponds to a particular shard, you could move the whole thing to Solr. If so, a new JMSUpdateHandler could perhaps be a way to go? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 14. feb. 2011, at 16.53, Rich Cariens wrote: Hello, I've built a system that receives JMS events containing links to docs that I must download and index. Right now the JMS receiving, downloading, and transformation into SolrInputDoc's happens in a separate JVM that then uses Solrj javabin HTTP POSTs to distribute these docs across many index shards. For various reasons I won't go into here, I'd like to relocate/deploy most of my processing (JMS receiving, downloading, and transformation into SolrInputDoc's) into the SOLR webapps running on each distributed shard host. I might be wrong, but I don't think the request-driven idiom of the DataImportHandler is not a good fit for me as I'm not kicking off full or delta imports. If that's true, what's the correct way to hook my components into SOLR's update facilities? Should I try to get a reference a configured DirectUpdateHandler? I don't know if this information helps, but I'll put it out there anyways: I'm using Spring 3 components to receive JMS events, wired up via webapp context hooks. My plan would be to add all that to my SOLR shard webapp. Best, Rich
Re: Guidance for event-driven indexing
Thanks Jan, I was referring to a custom UpdateHandler, not a RequestHandler. You know, the one that the Solr wiki discourages :). Best, Rich On Tue, Feb 15, 2011 at 8:37 AM, Jan Høydahl jan@cominvent.com wrote: Hi, You would wire your JMSUpdateRequestHandler into solrconfig.xml as normal, and if you want to apply an UpdateChain, that would look like this: requestHandler name=/update/jms class=solr.JmsUpdateRequestHandler lst name=defaults str name=update.processormyPipeline/str /lst /requestHandler See http://wiki.apache.org/solr/SolrRequestHandler for details -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 15. feb. 2011, at 14.30, Rich Cariens wrote: Thanks Jan. For the JMSUpdateHandler option, how does one plugin a custom UpdateHandler? I want to make sure I'm not missing any important pieces of Solr processing pipeline. Best, Rich On Tue, Feb 15, 2011 at 4:36 AM, Jan Høydahl jan@cominvent.com wrote: Solr is multi threaded, so you are free to send as many parallel update requests needed to utilize your HW. Each request will get its own thread. Simply configure StreamingUpdateSolrServer from your client. If there is some lengthy work to be done, it needs to be done in SOME thread, and I guess you just have to choose where :) A JMSUpdateHandler sounds heavy weight, but does not need to be, and might be the logically best place for such a feature imo. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 14. feb. 2011, at 17.42, Rich Cariens wrote: Thanks Jan, I don't think I want to tie up a thread on two boxes waiting for an UpdateRequestProcessor to finish. I'd prefer to offload it all to the target shards. And a special JMSUpdateHandler feels like overkill. I *think* I'm really just looking for a simple API that allows me to add a SolrInputDocument to the index in-process. Perhaps I just need to use the EmbeddedSolrServer in the Solrj packages? I'm worried that this will break all the nice stuff one gets with the standard SOLR webapp (stats, admin, etc). Best, Rich On Mon, Feb 14, 2011 at 11:18 AM, Jan Høydahl jan@cominvent.com wrote: Hi, One option would be to keep the JMS listener as today but move the downloading and transforming part to a SolrUpdateRequestProcessor on each shard. The benefit is that you ship only a tiny little SolrInputDocument over the wire with a reference to the doc to download, and do the heavy lifting on Solr side. If each JMS topic/channel corresponds to a particular shard, you could move the whole thing to Solr. If so, a new JMSUpdateHandler could perhaps be a way to go? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 14. feb. 2011, at 16.53, Rich Cariens wrote: Hello, I've built a system that receives JMS events containing links to docs that I must download and index. Right now the JMS receiving, downloading, and transformation into SolrInputDoc's happens in a separate JVM that then uses Solrj javabin HTTP POSTs to distribute these docs across many index shards. For various reasons I won't go into here, I'd like to relocate/deploy most of my processing (JMS receiving, downloading, and transformation into SolrInputDoc's) into the SOLR webapps running on each distributed shard host. I might be wrong, but I don't think the request-driven idiom of the DataImportHandler is not a good fit for me as I'm not kicking off full or delta imports. If that's true, what's the correct way to hook my components into SOLR's update facilities? Should I try to get a reference a configured DirectUpdateHandler? I don't know if this information helps, but I'll put it out there anyways: I'm using Spring 3 components to receive JMS events, wired up via webapp context hooks. My plan would be to add all that to my SOLR shard webapp. Best, Rich
Re: Guidance for event-driven indexing
Thanks Jan, I don't think I want to tie up a thread on two boxes waiting for an UpdateRequestProcessor to finish. I'd prefer to offload it all to the target shards. And a special JMSUpdateHandler feels like overkill. I *think* I'm really just looking for a simple API that allows me to add a SolrInputDocument to the index in-process. Perhaps I just need to use the EmbeddedSolrServer in the Solrj packages? I'm worried that this will break all the nice stuff one gets with the standard SOLR webapp (stats, admin, etc). Best, Rich On Mon, Feb 14, 2011 at 11:18 AM, Jan Høydahl jan@cominvent.com wrote: Hi, One option would be to keep the JMS listener as today but move the downloading and transforming part to a SolrUpdateRequestProcessor on each shard. The benefit is that you ship only a tiny little SolrInputDocument over the wire with a reference to the doc to download, and do the heavy lifting on Solr side. If each JMS topic/channel corresponds to a particular shard, you could move the whole thing to Solr. If so, a new JMSUpdateHandler could perhaps be a way to go? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 14. feb. 2011, at 16.53, Rich Cariens wrote: Hello, I've built a system that receives JMS events containing links to docs that I must download and index. Right now the JMS receiving, downloading, and transformation into SolrInputDoc's happens in a separate JVM that then uses Solrj javabin HTTP POSTs to distribute these docs across many index shards. For various reasons I won't go into here, I'd like to relocate/deploy most of my processing (JMS receiving, downloading, and transformation into SolrInputDoc's) into the SOLR webapps running on each distributed shard host. I might be wrong, but I don't think the request-driven idiom of the DataImportHandler is not a good fit for me as I'm not kicking off full or delta imports. If that's true, what's the correct way to hook my components into SOLR's update facilities? Should I try to get a reference a configured DirectUpdateHandler? I don't know if this information helps, but I'll put it out there anyways: I'm using Spring 3 components to receive JMS events, wired up via webapp context hooks. My plan would be to add all that to my SOLR shard webapp. Best, Rich
Re: Full text hit term highlighting
Thanks Lance. I'm storing the original document and indexing all it's extracted content, but I need to be able to high-light the text within it's original markup. I'm going to give Uwe's suggestion http://bit.ly/hCSdYZa go. On Sat, Dec 4, 2010 at 7:18 PM, Lance Norskog goks...@gmail.com wrote: Set the fragment length to 0. This means highlight the entire text body. If, you have stored the text body. Otherwise, you have to get the term vectors somehow and highlight the text yourself. I investigated this problem awhile back for PDFs. You can add a starting page and an OR list of search terms to the URL that loads a PDF into the in-browser version of the Adobe PDF reader. This allows you to load the PDF at the first occurence of any of the search terms, with the terms highlighted. The search button takes you to the next of any of the terms. On Sat, Dec 4, 2010 at 4:10 PM, Rich Cariens richcari...@gmail.com wrote: Anyone ever use Solr to present a view of a document with hit-terms highlighted within? Kind of like Google's cached http://bit.ly/hgudWq copies? -- Lance Norskog goks...@gmail.com
Re: Full text hit term highlighting
Uwe goes on to say: This works, as long as you don't need query highlighting, because the offsets from the first field addition cannot be used for highlighting inside the text with markup. *In this case, you have to write your own analyzer that removes the markup in the tokenizer, but preserves the original offsets. *Examples of this are e.g. The Wikipedia contrib in Lucene, which has an hand-crafted analyzer that can handle Mediawiki Markup syntax. On Sun, Dec 5, 2010 at 3:35 PM, Jonathan Rochkind rochk...@jhu.edu wrote: That suggestion says This works, as long as you don't need query highlighting. Have you found a way around that, or have you decided not to use highlighting after all? Or am I missing something? From: Rich Cariens [richcari...@gmail.com] Sent: Sunday, December 05, 2010 10:58 AM To: solr-user@lucene.apache.org Subject: Re: Full text hit term highlighting Thanks Lance. I'm storing the original document and indexing all it's extracted content, but I need to be able to high-light the text within it's original markup. I'm going to give Uwe's suggestion http://bit.ly/hCSdYZa go. On Sat, Dec 4, 2010 at 7:18 PM, Lance Norskog goks...@gmail.com wrote: Set the fragment length to 0. This means highlight the entire text body. If, you have stored the text body. Otherwise, you have to get the term vectors somehow and highlight the text yourself. I investigated this problem awhile back for PDFs. You can add a starting page and an OR list of search terms to the URL that loads a PDF into the in-browser version of the Adobe PDF reader. This allows you to load the PDF at the first occurence of any of the search terms, with the terms highlighted. The search button takes you to the next of any of the terms. On Sat, Dec 4, 2010 at 4:10 PM, Rich Cariens richcari...@gmail.com wrote: Anyone ever use Solr to present a view of a document with hit-terms highlighted within? Kind of like Google's cached http://bit.ly/hgudWq copies? -- Lance Norskog goks...@gmail.com
Full text hit term highlighting
Anyone ever use Solr to present a view of a document with hit-terms highlighted within? Kind of like Google's cached http://bit.ly/hgudWqcopies?
Re: Optimize Index
For what it's worth, the Solr class instructor at the Lucene Revolution conference recommended *against* optimizing, and instead suggested to just let the merge factor do it's job. On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey s...@elyograg.org wrote: On 11/4/2010 7:22 AM, stockiii wrote: how can i start an optimize by using DIH, but NOT after an delta- or full-import ? I'm not aware of a way to do this with DIH, though there might be something I'm not aware of. You can do it with an HTTP POST. Here's how to do it with curl: /usr/bin/curl http://HOST:PORT/solr/CORE/update; \ -H Content-Type: text/xml \ --data-binary 'optimize waitFlush=true waitSearcher=true/' Shawn
Re: StreamingUpdateSolrServer hangs
I experienced the hang described with the Solr 1.4.0 build. Yonik - I also thought the streaming updater was blocking on commits but updates never resumed. To be honest I was in a bit of a rush to meet a deadline so after spending a day or so tinkering I bailed out and just wrote a component by hand. I have not tried to reproduce this using the current trunk. I was using the 32-bit Sun JRE on a Red Hat EL 5 HP server. I'm not sure if the following enriches this thread, but I'll include it anyways: write a document generator and start adding a ton of 'em to a Solr server instance using the streaming updater. You *should* experience the hang. HTH, Rich On Fri, Apr 16, 2010 at 1:34 PM, Sascha Szott sz...@zib.de wrote: Hi Yonik, thanks for your fast reply. Yonik Seeley wrote: Thanks for the report Sascha. So after the hang, it never recovers? Some amount of hanging could be visible if there was a commit on the Solr server or something else to cause the solr requests to block for a while... but it should return to normal on it's own... In my case the whole application hangs and never recovers (CPU utilization goes down to near 0%). Interestingly, the problem reproducibly occurs only if SUSS is created with *more than 2* threads. Looking at the stack trace, it looks like threads are blocked waiting to get an http connection. I forgot to mention that my index app has exclusive access to the Solr instance. Therefore, concurrent searches against the same Solr instance while indexing are excluded. I'm traveling all next week, but I'll open a JIRA issue for this now. Thank you! Anything that would help us reproduce this is much appreciated. Are there any other who have experienced the same problem? -Sascha On Fri, Apr 16, 2010 at 8:57 AM, Sascha Szottsz...@zib.de wrote: Hi Yonik, Yonik Seeley wrote: Stephen, were you running stock Solr 1.4, or did you apply any of the SolrJ patches? I'm trying to figure out if anyone still has any problems, or if this was fixed with SOLR-1711: I'm using the latest trunk version (rev. 934846) and constantly running into the same problem. I'm using StreamingUpdateSolrServer with 3 treads and a queue size of 20 (not really knowing if this configuration is optimal). My multi-threaded application indexes 200k data items (bibliographic metadata in Dublin Core format) and constantly hangs after running for some time. Below you can find the thread dump of one of my index threads (after the app hangs all dumps are the same) thread 19 prio=10 tid=0x7fe8c0415800 nid=0x277d waiting on condition [0x42d05000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for0x7fe8cdcb7598 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254) at org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUpdateSolrServer.java:216) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:64) at de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:29) at de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:10) at de.kobv.ked.index.AbstractIndexThread.addIndexDocument(AbstractIndexThread.java:59) at de.kobv.ked.rss.RssThread.indiziere(RssThread.java:30) at de.kobv.ked.rss.RssThread.run(RssThread.java:58) and of the three SUSS threads: pool-1-thread-3 prio=10 tid=0x7fe8c7b7f000 nid=0x2780 in Object.wait() [0x409ac000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518) - locked0x7fe8cdcb6f10 (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at
Index transaction log or equivalent?
Are there any best practices or built-in support for keeping track of what's been indexed in a Solr application so as to support a full rebuild? I'm not indexing from a single source, but from many, sometimes arbitrary, sources including: 1. A document repository that fires events (containing a URL) when new documents are added to the repo; 2. A book-marking service that fires events containing URLs when users of that service bookmark a URL; 3. More services that raise events that make Solr update docs indexed via (1) or (2) with additional metadata (think user comments, tagging, etc). I'm looking at ~200M documents for the initial launch, with around 30K new docs every day, and many thousands of metadata events every day. Do any of you Solr gurus have any suggestions or guidance you can share with me? Thanks in advance, Rich
Re: Index transaction log or equivalent?
Thanks Mark. That's sort of what I was thinking of doing. On Thu, Apr 8, 2010 at 10:33 AM, Mark Miller markrmil...@gmail.com wrote: On 04/08/2010 09:23 AM, Rich Cariens wrote: Are there any best practices or built-in support for keeping track of what's been indexed in a Solr application so as to support a full rebuild? I'm not indexing from a single source, but from many, sometimes arbitrary, sources including: 1. A document repository that fires events (containing a URL) when new documents are added to the repo; 2. A book-marking service that fires events containing URLs when users of that service bookmark a URL; 3. More services that raise events that make Solr update docs indexed via (1) or (2) with additional metadata (think user comments, tagging, etc). I'm looking at ~200M documents for the initial launch, with around 30K new docs every day, and many thousands of metadata events every day. Do any of you Solr gurus have any suggestions or guidance you can share with me? Thanks in advance, Rich Pump everything through an UpdateProcessor that writes out SolrXML as docs go by? -- - Mark http://www.lucidimagination.com
Re: an OR filter query
Why not just make the your mature:false filter query a default value instead of always appended? I.e.: -snip- lst name=defaults str name=fqmature:false/str /lst -snip- That way if someone wants mature items in their results the search client explicitly sets fq=mature:* or whatever. Would that work? On Sun, Apr 4, 2010 at 3:27 PM, Blargy zman...@hotmail.com wrote: Is there anyway to use a filter query as an OR clause? For example I have product listings and I want to be able to filter out mature items by default. To do this I added: lst name=appends str name=fqmature:false/str /lst But then I can never return any mature items because appending fq=mature:true will obviously return 0 results because no item can both be mature and non-mature. I can get around this using defaults: lst name=default str name=fqmature:false/str /lst But this is a little hacky because anytime I want to include mature items with non-mature items I need to explicitly set fq as a blank string. So is there any better way to do this? Thanks -- View this message in context: http://n3.nabble.com/an-OR-filter-query-tp696579p696579.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Experience with indexing billions of documents?
A colleague of mine is using native Lucene + some home-grown patches/optimizations to index over 13B small documents in a 32-shard environment, which is around 406M docs per shard. If there's a 2B doc id limitation in Lucene then I assume he's patched it himself. On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote: My guess is that you will need to take advantage of Solr 1.5's upcoming cloud/cluster renovations and use multiple indexes to comfortably achieve those numbers. Hypthetically, in that case, you won't be limited by single index docid limitations of Lucene. We are currently indexing 5 million books in Solr, scaling up over the next few years to 20 million. However we are using the entire book as a Solr document. We are evaluating the possibility of indexing individual pages as there are some use cases where users want the most relevant pages regardless of what book they occur in. However, we estimate that we are talking about somewhere between 1 and 6 billion pages and have concerns over whether Solr will scale to this level. Does anyone have experience using Solr with 1-6 billion Solr documents? The lucene file format document (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) mentions a limit of about 2 billion document ids. I assume this is the lucene internal document id and would therefore be a per index/per shard limit. Is this correct? Tom Burton-West.