Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-23 Thread Rich Cariens
For what it's worth, we've had good luck using the ICUTokenizer and
associated filters. A native Chinese speaker here at the office gave us an
enthusiastic thumbs up on our Chinese search results. Your mileage may vary
of course.

On Wed, Sep 23, 2015 at 11:04 AM, Erick Erickson 
wrote:

> In a word, no. The CJK languages in general don't
> necessarily tokenize on whitespace so using a tokenizer
> that uses whitespace as it's default tokenizer simply won't
> work.
>
> Have you tried it? It seems a simple test would get you
> an answer faster.
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo  >
> wrote:
>
> > Hi,
> >
> > Would like to check, will StandardTokenizerFactory works well for
> indexing
> > both English and Chinese (Bilingual) documents, or do we need tokenizers
> > that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
> >
> >
> > Regards,
> > Edwin
> >
>


Re: Implementing custom analyzer for multi-language stemming

2014-08-06 Thread Rich Cariens
Yes, each token could have a LanguageAttribute on it, just like
ScriptAttributes. I didn't *think* a span would be necessary.

I would also add a multivalued lang field to the document. Searching
English documents for die might look like: q=dielang=eng. The lang
param could tell the RequestHandler to add a filter query fq=lang:eng to
constrain the search to the English corpus, as well as recruit an English
analyzer when tokenizing the die query term.

Since I can't control text length, I would just let the language detection
tool do it's best and not sweat it.


On Wed, Aug 6, 2014 at 12:11 AM, TK kuros...@sonic.net wrote:


 On 8/5/14, 8:36 AM, Rich Cariens wrote:

 Of course this is extremely primitive and basic, but I think it would be
 possible to write a CharFilter or TokenFilter that inspects the entire
 TokenStream to guess the language(s), perhaps even noting where languages
 change. Language and position information could be tracked, the
 TokenStream
 rewound and then Tokens emitted with LanguageAttributes for downstream
 Token stemmers to deal with.

  I'm curious how you are planning to handle the languageAttribute.
 Would each token have this attribute denoting a span of Tokens
 with a language? But then how would you search
 English documents that includes the term die while skipping
 all the German documents which most likely to have die?

 Automatic language detection works OK for long text of
 regular kind of contents.  But it doesn't work well with short
 text. What strategy would you use to deal with short text?

 --
 TK




Re: Implementing custom analyzer for multi-language stemming

2014-08-05 Thread Rich Cariens
I've started a GitHub project to try out some cross-lingual analysis ideas (
https://github.com/whateverdood/cross-lingual-search). I haven't played
over there for about 3 months, but plan on restarting work there shortly.
In a nutshell, the interesting component
(SimplePolyGlotStemmingTokenFilter) relies on ICU4J ScriptAttributes:
each token is inspected for it's script, i.e. latin or arabic, and then
a ScriptStemmer recruits the appropriate stemmer to handle the token.

Of course this is extremely primitive and basic, but I think it would be
possible to write a CharFilter or TokenFilter that inspects the entire
TokenStream to guess the language(s), perhaps even noting where languages
change. Language and position information could be tracked, the TokenStream
rewound and then Tokens emitted with LanguageAttributes for downstream
Token stemmers to deal with.

Or is that a crazy idea?


On Tue, Aug 5, 2014 at 12:10 AM, TK kuros...@sonic.net wrote:

 On 7/30/14, 10:47 AM, Eugene wrote:

  Hello, fellow Solr and Lucene users and developers!

  In our project we receive text from users in different languages. We
 detect language automatically and use Google Translate APIs a lot (so
 having arbitrary number of languages in our system doesn't concern us).
 However we need to be able to search using stemming. Having nearly hundred
 of fields (several fields for each language with language-specific
 stemmers) listed in our search query is not an option. So we need a way to
 have a single index which has stemmed tokens for different languages.


 Do you mean to have a Tokenizer that switches among supported languages
 depending on the lang field? This is something I thought about when I
 started working on Solr/Lucene and soon I realized it is not possible
 because
 of the way Lucene is designed; The Tokenizer in an analyzer chain cannot
 peek
 other field's value, or there is no way to control which field is processed
 first.

 If that's not what you are trying to achieve, could you tell us what
 it is? If you have different language text in a single field, and if
 someone search for a word common to many languages,
 such as sports (or Lucene for that matter), Solr will return
 the documents of different languages, most of which the user
 doesn't understand. Would that be useful? If you have
 a special use case, would you like to share it?

 --
 Kuro



Re: MMapDirectory failed to map a 23G compound index segment

2011-09-30 Thread Rich Cariens
My colleague and I thought the same thing - that this is an O/S
configuration issue.

/proc/sys/vm/max_map_count = 65536

I honestly don't know how many segments were in the index. Our merge factor
is 10 and there were around 4.4 million docs indexed. The OOME was raised
when the MMapDirectory was opened, so I don't think were reopening the
reader several times. Our MMapDirectory is set to use the unmapHack.

We've since switched back to non-compound index files and are having no
trouble at all.

On Tue, Sep 20, 2011 at 3:32 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 Since you hit OOME during mmap, I think this is an OS issue not a JVM
 issue.  Ie, the JVM isn't running out of memory.

 How many segments were in the unoptimized index?  It's possible the OS
 rejected the mmap because of process limits.  Run cat
 /proc/sys/vm/max_map_count to see how many mmaps are allowed.

 Or: is it possible you reopened the reader several times against the
 index (ie, after committing from Solr)?  If so, I think 2.9.x never
 unmaps the mapped areas, and so this would accumulate against the
 system limit.

  My memory of this is a little rusty but isn't mmap also limited by mem +
 swap on the box? What does 'free -g' report?

 I don't think this should be the case; you are using a 64 bit OS/JVM
 so in theory (except for OS system wide / per-process limits imposed)
 you should be able to mmap up to the full 64 bit address space.

 Your virtual memory is unlimited (from ulimit output), so that's good.

 Mike McCandless

 http://blog.mikemccandless.com

 On Wed, Sep 7, 2011 at 12:25 PM, Rich Cariens richcari...@gmail.com
 wrote:
  Ahoy ahoy!
 
  I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
 compound
  index segment file. The stack trace looks pretty much like every other
 trace
  I've found when searching for OOM  map failed[1]. My configuration
  follows:
 
  Solr 1.4.1/Lucene 2.9.3 (plus
  SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969
  )
  CentOS 4.9 (Final)
  Linux 2.6.9-100.ELsmp x86_64 yada yada yada
  Java SE (build 1.6.0_21-b06)
  Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
  ulimits:
 core file size (blocks, -c) 0
 data seg size(kbytes, -d) unlimited
 file size (blocks, -f) unlimited
 pending signals(-i) 1024
 max locked memory (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files(-n) 256000
 pipe size (512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 stack size(kbytes, -s) 10240
 cpu time(seconds, -t) unlimited
 max user processes (-u) 1064959
 virtual memory(kbytes, -v) unlimited
 file locks(-x) unlimited
 
  Any suggestions?
 
  Thanks in advance,
  Rich
 
  [1]
  ...
  java.io.IOException: Map failed
   at sun.nio.ch.FileChannelImpl.map(Unknown Source)
   at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
   at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
   at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
   at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown
 Source)
 
   at org.apache.lucene.index.SegmentReader.get(Unknown Source)
   at org.apache.lucene.index.SegmentReader.get(Unknown Source)
   at org.apache.lucene.index.DirectoryReader.init(Unknown Source)
   at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown
 Source)
   at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
   at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
  Source)
   at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
   at org.apache.lucene.index.IndexReader.open(Unknown Source)
  ...
  Caused by: java.lang.OutOfMemoryError: Map failed
   at sun.nio.ch.FileChannelImpl.map0(Native Method)
  ...
 



Re: MMapDirectory failed to map a 23G compound index segment

2011-09-12 Thread Rich Cariens
Thanks. It's definitely repeatable and I may spend some time plumbing this
further. I'll let the list know if I find anything.

The problem went away once I optimized the index down to a single segment
using a simple IndexWriter driver. This was a bit strange since the
resulting index contained similarly large ( 23G) files. The JVM didn't seem
to have any trouble MMap'ing those.

No, I don't need (or necessarily want) to use compound index file formats.
That was actually a goof on my part which I've since corrected :).

On Fri, Sep 9, 2011 at 9:42 PM, Lance Norskog goks...@gmail.com wrote:

 I remember now: by memory-mapping one block of address space that big, the
 garbage collector has problems working around it. If the OOM is repeatable,
 you could try watching the app with jconsole and watch the memory spaces.

 Lance

 On Thu, Sep 8, 2011 at 8:58 PM, Lance Norskog goks...@gmail.com wrote:

  Do you need to use the compound format?
 
  On Thu, Sep 8, 2011 at 3:57 PM, Rich Cariens richcari...@gmail.com
 wrote:
 
  I should add some more context:
 
1. the problem index included several cfs segment files that were
 around
4.7G, and
2. I'm running four SOLR instances on the same box, all of which have
similiar problem indeces.
 
  A colleague thought perhaps I was bumping up against my 256,000 open
 files
  ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file
  handle/descriptor?
 
  On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens richcari...@gmail.com
  wrote:
 
   FWiW I optimized the index down to a single segment and now I have no
   trouble opening an MMapDirectory on that index, even though the 23G
 cfx
   segment file remains.
  
  
   On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens richcari...@gmail.com
  wrote:
  
   Thanks for the response. free -g reports:
  
   totalusedfreesharedbuffers
   cached
   Mem:  141  95  46 0
   093
   -/+ buffers/cache:  2 139
   Swap:   3   0   3
  
   2011/9/7 François Schiettecatte fschietteca...@gmail.com
  
   My memory of this is a little rusty but isn't mmap also limited by
 mem
  +
   swap on the box? What does 'free -g' report?
  
   François
  
   On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:
  
Ahoy ahoy!
   
I've run into the dreaded OOM error with MMapDirectory on a 23G
 cfs
   compound
index segment file. The stack trace looks pretty much like every
  other
   trace
I've found when searching for OOM  map failed[1]. My
  configuration
follows:
   
Solr 1.4.1/Lucene 2.9.3 (plus
SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969
)
CentOS 4.9 (Final)
Linux 2.6.9-100.ELsmp x86_64 yada yada yada
Java SE (build 1.6.0_21-b06)
Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
ulimits:
   core file size (blocks, -c) 0
   data seg size(kbytes, -d) unlimited
   file size (blocks, -f) unlimited
   pending signals(-i) 1024
   max locked memory (kbytes, -l) 32
   max memory size (kbytes, -m) unlimited
   open files(-n) 256000
   pipe size (512 bytes, -p) 8
   POSIX message queues (bytes, -q) 819200
   stack size(kbytes, -s) 10240
   cpu time(seconds, -t) unlimited
   max user processes (-u) 1064959
   virtual memory(kbytes, -v) unlimited
   file locks(-x) unlimited
   
Any suggestions?
   
Thanks in advance,
Rich
   
[1]
...
java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(Unknown Source)
at
  org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
Source)
at
  org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
Source)
at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
at
 org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown
   Source)
   
at org.apache.lucene.index.SegmentReader.get(Unknown Source)
at org.apache.lucene.index.SegmentReader.get(Unknown Source)
at org.apache.lucene.index.DirectoryReader.init(Unknown Source)
at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown
   Source)
at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown
 Source)
at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
Source)
at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
at org.apache.lucene.index.IndexReader.open(Unknown Source)
...
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
...
  
  
  
  
 
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 
 


 --
 Lance Norskog
 goks...@gmail.com



Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Rich Cariens
Thanks for the response. free -g reports:

totalusedfreesharedbuffers
cached
Mem:  141  95  46 0
093
-/+ buffers/cache:  2 139
Swap:   3   0   3

2011/9/7 François Schiettecatte fschietteca...@gmail.com

 My memory of this is a little rusty but isn't mmap also limited by mem +
 swap on the box? What does 'free -g' report?

 François

 On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:

  Ahoy ahoy!
 
  I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
 compound
  index segment file. The stack trace looks pretty much like every other
 trace
  I've found when searching for OOM  map failed[1]. My configuration
  follows:
 
  Solr 1.4.1/Lucene 2.9.3 (plus
  SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969
  )
  CentOS 4.9 (Final)
  Linux 2.6.9-100.ELsmp x86_64 yada yada yada
  Java SE (build 1.6.0_21-b06)
  Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
  ulimits:
 core file size (blocks, -c) 0
 data seg size(kbytes, -d) unlimited
 file size (blocks, -f) unlimited
 pending signals(-i) 1024
 max locked memory (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files(-n) 256000
 pipe size (512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 stack size(kbytes, -s) 10240
 cpu time(seconds, -t) unlimited
 max user processes (-u) 1064959
 virtual memory(kbytes, -v) unlimited
 file locks(-x) unlimited
 
  Any suggestions?
 
  Thanks in advance,
  Rich
 
  [1]
  ...
  java.io.IOException: Map failed
  at sun.nio.ch.FileChannelImpl.map(Unknown Source)
  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
  at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
  at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown
 Source)
 
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.DirectoryReader.init(Unknown Source)
  at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source)
  at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
  at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
  Source)
  at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
  at org.apache.lucene.index.IndexReader.open(Unknown Source)
  ...
  Caused by: java.lang.OutOfMemoryError: Map failed
  at sun.nio.ch.FileChannelImpl.map0(Native Method)
  ...




Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Rich Cariens
FWiW I optimized the index down to a single segment and now I have no
trouble opening an MMapDirectory on that index, even though the 23G cfx
segment file remains.

On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens richcari...@gmail.com wrote:

 Thanks for the response. free -g reports:

 totalusedfreesharedbuffers
 cached
 Mem:  141  95  46 0
 093
 -/+ buffers/cache:  2 139
 Swap:   3   0   3

 2011/9/7 François Schiettecatte fschietteca...@gmail.com

 My memory of this is a little rusty but isn't mmap also limited by mem +
 swap on the box? What does 'free -g' report?

 François

 On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:

  Ahoy ahoy!
 
  I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
 compound
  index segment file. The stack trace looks pretty much like every other
 trace
  I've found when searching for OOM  map failed[1]. My configuration
  follows:
 
  Solr 1.4.1/Lucene 2.9.3 (plus
  SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969
  )
  CentOS 4.9 (Final)
  Linux 2.6.9-100.ELsmp x86_64 yada yada yada
  Java SE (build 1.6.0_21-b06)
  Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
  ulimits:
 core file size (blocks, -c) 0
 data seg size(kbytes, -d) unlimited
 file size (blocks, -f) unlimited
 pending signals(-i) 1024
 max locked memory (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files(-n) 256000
 pipe size (512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 stack size(kbytes, -s) 10240
 cpu time(seconds, -t) unlimited
 max user processes (-u) 1064959
 virtual memory(kbytes, -v) unlimited
 file locks(-x) unlimited
 
  Any suggestions?
 
  Thanks in advance,
  Rich
 
  [1]
  ...
  java.io.IOException: Map failed
  at sun.nio.ch.FileChannelImpl.map(Unknown Source)
  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
  at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
  at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown
 Source)
 
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.DirectoryReader.init(Unknown Source)
  at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown
 Source)
  at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
  at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
  Source)
  at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
  at org.apache.lucene.index.IndexReader.open(Unknown Source)
  ...
  Caused by: java.lang.OutOfMemoryError: Map failed
  at sun.nio.ch.FileChannelImpl.map0(Native Method)
  ...





Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Rich Cariens
I should add some more context:

   1. the problem index included several cfs segment files that were around
   4.7G, and
   2. I'm running four SOLR instances on the same box, all of which have
   similiar problem indeces.

A colleague thought perhaps I was bumping up against my 256,000 open files
ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file
handle/descriptor?

On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens richcari...@gmail.com wrote:

 FWiW I optimized the index down to a single segment and now I have no
 trouble opening an MMapDirectory on that index, even though the 23G cfx
 segment file remains.


 On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens richcari...@gmail.comwrote:

 Thanks for the response. free -g reports:

 totalusedfreesharedbuffers
 cached
 Mem:  141  95  46 0
 093
 -/+ buffers/cache:  2 139
 Swap:   3   0   3

 2011/9/7 François Schiettecatte fschietteca...@gmail.com

 My memory of this is a little rusty but isn't mmap also limited by mem +
 swap on the box? What does 'free -g' report?

 François

 On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:

  Ahoy ahoy!
 
  I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
 compound
  index segment file. The stack trace looks pretty much like every other
 trace
  I've found when searching for OOM  map failed[1]. My configuration
  follows:
 
  Solr 1.4.1/Lucene 2.9.3 (plus
  SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969
  )
  CentOS 4.9 (Final)
  Linux 2.6.9-100.ELsmp x86_64 yada yada yada
  Java SE (build 1.6.0_21-b06)
  Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
  ulimits:
 core file size (blocks, -c) 0
 data seg size(kbytes, -d) unlimited
 file size (blocks, -f) unlimited
 pending signals(-i) 1024
 max locked memory (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files(-n) 256000
 pipe size (512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 stack size(kbytes, -s) 10240
 cpu time(seconds, -t) unlimited
 max user processes (-u) 1064959
 virtual memory(kbytes, -v) unlimited
 file locks(-x) unlimited
 
  Any suggestions?
 
  Thanks in advance,
  Rich
 
  [1]
  ...
  java.io.IOException: Map failed
  at sun.nio.ch.FileChannelImpl.map(Unknown Source)
  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
  Source)
  at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
  at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown
 Source)
 
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
  at org.apache.lucene.index.DirectoryReader.init(Unknown Source)
  at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown
 Source)
  at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
  at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
  Source)
  at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
  at org.apache.lucene.index.IndexReader.open(Unknown Source)
  ...
  Caused by: java.lang.OutOfMemoryError: Map failed
  at sun.nio.ch.FileChannelImpl.map0(Native Method)
  ...






MMapDirectory failed to map a 23G compound index segment

2011-09-07 Thread Rich Cariens
Ahoy ahoy!

I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound
index segment file. The stack trace looks pretty much like every other trace
I've found when searching for OOM  map failed[1]. My configuration
follows:

Solr 1.4.1/Lucene 2.9.3 (plus
SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969
)
CentOS 4.9 (Final)
Linux 2.6.9-100.ELsmp x86_64 yada yada yada
Java SE (build 1.6.0_21-b06)
Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
ulimits:
core file size (blocks, -c) 0
data seg size(kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals(-i) 1024
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files(-n) 256000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size(kbytes, -s) 10240
cpu time(seconds, -t) unlimited
max user processes (-u) 1064959
virtual memory(kbytes, -v) unlimited
file locks(-x) unlimited

Any suggestions?

Thanks in advance,
Rich

[1]
...
java.io.IOException: Map failed
 at sun.nio.ch.FileChannelImpl.map(Unknown Source)
 at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
Source)
 at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown
Source)
 at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
 at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source)

 at org.apache.lucene.index.SegmentReader.get(Unknown Source)
 at org.apache.lucene.index.SegmentReader.get(Unknown Source)
 at org.apache.lucene.index.DirectoryReader.init(Unknown Source)
 at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source)
 at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
 at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
Source)
 at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
 at org.apache.lucene.index.IndexReader.open(Unknown Source)
...
Caused by: java.lang.OutOfMemoryError: Map failed
 at sun.nio.ch.FileChannelImpl.map0(Native Method)
...


SSD experience

2011-08-22 Thread Rich Cariens
Ahoy ahoy!

Does anyone have any experiences or stories they can share with the list
about how SSDs impacted search performance for better or worse?

I found a Lucene SSD performance benchmark
dochttp://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdfbut
the wiki engine is refusing to let me view the attachment (I get You
are not allowed to do AttachFile on this page.).

Thanks in advance!


Re: SSD experience

2011-08-22 Thread Rich Cariens
Thanks folks!

On Mon, Aug 22, 2011 at 11:13 AM, Erick Erickson erickerick...@gmail.comwrote:

 That link appears to be foo'd, and I can't find the original doc.

 But others (mostly on the user's list historically) have seen very
 significant
 performance improvements with SSDs, *IF* the entire index doesn't fit
 in memory.

 If your index does fit entirely in memory, there will probably be some
 improvement when fetching stored fields, especially if the stored fields
 are large. But I'm not sure the cost is worth the incremental speed
 in this case.. Of course if you can get your IT folks to spring for SSDs,
 go for it :)

 Best
 Erick

 On Mon, Aug 22, 2011 at 11:02 AM, Daniel Skiles
 daniel.ski...@docfinity.com wrote:
  I haven't tried it with Solr yet, but with straight Lucene about two
 years
  ago we saw about a 40% boost in performance on our tests with no changes
  except the disk.
 
  On Mon, Aug 22, 2011 at 10:54 AM, Rich Cariens richcari...@gmail.com
 wrote:
 
  Ahoy ahoy!
 
  Does anyone have any experiences or stories they can share with the list
  about how SSDs impacted search performance for better or worse?
 
  I found a Lucene SSD performance benchmark
  doc
 
 http://wiki.apache.org/lucene-java/SSD_performance?action=AttachFiledo=viewtarget=combined-disk-ssd.pdf
  but
  the wiki engine is refusing to let me view the attachment (I get You
  are not allowed to do AttachFile on this page.).
 
  Thanks in advance!
 
 



Re: how to enable MMapDirectory in solr 1.4?

2011-08-08 Thread Rich Cariens
We patched our 1.4.1 build with
SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969(making
MMapDirectory configurable) and realized a 64% search performance
boost on our Linux hosts.

On Mon, Aug 8, 2011 at 10:05 AM, Dyer, James james.d...@ingrambook.comwrote:

 If you want to try MMapDirectory with Solr 1.4, then copy the class
 org.apache.solr.core.MMapDirectoryFactory from 3.x or Trunk, and either add
 it to the .war file (you can just add it under src/java and re-package the
 war), or you can put it in its own .jar file in the lib directory under
 solr_home.  Then, in solrconfig.xml, add this entry under the root
 config element:

 directoryFactory class=org.apache.solr.core.MMapDirectoryFactory /

 I'm not sure if MMapDirectory will perform better for you with Linux over
 NIOFSDir.  I'm pretty sure in Trunk/4.0 it's the default for Windows and
 maybe Solaris.  In Windows, there is a definite advantage for using
 MMapDirectory on a 64-bit system.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Li Li [mailto:fancye...@gmail.com]
 Sent: Monday, August 08, 2011 4:09 AM
 To: solr-user@lucene.apache.org
 Subject: how to enable MMapDirectory in solr 1.4?

 hi all,
I read Apache Solr 3.1 Released Note today and found that
 MMapDirectory is now the default implementation in 64 bit Systems.
I am now using solr 1.4 with 64-bit jvm in Linux. how can I use
 MMapDirectory? will it improve performance?



Re: document storage

2011-05-13 Thread Rich Cariens
We've decided to store the original document in both Solr and external
repositories. This is to support the following:

   1. highlighting - We need to mark-up the entire document with hit-terms.
   However if this was the only reason to store the text I'd seriously consider
   calling out to the external repository via a custom highlighter.
   2. hot documents - We need to index user-generated data like activity
   streams, folksonomy tags, annotations, and comments. When our indexer is
   made aware of those events we decorate the existing SolrDocument with new
   fields and re-index it.
   3. in-place index rebuild - Our search service is still evolving so we
   periodically change our schema and indexing code. We believe it's more
   efficient, not to mention faster, to rebuild the index if we've got all the
   data.

Hope that helps!

On Fri, May 13, 2011 at 3:10 PM, Mike Sokolov soko...@ifactory.com wrote:

 Would anyone care to comment on the merits of storing indexed full-text
 documents in Solr versus storing them externally?

 It seems there are three options for us:

 1) store documents both in Solr and externally - this is what we are doing
 now, and gives us all sorts of flexibility, but doesn't seem like the most
 scalable option, at least in terms of storage space and I/O required when
 updating/inserting documents.

 2) store documents externally: For the moment, the only thing that requires
 us to store documents in Solr is the need to highlight them, both in search
 result snippets and in full document views. We are considering hunting for
 or writing a Highlighter extension that could pull in the document text from
 an external source (eg filesystem).

 3) store documents only in Solr.  We'd just retrieve document text as a
 Solr field value rather than reading from the filesystem.  Somehow this
 strikes me as the wrong thing to do, but it could work:  I'm not sure why.
  A lot of unnecessary merging I/O activity perhaps.  Makes it hard to grep
 the documents or use other filesystem tools, I suppose.

 Which one of these sounds best to you?  Under which circumstances? Are
 there other possibilities?

 Thanks!

 --

 Michael Sokolov
 Engineering Director
 www.ifactory.com




Re: Guidance for event-driven indexing

2011-02-15 Thread Rich Cariens
Thanks Jan.

For the JMSUpdateHandler option, how does one plugin a custom UpdateHandler?
I want to make sure I'm not missing any important pieces of Solr processing
pipeline.

Best,
Rich

On Tue, Feb 15, 2011 at 4:36 AM, Jan Høydahl jan@cominvent.com wrote:

 Solr is multi threaded, so you are free to send as many parallel update
 requests needed to utilize your HW. Each request will get its own thread.
 Simply configure StreamingUpdateSolrServer from your client.

 If there is some lengthy work to be done, it needs to be done in SOME
 thread, and I guess you just have to choose where :)

 A JMSUpdateHandler sounds heavy weight, but does not need to be, and might
 be the logically best place for such a feature imo.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 14. feb. 2011, at 17.42, Rich Cariens wrote:

  Thanks Jan,
 
  I don't think I want to tie up a thread on two boxes waiting for an
  UpdateRequestProcessor to finish. I'd prefer to offload it all to the
 target
  shards. And a special JMSUpdateHandler feels like overkill. I *think* I'm
  really just looking for a simple API that allows me to add a
  SolrInputDocument to the index in-process.
 
  Perhaps I just need to use the EmbeddedSolrServer in the Solrj packages?
 I'm
  worried that this will break all the nice stuff one gets with the
 standard
  SOLR webapp (stats, admin, etc).
 
  Best,
  Rich
 
 
  On Mon, Feb 14, 2011 at 11:18 AM, Jan Høydahl jan@cominvent.com
 wrote:
 
  Hi,
 
  One option would be to keep the JMS listener as today but move the
  downloading
  and transforming part to a SolrUpdateRequestProcessor on each shard. The
  benefit
  is that you ship only a tiny little SolrInputDocument over the wire with
 a
  reference to the doc to download, and do the heavy lifting on Solr side.
 
  If each JMS topic/channel corresponds to a particular shard, you could
  move the whole thing to Solr. If so, a new JMSUpdateHandler could
 perhaps
  be a way to go?
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
 
  On 14. feb. 2011, at 16.53, Rich Cariens wrote:
 
  Hello,
 
  I've built a system that receives JMS events containing links to docs
  that I
  must download and index. Right now the JMS receiving, downloading, and
  transformation into SolrInputDoc's happens in a separate JVM that then
  uses
  Solrj javabin HTTP POSTs to distribute these docs across many index
  shards.
 
  For various reasons I won't go into here, I'd like to relocate/deploy
  most
  of my processing (JMS receiving, downloading, and transformation into
  SolrInputDoc's) into the SOLR webapps running on each distributed shard
  host. I might be wrong, but I don't think the request-driven idiom of
 the
  DataImportHandler is not a good fit for me as I'm not kicking off full
 or
  delta imports. If that's true, what's the correct way to hook my
  components
  into SOLR's update facilities? Should I try to get a reference a
  configured
  DirectUpdateHandler?
 
  I don't know if this information helps, but I'll put it out there
  anyways:
  I'm using Spring 3 components to receive JMS events, wired up via
 webapp
  context hooks. My plan would be to add all that to my SOLR shard
 webapp.
 
  Best,
  Rich
 
 




Re: Guidance for event-driven indexing

2011-02-15 Thread Rich Cariens
Thanks Jan,

I was referring to a custom UpdateHandler, not a RequestHandler. You know,
the one that the Solr wiki discourages :).

Best,
Rich

On Tue, Feb 15, 2011 at 8:37 AM, Jan Høydahl jan@cominvent.com wrote:

 Hi,

 You would wire your JMSUpdateRequestHandler into solrconfig.xml as normal,
 and if you want to apply an UpdateChain, that would look like this:

  requestHandler name=/update/jms class=solr.JmsUpdateRequestHandler 
lst name=defaults
  str name=update.processormyPipeline/str
/lst
  /requestHandler

 See http://wiki.apache.org/solr/SolrRequestHandler for details

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 15. feb. 2011, at 14.30, Rich Cariens wrote:

  Thanks Jan.
 
  For the JMSUpdateHandler option, how does one plugin a custom
 UpdateHandler?
  I want to make sure I'm not missing any important pieces of Solr
 processing
  pipeline.
 
  Best,
  Rich
 
  On Tue, Feb 15, 2011 at 4:36 AM, Jan Høydahl jan@cominvent.com
 wrote:
 
  Solr is multi threaded, so you are free to send as many parallel update
  requests needed to utilize your HW. Each request will get its own
 thread.
  Simply configure StreamingUpdateSolrServer from your client.
 
  If there is some lengthy work to be done, it needs to be done in SOME
  thread, and I guess you just have to choose where :)
 
  A JMSUpdateHandler sounds heavy weight, but does not need to be, and
 might
  be the logically best place for such a feature imo.
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
 
  On 14. feb. 2011, at 17.42, Rich Cariens wrote:
 
  Thanks Jan,
 
  I don't think I want to tie up a thread on two boxes waiting for an
  UpdateRequestProcessor to finish. I'd prefer to offload it all to the
  target
  shards. And a special JMSUpdateHandler feels like overkill. I *think*
 I'm
  really just looking for a simple API that allows me to add a
  SolrInputDocument to the index in-process.
 
  Perhaps I just need to use the EmbeddedSolrServer in the Solrj
 packages?
  I'm
  worried that this will break all the nice stuff one gets with the
  standard
  SOLR webapp (stats, admin, etc).
 
  Best,
  Rich
 
 
  On Mon, Feb 14, 2011 at 11:18 AM, Jan Høydahl jan@cominvent.com
  wrote:
 
  Hi,
 
  One option would be to keep the JMS listener as today but move the
  downloading
  and transforming part to a SolrUpdateRequestProcessor on each shard.
 The
  benefit
  is that you ship only a tiny little SolrInputDocument over the wire
 with
  a
  reference to the doc to download, and do the heavy lifting on Solr
 side.
 
  If each JMS topic/channel corresponds to a particular shard, you could
  move the whole thing to Solr. If so, a new JMSUpdateHandler could
  perhaps
  be a way to go?
 
  --
  Jan Høydahl, search solution architect
  Cominvent AS - www.cominvent.com
 
  On 14. feb. 2011, at 16.53, Rich Cariens wrote:
 
  Hello,
 
  I've built a system that receives JMS events containing links to docs
  that I
  must download and index. Right now the JMS receiving, downloading,
 and
  transformation into SolrInputDoc's happens in a separate JVM that
 then
  uses
  Solrj javabin HTTP POSTs to distribute these docs across many index
  shards.
 
  For various reasons I won't go into here, I'd like to relocate/deploy
  most
  of my processing (JMS receiving, downloading, and transformation into
  SolrInputDoc's) into the SOLR webapps running on each distributed
 shard
  host. I might be wrong, but I don't think the request-driven idiom of
  the
  DataImportHandler is not a good fit for me as I'm not kicking off
 full
  or
  delta imports. If that's true, what's the correct way to hook my
  components
  into SOLR's update facilities? Should I try to get a reference a
  configured
  DirectUpdateHandler?
 
  I don't know if this information helps, but I'll put it out there
  anyways:
  I'm using Spring 3 components to receive JMS events, wired up via
  webapp
  context hooks. My plan would be to add all that to my SOLR shard
  webapp.
 
  Best,
  Rich
 
 
 
 




Re: Guidance for event-driven indexing

2011-02-14 Thread Rich Cariens
Thanks Jan,

I don't think I want to tie up a thread on two boxes waiting for an
UpdateRequestProcessor to finish. I'd prefer to offload it all to the target
shards. And a special JMSUpdateHandler feels like overkill. I *think* I'm
really just looking for a simple API that allows me to add a
SolrInputDocument to the index in-process.

Perhaps I just need to use the EmbeddedSolrServer in the Solrj packages? I'm
worried that this will break all the nice stuff one gets with the standard
SOLR webapp (stats, admin, etc).

Best,
Rich


On Mon, Feb 14, 2011 at 11:18 AM, Jan Høydahl jan@cominvent.com wrote:

 Hi,

 One option would be to keep the JMS listener as today but move the
 downloading
 and transforming part to a SolrUpdateRequestProcessor on each shard. The
 benefit
 is that you ship only a tiny little SolrInputDocument over the wire with a
 reference to the doc to download, and do the heavy lifting on Solr side.

 If each JMS topic/channel corresponds to a particular shard, you could
 move the whole thing to Solr. If so, a new JMSUpdateHandler could perhaps
 be a way to go?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 14. feb. 2011, at 16.53, Rich Cariens wrote:

  Hello,
 
  I've built a system that receives JMS events containing links to docs
 that I
  must download and index. Right now the JMS receiving, downloading, and
  transformation into SolrInputDoc's happens in a separate JVM that then
 uses
  Solrj javabin HTTP POSTs to distribute these docs across many index
 shards.
 
  For various reasons I won't go into here, I'd like to relocate/deploy
 most
  of my processing (JMS receiving, downloading, and transformation into
  SolrInputDoc's) into the SOLR webapps running on each distributed shard
  host. I might be wrong, but I don't think the request-driven idiom of the
  DataImportHandler is not a good fit for me as I'm not kicking off full or
  delta imports. If that's true, what's the correct way to hook my
 components
  into SOLR's update facilities? Should I try to get a reference a
 configured
  DirectUpdateHandler?
 
  I don't know if this information helps, but I'll put it out there
 anyways:
  I'm using Spring 3 components to receive JMS events, wired up via webapp
  context hooks. My plan would be to add all that to my SOLR shard webapp.
 
  Best,
  Rich




Re: Full text hit term highlighting

2010-12-05 Thread Rich Cariens
Thanks Lance.  I'm storing the original document and indexing all it's
extracted content, but I need to be able to high-light the text within it's
original markup.  I'm going to give Uwe's suggestion http://bit.ly/hCSdYZa go.

On Sat, Dec 4, 2010 at 7:18 PM, Lance Norskog goks...@gmail.com wrote:

 Set the fragment length to 0. This means highlight the entire text
 body. If, you have stored the text body.

 Otherwise, you have to get the term vectors somehow and highlight the
 text yourself.

 I investigated this problem awhile back for PDFs. You can add a
 starting page and an OR list of search terms to the URL that loads a
 PDF into the in-browser version of the Adobe PDF reader. This allows
 you to load the PDF at the first occurence of any of the search terms,
 with the terms highlighted. The search button takes you to the next of
 any of the terms.

 On Sat, Dec 4, 2010 at 4:10 PM, Rich Cariens richcari...@gmail.com
 wrote:
  Anyone ever use Solr to present a view of a document with hit-terms
  highlighted within?  Kind of like Google's cached http://bit.ly/hgudWq
 copies?
 



 --
 Lance Norskog
 goks...@gmail.com



Re: Full text hit term highlighting

2010-12-05 Thread Rich Cariens
Uwe goes on to say:


 This works, as long as you don't need query highlighting, because the offsets 
 from the first field addition cannot be used for highlighting inside the text 
 with markup. *In this case, you have to write your own analyzer that removes 
 the markup in the tokenizer, but preserves the original offsets. *Examples of 
 this are e.g. The Wikipedia contrib in Lucene, which has an hand-crafted 
 analyzer that can handle Mediawiki Markup syntax.



On Sun, Dec 5, 2010 at 3:35 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 That suggestion says This works, as long as you don't need query
 highlighting.  Have you found a way around that, or have you decided not to
 use highlighting after all?  Or am I missing something?
 
 From: Rich Cariens [richcari...@gmail.com]
 Sent: Sunday, December 05, 2010 10:58 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Full text hit term highlighting

 Thanks Lance.  I'm storing the original document and indexing all it's
 extracted content, but I need to be able to high-light the text within it's
 original markup.  I'm going to give Uwe's suggestion http://bit.ly/hCSdYZa
 go.

 On Sat, Dec 4, 2010 at 7:18 PM, Lance Norskog goks...@gmail.com wrote:

  Set the fragment length to 0. This means highlight the entire text
  body. If, you have stored the text body.
 
  Otherwise, you have to get the term vectors somehow and highlight the
  text yourself.
 
  I investigated this problem awhile back for PDFs. You can add a
  starting page and an OR list of search terms to the URL that loads a
  PDF into the in-browser version of the Adobe PDF reader. This allows
  you to load the PDF at the first occurence of any of the search terms,
  with the terms highlighted. The search button takes you to the next of
  any of the terms.
 
  On Sat, Dec 4, 2010 at 4:10 PM, Rich Cariens richcari...@gmail.com
  wrote:
   Anyone ever use Solr to present a view of a document with hit-terms
   highlighted within?  Kind of like Google's cached 
 http://bit.ly/hgudWq
  copies?
  
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 



Full text hit term highlighting

2010-12-04 Thread Rich Cariens
Anyone ever use Solr to present a view of a document with hit-terms
highlighted within?  Kind of like Google's cached http://bit.ly/hgudWqcopies?


Re: Optimize Index

2010-11-04 Thread Rich Cariens
For what it's worth, the Solr class instructor at the Lucene Revolution
conference recommended *against* optimizing, and instead suggested to just
let the merge factor do it's job.

On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey s...@elyograg.org wrote:

 On 11/4/2010 7:22 AM, stockiii wrote:

 how can i start an optimize by using DIH, but NOT after an delta- or
 full-import ?


 I'm not aware of a way to do this with DIH, though there might be something
 I'm not aware of.  You can do it with an HTTP POST.  Here's how to do it
 with curl:

 /usr/bin/curl http://HOST:PORT/solr/CORE/update; \
 -H Content-Type: text/xml \
 --data-binary 'optimize waitFlush=true waitSearcher=true/'

 Shawn




Re: StreamingUpdateSolrServer hangs

2010-04-16 Thread Rich Cariens
I experienced the hang described with the Solr 1.4.0 build.

Yonik - I also thought the streaming updater was blocking on commits but
updates never resumed.

To be honest I was in a bit of a rush to meet a deadline so after spending a
day or so tinkering I bailed out and just wrote a component by hand.  I have
not tried to reproduce this using the current trunk.  I was using the 32-bit
Sun JRE on a Red Hat EL 5 HP server.

I'm not sure if the following enriches this thread, but I'll include it
anyways: write a document generator and start adding a ton of 'em to a Solr
server instance using the streaming updater.  You *should* experience the
hang.

HTH,
Rich

On Fri, Apr 16, 2010 at 1:34 PM, Sascha Szott sz...@zib.de wrote:

 Hi Yonik,

 thanks for your fast reply.


 Yonik Seeley wrote:

 Thanks for the report Sascha.
 So after the hang, it never recovers?  Some amount of hanging could be
 visible if there was a commit on the Solr server or something else to
 cause the solr requests to block for a while... but it should return
 to normal on it's own...

 In my case the whole application hangs and never recovers (CPU utilization
 goes down to near 0%). Interestingly, the problem reproducibly occurs only
 if SUSS is created with *more than 2* threads.


  Looking at the stack trace, it looks like threads are blocked waiting
 to get an http connection.

 I forgot to mention that my index app has exclusive access to the Solr
 instance. Therefore, concurrent searches against the same Solr instance
 while indexing are excluded.


  I'm traveling all next week, but I'll open a JIRA issue for this now.

 Thank you!


  Anything that would help us reproduce this is much appreciated.

 Are there any other who have experienced the same problem?

 -Sascha



 On Fri, Apr 16, 2010 at 8:57 AM, Sascha Szottsz...@zib.de  wrote:

 Hi Yonik,

 Yonik Seeley wrote:


 Stephen, were you running stock Solr 1.4, or did you apply any of the
 SolrJ patches?
 I'm trying to figure out if anyone still has any problems, or if this
 was fixed with SOLR-1711:


 I'm using the latest trunk version (rev. 934846) and constantly running
 into
 the same problem. I'm using StreamingUpdateSolrServer with 3 treads and a
 queue size of 20 (not really knowing if this configuration is optimal).
 My
 multi-threaded application indexes 200k data items (bibliographic
 metadata
 in Dublin Core format) and constantly hangs after running for some time.

 Below you can find the thread dump of one of my index threads (after the
 app
 hangs all dumps are the same)

 thread 19 prio=10 tid=0x7fe8c0415800 nid=0x277d waiting on
 condition
 [0x42d05000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for0x7fe8cdcb7598  (a
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
 java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
at

 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
at

 java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254)
at

 org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUpdateSolrServer.java:216)
at

 org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:64)
at

 de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:29)
at

 de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:10)
at

 de.kobv.ked.index.AbstractIndexThread.addIndexDocument(AbstractIndexThread.java:59)
at de.kobv.ked.rss.RssThread.indiziere(RssThread.java:30)
at de.kobv.ked.rss.RssThread.run(RssThread.java:58)



 and of the three SUSS threads:

 pool-1-thread-3 prio=10 tid=0x7fe8c7b7f000 nid=0x2780 in
 Object.wait()
 [0x409ac000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on0x7fe8cdcb6f10  (a

 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at

 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked0x7fe8cdcb6f10  (a

 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at

 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
at

 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
at

 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at

 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at

 

Index transaction log or equivalent?

2010-04-08 Thread Rich Cariens
Are there any best practices or built-in support for keeping track of what's
been indexed in a Solr application so as to support a full rebuild?  I'm not
indexing from a single source, but from many, sometimes arbitrary, sources
including:

   1. A document repository that fires events (containing a URL) when new
   documents are added to the repo;
   2. A book-marking service that fires events containing URLs when users of
   that service bookmark a URL;
   3. More services that raise events that make Solr update docs indexed via
   (1) or (2) with additional metadata (think user comments, tagging, etc).

I'm looking at ~200M documents for the initial launch, with around 30K new
docs every day, and many thousands of metadata events every day.

Do any of you Solr gurus have any suggestions or guidance you can share with
me?

Thanks in advance,
Rich


Re: Index transaction log or equivalent?

2010-04-08 Thread Rich Cariens
Thanks Mark.  That's sort of what I was thinking of doing.

On Thu, Apr 8, 2010 at 10:33 AM, Mark Miller markrmil...@gmail.com wrote:

 On 04/08/2010 09:23 AM, Rich Cariens wrote:

 Are there any best practices or built-in support for keeping track of
 what's
 been indexed in a Solr application so as to support a full rebuild?  I'm
 not
 indexing from a single source, but from many, sometimes arbitrary, sources
 including:

1. A document repository that fires events (containing a URL) when new

documents are added to the repo;
2. A book-marking service that fires events containing URLs when users
 of

that service bookmark a URL;
3. More services that raise events that make Solr update docs indexed
 via

(1) or (2) with additional metadata (think user comments, tagging,
 etc).

 I'm looking at ~200M documents for the initial launch, with around 30K new
 docs every day, and many thousands of metadata events every day.

 Do any of you Solr gurus have any suggestions or guidance you can share
 with
 me?

 Thanks in advance,
 Rich




 Pump everything through an UpdateProcessor that writes out SolrXML as docs
 go by?

 --
 - Mark

 http://www.lucidimagination.com






Re: an OR filter query

2010-04-04 Thread Rich Cariens
Why not just make the your mature:false filter query a default value
instead of always appended?  I.e.:

-snip-
lst name=defaults
str name=fqmature:false/str
/lst
-snip-

That way if someone wants mature items in their results the search client
explicitly sets fq=mature:* or whatever.

Would that work?

On Sun, Apr 4, 2010 at 3:27 PM, Blargy zman...@hotmail.com wrote:


 Is there anyway to use a filter query as an OR clause?

 For example I have product listings and I want to be able to filter out
 mature items by default. To do this I added:

 lst name=appends
  str name=fqmature:false/str
 /lst

 But then I can never return any mature items because appending
 fq=mature:true will obviously return 0 results because no item can both be
 mature and non-mature.

 I can get around this using defaults:

 lst name=default
  str name=fqmature:false/str
 /lst

 But this is a little hacky because anytime I want to include mature items
 with non-mature items I need to explicitly set fq as a blank string.

 So is there any better way to do this? Thanks
 --
 View this message in context:
 http://n3.nabble.com/an-OR-filter-query-tp696579p696579.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Experience with indexing billions of documents?

2010-04-02 Thread Rich Cariens
A colleague of mine is using native Lucene + some home-grown
patches/optimizations to index over 13B small documents in a 32-shard
environment, which is around 406M docs per shard.

If there's a 2B doc id limitation in Lucene then I assume he's patched it
himself.

On Fri, Apr 2, 2010 at 1:17 PM, dar...@ontrenet.com wrote:

 My guess is that you will need to take advantage of Solr 1.5's upcoming
 cloud/cluster renovations and use multiple indexes to comfortably achieve
 those numbers. Hypthetically, in that case, you won't be limited by single
 index docid limitations of Lucene.

  We are currently indexing 5 million books in Solr, scaling up over the
  next few years to 20 million.  However we are using the entire book as a
  Solr document.  We are evaluating the possibility of indexing individual
  pages as there are some use cases where users want the most relevant
 pages
  regardless of what book they occur in.  However, we estimate that we are
  talking about somewhere between 1 and 6 billion pages and have concerns
  over whether Solr will scale to this level.
 
  Does anyone have experience using Solr with 1-6 billion Solr documents?
 
  The lucene file format document
  (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
  mentions a limit of about 2 billion document ids.   I assume this is the
  lucene internal document id and would therefore be a per index/per shard
  limit.  Is this correct?
 
 
  Tom Burton-West.