Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-23 Thread Rich Cariens
For what it's worth, we've had good luck using the ICUTokenizer and
associated filters. A native Chinese speaker here at the office gave us an
enthusiastic thumbs up on our Chinese search results. Your mileage may vary
of course.

On Wed, Sep 23, 2015 at 11:04 AM, Erick Erickson 
wrote:

> In a word, no. The CJK languages in general don't
> necessarily tokenize on whitespace so using a tokenizer
> that uses whitespace as it's default tokenizer simply won't
> work.
>
> Have you tried it? It seems a simple test would get you
> an answer faster.
>
> Best,
> Erick
>
> On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo  >
> wrote:
>
> > Hi,
> >
> > Would like to check, will StandardTokenizerFactory works well for
> indexing
> > both English and Chinese (Bilingual) documents, or do we need tokenizers
> > that are customised for chinese (Eg: HMMChineseTokenizerFactory)?
> >
> >
> > Regards,
> > Edwin
> >
>


Re: Implementing custom analyzer for multi-language stemming

2014-08-06 Thread Rich Cariens
Yes, each token could have a LanguageAttribute on it, just like
ScriptAttributes. I didn't *think* a span would be necessary.

I would also add a multivalued "lang" field to the document. Searching
English documents for "die" might look like: "q=die&lang=eng". The "lang"
param could tell the RequestHandler to add a filter query "fq=lang:eng" to
constrain the search to the English corpus, as well as recruit an English
analyzer when tokenizing the "die" query term.

Since I can't control text length, I would just let the language detection
tool do it's best and not sweat it.


On Wed, Aug 6, 2014 at 12:11 AM, TK  wrote:

>
> On 8/5/14, 8:36 AM, Rich Cariens wrote:
>
>> Of course this is extremely primitive and basic, but I think it would be
>> possible to write a CharFilter or TokenFilter that inspects the entire
>> TokenStream to guess the language(s), perhaps even noting where languages
>> change. Language and position information could be tracked, the
>> TokenStream
>> rewound and then Tokens emitted with "LanguageAttributes" for downstream
>> Token stemmers to deal with.
>>
>>  I'm curious how you are planning to handle the languageAttribute.
> Would each token have this attribute denoting a span of Tokens
> with a language? But then how would you search
> English documents that includes the term "die" while skipping
> all the German documents which most likely to have "die"?
>
> Automatic language detection works OK for long text of
> regular kind of contents.  But it doesn't work well with short
> text. What strategy would you use to deal with short text?
>
> --
> TK
>
>


Re: Implementing custom analyzer for multi-language stemming

2014-08-05 Thread Rich Cariens
I've started a GitHub project to try out some cross-lingual analysis ideas (
https://github.com/whateverdood/cross-lingual-search). I haven't played
over there for about 3 months, but plan on restarting work there shortly.
In a nutshell, the interesting component
("SimplePolyGlotStemmingTokenFilter") relies on ICU4J ScriptAttributes:
each token is inspected for it's script, i.e. "latin" or "arabic", and then
a "ScriptStemmer" recruits the appropriate stemmer to handle the token.

Of course this is extremely primitive and basic, but I think it would be
possible to write a CharFilter or TokenFilter that inspects the entire
TokenStream to guess the language(s), perhaps even noting where languages
change. Language and position information could be tracked, the TokenStream
rewound and then Tokens emitted with "LanguageAttributes" for downstream
Token stemmers to deal with.

Or is that a crazy idea?


On Tue, Aug 5, 2014 at 12:10 AM, TK  wrote:

> On 7/30/14, 10:47 AM, Eugene wrote:
>
>>  Hello, fellow Solr and Lucene users and developers!
>>
>>  In our project we receive text from users in different languages. We
>> detect language automatically and use Google Translate APIs a lot (so
>> having arbitrary number of languages in our system doesn't concern us).
>> However we need to be able to search using stemming. Having nearly hundred
>> of fields (several fields for each language with language-specific
>> stemmers) listed in our search query is not an option. So we need a way to
>> have a single index which has stemmed tokens for different languages.
>>
>
> Do you mean to have a Tokenizer that switches among supported languages
> depending on the "lang" field? This is something I thought about when I
> started working on Solr/Lucene and soon I realized it is not possible
> because
> of the way Lucene is designed; The Tokenizer in an analyzer chain cannot
> peek
> other field's value, or there is no way to control which field is processed
> first.
>
> If that's not what you are trying to achieve, could you tell us what
> it is? If you have different language text in a single field, and if
> someone search for a word common to many languages,
> such as "sports" (or "Lucene" for that matter), Solr will return
> the documents of different languages, most of which the user
> doesn't understand. Would that be useful? If you have
> a special use case, would you like to share it?
>
> --
> Kuro
>


Re: MMapDirectory failed to map a 23G compound index segment

2011-09-30 Thread Rich Cariens
My colleague and I thought the same thing - that this is an O/S
configuration issue.

/proc/sys/vm/max_map_count = 65536

I honestly don't know how many segments were in the index. Our merge factor
is 10 and there were around 4.4 million docs indexed. The OOME was raised
when the MMapDirectory was opened, so I don't think were reopening the
reader several times. Our MMapDirectory is set to use the "unmapHack".

We've since switched back to non-compound index files and are having no
trouble at all.

On Tue, Sep 20, 2011 at 3:32 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Since you hit OOME during mmap, I think this is an OS issue not a JVM
> issue.  Ie, the JVM isn't running out of memory.
>
> How many segments were in the unoptimized index?  It's possible the OS
> rejected the mmap because of process limits.  Run "cat
> /proc/sys/vm/max_map_count" to see how many mmaps are allowed.
>
> Or: is it possible you reopened the reader several times against the
> index (ie, after committing from Solr)?  If so, I think 2.9.x never
> unmaps the mapped areas, and so this would "accumulate" against the
> system limit.
>
> > My memory of this is a little rusty but isn't mmap also limited by mem +
> swap on the box? What does 'free -g' report?
>
> I don't think this should be the case; you are using a 64 bit OS/JVM
> so in theory (except for OS system wide / per-process limits imposed)
> you should be able to mmap up to the full 64 bit address space.
>
> Your virtual memory is unlimited (from "ulimit" output), so that's good.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Sep 7, 2011 at 12:25 PM, Rich Cariens 
> wrote:
> > Ahoy ahoy!
> >
> > I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
> compound
> > index segment file. The stack trace looks pretty much like every other
> trace
> > I've found when searching for OOM & "map failed"[1]. My configuration
> > follows:
> >
> > Solr 1.4.1/Lucene 2.9.3 (plus
> > SOLR-1969<https://issues.apache.org/jira/browse/SOLR-1969>
> > )
> > CentOS 4.9 (Final)
> > Linux 2.6.9-100.ELsmp x86_64 yada yada yada
> > Java SE (build 1.6.0_21-b06)
> > Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
> > ulimits:
> >core file size (blocks, -c) 0
> >data seg size(kbytes, -d) unlimited
> >file size (blocks, -f) unlimited
> >pending signals(-i) 1024
> >max locked memory (kbytes, -l) 32
> >max memory size (kbytes, -m) unlimited
> >open files(-n) 256000
> >pipe size (512 bytes, -p) 8
> >POSIX message queues (bytes, -q) 819200
> >stack size(kbytes, -s) 10240
> >cpu time(seconds, -t) unlimited
> >max user processes (-u) 1064959
> >virtual memory(kbytes, -v) unlimited
> >file locks(-x) unlimited
> >
> > Any suggestions?
> >
> > Thanks in advance,
> > Rich
> >
> > [1]
> > ...
> > java.io.IOException: Map failed
> >  at sun.nio.ch.FileChannelImpl.map(Unknown Source)
> >  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
> > Source)
> >  at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
> > Source)
> >  at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
> >  at org.apache.lucene.index.SegmentReader$CoreReaders.(Unknown
> Source)
> >
> >  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
> >  at org.apache.lucene.index.SegmentReader.get(Unknown Source)
> >  at org.apache.lucene.index.DirectoryReader.(Unknown Source)
> >  at org.apache.lucene.index.ReadOnlyDirectoryReader.(Unknown
> Source)
> >  at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
> >  at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
> > Source)
> >  at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
> >  at org.apache.lucene.index.IndexReader.open(Unknown Source)
> > ...
> > Caused by: java.lang.OutOfMemoryError: Map failed
> >  at sun.nio.ch.FileChannelImpl.map0(Native Method)
> > ...
> >
>


Re: MMapDirectory failed to map a 23G compound index segment

2011-09-12 Thread Rich Cariens
Thanks. It's definitely repeatable and I may spend some time plumbing this
further. I'll let the list know if I find anything.

The problem went away once I optimized the index down to a single segment
using a simple IndexWriter driver. This was a bit strange since the
resulting index contained similarly large (> 23G) files. The JVM didn't seem
to have any trouble MMap'ing those.

No, I don't need (or necessarily want) to use compound index file formats.
That was actually a goof on my part which I've since corrected :).

On Fri, Sep 9, 2011 at 9:42 PM, Lance Norskog  wrote:

> I remember now: by memory-mapping one block of address space that big, the
> garbage collector has problems working around it. If the OOM is repeatable,
> you could try watching the app with jconsole and watch the memory spaces.
>
> Lance
>
> On Thu, Sep 8, 2011 at 8:58 PM, Lance Norskog  wrote:
>
> > Do you need to use the compound format?
> >
> > On Thu, Sep 8, 2011 at 3:57 PM, Rich Cariens  >wrote:
> >
> >> I should add some more context:
> >>
> >>   1. the problem index included several cfs segment files that were
> around
> >>   4.7G, and
> >>   2. I'm running four SOLR instances on the same box, all of which have
> >>   similiar problem indeces.
> >>
> >> A colleague thought perhaps I was bumping up against my 256,000 open
> files
> >> ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file
> >> handle/descriptor?
> >>
> >> On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens 
> >> wrote:
> >>
> >> > FWiW I optimized the index down to a single segment and now I have no
> >> > trouble opening an MMapDirectory on that index, even though the 23G
> cfx
> >> > segment file remains.
> >> >
> >> >
> >> > On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens  >> >wrote:
> >> >
> >> >> Thanks for the response. "free -g" reports:
> >> >>
> >> >> totalusedfreesharedbuffers
> >> >> cached
> >> >> Mem:  141  95  46     0
> >> >> 093
> >> >> -/+ buffers/cache:  2 139
> >> >> Swap:   3   0   3
> >> >>
> >> >> 2011/9/7 François Schiettecatte 
> >> >>
> >> >>> My memory of this is a little rusty but isn't mmap also limited by
> mem
> >> +
> >> >>> swap on the box? What does 'free -g' report?
> >> >>>
> >> >>> François
> >> >>>
> >> >>> On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:
> >> >>>
> >> >>> > Ahoy ahoy!
> >> >>> >
> >> >>> > I've run into the dreaded OOM error with MMapDirectory on a 23G
> cfs
> >> >>> compound
> >> >>> > index segment file. The stack trace looks pretty much like every
> >> other
> >> >>> trace
> >> >>> > I've found when searching for OOM & "map failed"[1]. My
> >> configuration
> >> >>> > follows:
> >> >>> >
> >> >>> > Solr 1.4.1/Lucene 2.9.3 (plus
> >> >>> > SOLR-1969<https://issues.apache.org/jira/browse/SOLR-1969>
> >> >>> > )
> >> >>> > CentOS 4.9 (Final)
> >> >>> > Linux 2.6.9-100.ELsmp x86_64 yada yada yada
> >> >>> > Java SE (build 1.6.0_21-b06)
> >> >>> > Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
> >> >>> > ulimits:
> >> >>> >core file size (blocks, -c) 0
> >> >>> >data seg size(kbytes, -d) unlimited
> >> >>> >file size (blocks, -f) unlimited
> >> >>> >pending signals(-i) 1024
> >> >>> >max locked memory (kbytes, -l) 32
> >> >>> >max memory size (kbytes, -m) unlimited
> >> >>> >open files(-n) 256000
> >> >>> >pipe size (512 bytes, -p) 8
> >> >>> >POSIX message queues (bytes, -q) 819200
> >> >>> >stack size(kbytes, -s) 10240
> >> >>> >cpu time(seconds, -t) unlimited
&g

Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Rich Cariens
I should add some more context:

   1. the problem index included several cfs segment files that were around
   4.7G, and
   2. I'm running four SOLR instances on the same box, all of which have
   similiar problem indeces.

A colleague thought perhaps I was bumping up against my 256,000 open files
ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file
handle/descriptor?

On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens  wrote:

> FWiW I optimized the index down to a single segment and now I have no
> trouble opening an MMapDirectory on that index, even though the 23G cfx
> segment file remains.
>
>
> On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens wrote:
>
>> Thanks for the response. "free -g" reports:
>>
>> totalusedfreesharedbuffers
>> cached
>> Mem:  141  95  46 0
>> 093
>> -/+ buffers/cache:  2 139
>> Swap:   3   0   3
>>
>> 2011/9/7 François Schiettecatte 
>>
>>> My memory of this is a little rusty but isn't mmap also limited by mem +
>>> swap on the box? What does 'free -g' report?
>>>
>>> François
>>>
>>> On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:
>>>
>>> > Ahoy ahoy!
>>> >
>>> > I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
>>> compound
>>> > index segment file. The stack trace looks pretty much like every other
>>> trace
>>> > I've found when searching for OOM & "map failed"[1]. My configuration
>>> > follows:
>>> >
>>> > Solr 1.4.1/Lucene 2.9.3 (plus
>>> > SOLR-1969<https://issues.apache.org/jira/browse/SOLR-1969>
>>> > )
>>> > CentOS 4.9 (Final)
>>> > Linux 2.6.9-100.ELsmp x86_64 yada yada yada
>>> > Java SE (build 1.6.0_21-b06)
>>> > Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
>>> > ulimits:
>>> >core file size (blocks, -c) 0
>>> >data seg size(kbytes, -d) unlimited
>>> >file size (blocks, -f) unlimited
>>> >pending signals(-i) 1024
>>> >max locked memory (kbytes, -l) 32
>>> >max memory size (kbytes, -m) unlimited
>>> >open files(-n) 256000
>>> >pipe size (512 bytes, -p) 8
>>> >POSIX message queues (bytes, -q) 819200
>>> >stack size(kbytes, -s) 10240
>>> >cpu time(seconds, -t) unlimited
>>> >max user processes (-u) 1064959
>>> >virtual memory(kbytes, -v) unlimited
>>> >file locks(-x) unlimited
>>> >
>>> > Any suggestions?
>>> >
>>> > Thanks in advance,
>>> > Rich
>>> >
>>> > [1]
>>> > ...
>>> > java.io.IOException: Map failed
>>> > at sun.nio.ch.FileChannelImpl.map(Unknown Source)
>>> > at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
>>> > Source)
>>> > at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
>>> > Source)
>>> > at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
>>> > at org.apache.lucene.index.SegmentReader$CoreReaders.(Unknown
>>> Source)
>>> >
>>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
>>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
>>> > at org.apache.lucene.index.DirectoryReader.(Unknown Source)
>>> > at org.apache.lucene.index.ReadOnlyDirectoryReader.(Unknown
>>> Source)
>>> > at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
>>> > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
>>> > Source)
>>> > at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
>>> > at org.apache.lucene.index.IndexReader.open(Unknown Source)
>>> > ...
>>> > Caused by: java.lang.OutOfMemoryError: Map failed
>>> > at sun.nio.ch.FileChannelImpl.map0(Native Method)
>>> > ...
>>>
>>>
>>
>


Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Rich Cariens
FWiW I optimized the index down to a single segment and now I have no
trouble opening an MMapDirectory on that index, even though the 23G cfx
segment file remains.

On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens  wrote:

> Thanks for the response. "free -g" reports:
>
> totalusedfreesharedbuffers
> cached
> Mem:  141  95  46 0
> 093
> -/+ buffers/cache:  2 139
> Swap:   3   0   3
>
> 2011/9/7 François Schiettecatte 
>
>> My memory of this is a little rusty but isn't mmap also limited by mem +
>> swap on the box? What does 'free -g' report?
>>
>> François
>>
>> On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:
>>
>> > Ahoy ahoy!
>> >
>> > I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
>> compound
>> > index segment file. The stack trace looks pretty much like every other
>> trace
>> > I've found when searching for OOM & "map failed"[1]. My configuration
>> > follows:
>> >
>> > Solr 1.4.1/Lucene 2.9.3 (plus
>> > SOLR-1969<https://issues.apache.org/jira/browse/SOLR-1969>
>> > )
>> > CentOS 4.9 (Final)
>> > Linux 2.6.9-100.ELsmp x86_64 yada yada yada
>> > Java SE (build 1.6.0_21-b06)
>> > Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
>> > ulimits:
>> >core file size (blocks, -c) 0
>> >data seg size(kbytes, -d) unlimited
>> >file size (blocks, -f) unlimited
>> >pending signals(-i) 1024
>> >max locked memory (kbytes, -l) 32
>> >max memory size (kbytes, -m) unlimited
>> >open files(-n) 256000
>> >pipe size (512 bytes, -p) 8
>> >POSIX message queues (bytes, -q) 819200
>> >stack size(kbytes, -s) 10240
>> >cpu time(seconds, -t) unlimited
>> >max user processes (-u) 1064959
>> >virtual memory(kbytes, -v) unlimited
>> >file locks(-x) unlimited
>> >
>> > Any suggestions?
>> >
>> > Thanks in advance,
>> > Rich
>> >
>> > [1]
>> > ...
>> > java.io.IOException: Map failed
>> > at sun.nio.ch.FileChannelImpl.map(Unknown Source)
>> > at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
>> > Source)
>> > at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
>> > Source)
>> > at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
>> > at org.apache.lucene.index.SegmentReader$CoreReaders.(Unknown
>> Source)
>> >
>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
>> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
>> > at org.apache.lucene.index.DirectoryReader.(Unknown Source)
>> > at org.apache.lucene.index.ReadOnlyDirectoryReader.(Unknown
>> Source)
>> > at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
>> > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
>> > Source)
>> > at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
>> > at org.apache.lucene.index.IndexReader.open(Unknown Source)
>> > ...
>> > Caused by: java.lang.OutOfMemoryError: Map failed
>> > at sun.nio.ch.FileChannelImpl.map0(Native Method)
>> > ...
>>
>>
>


Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Rich Cariens
Thanks for the response. "free -g" reports:

totalusedfreesharedbuffers
cached
Mem:  141  95  46 0
093
-/+ buffers/cache:  2 139
Swap:   3   0   3

2011/9/7 François Schiettecatte 

> My memory of this is a little rusty but isn't mmap also limited by mem +
> swap on the box? What does 'free -g' report?
>
> François
>
> On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:
>
> > Ahoy ahoy!
> >
> > I've run into the dreaded OOM error with MMapDirectory on a 23G cfs
> compound
> > index segment file. The stack trace looks pretty much like every other
> trace
> > I've found when searching for OOM & "map failed"[1]. My configuration
> > follows:
> >
> > Solr 1.4.1/Lucene 2.9.3 (plus
> > SOLR-1969<https://issues.apache.org/jira/browse/SOLR-1969>
> > )
> > CentOS 4.9 (Final)
> > Linux 2.6.9-100.ELsmp x86_64 yada yada yada
> > Java SE (build 1.6.0_21-b06)
> > Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
> > ulimits:
> >core file size (blocks, -c) 0
> >data seg size(kbytes, -d) unlimited
> >file size (blocks, -f) unlimited
> >pending signals(-i) 1024
> >max locked memory (kbytes, -l) 32
> >max memory size (kbytes, -m) unlimited
> >open files(-n) 256000
> >pipe size (512 bytes, -p) 8
> >POSIX message queues (bytes, -q) 819200
> >stack size(kbytes, -s) 10240
> >cpu time(seconds, -t) unlimited
> >max user processes (-u) 1064959
> >virtual memory(kbytes, -v) unlimited
> >file locks(-x) unlimited
> >
> > Any suggestions?
> >
> > Thanks in advance,
> > Rich
> >
> > [1]
> > ...
> > java.io.IOException: Map failed
> > at sun.nio.ch.FileChannelImpl.map(Unknown Source)
> > at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
> > Source)
> > at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
> > Source)
> > at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
> > at org.apache.lucene.index.SegmentReader$CoreReaders.(Unknown
> Source)
> >
> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
> > at org.apache.lucene.index.SegmentReader.get(Unknown Source)
> > at org.apache.lucene.index.DirectoryReader.(Unknown Source)
> > at org.apache.lucene.index.ReadOnlyDirectoryReader.(Unknown Source)
> > at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
> > at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
> > Source)
> > at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
> > at org.apache.lucene.index.IndexReader.open(Unknown Source)
> > ...
> > Caused by: java.lang.OutOfMemoryError: Map failed
> > at sun.nio.ch.FileChannelImpl.map0(Native Method)
> > ...
>
>


MMapDirectory failed to map a 23G compound index segment

2011-09-07 Thread Rich Cariens
Ahoy ahoy!

I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound
index segment file. The stack trace looks pretty much like every other trace
I've found when searching for OOM & "map failed"[1]. My configuration
follows:

Solr 1.4.1/Lucene 2.9.3 (plus
SOLR-1969
)
CentOS 4.9 (Final)
Linux 2.6.9-100.ELsmp x86_64 yada yada yada
Java SE (build 1.6.0_21-b06)
Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
ulimits:
core file size (blocks, -c) 0
data seg size(kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals(-i) 1024
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files(-n) 256000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size(kbytes, -s) 10240
cpu time(seconds, -t) unlimited
max user processes (-u) 1064959
virtual memory(kbytes, -v) unlimited
file locks(-x) unlimited

Any suggestions?

Thanks in advance,
Rich

[1]
...
java.io.IOException: Map failed
 at sun.nio.ch.FileChannelImpl.map(Unknown Source)
 at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
Source)
 at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
Source)
 at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
 at org.apache.lucene.index.SegmentReader$CoreReaders.(Unknown Source)

 at org.apache.lucene.index.SegmentReader.get(Unknown Source)
 at org.apache.lucene.index.SegmentReader.get(Unknown Source)
 at org.apache.lucene.index.DirectoryReader.(Unknown Source)
 at org.apache.lucene.index.ReadOnlyDirectoryReader.(Unknown Source)
 at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
 at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
Source)
 at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
 at org.apache.lucene.index.IndexReader.open(Unknown Source)
...
Caused by: java.lang.OutOfMemoryError: Map failed
 at sun.nio.ch.FileChannelImpl.map0(Native Method)
...


Re: SSD experience

2011-08-22 Thread Rich Cariens
Thanks folks!

On Mon, Aug 22, 2011 at 11:13 AM, Erick Erickson wrote:

> That link appears to be foo'd, and I can't find the original doc.
>
> But others (mostly on the user's list historically) have seen very
> significant
> performance improvements with SSDs, *IF* the entire index doesn't fit
> in memory.
>
> If your index does fit entirely in memory, there will probably be some
> improvement when fetching stored fields, especially if the stored fields
> are large. But I'm not sure the cost is worth the incremental speed
> in this case.. Of course if you can get your IT folks to spring for SSDs,
> go for it :)
>
> Best
> Erick
>
> On Mon, Aug 22, 2011 at 11:02 AM, Daniel Skiles
>  wrote:
> > I haven't tried it with Solr yet, but with straight Lucene about two
> years
> > ago we saw about a 40% boost in performance on our tests with no changes
> > except the disk.
> >
> > On Mon, Aug 22, 2011 at 10:54 AM, Rich Cariens  >wrote:
> >
> >> Ahoy ahoy!
> >>
> >> Does anyone have any experiences or stories they can share with the list
> >> about how SSDs impacted search performance for better or worse?
> >>
> >> I found a Lucene SSD performance benchmark
> >> doc<
> >>
> http://wiki.apache.org/lucene-java/SSD_performance?action=AttachFile&do=view&target=combined-disk-ssd.pdf
> >> >but
> >> the wiki engine is refusing to let me view the attachment (I get "You
> >> are not allowed to do AttachFile on this page.").
> >>
> >> Thanks in advance!
> >>
> >
>


SSD experience

2011-08-22 Thread Rich Cariens
Ahoy ahoy!

Does anyone have any experiences or stories they can share with the list
about how SSDs impacted search performance for better or worse?

I found a Lucene SSD performance benchmark
docbut
the wiki engine is refusing to let me view the attachment (I get "You
are not allowed to do AttachFile on this page.").

Thanks in advance!


Re: how to enable MMapDirectory in solr 1.4?

2011-08-08 Thread Rich Cariens
We patched our 1.4.1 build with
SOLR-1969(making
MMapDirectory configurable) and realized a 64% search performance
boost on our Linux hosts.

On Mon, Aug 8, 2011 at 10:05 AM, Dyer, James wrote:

> If you want to try MMapDirectory with Solr 1.4, then copy the class
> org.apache.solr.core.MMapDirectoryFactory from 3.x or Trunk, and either add
> it to the .war file (you can just add it under "src/java" and re-package the
> war), or you can put it in its own .jar file in the "lib" directory under
> "solr_home".  Then, in solrconfig.xml, add this entry under the root
> "config" element:
>
> 
>
> I'm not sure if MMapDirectory will perform better for you with Linux over
> NIOFSDir.  I'm pretty sure in Trunk/4.0 it's the default for Windows and
> maybe Solaris.  In Windows, there is a definite advantage for using
> MMapDirectory on a 64-bit system.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: Li Li [mailto:fancye...@gmail.com]
> Sent: Monday, August 08, 2011 4:09 AM
> To: solr-user@lucene.apache.org
> Subject: how to enable MMapDirectory in solr 1.4?
>
> hi all,
>I read Apache Solr 3.1 Released Note today and found that
> MMapDirectory is now the default implementation in 64 bit Systems.
>I am now using solr 1.4 with 64-bit jvm in Linux. how can I use
> MMapDirectory? will it improve performance?
>


Re: document storage

2011-05-13 Thread Rich Cariens
We've decided to store the original document in both Solr and external
repositories. This is to support the following:

   1. highlighting - We need to mark-up the entire document with hit-terms.
   However if this was the only reason to store the text I'd seriously consider
   calling out to the external repository via a custom highlighter.
   2. "hot" documents - We need to index user-generated data like activity
   streams, folksonomy tags, annotations, and comments. When our indexer is
   made aware of those events we decorate the existing SolrDocument with new
   fields and re-index it.
   3. in-place index rebuild - Our search service is still evolving so we
   periodically change our schema and indexing code. We believe it's more
   efficient, not to mention faster, to rebuild the index if we've got all the
   data.

Hope that helps!

On Fri, May 13, 2011 at 3:10 PM, Mike Sokolov  wrote:

> Would anyone care to comment on the merits of storing indexed full-text
> documents in Solr versus storing them externally?
>
> It seems there are three options for us:
>
> 1) store documents both in Solr and externally - this is what we are doing
> now, and gives us all sorts of flexibility, but doesn't seem like the most
> scalable option, at least in terms of storage space and I/O required when
> updating/inserting documents.
>
> 2) store documents externally: For the moment, the only thing that requires
> us to store documents in Solr is the need to highlight them, both in search
> result snippets and in full document views. We are considering hunting for
> or writing a Highlighter extension that could pull in the document text from
> an external source (eg filesystem).
>
> 3) store documents only in Solr.  We'd just retrieve document text as a
> Solr field value rather than reading from the filesystem.  Somehow this
> strikes me as the wrong thing to do, but it could work:  I'm not sure why.
>  A lot of unnecessary merging I/O activity perhaps.  Makes it hard to grep
> the documents or use other filesystem tools, I suppose.
>
> Which one of these sounds best to you?  Under which circumstances? Are
> there other possibilities?
>
> Thanks!
>
> --
>
> Michael Sokolov
> Engineering Director
> www.ifactory.com
>
>


Re: Guidance for event-driven indexing

2011-02-15 Thread Rich Cariens
Thanks Jan,

I was referring to a custom UpdateHandler, not a RequestHandler. You know,
the one that the Solr wiki discourages :).

Best,
Rich

On Tue, Feb 15, 2011 at 8:37 AM, Jan Høydahl  wrote:

> Hi,
>
> You would wire your JMSUpdateRequestHandler into solrconfig.xml as normal,
> and if you want to apply an UpdateChain, that would look like this:
>
>  
>
>  myPipeline
>
>  
>
> See http://wiki.apache.org/solr/SolrRequestHandler for details
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 15. feb. 2011, at 14.30, Rich Cariens wrote:
>
> > Thanks Jan.
> >
> > For the JMSUpdateHandler option, how does one plugin a custom
> UpdateHandler?
> > I want to make sure I'm not missing any important pieces of Solr
> processing
> > pipeline.
> >
> > Best,
> > Rich
> >
> > On Tue, Feb 15, 2011 at 4:36 AM, Jan Høydahl 
> wrote:
> >
> >> Solr is multi threaded, so you are free to send as many parallel update
> >> requests needed to utilize your HW. Each request will get its own
> thread.
> >> Simply configure StreamingUpdateSolrServer from your client.
> >>
> >> If there is some lengthy work to be done, it needs to be done in SOME
> >> thread, and I guess you just have to choose where :)
> >>
> >> A JMSUpdateHandler sounds heavy weight, but does not need to be, and
> might
> >> be the logically best place for such a feature imo.
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >>
> >> On 14. feb. 2011, at 17.42, Rich Cariens wrote:
> >>
> >>> Thanks Jan,
> >>>
> >>> I don't think I want to tie up a thread on two boxes waiting for an
> >>> UpdateRequestProcessor to finish. I'd prefer to offload it all to the
> >> target
> >>> shards. And a special JMSUpdateHandler feels like overkill. I *think*
> I'm
> >>> really just looking for a simple API that allows me to add a
> >>> SolrInputDocument to the index in-process.
> >>>
> >>> Perhaps I just need to use the EmbeddedSolrServer in the Solrj
> packages?
> >> I'm
> >>> worried that this will break all the nice stuff one gets with the
> >> standard
> >>> SOLR webapp (stats, admin, etc).
> >>>
> >>> Best,
> >>> Rich
> >>>
> >>>
> >>> On Mon, Feb 14, 2011 at 11:18 AM, Jan Høydahl 
> >> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> One option would be to keep the JMS listener as today but move the
> >>>> downloading
> >>>> and transforming part to a SolrUpdateRequestProcessor on each shard.
> The
> >>>> benefit
> >>>> is that you ship only a tiny little SolrInputDocument over the wire
> with
> >> a
> >>>> reference to the doc to download, and do the heavy lifting on Solr
> side.
> >>>>
> >>>> If each JMS topic/channel corresponds to a particular shard, you could
> >>>> move the whole thing to Solr. If so, a new JMSUpdateHandler could
> >> perhaps
> >>>> be a way to go?
> >>>>
> >>>> --
> >>>> Jan Høydahl, search solution architect
> >>>> Cominvent AS - www.cominvent.com
> >>>>
> >>>> On 14. feb. 2011, at 16.53, Rich Cariens wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> I've built a system that receives JMS events containing links to docs
> >>>> that I
> >>>>> must download and index. Right now the JMS receiving, downloading,
> and
> >>>>> transformation into SolrInputDoc's happens in a separate JVM that
> then
> >>>> uses
> >>>>> Solrj javabin HTTP POSTs to distribute these docs across many index
> >>>> shards.
> >>>>>
> >>>>> For various reasons I won't go into here, I'd like to relocate/deploy
> >>>> most
> >>>>> of my processing (JMS receiving, downloading, and transformation into
> >>>>> SolrInputDoc's) into the SOLR webapps running on each distributed
> shard
> >>>>> host. I might be wrong, but I don't think the request-driven idiom of
> >> the
> >>>>> DataImportHandler is not a good fit for me as I'm not kicking off
> full
> >> or
> >>>>> delta imports. If that's true, what's the correct way to hook my
> >>>> components
> >>>>> into SOLR's update facilities? Should I try to get a reference a
> >>>> configured
> >>>>> DirectUpdateHandler?
> >>>>>
> >>>>> I don't know if this information helps, but I'll put it out there
> >>>> anyways:
> >>>>> I'm using Spring 3 components to receive JMS events, wired up via
> >> webapp
> >>>>> context hooks. My plan would be to add all that to my SOLR shard
> >> webapp.
> >>>>>
> >>>>> Best,
> >>>>> Rich
> >>>>
> >>>>
> >>
> >>
>
>


Re: Guidance for event-driven indexing

2011-02-15 Thread Rich Cariens
Thanks Jan.

For the JMSUpdateHandler option, how does one plugin a custom UpdateHandler?
I want to make sure I'm not missing any important pieces of Solr processing
pipeline.

Best,
Rich

On Tue, Feb 15, 2011 at 4:36 AM, Jan Høydahl  wrote:

> Solr is multi threaded, so you are free to send as many parallel update
> requests needed to utilize your HW. Each request will get its own thread.
> Simply configure StreamingUpdateSolrServer from your client.
>
> If there is some lengthy work to be done, it needs to be done in SOME
> thread, and I guess you just have to choose where :)
>
> A JMSUpdateHandler sounds heavy weight, but does not need to be, and might
> be the logically best place for such a feature imo.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 14. feb. 2011, at 17.42, Rich Cariens wrote:
>
> > Thanks Jan,
> >
> > I don't think I want to tie up a thread on two boxes waiting for an
> > UpdateRequestProcessor to finish. I'd prefer to offload it all to the
> target
> > shards. And a special JMSUpdateHandler feels like overkill. I *think* I'm
> > really just looking for a simple API that allows me to add a
> > SolrInputDocument to the index in-process.
> >
> > Perhaps I just need to use the EmbeddedSolrServer in the Solrj packages?
> I'm
> > worried that this will break all the nice stuff one gets with the
> standard
> > SOLR webapp (stats, admin, etc).
> >
> > Best,
> > Rich
> >
> >
> > On Mon, Feb 14, 2011 at 11:18 AM, Jan Høydahl 
> wrote:
> >
> >> Hi,
> >>
> >> One option would be to keep the JMS listener as today but move the
> >> downloading
> >> and transforming part to a SolrUpdateRequestProcessor on each shard. The
> >> benefit
> >> is that you ship only a tiny little SolrInputDocument over the wire with
> a
> >> reference to the doc to download, and do the heavy lifting on Solr side.
> >>
> >> If each JMS topic/channel corresponds to a particular shard, you could
> >> move the whole thing to Solr. If so, a new JMSUpdateHandler could
> perhaps
> >> be a way to go?
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >>
> >> On 14. feb. 2011, at 16.53, Rich Cariens wrote:
> >>
> >>> Hello,
> >>>
> >>> I've built a system that receives JMS events containing links to docs
> >> that I
> >>> must download and index. Right now the JMS receiving, downloading, and
> >>> transformation into SolrInputDoc's happens in a separate JVM that then
> >> uses
> >>> Solrj javabin HTTP POSTs to distribute these docs across many index
> >> shards.
> >>>
> >>> For various reasons I won't go into here, I'd like to relocate/deploy
> >> most
> >>> of my processing (JMS receiving, downloading, and transformation into
> >>> SolrInputDoc's) into the SOLR webapps running on each distributed shard
> >>> host. I might be wrong, but I don't think the request-driven idiom of
> the
> >>> DataImportHandler is not a good fit for me as I'm not kicking off full
> or
> >>> delta imports. If that's true, what's the correct way to hook my
> >> components
> >>> into SOLR's update facilities? Should I try to get a reference a
> >> configured
> >>> DirectUpdateHandler?
> >>>
> >>> I don't know if this information helps, but I'll put it out there
> >> anyways:
> >>> I'm using Spring 3 components to receive JMS events, wired up via
> webapp
> >>> context hooks. My plan would be to add all that to my SOLR shard
> webapp.
> >>>
> >>> Best,
> >>> Rich
> >>
> >>
>
>


Re: Guidance for event-driven indexing

2011-02-14 Thread Rich Cariens
Thanks Jan,

I don't think I want to tie up a thread on two boxes waiting for an
UpdateRequestProcessor to finish. I'd prefer to offload it all to the target
shards. And a special JMSUpdateHandler feels like overkill. I *think* I'm
really just looking for a simple API that allows me to add a
SolrInputDocument to the index in-process.

Perhaps I just need to use the EmbeddedSolrServer in the Solrj packages? I'm
worried that this will break all the nice stuff one gets with the standard
SOLR webapp (stats, admin, etc).

Best,
Rich


On Mon, Feb 14, 2011 at 11:18 AM, Jan Høydahl  wrote:

> Hi,
>
> One option would be to keep the JMS listener as today but move the
> downloading
> and transforming part to a SolrUpdateRequestProcessor on each shard. The
> benefit
> is that you ship only a tiny little SolrInputDocument over the wire with a
> reference to the doc to download, and do the heavy lifting on Solr side.
>
> If each JMS topic/channel corresponds to a particular shard, you could
> move the whole thing to Solr. If so, a new JMSUpdateHandler could perhaps
> be a way to go?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 14. feb. 2011, at 16.53, Rich Cariens wrote:
>
> > Hello,
> >
> > I've built a system that receives JMS events containing links to docs
> that I
> > must download and index. Right now the JMS receiving, downloading, and
> > transformation into SolrInputDoc's happens in a separate JVM that then
> uses
> > Solrj javabin HTTP POSTs to distribute these docs across many index
> shards.
> >
> > For various reasons I won't go into here, I'd like to relocate/deploy
> most
> > of my processing (JMS receiving, downloading, and transformation into
> > SolrInputDoc's) into the SOLR webapps running on each distributed shard
> > host. I might be wrong, but I don't think the request-driven idiom of the
> > DataImportHandler is not a good fit for me as I'm not kicking off full or
> > delta imports. If that's true, what's the correct way to hook my
> components
> > into SOLR's update facilities? Should I try to get a reference a
> configured
> > DirectUpdateHandler?
> >
> > I don't know if this information helps, but I'll put it out there
> anyways:
> > I'm using Spring 3 components to receive JMS events, wired up via webapp
> > context hooks. My plan would be to add all that to my SOLR shard webapp.
> >
> > Best,
> > Rich
>
>


Guidance for event-driven indexing

2011-02-14 Thread Rich Cariens
Hello,

I've built a system that receives JMS events containing links to docs that I
must download and index. Right now the JMS receiving, downloading, and
transformation into SolrInputDoc's happens in a separate JVM that then uses
Solrj javabin HTTP POSTs to distribute these docs across many index shards.

For various reasons I won't go into here, I'd like to relocate/deploy most
of my processing (JMS receiving, downloading, and transformation into
SolrInputDoc's) into the SOLR webapps running on each distributed shard
host. I might be wrong, but I don't think the request-driven idiom of the
DataImportHandler is not a good fit for me as I'm not kicking off full or
delta imports. If that's true, what's the correct way to hook my components
into SOLR's update facilities? Should I try to get a reference a configured
DirectUpdateHandler?

I don't know if this information helps, but I'll put it out there anyways:
I'm using Spring 3 components to receive JMS events, wired up via webapp
context hooks. My plan would be to add all that to my SOLR shard webapp.

Best,
Rich


Re: Full text hit term highlighting

2010-12-05 Thread Rich Cariens
Uwe goes on to say:


> This works, as long as you don't need query highlighting, because the offsets 
> from the first field addition cannot be used for highlighting inside the text 
> with markup. *In this case, you have to write your own analyzer that removes 
> the markup in the tokenizer, but preserves the original offsets. *Examples of 
> this are e.g. The Wikipedia contrib in Lucene, which has an hand-crafted 
> analyzer that can handle Mediawiki Markup syntax.
>
>

On Sun, Dec 5, 2010 at 3:35 PM, Jonathan Rochkind  wrote:

> That suggestion says "This works, as long as you don't need query
> highlighting."  Have you found a way around that, or have you decided not to
> use highlighting after all?  Or am I missing something?
> ____
> From: Rich Cariens [richcari...@gmail.com]
> Sent: Sunday, December 05, 2010 10:58 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Full text hit term highlighting
>
> Thanks Lance.  I'm storing the original document and indexing all it's
> extracted content, but I need to be able to high-light the text within it's
> original markup.  I'm going to give Uwe's suggestion <http://bit.ly/hCSdYZ>a
> go.
>
> On Sat, Dec 4, 2010 at 7:18 PM, Lance Norskog  wrote:
>
> > Set the fragment length to 0. This means highlight the entire text
> > body. If, you have stored the text body.
> >
> > Otherwise, you have to get the term vectors somehow and highlight the
> > text yourself.
> >
> > I investigated this problem awhile back for PDFs. You can add a
> > starting page and an OR list of search terms to the URL that loads a
> > PDF into the in-browser version of the Adobe PDF reader. This allows
> > you to load the PDF at the first occurence of any of the search terms,
> > with the terms highlighted. The search button takes you to the next of
> > any of the terms.
> >
> > On Sat, Dec 4, 2010 at 4:10 PM, Rich Cariens 
> > wrote:
> > > Anyone ever use Solr to present a view of a document with hit-terms
> > > highlighted within?  Kind of like Google's cached <
> http://bit.ly/hgudWq
> > >copies?
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>


Re: Full text hit term highlighting

2010-12-05 Thread Rich Cariens
Thanks Lance.  I'm storing the original document and indexing all it's
extracted content, but I need to be able to high-light the text within it's
original markup.  I'm going to give Uwe's suggestion <http://bit.ly/hCSdYZ>a go.

On Sat, Dec 4, 2010 at 7:18 PM, Lance Norskog  wrote:

> Set the fragment length to 0. This means highlight the entire text
> body. If, you have stored the text body.
>
> Otherwise, you have to get the term vectors somehow and highlight the
> text yourself.
>
> I investigated this problem awhile back for PDFs. You can add a
> starting page and an OR list of search terms to the URL that loads a
> PDF into the in-browser version of the Adobe PDF reader. This allows
> you to load the PDF at the first occurence of any of the search terms,
> with the terms highlighted. The search button takes you to the next of
> any of the terms.
>
> On Sat, Dec 4, 2010 at 4:10 PM, Rich Cariens 
> wrote:
> > Anyone ever use Solr to present a view of a document with hit-terms
> > highlighted within?  Kind of like Google's cached <http://bit.ly/hgudWq
> >copies?
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Full text hit term highlighting

2010-12-04 Thread Rich Cariens
Anyone ever use Solr to present a view of a document with hit-terms
highlighted within?  Kind of like Google's cached copies?


Re: Optimize Index

2010-11-04 Thread Rich Cariens
For what it's worth, the Solr class instructor at the Lucene Revolution
conference recommended *against* optimizing, and instead suggested to just
let the merge factor do it's job.

On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey  wrote:

> On 11/4/2010 7:22 AM, stockiii wrote:
>
>> how can i start an optimize by using DIH, but NOT after an delta- or
>> full-import ?
>>
>
> I'm not aware of a way to do this with DIH, though there might be something
> I'm not aware of.  You can do it with an HTTP POST.  Here's how to do it
> with curl:
>
> /usr/bin/curl "http://HOST:PORT/solr/CORE/update"; \
> -H "Content-Type: text/xml" \
> --data-binary ''
>
> Shawn
>
>


Re: StreamingUpdateSolrServer hangs

2010-04-16 Thread Rich Cariens
I experienced the hang described with the Solr 1.4.0 build.

Yonik - I also thought the streaming updater was blocking on commits but
updates never resumed.

To be honest I was in a bit of a rush to meet a deadline so after spending a
day or so tinkering I bailed out and just wrote a component by hand.  I have
not tried to reproduce this using the current trunk.  I was using the 32-bit
Sun JRE on a Red Hat EL 5 HP server.

I'm not sure if the following enriches this thread, but I'll include it
anyways: write a document generator and start adding a ton of 'em to a Solr
server instance using the streaming updater.  You *should* experience the
hang.

HTH,
Rich

On Fri, Apr 16, 2010 at 1:34 PM, Sascha Szott  wrote:

> Hi Yonik,
>
> thanks for your fast reply.
>
>
> Yonik Seeley wrote:
>
>> Thanks for the report Sascha.
>> So after the hang, it never recovers?  Some amount of hanging could be
>> visible if there was a commit on the Solr server or something else to
>> cause the solr requests to block for a while... but it should return
>> to normal on it's own...
>>
> In my case the whole application hangs and never recovers (CPU utilization
> goes down to near 0%). Interestingly, the problem reproducibly occurs only
> if SUSS is created with *more than 2* threads.
>
>
>  Looking at the stack trace, it looks like threads are blocked waiting
>> to get an http connection.
>>
> I forgot to mention that my index app has exclusive access to the Solr
> instance. Therefore, concurrent searches against the same Solr instance
> while indexing are excluded.
>
>
>  I'm traveling all next week, but I'll open a JIRA issue for this now.
>>
> Thank you!
>
>
>  Anything that would help us reproduce this is much appreciated.
>>
> Are there any other who have experienced the same problem?
>
> -Sascha
>
>
>
>> On Fri, Apr 16, 2010 at 8:57 AM, Sascha Szott  wrote:
>>
>>> Hi Yonik,
>>>
>>> Yonik Seeley wrote:
>>>

 Stephen, were you running stock Solr 1.4, or did you apply any of the
 SolrJ patches?
 I'm trying to figure out if anyone still has any problems, or if this
 was fixed with SOLR-1711:

>>>
>>> I'm using the latest trunk version (rev. 934846) and constantly running
>>> into
>>> the same problem. I'm using StreamingUpdateSolrServer with 3 treads and a
>>> queue size of 20 (not really knowing if this configuration is optimal).
>>> My
>>> multi-threaded application indexes 200k data items (bibliographic
>>> metadata
>>> in Dublin Core format) and constantly hangs after running for some time.
>>>
>>> Below you can find the thread dump of one of my index threads (after the
>>> app
>>> hangs all dumps are the same)
>>>
>>> "thread 19" prio=10 tid=0x7fe8c0415800 nid=0x277d waiting on
>>> condition
>>> [0x42d05000]
>>>   java.lang.Thread.State: WAITING (parking)
>>>at sun.misc.Unsafe.park(Native Method)
>>>- parking to wait for<0x7fe8cdcb7598>  (a
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>>>at
>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>>>at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>>>at
>>>
>>> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254)
>>>at
>>>
>>> org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer.request(StreamingUpdateSolrServer.java:216)
>>>at
>>>
>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>>>at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:64)
>>>at
>>>
>>> de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:29)
>>>at
>>>
>>> de.kobv.ked.index.SolrIndexWriter.addIndexDocument(SolrIndexWriter.java:10)
>>>at
>>>
>>> de.kobv.ked.index.AbstractIndexThread.addIndexDocument(AbstractIndexThread.java:59)
>>>at de.kobv.ked.rss.RssThread.indiziere(RssThread.java:30)
>>>at de.kobv.ked.rss.RssThread.run(RssThread.java:58)
>>>
>>>
>>>
>>> and of the three SUSS threads:
>>>
>>> "pool-1-thread-3" prio=10 tid=0x7fe8c7b7f000 nid=0x2780 in
>>> Object.wait()
>>> [0x409ac000]
>>>   java.lang.Thread.State: WAITING (on object monitor)
>>>at java.lang.Object.wait(Native Method)
>>>- waiting on<0x7fe8cdcb6f10>  (a
>>>
>>> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
>>>at
>>>
>>> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
>>>- locked<0x7fe8cdcb6f10>  (a
>>>
>>> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
>>>at
>>>
>>> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
>>>at
>>>
>>> org.apache.commons.httpclient.HttpMethodDirector.e

Re: Index "transaction log" or equivalent?

2010-04-08 Thread Rich Cariens
Thanks Mark.  That's sort of what I was thinking of doing.

On Thu, Apr 8, 2010 at 10:33 AM, Mark Miller  wrote:

> On 04/08/2010 09:23 AM, Rich Cariens wrote:
>
>> Are there any best practices or built-in support for keeping track of
>> what's
>> been indexed in a Solr application so as to support a full rebuild?  I'm
>> not
>> indexing from a single source, but from many, sometimes arbitrary, sources
>> including:
>>
>>1. A document repository that fires events (containing a URL) when new
>>
>>documents are added to the repo;
>>2. A book-marking service that fires events containing URLs when users
>> of
>>
>>that service bookmark a URL;
>>3. More services that raise events that make Solr update docs indexed
>> via
>>
>>(1) or (2) with additional metadata (think user comments, tagging,
>> etc).
>>
>> I'm looking at ~200M documents for the initial launch, with around 30K new
>> docs every day, and many thousands of metadata events every day.
>>
>> Do any of you Solr gurus have any suggestions or guidance you can share
>> with
>> me?
>>
>> Thanks in advance,
>> Rich
>>
>>
>>
>
> Pump everything through an UpdateProcessor that writes out SolrXML as docs
> go by?
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>


Index "transaction log" or equivalent?

2010-04-08 Thread Rich Cariens
Are there any best practices or built-in support for keeping track of what's
been indexed in a Solr application so as to support a full rebuild?  I'm not
indexing from a single source, but from many, sometimes arbitrary, sources
including:

   1. A document repository that fires events (containing a URL) when new
   documents are added to the repo;
   2. A book-marking service that fires events containing URLs when users of
   that service bookmark a URL;
   3. More services that raise events that make Solr update docs indexed via
   (1) or (2) with additional metadata (think user comments, tagging, etc).

I'm looking at ~200M documents for the initial launch, with around 30K new
docs every day, and many thousands of metadata events every day.

Do any of you Solr gurus have any suggestions or guidance you can share with
me?

Thanks in advance,
Rich


Re: an OR filter query

2010-04-04 Thread Rich Cariens
Why not just make the your "mature:false" filter query a default value
instead of always appended?  I.e.:

-snip-

mature:false

-snip-

That way if someone wants mature items in their results the search client
explicitly sets "fq=mature:*" or whatever.

Would that work?

On Sun, Apr 4, 2010 at 3:27 PM, Blargy  wrote:

>
> Is there anyway to use a filter query as an OR clause?
>
> For example I have product listings and I want to be able to filter out
> mature items by default. To do this I added:
>
> 
>  mature:false
> 
>
> But then I can never return any mature items because appending
> fq=mature:true will obviously return 0 results because no item can both be
> mature and non-mature.
>
> I can get around this using defaults:
>
> 
>  mature:false
> 
>
> But this is a little hacky because anytime I want to include mature items
> with non-mature items I need to explicitly set fq as a blank string.
>
> So is there any better way to do this? Thanks
> --
> View this message in context:
> http://n3.nabble.com/an-OR-filter-query-tp696579p696579.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Experience with indexing billions of documents?

2010-04-02 Thread Rich Cariens
A colleague of mine is using native Lucene + some home-grown
patches/optimizations to index over 13B small documents in a 32-shard
environment, which is around 406M docs per shard.

If there's a 2B doc id limitation in Lucene then I assume he's patched it
himself.

On Fri, Apr 2, 2010 at 1:17 PM,  wrote:

> My guess is that you will need to take advantage of Solr 1.5's upcoming
> cloud/cluster renovations and use multiple indexes to comfortably achieve
> those numbers. Hypthetically, in that case, you won't be limited by single
> index docid limitations of Lucene.
>
> > We are currently indexing 5 million books in Solr, scaling up over the
> > next few years to 20 million.  However we are using the entire book as a
> > Solr document.  We are evaluating the possibility of indexing individual
> > pages as there are some use cases where users want the most relevant
> pages
> > regardless of what book they occur in.  However, we estimate that we are
> > talking about somewhere between 1 and 6 billion pages and have concerns
> > over whether Solr will scale to this level.
> >
> > Does anyone have experience using Solr with 1-6 billion Solr documents?
> >
> > The lucene file format document
> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> > mentions a limit of about 2 billion document ids.   I assume this is the
> > lucene internal document id and would therefore be a per index/per shard
> > limit.  Is this correct?
> >
> >
> > Tom Burton-West.
> >
> >
> >
> >
>
>