Re: Umlauts as Char

2011-02-07 Thread Stefan Bodewig
On 2011-02-08, Prescott Nasser wrote:

 in the void subsitute function you'll see them:

 else if ( buffer.charAt( c ) == 'ü' ) {
   buffer.setCharAt( c, 'u' );
 }

 This does not constitue a character in .net (that I can figure out)
 and thus it doesn't compile. The .java file says encoded in UTF-8. I
 was thinking maybe I could do the same thing in VS2010, but I'm not
 finding a way, and searching on this has been difficult.

IIRC VS will recognize UTF-8 encoded files if they start with a byte
order mark (BOM) but Java usually doesn't write one.  I think I once
found the setting for reading/writing UTF-8 in VS, will need to search
for it when at work.

If you have a JDK installed you can use its native2ascii tool that can
be used to replace non-ASCII characters with Unicoce escape sequences
that you can then use in C# as well (see Nicolas' post).

If you have Ant installed (sorry, can't resist ;-) you can convert the
whole tree in one (untested) go with something like

copy todir=will-hold-translated-files
  encoding=utf8
  fileset dir=holds-original-files/
  escapeunicode/
/copy

Stefan


RE: Umlauts as Char

2011-02-07 Thread Prescott Nasser

Stefan somewhat nailed it on the head. My concerns where the java characters - 
I can't even search google or bing for them. So I can take the source codes 
word that 'ü' is the u with dots over it (becuase it says replace umlauts in 
the source notes). But, I guess, is that really true? Is that perhaps u with a 
carrot over it instead?
 
I'm tempted to take the source at it's word and just replace them with the 
umlauts versions (via character map -thanks Aaron), and then make some comment 
expressing what originally it was in the java source.
 
What are your guy's thoughts?
 
~P
 







 From: bode...@apache.org
 To: lucene-net-dev@lucene.apache.org
 Subject: Re: Umlauts as Char
 Date: Tue, 8 Feb 2011 06:01:27 +0100

 On 2011-02-08, Nicholas Paldino [.NET/C# MVP] wrote:

  You can simply use the Unicode escape sequence in code and in
  string/character literals, as specified by section 2.4.2 of the C# spec
  (http://msdn.microsoft.com/en-us/library/aa664670(v=vs.71).aspx):

 I think in Prescott's case part of the problem is that he doesn't know
 which character the sequence seems to be. In this case it likely is an
 ü.

  else if ( buffer.charAt( c ) == 'ü' ) {
  buffer.setCharAt( c, 'u' );
  }

  Would become:

  else if ( buffer.charAt( c ) == '\u00C3¼' ) {
  buffer.setCharAt( c, 'u' );
  }

 No. The two bytes are part of a two byte UTF-8 sequence making up a
 single character.

 Stefan  

[jira] Updated: (LUCENE-2910) Highlighter does not correctly highlight the phrase around 50th term

2011-02-07 Thread Shinya Kasatani (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shinya Kasatani updated LUCENE-2910:


Attachment: HighlighterFix.patch

A test case that describes the problem, along with a fix.


 Highlighter does not correctly highlight the phrase around 50th term
 

 Key: LUCENE-2910
 URL: https://issues.apache.org/jira/browse/LUCENE-2910
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Priority: Trivial
 Attachments: HighlighterFix.patch


 When you use the Highlighter combined with N-Gram tokenizers such as 
 CJKTokenizer and try to highlight the phrase that appears around 50th term in 
 the field, the highlighted phrase is shorter than expected.
 e.g. Highlighting fooo in the following text with bigram tokenizer:
 0-1-2-3-4-fooo---
 Expected: 0-1-2-3-4-Bfooo/B---
 Actual: 0-1-2-3-4-fBooo/B---

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2910) Highlighter does not correctly highlight the phrase around 50th term

2011-02-07 Thread Shinya Kasatani (JIRA)
Highlighter does not correctly highlight the phrase around 50th term


 Key: LUCENE-2910
 URL: https://issues.apache.org/jira/browse/LUCENE-2910
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Priority: Trivial
 Attachments: HighlighterFix.patch

When you use the Highlighter combined with N-Gram tokenizers such as 
CJKTokenizer and try to highlight the phrase that appears around 50th term in 
the field, the highlighted phrase is shorter than expected.

e.g. Highlighting fooo in the following text with bigram tokenizer:
0-1-2-3-4-fooo---

Expected: 0-1-2-3-4-Bfooo/B---
Actual: 0-1-2-3-4-fBooo/B---


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2910) Highlighter does not correctly highlight the phrase around 50th term

2011-02-07 Thread Shinya Kasatani (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shinya Kasatani updated LUCENE-2910:


Description: 
When you use the Highlighter combined with N-Gram tokenizers such as 
CJKTokenizer and try to highlight the phrase that appears around 50th term in 
the field, the highlighted phrase is shorter than expected.

{noformat}
e.g. Highlighting fooo in the following text with bigram tokenizer:
0-1-2-3-4-fooo---

Expected: 0-1-2-3-4-Bfooo/B---
Actual: 0-1-2-3-4-fBooo/B---
{noformat}

  was:
When you use the Highlighter combined with N-Gram tokenizers such as 
CJKTokenizer and try to highlight the phrase that appears around 50th term in 
the field, the highlighted phrase is shorter than expected.

e.g. Highlighting fooo in the following text with bigram tokenizer:
0-1-2-3-4-fooo---

Expected: 0-1-2-3-4-Bfooo/B---
Actual: 0-1-2-3-4-fBooo/B---



 Highlighter does not correctly highlight the phrase around 50th term
 

 Key: LUCENE-2910
 URL: https://issues.apache.org/jira/browse/LUCENE-2910
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Priority: Trivial
 Attachments: HighlighterFix.patch


 When you use the Highlighter combined with N-Gram tokenizers such as 
 CJKTokenizer and try to highlight the phrase that appears around 50th term in 
 the field, the highlighted phrase is shorter than expected.
 {noformat}
 e.g. Highlighting fooo in the following text with bigram tokenizer:
 0-1-2-3-4-fooo---
 Expected: 0-1-2-3-4-Bfooo/B---
 Actual: 0-1-2-3-4-fBooo/B---
 {noformat}

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2666) ArrayIndexOutOfBoundsException when iterating over TermDocs

2011-02-07 Thread Nick Pellow (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991315#comment-12991315
 ] 

Nick Pellow commented on LUCENE-2666:
-

Hi Michael, 

This issue was entirely a problem with our code, and I doubt Lucene could have 
done a better job.

The problem was that on upgrade of the index (done when fields have changed 
etc), we recreate the index in the same location using 
{{IndexWriter.create(directory, analyzer, true, MAX_FIELD_LENGTH)}}.

Some code was added just before this however, that deleted every single file in 
the directory. This meant that some other thread performing a search could have 
seen a corrupt index, thus causing the AIOOBE. The developer was paranoid that 
IndexWriter.create was leaving old files lying around.

I'm glad we got to the bottom of this, and very much so that it was not a bug 
in Lucene!

Thanks again for helping us track this down.

Best Regards,
Nick Pellow


 ArrayIndexOutOfBoundsException when iterating over TermDocs
 ---

 Key: LUCENE-2666
 URL: https://issues.apache.org/jira/browse/LUCENE-2666
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.2
Reporter: Shay Banon
 Attachments: checkindex-out.txt


 A user got this very strange exception, and I managed to get the index that 
 it happens on. Basically, iterating over the TermDocs causes an AAOIB 
 exception. I easily reproduced it using the FieldCache which does exactly 
 that (the field in question is indexed as numeric). Here is the exception:
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
   at 
 org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501)
   at 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183)
   at 
 org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470)
   at TestMe.main(TestMe.java:56)
 It happens on the following segment: _26t docCount: 914 delCount: 1 
 delFileName: _26t_1.del
 And as you can see, it smells like a corner case (it fails for document 
 number 912, the AIOOB happens from the deleted docs). The code to recreate it 
 is simple:
 FSDirectory dir = FSDirectory.open(new File(index));
 IndexReader reader = IndexReader.open(dir, true);
 IndexReader[] subReaders = reader.getSequentialSubReaders();
 for (IndexReader subReader : subReaders) {
 Field field = 
 subReader.getClass().getSuperclass().getDeclaredField(si);
 field.setAccessible(true);
 SegmentInfo si = (SegmentInfo) field.get(subReader);
 System.out.println(--  + si);
 if (si.getDocStoreSegment().contains(_26t)) {
 // this is the probleatic one...
 System.out.println(problematic one...);
 FieldCache.DEFAULT.getLongs(subReader, __documentdate, 
 FieldCache.NUMERIC_UTILS_LONG_PARSER);
 }
 }
 Here is the result of a check index on that segment:
   8 of 10: name=_26t docCount=914
 compound=true
 hasProx=true
 numFiles=2
 size (MB)=1.641
 diagnostics = {optimize=false, mergeFactor=10, 
 os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true, 
 lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge, 
 os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.}
 has deletions [delFileName=_26t_1.del]
 test: open reader.OK [1 deleted docs]
 test: fields..OK [32 fields]
 test: field norms.OK [32 fields]
 test: terms, freq, prox...ERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
   at 
 org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102)
   at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
   at TestMe.main(TestMe.java:47)
 test: stored fields...ERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34)
   at 
 org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512)
   at 

[jira] Commented: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

2011-02-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991316#comment-12991316
 ] 

Robert Muir commented on LUCENE-2909:
-

Is the bug really in NGramTokenFilter? 

This seems to be a larger problem that would affect all tokenfilters that break 
larger tokens
into smaller ones and recalculate offsets, right?

For example: EdgeNGramTokenFilter, ThaiWordFilter, SmartChineseAnalyzer's 
WordTokenFilter, etc?

I think WordDelimiterFilter has special code that might avoid the problem (line 
352), so it might
be ok.

Is there any better way we could solve this: for example maybe instead of the 
tokenizer calling
correctOffset() it gets called somewhere else? This seems to be what is causing 
the problem.


 NGramTokenFilter may generate offsets that exceed the length of original text
 -

 Key: LUCENE-2909
 URL: https://issues.apache.org/jira/browse/LUCENE-2909
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Assignee: Koji Sekiguchi
Priority: Minor
 Attachments: TokenFilterOffset.patch


 Whan using NGramTokenFilter combined with CharFilters that lengthen the 
 original text (such as ß - ss), the generated offsets exceed the length 
 of the origianal text.
 This causes InvalidTokenOffsetsException when you try to highlight the text 
 in Solr.
 While it is not possible to know the accurate offset of each character once 
 you tokenize the whole text with tokenizers like KeywordTokenizer, 
 NGramTokenFilter should at least avoid generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

2011-02-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991319#comment-12991319
 ] 

Uwe Schindler commented on LUCENE-2909:
---

The problem has nothing to do with CharFilters. This problem always occurs, if 
endOffset - startOffset != termAtt.length().

If you e.g. put a Stemmer before ngramming, that creates longer tokens (like 
Portugise -ã - -ão or German ß - ss) you have the same problem. A solution 
might be to use some factor to correct this in these offsets: (endOffset - 
startOffset) / termAtt.length()

 NGramTokenFilter may generate offsets that exceed the length of original text
 -

 Key: LUCENE-2909
 URL: https://issues.apache.org/jira/browse/LUCENE-2909
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Assignee: Koji Sekiguchi
Priority: Minor
 Attachments: TokenFilterOffset.patch


 Whan using NGramTokenFilter combined with CharFilters that lengthen the 
 original text (such as ß - ss), the generated offsets exceed the length 
 of the origianal text.
 This causes InvalidTokenOffsetsException when you try to highlight the text 
 in Solr.
 While it is not possible to know the accurate offset of each character once 
 you tokenize the whole text with tokenizers like KeywordTokenizer, 
 NGramTokenFilter should at least avoid generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

2011-02-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991320#comment-12991320
 ] 

Robert Muir commented on LUCENE-2909:
-

You are right, some stemmers increase the size, so this assumption that end - 
start = termAtt.length is a problem.

So, between this and LUCENE-2208, I think we need to add some more 
checks/asserts to BaseTokenStreamTestCase (at least to validate offset  end, 
but maybe some other ideas?)

If the highlighter hits this condition, it (rightfully) complains and throws an 
exception, among other problems. So I think we need to improve this situation 
everywhere.

 NGramTokenFilter may generate offsets that exceed the length of original text
 -

 Key: LUCENE-2909
 URL: https://issues.apache.org/jira/browse/LUCENE-2909
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Assignee: Koji Sekiguchi
Priority: Minor
 Attachments: TokenFilterOffset.patch


 Whan using NGramTokenFilter combined with CharFilters that lengthen the 
 original text (such as ß - ss), the generated offsets exceed the length 
 of the origianal text.
 This causes InvalidTokenOffsetsException when you try to highlight the text 
 in Solr.
 While it is not possible to know the accurate offset of each character once 
 you tokenize the whole text with tokenizers like KeywordTokenizer, 
 NGramTokenFilter should at least avoid generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text

2011-02-07 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2909:


Attachment: LUCENE-2909_assert.patch

here's a check we can add to BaseTokenStreamTestCase for this condition.


 NGramTokenFilter may generate offsets that exceed the length of original text
 -

 Key: LUCENE-2909
 URL: https://issues.apache.org/jira/browse/LUCENE-2909
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4
Reporter: Shinya Kasatani
Assignee: Koji Sekiguchi
Priority: Minor
 Attachments: LUCENE-2909_assert.patch, TokenFilterOffset.patch


 Whan using NGramTokenFilter combined with CharFilters that lengthen the 
 original text (such as ß - ss), the generated offsets exceed the length 
 of the origianal text.
 This causes InvalidTokenOffsetsException when you try to highlight the text 
 in Solr.
 While it is not possible to know the accurate offset of each character once 
 you tokenize the whole text with tokenizers like KeywordTokenizer, 
 NGramTokenFilter should at least avoid generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Threading of JIRA e-mails in gmail?

2011-02-07 Thread Dawid Weiss
Just a follow-up to this one: no reply from infra yet, but I simply
tried my config. on people.apache.org and it works like a charm, so
for Apache committers and gmail users this is probably a life-saver.
My config is described in a comment here:

https://issues.apache.org/jira/browse/INFRA-3403

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Threading of JIRA e-mails in gmail?

2011-02-07 Thread Doron Cohen
Thanks Dawid
It is not working for me yet, looking for the reason for that...
Doron

On Mon, Feb 7, 2011 at 12:48 PM, Dawid Weiss
dawid.we...@cs.put.poznan.plwrote:

 Just a follow-up to this one: no reply from infra yet, but I simply
 tried my config. on people.apache.org and it works like a charm, so
 for Apache committers and gmail users this is probably a life-saver.
 My config is described in a comment here:

 https://issues.apache.org/jira/browse/INFRA-3403

 Dawid

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Threading of JIRA e-mails in gmail?

2011-02-07 Thread Dawid Weiss
Looks like my action prompted a response from infra and it's not
encouraging -- they're supposedly switching off procmail support on
that server soon.

Track INFRA-3403 to see what will come out of this, I don't want to
spam this list. Eh.

Dawid

On Mon, Feb 7, 2011 at 1:14 PM, Doron Cohen cdor...@gmail.com wrote:
 Thanks Dawid
 It is not working for me yet, looking for the reason for that...
 Doron

 On Mon, Feb 7, 2011 at 12:48 PM, Dawid Weiss dawid.we...@cs.put.poznan.pl
 wrote:

 Just a follow-up to this one: no reply from infra yet, but I simply
 tried my config. on people.apache.org and it works like a charm, so
 for Apache committers and gmail users this is probably a life-saver.
 My config is described in a comment here:

 https://issues.apache.org/jira/browse/INFRA-3403

 Dawid

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Distributed Indexing

2011-02-07 Thread Upayavira
I'm saying that deterministic policies are a requirement that
*some* people will want. Others might want a random spread. Thus,
I'd have deterministic based on ID and random as the two initial
implementations.

Upayavira
NB. In case folks haven't worked it out already, I have been
tasked to mentor this group of students in this work, and had the
fortune to be able to point them to a task I've already thought a
lot about myself, but had no time to do :-)

On Sun, 06 Feb 2011 21:57 +, William Mayor
m...@williammayor.co.uk wrote:

  Hi



Good call about the policies being deterministic, should've
thought of that earlier.



We've changed the patch to include this and I've removed the
random assignment one (for obvious reasons).



Take a look and let me know what's to do.
([1]https://issues.apache.org/jira/browse/SOLR-2341)



Cheers



William
On Thu, Feb 3, 2011 at 5:00 PM, Upayavira [2]u...@odoko.co.uk
wrote:


On Thu, 03 Feb 2011 15:12 +, Alex Cowell
[3]alxc...@gmail.com wrote:

  Hi all,
  Just a couple of questions that have arisen.
  1. For handling non-distributed update requests (shards param
  is not present or is invalid), our code currently
  * assumes the user would like the data indexed, so gets the
request handler assigned to /update
  * executes the request using core.execute() for the SolrCore
associated with the original request

  Is this what we want it to do and is using core.execute() from
  within a request handler a valid method of passing on the
  update request?


Take a look at how it is done in
handler.component.SearchHandler.handleRequestBody(). I'd say try
to follow as similar approach as possible. E.g. it is the
SearchHandler that does much of the work, branching depending on
whether it found a shards parameter.

  2. We have partially implemented an update processor which
  actually generates and sends the split update requests to each
  specified shard (as designated by the policy). As it stands,
  the code shares a lot in common with the HttpCommComponent
  class used for distributed search. Should we look at opening
  up the HttpCommComponent class so it could be used by our
  request handler as well or should we continue with our current
  implementation and worry about that later?


I agree that you are going to want to implement an
UpdateRequestProcessor. However, it would seem to me that, unlike
search, you're not going to want to bother with the existing
processor and associated component chain, you're going to want to
replace the processor with a distributed version.

As to the HttpCommComponent, I'd suggest you make your own
educated decision. How similar is the class? Could one serve both
needs effectively?

  3. Our update processor uses a
  MultiThreadedHttpConnectionManager to send parallel updates to
  shards, can anyone give some appropriate values to be used for
  the defaultMaxConnectionsPerHost and maxTotalConnections
  params? Won't the  values used for distributed search be a
  little high for distributed indexing?


You are right, these will likely be lower for distributed
indexing, however I'd suggest not worrying about it for now, as
it is easy to tweak later.

Upayavira

---
Enterprise Search Consultant at Sourcesense UK,
Making Sense of Open Source

References

1. https://issues.apache.org/jira/browse/SOLR-2341
2. mailto:u...@odoko.co.uk
3. mailto:alxc...@gmail.com
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source



Re: Distributed Indexing

2011-02-07 Thread Upayavira
Surely you want to be implementing an UpdateRequestProcessor,
rather than a RequestHandler.

The ContentStreamHandlerBase, in the handleRequestBody method
gets an UpdateRequestProcessor and uses it to process the
request. What we need is that handleRequestBody method to, as you
have suggested, check on the shards parameter, and if necessary
call a different UpdateRequestProcessor (a
DistributedUpdateRequestProcessor).

I don't think we really need it to be configurable at this point.
The ContentStreamHandlerBase could just use a single hardwired
implementation. If folks want choice of
DistributedUpdateRequestProcessor, it can be added later.

For configuration, the DistributedUpdateRequestProcessor should
get its config from the parent RequestHandler. The configuration
I'm most interested in is the DistributionPolicy. And that can be
done with a distributionPolicyClass=solr.IDHashDistributionPolicy
request parameter, which could potentially be configured in
solrconfig.xml as an invariant, or provided in the request by the
user if necessary.

So, I'd avoid another thing that needs to be configured unless
there are real benefits to it (which there don't seem to me to be
right now).

Upayavira

On Sun, 06 Feb 2011 23:08 +, Alex Cowell
alxc...@gmail.com wrote:

  Hey,
  We're making good progress, but our
  DistributedUpdateRequestHandler is having a bit of an identity
  crisis, so we thought we'd ask what other people's opinions
  are. The current situation is as follows:
  We've added a method to ContentStreamHandlerBase to check if
  an update request is distributed or not (based on the
  presence/validity of the 'shards' parameter). So a
  non-distributed request will proceed as normal but a
  distributed request would be passed on to the
  DistributedUpdateRequestHandler to deal with.
  The reason this choice is made in the ContentStreamHandlerBase
  is so that the DistributedUpdateRequestHandler can use the URL
  the request came in on to determine where to distribute update
  requests. Eg. an update request is sent to:
  [1]http://localhost:8983/solr/update/csv?shards=shard1,shard2.
  ..
  then the DistributedUpdateRequestHandler knows to send
  requests to:
  shard1/update/csv
  shard2/update/csv
  Alternatively, if the request wasn't distributed, it would
  simply be handled by whichever request handler /update/csv
  uses.
  Herein lies the problem. The DistributedUpdateRequestHandler
  is not really a request handler in the same way as the
  CSVRequestHandler or XmlUpdateRequestHandlers are. If
  anything, it's more like a plugin for the various existing
  update request handlers, to allow them to deal with
  distributed requests - a distributor if you will. It isn't
  designed to be able to receive and handle requests directly.
  We would like this DistributedUpdateRequestHandler to be
  defined in the solrconfig to allow flexibility for setting up
  multiple different DistributedUpdateRequestHandlers with
  different ShardDistributionPolicies etc.and also to allow us
  to get the appropriate instance from the core in the code.
  There seem to be two paths for doing this:
  1. Leave it as an implementation of SolrRequestHandler and
  hope the user doesn't directly send update requests to it (ie.
  a request to [2]http://localhost:8983/solr/distrib update
  handler path would most likely cripple something). So it
  would be defined in the solrconfig something like:
  requestHandler name=distrib-update
  class=solr.DistributedUpdateRequestHandler /
  2. Create a new plugin type for the solrconfig, say
  updateRequestDistributor which would involve creating a new
  interface for the DistributedUpdateRequestHandler to
  implement, then registering it with the core. It would be
  defined in the solrconfig something like:
  updateRequestDistributor name=distrib-update
  class=solr.DistributedUpdateRequestHandler
lst name=defaults
  str name=policysolr.HashedDistributionPolicy/str
/lst
  /updateRequestDistributor
  This would mean that it couldn't directly receive requests,
  but that an instance could still easily be retrieved from the
  core to handle the distribution of update requests.
  Any thoughts on the above issue (or a more succinct,
  descriptive name for the class) are most welcome!
  Alex

References

1. http://localhost:8983/solr/update/csv?shards=shard1,shard2.
2. http://localhost:8983/solr/
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source



Maintain stopwords.txt and other files

2011-02-07 Thread Timo Schmidt
Hello together,

i am currently developing a search solution, based on Apache Solr. Currently I 
have the problem that I want
to offer the user the possibility to maintain synonyms and stopwords in a 
userfriendy tool. But currently I could
not find any possibility to write the stopwords.txt or synonyms.txt.

Are there any other solutions?

Currently I have some ideas how to handle it:

1.  Implement another SynonymFilterFactory to allow other datasources like 
databases. I already saw approaches for that but no solutions yet.
2.  Implement a fileWriter request handler to write the stopwords.txt

Are there other solutions which are maybe already implemented?

Thanks and best regards Timo


Timo Schmidt
Entwickler (Diplom Informatiker FH)


AOE media GmbH
Borsigstr. 3
65205 Wiesbaden
Germany 
Tel. +49 (0) 6122 70 70 7 - 234
Fax. +49 (0) 6122 70 70 7 -199



e-Mail: timo.schm...@aoemedia.de
Web: http://www.aoemedia.de/

Pflichtangaben laut Handelsgesetz §37a / Aktiengesetz §35a
USt-ID Nr.: DE250247455
Handelsregister: Wiesbaden B
Handelsregister Nr.: 22567 


Stammsitz: Wiesbaden
Creditreform: 625.0209354
Geschäftsführer: Kian Toyouri Gould 


Diese E-Mail Nachricht enthält vertrauliche und/oder rechtlich geschützte 
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail 
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und 
vernichten Sie diese Mail. This e-mail message may contain confidential and/or 
privileged information. If you are not the intended recipient (or have received 
this e-mail in error) please notify the sender immediately and destroy this 
e-mail. 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Tokenization and Fuzziness: How to Allow Multiple Strategies?

2011-02-07 Thread Tavi Nathanson

Hey everyone,

Tokenization seems inherently fuzzy and imprecise, yet Lucene does not
appear to provide an easy mechanism to account for this fuzziness.

Let's take an example, where the document I'm indexing is v1.1.0 mr. jones
da...@gmail.com

I may want to tokenize this as follows: [v1.1.0, mr, jones,
da...@gmail.com]
...or I may want to tokenize this as follows: [v1, 1.0, mr, jones,
david, gmail.com]
...or I may want to tokenize it another way.

I would think that the best approach would be indexing using multiple
strategies, such as:

[v1.1.0, v1, 1.0, mr, jones, da...@gmail.com, david,
gmail.com]

However, this would destroy phrase queries. And while Lucene lets you index
multiple tokens at the same position, I haven't found a way to deal with
cases where you want to index a set of tokens at one position: nor does that
even make sense. For instance, I can't index [david, gmail.com] in the
same position as da...@gmail.com.

So:

- Any thoughts, in general, about how you all approach this fuzziness? Do
you just choose one tokenization strategy and hope for the best?
- Might there be a way to use multiple strategies and *not* break phrase
queries that I'm overlooking?

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenization-and-Fuzziness-How-to-Allow-Multiple-Strategies-tp2444956p2444956.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Scoring: Precedent for a Better, Less Fragile Approach?

2011-02-07 Thread Tavi Nathanson

Hey everyone,

I have a question about Lucene/Solr scoring in general. It really feels like
a wobbly house of cards that falls down whenever I make the slightest tweak.
There are many factors at play in Lucene scoring: they're all fighting with
each other, and very often one will completely dominate everything else,
when that may not really be the intention.

** The question: might there be a way to enforce strict requirements that
certain factors are higher priority than other factors, and/or certain
factors shouldn't overtake other factors? Perhaps a set of rules where one
factor is considered before even examining another factor? Tuning boost
numbers around and hoping for the best seems imprecise and very fragile. **

To make this more concrete, an example:

We previously added the scores of multi-field matches together via an OR,
so: score(query apple) = score(field1:apple) + score(field2:apple). I
changed that to be more in-line with DisMaxParser, namely a max: score(query
apple) = max(score(field1:apple), score(field2:apple)). I also modified
coord such that coord would only consider actual unique terms (apple vs.
orange), rather than terms across multiple fields (field1:apple vs.
field2:apple).

This seemed like a good idea, but it actually introduced a bug that was
previously hidden. Suddenly, documents matching apple in the title and
*nothing* in the body were being boosted over apple in the title and
apple in the body! I investigated, and it was due to lengthNorm:
previously, documents matching apple in both title and body were getting
higher scores thanks to to summing the field scores (vs. max) as well as a
higher coord score. Now that they were no longer getting these boosts, which
was beneficial in many respects, the playing field was leveled. And this
leveling of the playing field allowed lengthNorm to dominate everything
else.

Any help would be much appreciated. Thanks!

Tavi
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Scoring-Precedent-for-a-Better-Less-Fragile-Approach-tp2445112p2445112.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [REINDEX] Note: re-indexing required !

2011-02-07 Thread Earwin Burrfoot
Lucene maintains compatibility with earlier stable release index
versions, and to some extent transparently upgrades them.
But there is no guaranteed compatibility between different
in-development indexes.

E.g. 3.2 reads 3.1 indexes and upgrades them, but 3.2-dev-snapshot-10
(while happily handling 3.1) may fail reading 3.2-dev-snapshot-8
index, as they have the same version tag, yet different formats.

On Sun, Jan 23, 2011 at 19:18, Earl Hood e...@earlhood.com wrote:
 On Sat, Jan 22, 2011 at 11:14 PM, Shai Erera ser...@gmail.com wrote:
 Under LUCENE-2720 the index format of both trunk and 3x has changed. You
 should re-index any indexes created with either of these code streams.

 Does the 3x refer to the 3.x development branch?

 I.e. For those of using the stable 3.x release of Lucene, will
 a future 3.x release require rebuilding indexes?

 --ewh

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Tokenization and Fuzziness: How to Allow Multiple Strategies?

2011-02-07 Thread Steven A Rowe
Hi Tavi,

solr-...@lucene.apache.org has been deprecated since the Lucene and Solr source 
trees merged last year.  Please use dev@lucene.apache.org instead.

However, your question is about *usage* of Lucene/Solr, rather than 
*development*, so you should be using solr-u...@lucene.apache.org or 
lucene-u...@lucene.apache.org.  Please repost your question to one of these 
lists.

Steve

 -Original Message-
 From: Tavi Nathanson [mailto:tavi.nathan...@gmail.com]
 Sent: Monday, February 07, 2011 12:12 PM
 To: solr-...@lucene.apache.org
 Subject: Tokenization and Fuzziness: How to Allow Multiple Strategies?
 
 
 Hey everyone,
 
 Tokenization seems inherently fuzzy and imprecise, yet Lucene does not
 appear to provide an easy mechanism to account for this fuzziness.
 
 Let's take an example, where the document I'm indexing is v1.1.0 mr.
 jones
 da...@gmail.com
 
 I may want to tokenize this as follows: [v1.1.0, mr, jones,
 da...@gmail.com]
 ...or I may want to tokenize this as follows: [v1, 1.0, mr, jones,
 david, gmail.com]
 ...or I may want to tokenize it another way.
 
 I would think that the best approach would be indexing using multiple
 strategies, such as:
 
 [v1.1.0, v1, 1.0, mr, jones, da...@gmail.com, david,
 gmail.com]
 
 However, this would destroy phrase queries. And while Lucene lets you
 index
 multiple tokens at the same position, I haven't found a way to deal with
 cases where you want to index a set of tokens at one position: nor does
 that
 even make sense. For instance, I can't index [david, gmail.com] in the
 same position as da...@gmail.com.
 
 So:
 
 - Any thoughts, in general, about how you all approach this fuzziness? Do
 you just choose one tokenization strategy and hope for the best?
 - Might there be a way to use multiple strategies and *not* break phrase
 queries that I'm overlooking?
 
 Thanks!
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tokenization-and-Fuzziness-How-to-
 Allow-Multiple-Strategies-tp2444956p2444956.html
 Sent from the Solr - Dev mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2666) ArrayIndexOutOfBoundsException when iterating over TermDocs

2011-02-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991545#comment-12991545
 ] 

Michael McCandless commented on LUCENE-2666:


Ahh, thanks for bringing closure Nick!  Although, I'm a little confused how 
removing files from the index while readers are using it, could lead to those 
exceptions...

Note that it's perfectly fine to pass create=true to IW, over an existing 
index, even while readers are using it; IW will gracefully remove the old files 
itself, even if open IRs are still using them. IW just makes a new commit point 
that drops all references to prior segments...


 ArrayIndexOutOfBoundsException when iterating over TermDocs
 ---

 Key: LUCENE-2666
 URL: https://issues.apache.org/jira/browse/LUCENE-2666
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0.2
Reporter: Shay Banon
 Attachments: checkindex-out.txt


 A user got this very strange exception, and I managed to get the index that 
 it happens on. Basically, iterating over the TermDocs causes an AAOIB 
 exception. I easily reproduced it using the FieldCache which does exactly 
 that (the field in question is indexed as numeric). Here is the exception:
 Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
   at 
 org.apache.lucene.search.FieldCacheImpl$LongCache.createValue(FieldCacheImpl.java:501)
   at 
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:183)
   at 
 org.apache.lucene.search.FieldCacheImpl.getLongs(FieldCacheImpl.java:470)
   at TestMe.main(TestMe.java:56)
 It happens on the following segment: _26t docCount: 914 delCount: 1 
 delFileName: _26t_1.del
 And as you can see, it smells like a corner case (it fails for document 
 number 912, the AIOOB happens from the deleted docs). The code to recreate it 
 is simple:
 FSDirectory dir = FSDirectory.open(new File(index));
 IndexReader reader = IndexReader.open(dir, true);
 IndexReader[] subReaders = reader.getSequentialSubReaders();
 for (IndexReader subReader : subReaders) {
 Field field = 
 subReader.getClass().getSuperclass().getDeclaredField(si);
 field.setAccessible(true);
 SegmentInfo si = (SegmentInfo) field.get(subReader);
 System.out.println(--  + si);
 if (si.getDocStoreSegment().contains(_26t)) {
 // this is the probleatic one...
 System.out.println(problematic one...);
 FieldCache.DEFAULT.getLongs(subReader, __documentdate, 
 FieldCache.NUMERIC_UTILS_LONG_PARSER);
 }
 }
 Here is the result of a check index on that segment:
   8 of 10: name=_26t docCount=914
 compound=true
 hasProx=true
 numFiles=2
 size (MB)=1.641
 diagnostics = {optimize=false, mergeFactor=10, 
 os.version=2.6.18-194.11.1.el5.centos.plus, os=Linux, mergeDocStores=true, 
 lucene.version=3.0.2 953716 - 2010-06-11 17:13:53, source=merge, 
 os.arch=amd64, java.version=1.6.0, java.vendor=Sun Microsystems Inc.}
 has deletions [delFileName=_26t_1.del]
 test: open reader.OK [1 deleted docs]
 test: fields..OK [32 fields]
 test: field norms.OK [32 fields]
 test: terms, freq, prox...ERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:127)
   at 
 org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:102)
   at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:616)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:509)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
   at TestMe.main(TestMe.java:47)
 test: stored fields...ERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 org.apache.lucene.index.ReadOnlySegmentReader.isDeleted(ReadOnlySegmentReader.java:34)
   at 
 org.apache.lucene.index.CheckIndex.testStoredFields(CheckIndex.java:684)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:512)
   at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:299)
   at TestMe.main(TestMe.java:47)
 test: term vectorsERROR [114]
 java.lang.ArrayIndexOutOfBoundsException: 114
   at org.apache.lucene.util.BitVector.get(BitVector.java:104)
   at 
 

[jira] Commented: (LUCENE-2908) clean up serialization in the codebase

2011-02-07 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991559#comment-12991559
 ] 

Michael McCandless commented on LUCENE-2908:


+1

 clean up serialization in the codebase
 --

 Key: LUCENE-2908
 URL: https://issues.apache.org/jira/browse/LUCENE-2908
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2908.patch


 We removed contrib/remote, but forgot to cleanup serialization hell 
 everywhere.
 this is no longer needed, never really worked (e.g. across versions), and 
 slows 
 development (e.g. i wasted a long time debugging stupid serialization of 
 Similarity.idfExplain when trying to make a patch for the scoring system).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (SOLR-2350) improve post.jar to handle non UTF-8 files

2011-02-07 Thread Hoss Man (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man resolved SOLR-2350.


Resolution: Fixed

Committed revision 1068149. - trunk
Committed revision 1068152. - 3x


 improve post.jar to handle non UTF-8 files
 --

 Key: SOLR-2350
 URL: https://issues.apache.org/jira/browse/SOLR-2350
 Project: Solr
  Issue Type: Improvement
Reporter: Hoss Man
Assignee: Hoss Man
 Fix For: 3.1, 4.0

 Attachments: SOLR-2350.patch, SOLR-2350.patch


 thanks to all the awesomeness Uwe did in SOLR-96, some hard coded 
 limitations/assumptions in the simple post.jar provided for the example files 
 can be cleaned up.
 notably: it use to deal with Readers/Writers, and warned people there data 
 had to be UTF-8 (because that's all Solr supported) and now it can deal with 
 raw streams

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Keyword - search statistics

2011-02-07 Thread Erick Erickson
Solr doesn't keep meta data, so if you're asking for some kind of search
logging your app has to provide that...

Best
Erick

On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan selvara...@gmail.com
 wrote:


  Hi

Is there any way i can get 'no of times' a key word searched in SOLR ?


 *Here is my solr package details*

 Solr Specification Version: 1.4.0
 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06
 12:33:40
 Lucene Specification Version: 2.9.1
 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25

 -Selvaraj



Re: Keyword - search statistics

2011-02-07 Thread Bill Bell
You can also use Google Analytics or something like that too to get stats.

Bill Bell
Sent from mobile


On Feb 7, 2011, at 4:31 PM, Erick Erickson erickerick...@gmail.com wrote:

 Solr doesn't keep meta data, so if you're asking for some kind of search
 logging your app has to provide that...
 
 Best
 Erick
 
 On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan selvara...@gmail.com 
 wrote:
  
 Hi 
  
Is there any way i can get 'no of times' a key word searched in SOLR ?
 

 Here is my solr package details
 
 Solr Specification Version: 1.4.0
 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 
 12:33:40
 Lucene Specification Version: 2.9.1
 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25
 
 -Selvaraj
 


Re: Keyword - search statistics

2011-02-07 Thread Selvaraj Varadharajan
Thanks Eric.
What about having another core and interpret the request calls and pool it
in that core.. ?
Do we see any performance hit form your point of view.

-Selvaraj


On Mon, Feb 7, 2011 at 3:31 PM, Erick Erickson erickerick...@gmail.comwrote:

 Solr doesn't keep meta data, so if you're asking for some kind of search
 logging your app has to provide that...

 Best
 Erick


 On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan 
 selvara...@gmail.com wrote:


  Hi

Is there any way i can get 'no of times' a key word searched in SOLR ?


 *Here is my solr package details*

 Solr Specification Version: 1.4.0
 Solr Implementation Version: 1.4.0 833479 - grantingersoll -
 2009-11-06 12:33:40
 Lucene Specification Version: 2.9.1
 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25

 -Selvaraj





Re: Keyword - search statistics

2011-02-07 Thread Erick Erickson
You have to explain your problem in *much* more detail for anyone to make a
really relevant comment, all we can do so far is guess what you're *really*
after

Best
Erick


On Mon, Feb 7, 2011 at 8:25 PM, Selvaraj Varadharajan
selvara...@gmail.comwrote:

 Thanks Eric.
 What about having another core and interpret the request calls and pool it
 in that core.. ?
 Do we see any performance hit form your point of view.

 -Selvaraj



 On Mon, Feb 7, 2011 at 3:31 PM, Erick Erickson erickerick...@gmail.comwrote:

 Solr doesn't keep meta data, so if you're asking for some kind of search
 logging your app has to provide that...

 Best
 Erick


 On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan 
 selvara...@gmail.com wrote:


  Hi

Is there any way i can get 'no of times' a key word searched in SOLR ?


 *Here is my solr package details*

 Solr Specification Version: 1.4.0
 Solr Implementation Version: 1.4.0 833479 - grantingersoll -
 2009-11-06 12:33:40
 Lucene Specification Version: 2.9.1
 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25

 -Selvaraj






Umlauts as Char

2011-02-07 Thread Prescott Nasser


Hey all, 
 
So while digging into the code a bit (and pushed by digy's Arabic conversion 
yesterday). I started looking at the various other languages we were missing 
from java.
 
I started porting the GermanAnalyzer, but ran into an issue of the Umlauts...
 
http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_9_4/contrib/analyzers/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java?revision=1040993view=co
 
in the void subsitute function you'll see them:
 
else if ( buffer.charAt( c ) == 'ü' ) {
  buffer.setCharAt( c, 'u' );
}

This does not constitue a character in .net (that I can figure out) and thus it 
doesn't compile. The .java file says encoded in UTF-8. I was thinking maybe I 
could do the same thing in VS2010, but I'm not finding a way, and searching on 
this has been difficult.
 
Any ideas?
 
~Prescott 

Re: Keyword - search statistics

2011-02-07 Thread Vijay Raj
You can have a custom SearchComponent and configure a listener to the same. 
 Checkout example/solr/config/solrconfig.xml , regarding configuring custom 
query components , before and after the default list of components, that can 
help provide some of this 'aspect' behavior. 

arr name=first-components
  strmyFirstComponentName/str
/arr

More details, about SearchComponent,  available in the javadoc 
here: http://bit.ly/giz0b1 . 

Override the method  - prepare(ResponseBuilder rb)  , and do a count , based on 
the q / other parameters that you get access to , from the ResponseBuilder.  

Be aware that this gets in the way of every search request that hits the solr 
server. So - you need to be careful about how this is persisted (how frequently 
to the datastore etc.) , without being intrusive and adding to the query 
time.  --
  Vijay



From: Selvaraj Varadharajan selvara...@gmail.com
To: dev@lucene.apache.org
Sent: Mon, February 7, 2011 5:25:18 PM
Subject: Re: Keyword - search statistics

Thanks Eric.
What about having another core and interpret the request calls and pool it in 
that core.. ?
Do we see any performance hit form your point of view.

-Selvaraj



On Mon, Feb 7, 2011 at 3:31 PM, Erick Erickson erickerick...@gmail.com wrote:

Solr doesn't keep meta data, so if you're asking for some kind of search
logging your app has to provide that...


Best
Erick



On Sun, Feb 6, 2011 at 10:46 PM, Selvaraj Varadharajan selvara...@gmail.com 
wrote:

 
Hi 
 
   Is there any way i can get 'no of times' a key  word searched in SOLR ?

   
Here is my solr package details

Solr Specification Version: 1.4.0
Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 
12:33:40
Lucene Specification Version: 2.9.1
Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25

-Selvaraj




[jira] Commented: (SOLR-2348) No error reported when using a FieldCached backed ValueSource for a field Solr knows won't work

2011-02-07 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991771#comment-12991771
 ] 

Hoss Man commented on SOLR-2348:


My hope had been that this would be really straightforward, and a simple 
inspection of the SchemaField (or FieldType) could be done inside the 
FieldCacheSource to cover all cases -- except that FieldCacheSource (and it's 
subclasses) is only ever given a field name, and never gets a copy of the 
FieldType, SchemaField of even the IndexSchema to inspect to ensure that using 
the FieldCache is viable.

This means that we have to take the same basic approach as SOLR-2339 - every 
FieldType impl that utilizes a FieldCacheSource in it's getValueSource method 
needs to check if FieldCache is viable for that field (ie: indexed, not 
multivalued)

We could rename the checkSortability method I just added in SOLR-2339 into a 
checkFieldCachibility type method and use it for both purposes, but:
* it currently throws exceptions with specific messages like can not sort on 
unindexed field: 
* I seem to recall some folks talking about the idea of expanding FieldCache to 
support more things like UninvertedField does, in which case being multivalued 
won't prevent you from using the FieldCache on a field which would ultimately 
mean the pre-conditions for using a FieldCacheSource would change.  We could 
imagine the user specifing a function that takes in vector args to use to 
collapse the multiple values per doc on a per usage basis (ie: in this function 
query case, use the max value of the multiple values for each doc; in this 
function query, use the average value of the multiple values for each doc; 
etc...)

with that in mind, i think for now the most straight forward thing to do is to 
add a checkFieldCacheSource(QParser) method to SchemaField that would be 
cut/paste of checkSortability (with the error message wording changed) and make 
all of the (applicable) FieldType.getValueSource methods call it.  In the 
future, it could evolve differently then checkSortability -- either removing 
the !multivalued assertion completley, or introspecting the Qparser context 
in some way to determine that neccessary info has been provided to know how to 
use the (hypothetical) multivalued FieldCache (hard t ospeculate at this point)



 No error reported when using a FieldCached backed ValueSource for a field 
 Solr knows won't work
 ---

 Key: SOLR-2348
 URL: https://issues.apache.org/jira/browse/SOLR-2348
 Project: Solr
  Issue Type: Bug
Reporter: Hoss Man
Assignee: Hoss Man
 Fix For: 3.1, 4.0


 For the same reasons outlined in SOLR-2339, Solr FieldTypes that return 
 FieldCached backed ValueSources should explicitly check for situations where 
 knows the FieldCache is meaningless.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: CustomScoreQueryWithSubqueries

2011-02-07 Thread Fernando Wasylyszyn
Robert: I'm trying to follow the steps that are mentioned in:

http://wiki.apache.org/lucene-java/HowToContribute

in order to make a patch with my contribution. But, in the source code that I 
get from:


http://svn.apache.org/repos/asf/lucene/dev/trunk/
the class org.apache.lucene.search.Searcher is missing and the only method 
available to obtain a Scorer from a Weight object is 
scorer(IndexReader.AtomicReaderContext, ScorerContext)
I just checked and class Searcher still exists in Lucene 3.0.3. In which 
version 
the trunk that I've checkout is based? The patch that I want to submit is based 
on Lucene 2.9.1.
Thanks in advance.
Regards.
Fernando.










De: Robert Muir rcm...@gmail.com
Para: dev@lucene.apache.org
Enviado: miércoles, 2 de febrero, 2011 16:52:58
Asunto: Re: CustomScoreQueryWithSubqueries

On Wed, Feb 2, 2011 at 2:37 PM, Fernando Wasylyszyn
ferw...@yahoo.com.ar wrote:
 Hi everyone. My name is Fernando and I am a researcher and developer in the
 R+D lab at Snoop Consulting S.R.L. in Argentina.
 Based on the patch suggested in LUCENE-1608
 (https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of one
 of our customers, for who we are developing a customized search engine on
 top of Lucene and Solr, we have developed the class
 CustomScoreQueryWithSubqueries, which is a variation of CustomScoreQuery
 that allows the use of arbitrary Query objects besides instances of
 ValueSourceQuery, without the need of wrapping the arbitrary/ies query/ies
 with the QueryValueSource proposed in Jira, which has the disadvantage of
 create an instance of an IndexSearcher in each invocation of the method
 getValues(IndexReader).
 If you think that this contribution can be useful for the Lucene community,
 please let me know the steps in order to contribute.

Hi Fernando: https://issues.apache.org/jira/browse/LUCENE-1608 is
still an open issue.

If you have a better solution, please don't hesitate to upload a patch
file to the issue!
There are some more detailed instructions here:
http://wiki.apache.org/lucene-java/HowToContribute

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


  

Should ASCIIFoldingFilter be deprecated?

2011-02-07 Thread David Smiley (@MITRE.org)

ISOLatin1AccentFilter is deprecated, presumably because you can (and should)
use MappingCharFilter configured with mapping-ISOLatin1Accent.txt.  By that
same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using
mapping-FoldToASCII.txt ?

~ David Smiley

-
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Should ASCIIFoldingFilter be deprecated?

2011-02-07 Thread Steven A Rowe
AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter provides 
a superset of it mappings.

I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter 
can achieve a significantly higher throughput rate than MappingCharFilter, and 
given that, it probably makes sense to keep both, to allow people to make the 
choice about the tradeoff between the flexibility provided by the 
human-readable (and editable) mapping file and the speed provided by 
ASCIIFoldingFilter.

Steve

 -Original Message-
 From: David Smiley (@MITRE.org) [mailto:dsmi...@mitre.org]
 Sent: Monday, February 07, 2011 10:34 PM
 To: solr-...@lucene.apache.org
 Subject: Should ASCIIFoldingFilter be deprecated?
 
 
 ISOLatin1AccentFilter is deprecated, presumably because you can (and
 should)
 use MappingCharFilter configured with mapping-ISOLatin1Accent.txt.  By
 that
 same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of
 using
 mapping-FoldToASCII.txt ?
 
 ~ David Smiley
 
 -
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
 --
 View this message in context: http://lucene.472066.n3.nabble.com/Should-
 ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html
 Sent from the Solr - Dev mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Should ASCIIFoldingFilter be deprecated?

2011-02-07 Thread Chris Hostetter
: 
: ISOLatin1AccentFilter is deprecated, presumably because you can (and should)
: use MappingCharFilter configured with mapping-ISOLatin1Accent.txt.  By that
: same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using
: mapping-FoldToASCII.txt ?

CharFilters and TokenFilters have different purposes though...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter

(ie: If you use MappingCharFilter, you can't then tokenize on some of the 
characters you filtered away)



: 
: ~ David Smiley
: 
: -
:  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
: -- 
: View this message in context: 
http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html
: Sent from the Solr - Dev mailing list archive at Nabble.com.
: 
: -
: To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
: For additional commands, e-mail: dev-h...@lucene.apache.org
: 
: 

-Hoss

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[HUDSON] Lucene-Solr-tests-only-trunk - Build # 4621 - Failure

2011-02-07 Thread Apache Hudson Server
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/4621/

1 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexWriter.testOptimizeTempSpaceUsage

Error Message:
optimize used too much temporary space: starting usage was 60814 bytes; max 
temp usage was 244924 but should have been 243256 (= 4X starting usage)

Stack Trace:
junit.framework.AssertionFailedError: optimize used too much temporary space: 
starting usage was 60814 bytes; max temp usage was 244924 but should have been 
243256 (= 4X starting usage)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1183)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1115)
at 
org.apache.lucene.index.TestIndexWriter.testOptimizeTempSpaceUsage(TestIndexWriter.java:294)




Build Log (for compile errors):
[...truncated 3044 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: CustomScoreQueryWithSubqueries

2011-02-07 Thread Doron Cohen
Hi Fernando,
The wiki indeed relates mainly to trunk development.
For creating a 2.9 patch checkout code from
/repos/asf/lucene/java/branches/lucene_2_9
Regards,
Doron

As the wiki page says
 Most development is done on the trunk
You can either use that, or, in order

On Tue, Feb 8, 2011 at 4:56 AM, Fernando Wasylyszyn ferw...@yahoo.com.arwrote:

 Robert: I'm trying to follow the steps that are mentioned in:


 http://wiki.apache.org/lucene-java/HowToContribute

 in order to make a patch with my contribution. But, in the source code that
 I get from:

 http://svn.apache.org/repos/asf/lucene/dev/trunk/


 the class org.apache.lucene.search.Searcher is missing and the only method
 available to obtain a Scorer from a Weight object is
 scorer(IndexReader.AtomicReaderContext, ScorerContext)
 I just checked and class Searcher still exists in Lucene 3.0.3. In which
 version the trunk that I've checkout is based? The patch that I want to
 submit is based on Lucene 2.9.1.
 Thanks in advance.
 Regards.
 Fernando.






 --
 *De:* Robert Muir rcm...@gmail.com
 *Para:* dev@lucene.apache.org
 *Enviado:* miércoles, 2 de febrero, 2011 16:52:58
 *Asunto:* Re: CustomScoreQueryWithSubqueries

 On Wed, Feb 2, 2011 at 2:37 PM, Fernando Wasylyszyn
 ferw...@yahoo.com.ar wrote:
  Hi everyone. My name is Fernando and I am a researcher and developer in
 the
  R+D lab at Snoop Consulting S.R.L. in Argentina.
  Based on the patch suggested in LUCENE-1608
  (https://issues.apache.org/jira/browse/LUCENE-1608) and in the needs of
 one
  of our customers, for who we are developing a customized search engine on
  top of Lucene and Solr, we have developed the class
  CustomScoreQueryWithSubqueries, which is a variation of CustomScoreQuery
  that allows the use of arbitrary Query objects besides instances of
  ValueSourceQuery, without the need of wrapping the arbitrary/ies
 query/ies
  with the QueryValueSource proposed in Jira, which has the disadvantage of
  create an instance of an IndexSearcher in each invocation of the method
  getValues(IndexReader).
  If you think that this contribution can be useful for the Lucene
 community,
  please let me know the steps in order to contribute.

 Hi Fernando: https://issues.apache.org/jira/browse/LUCENE-1608 is
 still an open issue.

 If you have a better solution, please don't hesitate to upload a patch
 file to the issue!
 There are some more detailed instructions here:
 http://wiki.apache.org/lucene-java/HowToContribute

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






[jira] Created: (SOLR-2351) Allow the MoreLikeThis component to accept filters and use the already parsed query from previous stages (if applicable) as seed.

2011-02-07 Thread Amit Nithian (JIRA)
Allow the MoreLikeThis component to accept filters and use the already parsed 
query from previous stages (if applicable) as seed.
-

 Key: SOLR-2351
 URL: https://issues.apache.org/jira/browse/SOLR-2351
 Project: Solr
  Issue Type: Improvement
  Components: MoreLikeThis
Affects Versions: 1.5
Reporter: Amit Nithian
Priority: Minor
 Fix For: 1.5, 1.3


Currently the MLT component doesn't accept filter queries specified on the URL 
which my application needed (I needed to restrict similar results by a lat/long 
bounding box). This patch also attempts to solve the issue of allowing the 
boost functions of the dismax to be used in the MLT component by using the 
query object created by the QueryComponent to OR with the query created by the 
MLT as part of the final query. In a blank dismax query with no query/phrase 
clauses, this works although a separate BF definition/parsing would be ideal.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2351) Allow the MoreLikeThis component to accept filters and use the already parsed query from previous stages (if applicable) as seed.

2011-02-07 Thread Amit Nithian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Nithian updated SOLR-2351:
---

Attachment: mlt.patch

 Allow the MoreLikeThis component to accept filters and use the already parsed 
 query from previous stages (if applicable) as seed.
 -

 Key: SOLR-2351
 URL: https://issues.apache.org/jira/browse/SOLR-2351
 Project: Solr
  Issue Type: Improvement
  Components: MoreLikeThis
Affects Versions: 1.5
Reporter: Amit Nithian
Priority: Minor
 Fix For: 1.3, 1.5

 Attachments: mlt.patch


 Currently the MLT component doesn't accept filter queries specified on the 
 URL which my application needed (I needed to restrict similar results by a 
 lat/long bounding box). This patch also attempts to solve the issue of 
 allowing the boost functions of the dismax to be used in the MLT component by 
 using the query object created by the QueryComponent to OR with the query 
 created by the MLT as part of the final query. In a blank dismax query with 
 no query/phrase clauses, this works although a separate BF definition/parsing 
 would be ideal.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2155) Geospatial search using geohash prefixes

2011-02-07 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991818#comment-12991818
 ] 

Lance Norskog commented on SOLR-2155:
-

The lat/long version has to be rotated away from the true. Then, the 
calculations don't blow up at the poles or the equator. 

The real answer to doing geo and have it always work is to use quaternions. A 
lat/lon pair is essentially a complex number: latitude is the scalar and 
longitude rotates back to 0. A quaternion is a 4-valued variation of complex 
numbers: a + bi + cj + dk where i,j,k are separate values of sqrt(-1), 
assuming an infinite number of such values. A geo position, projected onto 
quaternions, gives a subspace. 

There are a bunch of 3D algorithms which use quaternions because they don't 
have problems at the (0-1) boundary.  The classic apocryphal story is the jet 
fighter pilot on a test flight: he crossed the equator and the plane flipped 
upside down. Quaternions don't have this problem.

[SLERP|http://en.wikipedia.org/wiki/Slerp] explains the problem of distance on 
a sphere. How to do distances, box containment, etc. I don't know. I am _so_ 
not a math guy.



 Geospatial search using geohash prefixes
 

 Key: SOLR-2155
 URL: https://issues.apache.org/jira/browse/SOLR-2155
 Project: Solr
  Issue Type: Improvement
Reporter: David Smiley
 Attachments: GeoHashPrefixFilter.patch, GeoHashPrefixFilter.patch, 
 GeoHashPrefixFilter.patch


 There currently isn't a solution in Solr for doing geospatial filtering on 
 documents that have a variable number of points.  This scenario occurs when 
 there is location extraction (i.e. via a gazateer) occurring on free text.  
 None, one, or many geospatial locations might be extracted from any given 
 document and users want to limit their search results to those occurring in a 
 user-specified area.
 I've implemented this by furthering the GeoHash based work in Lucene/Solr 
 with a geohash prefix based filter.  A geohash refers to a lat-lon box on the 
 earth.  Each successive character added further subdivides the box into a 4x8 
 (or 8x4 depending on the even/odd length of the geohash) grid.  The first 
 step in this scheme is figuring out which geohash grid squares cover the 
 user's search query.  I've added various extra methods to GeoHashUtils (and 
 added tests) to assist in this purpose.  The next step is an actual Lucene 
 Filter, GeoHashPrefixFilter, that uses these geohash prefixes in 
 TermsEnum.seek() to skip to relevant grid squares in the index.  Once a 
 matching geohash grid is found, the points therein are compared against the 
 user's query to see if it matches.  I created an abstraction GeoShape 
 extended by subclasses named PointDistance... and CartesianBox to support 
 different queried shapes so that the filter need not care about these details.
 This work was presented at LuceneRevolution in Boston on October 8th.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org