[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992382#comment-12992382
 ] 

Simon Willnauer commented on LUCENE-2881:
-

bq. All tests pass. I'll commit this in a day or two if nobody objects.

+1

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.

2011-02-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2911:


Attachment: LUCENE-2911.patch

improved the patch by using a simpler demorgan expression Steven came up with.

I think this one is ready to commit.

 synchronize grammar/token types across StandardTokenizer, 
 UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
 --

 Key: LUCENE-2911
 URL: https://issues.apache.org/jira/browse/LUCENE-2911
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2911.patch, LUCENE-2911.patch


 I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a 
 future target such as 3.2
 But, in 3.1 I would like to do a little cleanup first, and synchronize all 
 these token types, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992476#comment-12992476
 ] 

Simon Willnauer commented on LUCENE-2881:
-

I just integrated this patch to the docvalues branch! It works like a charm! 
Nice work Michael, this brings docValues a huge step closer. All tests pass 
which failed before, FieldInfo is reliable now!

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1165) Reduce exposure of nightly build documentation

2011-02-09 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-1165.
---

Resolution: Fixed

Corresponding INFRA-3389 was solved: https://hudson.apache.org/robots.txt

 Reduce exposure of nightly build documentation
 --

 Key: LUCENE-1165
 URL: https://issues.apache.org/jira/browse/LUCENE-1165
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doron Cohen
Assignee: Uwe Schindler
Priority: Minor

 From LUCENE-1157  -
  ..the nightly build documentation is too prominent. A search for 
 indexwriter api on Google or Yahoo! returns nightly documentation before 
 released documentation.
 (https://issues.apache.org/jira/browse/LUCENE-1157?focusedCommentId=12565820#action_12565820)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1165) Reduce exposure of nightly build documentation

2011-02-09 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler closed LUCENE-1165.
-


 Reduce exposure of nightly build documentation
 --

 Key: LUCENE-1165
 URL: https://issues.apache.org/jira/browse/LUCENE-1165
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Website
Reporter: Doron Cohen
Assignee: Uwe Schindler
Priority: Minor

 From LUCENE-1157  -
  ..the nightly build documentation is too prominent. A search for 
 indexwriter api on Google or Yahoo! returns nightly documentation before 
 released documentation.
 (https://issues.apache.org/jira/browse/LUCENE-1157?focusedCommentId=12565820#action_12565820)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-09 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992524#comment-12992524
 ] 

Michael Busch commented on LUCENE-2881:
---

Awesome, thanks for letting me know!  I hope I'll be able to say the same about 
the RT branch after I tried it there... :)

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.

2011-02-09 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992531#comment-12992531
 ] 

Steven Rowe commented on LUCENE-2911:
-

bq. I think this one is ready to commit.

+1 

I applied the patch, jflex generates properly, tests pass

 synchronize grammar/token types across StandardTokenizer, 
 UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
 --

 Key: LUCENE-2911
 URL: https://issues.apache.org/jira/browse/LUCENE-2911
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2911.patch, LUCENE-2911.patch


 I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a 
 future target such as 3.2
 But, in 3.1 I would like to do a little cleanup first, and synchronize all 
 these token types, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2911) synchronize grammar/token types across StandardTokenizer, UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.

2011-02-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992601#comment-12992601
 ] 

Robert Muir commented on LUCENE-2911:
-

Committed revision 1068979. Now backporting...

 synchronize grammar/token types across StandardTokenizer, 
 UAX29EmailURLTokenizer, ICUTokenizer, add CJK types.
 --

 Key: LUCENE-2911
 URL: https://issues.apache.org/jira/browse/LUCENE-2911
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Analysis
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 3.1

 Attachments: LUCENE-2911.patch, LUCENE-2911.patch


 I'd like to do LUCENE-2906 (better cjk support for these tokenizers) for a 
 future target such as 3.2
 But, in 3.1 I would like to do a little cleanup first, and synchronize all 
 these token types, etc.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml

2011-02-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992614#comment-12992614
 ] 

Yonik Seeley commented on SOLR-2338:


Yep, sounds like a great idea!
Should we specify the similarity class in each fieldType that want's to use a 
non-default similarity:

{code}
fieldType
  analyzer.../analyzer
  similarity class=.../similarity
/fieldType
{code}

Or use named similarities and refer to them:
{code}
fieldType 
  analyzer.../analyzer
  similarity name=short_text/
/fieldType

similarity name=short_text class=.../similarity
{code}

 improved per-field similarity integration into schema.xml
 -

 Key: SOLR-2338
 URL: https://issues.apache.org/jira/browse/SOLR-2338
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Robert Muir

 Currently since LUCENE-2236, we can enable Similarity per-field, but in 
 schema.xml there is only a 'global' factory
 for the SimilarityProvider.
 In my opinion this is too low-level because to customize Similarity on a 
 per-field basis, you have to set your own
 CustomSimilarityProvider with similarity class=.../ and manage the 
 per-field mapping yourself in java code.
 Instead I think it would be better if you just specify the Similarity in the 
 FieldType, like after analyzer.
 As far as the example, one idea from LUCENE-1360 was to make a short_text 
 or metadata_text used by the
 various metadata fields in the example that has better norm quantization for 
 its shortness...

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2912) remove field param from computeNorm, scorePayload ; remove UOE'd lengthNorm, switch SweetSpot to per-field

2011-02-09 Thread Robert Muir (JIRA)
remove field param from computeNorm, scorePayload ; remove UOE'd lengthNorm, 
switch SweetSpot to per-field 
---

 Key: LUCENE-2912
 URL: https://issues.apache.org/jira/browse/LUCENE-2912
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Fix For: 4.0


In LUCENE-2236 we switched sim to per field (SimilarityProvider returns a 
per-field similarity).

But we didn't completely cleanup there... I think we should now do this:
* SweetSpotSimilarity loses all its hashmaps. Instead, just configure one per 
field and return it in your SimilarityProvider. this means for example, all its 
TF factors can now be configured per-field too, not just the length 
normalization factors.
* computeNorm and scorePayload lose their field parameter, as its redundant and 
confusing.
* the UOE'd obselete lengthNorm is removed. I also updated javadocs that were 
pointing to it (this is bad!).



-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2912) remove field param from computeNorm, scorePayload ; remove UOE'd lengthNorm, switch SweetSpot to per-field

2011-02-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2912:


Attachment: LUCENE-2912.patch

Attached is an initial patch, all tests pass.


 remove field param from computeNorm, scorePayload ; remove UOE'd lengthNorm, 
 switch SweetSpot to per-field 
 ---

 Key: LUCENE-2912
 URL: https://issues.apache.org/jira/browse/LUCENE-2912
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2912.patch


 In LUCENE-2236 we switched sim to per field (SimilarityProvider returns a 
 per-field similarity).
 But we didn't completely cleanup there... I think we should now do this:
 * SweetSpotSimilarity loses all its hashmaps. Instead, just configure one per 
 field and return it in your SimilarityProvider. this means for example, all 
 its TF factors can now be configured per-field too, not just the length 
 normalization factors.
 * computeNorm and scorePayload lose their field parameter, as its redundant 
 and confusing.
 * the UOE'd obselete lengthNorm is removed. I also updated javadocs that were 
 pointing to it (this is bad!).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml

2011-02-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992670#comment-12992670
 ] 

Robert Muir commented on SOLR-2338:
---

doesn't matter to me really, but what is the advantage of the named 
similarities?

this would be a bit inconsistent from how you configure analyzers (and an 
additional level
of indirection that might be confusing)... or am I missing something?


 improved per-field similarity integration into schema.xml
 -

 Key: SOLR-2338
 URL: https://issues.apache.org/jira/browse/SOLR-2338
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Robert Muir

 Currently since LUCENE-2236, we can enable Similarity per-field, but in 
 schema.xml there is only a 'global' factory
 for the SimilarityProvider.
 In my opinion this is too low-level because to customize Similarity on a 
 per-field basis, you have to set your own
 CustomSimilarityProvider with similarity class=.../ and manage the 
 per-field mapping yourself in java code.
 Instead I think it would be better if you just specify the Similarity in the 
 FieldType, like after analyzer.
 As far as the example, one idea from LUCENE-1360 was to make a short_text 
 or metadata_text used by the
 various metadata fields in the example that has better norm quantization for 
 its shortness...

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml

2011-02-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992676#comment-12992676
 ] 

Yonik Seeley commented on SOLR-2338:


Other components in solrconfig use that indirection, but
I'm fine w/ the approach taken by tokenizer / token filter config.



 improved per-field similarity integration into schema.xml
 -

 Key: SOLR-2338
 URL: https://issues.apache.org/jira/browse/SOLR-2338
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Robert Muir

 Currently since LUCENE-2236, we can enable Similarity per-field, but in 
 schema.xml there is only a 'global' factory
 for the SimilarityProvider.
 In my opinion this is too low-level because to customize Similarity on a 
 per-field basis, you have to set your own
 CustomSimilarityProvider with similarity class=.../ and manage the 
 per-field mapping yourself in java code.
 Instead I think it would be better if you just specify the Similarity in the 
 FieldType, like after analyzer.
 As far as the example, one idea from LUCENE-1360 was to make a short_text 
 or metadata_text used by the
 various metadata fields in the example that has better norm quantization for 
 its shortness...

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2353) SpellCheckCollator uses org.mortbay.log.Log for logging

2011-02-09 Thread Sami Siren (JIRA)
SpellCheckCollator uses org.mortbay.log.Log for logging
---

 Key: SOLR-2353
 URL: https://issues.apache.org/jira/browse/SOLR-2353
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Reporter: Sami Siren
Priority: Trivial


SLF4j should be used instead.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2353) SpellCheckCollator uses org.mortbay.log.Log for logging

2011-02-09 Thread Sami Siren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sami Siren updated SOLR-2353:
-

Attachment: SOLR-2353.patch

 SpellCheckCollator uses org.mortbay.log.Log for logging
 ---

 Key: SOLR-2353
 URL: https://issues.apache.org/jira/browse/SOLR-2353
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Reporter: Sami Siren
Priority: Trivial
 Attachments: SOLR-2353.patch


 SLF4j should be used instead.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992687#comment-12992687
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Robert and Michael

In order to test if ByteBuffer/IntBuffer works better than int[]-byte[] 
conversion, I now separate them into 3 different codecs. All of them use the 
same PForDelta implementation except that they use different 
indexinput/indexoutput as follows.

1. PatchedFrameOfRef3 - use in.readBytes(), it will convert int[] - byte[] 
manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java

2. PatchedFrameOfRef4 - use in.readBytes(), it will convert int[] - byte[] 
by ByteBuffer/IntBuffer. Its corresponding java code is: 
PForDeltaFixedIntBlockWithByteBufferCodec.java

3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need 
conversion. Its corresponding java code is: 
PForDeltaFixedIntBlockWithReadIntCodec.java

I tested them against BulkVInt on MacOS. The detailed results are attached. 
Here is the conclusion:

1) Yes, Michael and Robert, you guys are right! ByteBuffer/IntBuffer are faster 
then my manual conversion btw byte[]/int[]. I guess the reason I thought they 
were worse is that i did not separate codecs before, such that the test results 
is not stable due to JVM/JIT. 

2) Now, PatchedFrameOfRef4 is still worse than BulkVInt in many kinds of 
queries. However, it seems that it can do better for fuzzy queries and 
wildcardquery.

3) Of course, these PatchedFrameOfRef3,4,5 are all better than 
PatchedFrameOfRef and FrameOfRef for almost all queries.

4) The new patched is just uploaded, please check them out. 

The following is the experimental results for 0.1M data.

(1) bulkVInt VS patchedFrameOfRef4 (withByteBuffer, in.readBytes(..) )

QueryQPS bulkVIntQPS pathcedFrameofref4-withByteBuffer  Pct diff
 united states  389.26  361.79 -7.1%
   united states~3  234.52  228.99 -2.4%
   +nebraska +states 1138.95  992.06-12.9%
 +united +states  670.69  603.86-10.0%
doctimesecnum:[1 TO 6]  415.28  447.83  7.8%
doctitle:.*[Uu]nited.*  496.03  522.47  5.3%
  spanFirst(unit, 5) 1176.47 1086.96 -7.6%
spanNear([unit, state], 10, true)  502.26  423.73-15.6%
  states 1612.90 1453.49 -9.9%
 u*d  167.95  171.17  1.9%
un*d  260.69  275.33  5.6%
uni*  602.41  577.37 -4.2%
   unit* 1016.26 1041.67  2.5%
   united states  617.28  549.45-11.0%
  united~0.6   12.22   12.93  5.9%
 united~0.75   53.88   56.78  5.4%
unit~0.5   12.58   13.19  4.9%
unit~0.7   52.41   54.93  4.8%

(2) bulkVInt VS patchedFrameOfRef3 (with my own int[] - byte[] conversion, 
still in.readBytes(..))

 QueryQPS bulkVIntQPS pathcedFrameofref3  Pct diff
 united states  388.50  363.24 -6.5%
   united states~3  234.80  223.56 -4.8%
   +nebraska +states 1138.95 1016.26-10.8%
 +united +states  671.14  607.90 -9.4%
doctimesecnum:[1 TO 6]  418.24  441.89  5.7%
doctitle:.*[Uu]nited.*  489.00  522.74  6.9%
  spanFirst(unit, 5) 1246.88 1127.40 -9.6%
spanNear([unit, state], 10, true)  514.14  473.71 -7.9%
  states 1612.90 1488.10 -7.7%
 u*d  170.77  167.31 -2.0%
un*d  261.37  264.48  1.2%
uni*  609.38  602.41 -1.1%
   unit* 1028.81 1052.63  2.3%
   united states  614.25  564.33 -8.1%
  united~0.6   12.05   12.11  0.5%
 united~0.75   53.16   54.97  3.4%
unit~0.5   12.43   12.50  0.6%
unit~0.7   52.81   53.23  0.8%


(3) bulkVInt VS patchedFrameOfRef5 (with my own int[] - byte[] conversion, 
still in.readBytes(..))

  QueryQPS bulkVIntQPS pathcedFrameofref5-withReadInt  Pct diff
 united states  391.24  366.70 -6.3%
   united states~3  235.40  235.07 -0.1%
   +nebraska +states 1137.66 1072.96 -5.7%
 +united +states  673.40  642.26 -4.6%
doctimesecnum:[1 TO 6]  414.25  407.66 -1.6%
doctitle:.*[Uu]nited.*  492.61  538.21  9.3%
  spanFirst(unit, 5) 1253.13 1175.09 -6.2%
spanNear([unit, state], 10, true)  511.25  483.56 -5.4%
  states 1642.04 1490.31 -9.2%
 u*d  166.78  160.28 -3.9%
un*d  261.64  255.36 -2.4%
uni*  609.38  593.47 -2.6%
   unit* 1026.69 

[jira] Commented: (SOLR-1942) Ability to select codec per field

2011-02-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992690#comment-12992690
 ] 

Grant Ingersoll commented on SOLR-1942:
---

Hey Simon,

Any progress on this?  Seems like a useful feature.  I haven't had time to 
review, but if you feel it's ready, I say go for it.

 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1942) Ability to select codec per field

2011-02-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992691#comment-12992691
 ] 

Grant Ingersoll commented on SOLR-1942:
---

FWIW, I like the syntax of the schema and solrconfig.xml configuration that you 
showed in the example here.  Would be good to add to the wiki once you commit 
it.

 Ability to select codec per field
 -

 Key: SOLR-1942
 URL: https://issues.apache.org/jira/browse/SOLR-1942
 Project: Solr
  Issue Type: New Feature
Affects Versions: 4.0
Reporter: Yonik Seeley
 Fix For: 4.0

 Attachments: SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, 
 SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch, SOLR-1942.patch


 We should use PerFieldCodecWrapper to allow users to select the codec 
 per-field.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2354) CoreAdminRequest#createCore should allow you to specify the data dir

2011-02-09 Thread Mark Miller (JIRA)
CoreAdminRequest#createCore should allow you to specify the data dir


 Key: SOLR-2354
 URL: https://issues.apache.org/jira/browse/SOLR-2354
 Project: Solr
  Issue Type: Improvement
  Components: clients - java
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 4.0




-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (SOLR-2355) simple distrib update processor

2011-02-09 Thread Yonik Seeley (JIRA)
simple distrib update processor
---

 Key: SOLR-2355
 URL: https://issues.apache.org/jira/browse/SOLR-2355
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Priority: Minor


Here's a simple update processor for distributed indexing that I implemented 
years ago.
It implements a simple hash(id) MOD nservers and just fails if any servers are 
down.
Given the recent activity in distributed indexing, I thought this might be at 
least a good source for ideas.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session

2011-02-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992707#comment-12992707
 ] 

Simon Willnauer commented on LUCENE-2881:
-

I actually have some intermitted failures on docvalues which seem to be caused 
by some mixed up codec IDs. I found one which fixed most of the cases in 
FieldInfo#clone() where the codec ID is lost. But there must be some other 
situation where the codec ID gets mixed up. I will be AFK for 10 days at least 
so I have to leave you alone with that. Seems that we need a test for that 
though. The docValues test bring that up quickly :(

 Track FieldInfo per segment instead of per-IW-session
 -

 Key: LUCENE-2881
 URL: https://issues.apache.org/jira/browse/LUCENE-2881
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: Realtime Branch, CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Michael Busch
 Fix For: Realtime Branch, CSF branch, 4.0

 Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch


 Currently FieldInfo is tracked per IW session to guarantee consistent global 
 field-naming / ordering. IW carries FI instances over from previous segments 
 which also carries over field properties like isIndexed etc. While having 
 consistent field ordering per IW session appears to be important due to bulk 
 merging stored fields etc. carrying over other properties might become 
 problematic with Lucene's Codec support.  Codecs that rely on consistent 
 properties in FI will fail if FI properties are carried over.
 The DocValuesCodec (DocValuesBranch) for instance writes files per segment 
 and field (using the field id within the file name). Yet, if a segment has no 
 DocValues indexed in a particular segment but a previous segment in the same 
 IW session had DocValues, FieldInfo#docValues will be true  since those 
 values are reused from previous segments. 
 We already work around this limitation in SegmentInfo with properties like 
 hasVectors or hasProx which is really something we should manage per Codec  
 Segment. Ideally FieldInfo would be managed per Segment and Codec such that 
 its properties are valid per segment. It also seems to be necessary to bind 
 FieldInfoS to SegmentInfo logically since its really just per segment 
 metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-2355) simple distrib update processor

2011-02-09 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated SOLR-2355:
---

Attachment: TestDistributedUpdate.java
DistributedUpdateProcessorFactory.java

Here's the processor class and the test class (not in patch form, I just pulled 
these files straight from our commercial product).


 simple distrib update processor
 ---

 Key: SOLR-2355
 URL: https://issues.apache.org/jira/browse/SOLR-2355
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Priority: Minor
 Attachments: DistributedUpdateProcessorFactory.java, 
 TestDistributedUpdate.java


 Here's a simple update processor for distributed indexing that I implemented 
 years ago.
 It implements a simple hash(id) MOD nservers and just fails if any servers 
 are down.
 Given the recent activity in distributed indexing, I thought this might be at 
 least a good source for ideas.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2355) simple distrib update processor

2011-02-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992711#comment-12992711
 ] 

Yonik Seeley commented on SOLR-2355:


Some sample configuration:

{code}
  updateRequestProcessorChain name=distrib
processor class=com.lucid.update.DistributedUpdateProcessorFactory
  !-- example configuration...
   shards should be in the *same* order for every server
in a cluster.  Only self should change to represent
what server *this* is.

  str name=selflocalhost:8983/solr/str
  arr name=shards
strlocalhost:8983/solr/str
strlocalhost:7574/solr/str
  /arr
  --
/processor
processor class=solr.LogUpdateProcessorFactory
  int name=maxNumToLog10/int
/processor
processor class=solr.RunUpdateProcessorFactory/
  /updateRequestProcessorChain
{code}

Now on any update command, you can set update.processor=distrib and have 
distrib indexing controlled by the shards and self params, either 
configured in solrconfig, or passed in w/ the update command.

Or if you don't want to have to specify update.processor=distrib, you can set 
it as the default update processor for any update request handlers:
{code}
  !-- CSV update handler, loaded on demand --
  requestHandler class=solr.CSVRequestHandler name=/update/csv 
startup=lazy
lst name=defaults
  str name=update.processordistrib/str
/lst
  /requestHandler
{code}





 simple distrib update processor
 ---

 Key: SOLR-2355
 URL: https://issues.apache.org/jira/browse/SOLR-2355
 Project: Solr
  Issue Type: New Feature
Reporter: Yonik Seeley
Priority: Minor
 Attachments: DistributedUpdateProcessorFactory.java, 
 TestDistributedUpdate.java


 Here's a simple update processor for distributed indexing that I implemented 
 years ago.
 It implements a simple hash(id) MOD nservers and just fails if any servers 
 are down.
 Given the recent activity in distributed indexing, I thought this might be at 
 least a good source for ideas.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Distributed Indexing

2011-02-09 Thread Yonik Seeley
I haven't had time to follow all of this discussion, but this issue might help:
https://issues.apache.org/jira/browse/SOLR-2355

It's an implementation of the basic
http://localhost:8983/solr/update/csv?shards=shard1,shard2...

-Yonik
http://lucidimagination.com

On Mon, Feb 7, 2011 at 8:55 AM, Upayavira u...@odoko.co.uk wrote:
 Surely you want to be implementing an UpdateRequestProcessor, rather than a
 RequestHandler.

 The ContentStreamHandlerBase, in the handleRequestBody method gets an
 UpdateRequestProcessor and uses it to process the request. What we need is
 that handleRequestBody method to, as you have suggested, check on the shards
 parameter, and if necessary call a different UpdateRequestProcessor (a
 DistributedUpdateRequestProcessor).

 I don't think we really need it to be configurable at this point. The
 ContentStreamHandlerBase could just use a single hardwired implementation.
 If folks want choice of DistributedUpdateRequestProcessor, it can be added
 later.

 For configuration, the DistributedUpdateRequestProcessor should get its
 config from the parent RequestHandler. The configuration I'm most interested
 in is the DistributionPolicy. And that can be done with a
 distributionPolicyClass=solr.IDHashDistributionPolicy request parameter,
 which could potentially be configured in solrconfig.xml as an invariant, or
 provided in the request by the user if necessary.

 So, I'd avoid another thing that needs to be configured unless there are
 real benefits to it (which there don't seem to me to be right now).

 Upayavira

 On Sun, 06 Feb 2011 23:08 +, Alex Cowell alxc...@gmail.com wrote:

 Hey,

 We're making good progress, but our DistributedUpdateRequestHandler is
 having a bit of an identity crisis, so we thought we'd ask what other
 people's opinions are. The current situation is as follows:

 We've added a method to ContentStreamHandlerBase to check if an update
 request is distributed or not (based on the presence/validity of the
 'shards' parameter). So a non-distributed request will proceed as normal but
 a distributed request would be passed on to the
 DistributedUpdateRequestHandler to deal with.

 The reason this choice is made in the ContentStreamHandlerBase is so that
 the DistributedUpdateRequestHandler can use the URL the request came in on
 to determine where to distribute update requests. Eg. an update request is
 sent to:
 http://localhost:8983/solr/update/csv?shards=shard1,shard2...
 then the DistributedUpdateRequestHandler knows to send requests to:
 shard1/update/csv
 shard2/update/csv

 Alternatively, if the request wasn't distributed, it would simply be handled
 by whichever request handler /update/csv uses.

 Herein lies the problem. The DistributedUpdateRequestHandler is not really a
 request handler in the same way as the CSVRequestHandler or
 XmlUpdateRequestHandlers are. If anything, it's more like a plugin for the
 various existing update request handlers, to allow them to deal with
 distributed requests - a distributor if you will. It isn't designed to be
 able to receive and handle requests directly.

 We would like this DistributedUpdateRequestHandler to be defined in the
 solrconfig to allow flexibility for setting up multiple different
 DistributedUpdateRequestHandlers with different ShardDistributionPolicies
 etc.and also to allow us to get the appropriate instance from the core in
 the code. There seem to be two paths for doing this:

 1. Leave it as an implementation of SolrRequestHandler and hope the user
 doesn't directly send update requests to it (ie. a request to
 http://localhost:8983/solr/distrib update handler path would most likely
 cripple something). So it would be defined in the solrconfig something like:
 requestHandler name=distrib-update
 class=solr.DistributedUpdateRequestHandler /

 2. Create a new plugin type for the solrconfig, say
 updateRequestDistributor which would involve creating a new interface for
 the DistributedUpdateRequestHandler to implement, then registering it with
 the core. It would be defined in the solrconfig something like:
 updateRequestDistributor name=distrib-update
 class=solr.DistributedUpdateRequestHandler
   lst name=defaults
     str name=policysolr.HashedDistributionPolicy/str
   /lst
 /updateRequestDistributor

 This would mean that it couldn't directly receive requests, but that an
 instance could still easily be retrieved from the core to handle the
 distribution of update requests.

 Any thoughts on the above issue (or a more succinct, descriptive name for
 the class) are most welcome!

 Alex

 ---
 Enterprise Search Consultant at Sourcesense UK,
 Making Sense of Open Source

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml

2011-02-09 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992728#comment-12992728
 ] 

Hoss Man commented on SOLR-2338:


Most existing situations where plugins are dereferenced by name are so we can 
reuse the exact same object instance (ie: for recording stats, or because they 
are heavyweight to construct on the fly)

in the case of similarity, the main advantage i can think of would be if we 
wanted true per-field similiarity declaration, not just per field type ie...

{code}
similarity name=S_XX class=.../similarity
similarity name=S_YY class=.../similarity
...
fieldType name=FT_AA 
  analyzer.../analyzer
  similarity name=S_XX/
/fieldType
...
field name=F_111 type=FT_AA /!-- implied S_XX --
field name=F_222 type=FT_AA similarity=S_YY /
{code}

...but even if we don't do that, i suppose it's also conceivable that someone 
might have their own Similarity implementation that is expensive to instantiate 
(ie: maintains some big in memory data structures?) and might want to be able 
to declare one instance and then refer to it by name in many different 
fieldType declarations.

I think for now just supporting the first example yonik cited...

{code}
fieldType
  analyzer.../analyzer
  similarity class=.../similarity
/fieldType
{code}

would be a huge win, and we can always enhance to add name derefrencing later.

 improved per-field similarity integration into schema.xml
 -

 Key: SOLR-2338
 URL: https://issues.apache.org/jira/browse/SOLR-2338
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Robert Muir

 Currently since LUCENE-2236, we can enable Similarity per-field, but in 
 schema.xml there is only a 'global' factory
 for the SimilarityProvider.
 In my opinion this is too low-level because to customize Similarity on a 
 per-field basis, you have to set your own
 CustomSimilarityProvider with similarity class=.../ and manage the 
 per-field mapping yourself in java code.
 Instead I think it would be better if you just specify the Similarity in the 
 FieldType, like after analyzer.
 As far as the example, one idea from LUCENE-1360 was to make a short_text 
 or metadata_text used by the
 various metadata fields in the example that has better norm quantization for 
 its shortness...

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: LUCENE-2903.patch

This patch is to further improve pfordelta codec (PForDeltaFixedIntBlockCodec). 
I used 3 different implementations (3 codecs) for inputindex/outputindex. In 
particular, 

1. PatchedFrameOfRef3  use in.readBytes(), it will convert int[]  byte[] 
manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java

2. PatchedFrameOfRef4  use in.readBytes(), it will convert int[]  byte[] by 
ByteBuffer/IntBuffer. Its corresponding java code is: 
PForDeltaFixedIntBlockWithByteBufferCodec.java

3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need 
conversion. Its corresponding java code is: 
PForDeltaFixedIntBlockWithReadIntCodec.java




 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-2338) improved per-field similarity integration into schema.xml

2011-02-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992735#comment-12992735
 ] 

Robert Muir commented on SOLR-2338:
---

{quote}
...but even if we don't do that, i suppose it's also conceivable that someone 
might have their own Similarity implementation that is expensive to instantiate 
(ie: maintains some big in memory data structures?) and might want to be able 
to declare one instance and then refer to it by name in many different 
fieldType declarations.
{quote}

I don't think this is really a use case we need to support: the purpose of 
Similarity today is to do term weighting, not to be a huge data-structure 
holder.

While I know Mike's original patch went this way with LUCENE-2392 (e.g. norms), 
I'm not sure i like it being in Similarity in the future either.

Otherwise concepts like lazy-loading norms and all this other stuff get pushed 
onto the sim, which is an awkward place (imagine if you have many fields). 

So, I think we shouldn't really design for abuses of the API. If there are 
other use cases for named similarity that have to do with term weighting, I'm 
interested.


 improved per-field similarity integration into schema.xml
 -

 Key: SOLR-2338
 URL: https://issues.apache.org/jira/browse/SOLR-2338
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Affects Versions: 4.0
Reporter: Robert Muir

 Currently since LUCENE-2236, we can enable Similarity per-field, but in 
 schema.xml there is only a 'global' factory
 for the SimilarityProvider.
 In my opinion this is too low-level because to customize Similarity on a 
 per-field basis, you have to set your own
 CustomSimilarityProvider with similarity class=.../ and manage the 
 per-field mapping yourself in java code.
 Instead I think it would be better if you just specify the Similarity in the 
 FieldType, like after analyzer.
 As far as the example, one idea from LUCENE-1360 was to make a short_text 
 or metadata_text used by the
 various metadata fields in the example that has better norm quantization for 
 its shortness...

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992757#comment-12992757
 ] 

Robert Muir commented on LUCENE-2903:
-

Hello, 

I don't see the new files you referred to in the patch
Maybe the new files were not added to svn with 'svn add' before making the 
patch?


 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2892) Add QueryParser.newFieldQuery

2011-02-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2892:


Fix Version/s: 4.0

 Add QueryParser.newFieldQuery
 -

 Key: LUCENE-2892
 URL: https://issues.apache.org/jira/browse/LUCENE-2892
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2892.patch


 Note: this patch changes no behavior, just makes QP more subclassable.
 Currently we have Query getFieldQuery(String field, String queryText, boolean 
 quoted)
 This contains very hairy methods for producing a query from QP's analyzer.
 I propose we factor this into newFieldQuery(Analyzer analyzer, String field, 
 String queryText, boolean quoted)
 Then getFieldQuery just calls newFieldQuery(this.analyzer, field, queryText, 
 quoted);
 The reasoning is: it can be quite useful to consider the double quote as more 
 than phrases, but a more exact search.
 In the case the user quoted the terms, you might want to analyze the text 
 with an alternate analyzer that:
 doesn't produce synonyms, doesnt decompose compounds, doesn't use 
 WordDelimiterFilter 
 (you would need to be using preserveOriginal=true at index time for the WDF 
 one), etc etc.
 This is similar to the way google's double quote operator works, its not 
 defined as phrase but this exact wording or phrase.
 For example compare results to a query of tests versus tests.
 Currently you can do this without heavy code duplication, but really only if 
 you make a separate field (which is wasteful),
 and make your custom QP lie about its field... in the examples I listed above 
 you can do this with a single field, yet still
 have a more exact phrase search.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2892) Add QueryParser.newFieldQuery

2011-02-09 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992759#comment-12992759
 ] 

Robert Muir commented on LUCENE-2892:
-

I think this is an easy win for some common use cases from the API perspective.

If no one objects i'll commit in a few days.

 Add QueryParser.newFieldQuery
 -

 Key: LUCENE-2892
 URL: https://issues.apache.org/jira/browse/LUCENE-2892
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2892.patch


 Note: this patch changes no behavior, just makes QP more subclassable.
 Currently we have Query getFieldQuery(String field, String queryText, boolean 
 quoted)
 This contains very hairy methods for producing a query from QP's analyzer.
 I propose we factor this into newFieldQuery(Analyzer analyzer, String field, 
 String queryText, boolean quoted)
 Then getFieldQuery just calls newFieldQuery(this.analyzer, field, queryText, 
 quoted);
 The reasoning is: it can be quite useful to consider the double quote as more 
 than phrases, but a more exact search.
 In the case the user quoted the terms, you might want to analyze the text 
 with an alternate analyzer that:
 doesn't produce synonyms, doesnt decompose compounds, doesn't use 
 WordDelimiterFilter 
 (you would need to be using preserveOriginal=true at index time for the WDF 
 one), etc etc.
 This is similar to the way google's double quote operator works, its not 
 defined as phrase but this exact wording or phrase.
 For example compare results to a query of tests versus tests.
 Currently you can do this without heavy code duplication, but really only if 
 you make a separate field (which is wasteful),
 and make your custom QP lie about its field... in the examples I listed above 
 you can do this with a single field, yet still
 have a more exact phrase search.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2892) Add QueryParser.newFieldQuery

2011-02-09 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-2892:
---

Assignee: Robert Muir

 Add QueryParser.newFieldQuery
 -

 Key: LUCENE-2892
 URL: https://issues.apache.org/jira/browse/LUCENE-2892
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2892.patch


 Note: this patch changes no behavior, just makes QP more subclassable.
 Currently we have Query getFieldQuery(String field, String queryText, boolean 
 quoted)
 This contains very hairy methods for producing a query from QP's analyzer.
 I propose we factor this into newFieldQuery(Analyzer analyzer, String field, 
 String queryText, boolean quoted)
 Then getFieldQuery just calls newFieldQuery(this.analyzer, field, queryText, 
 quoted);
 The reasoning is: it can be quite useful to consider the double quote as more 
 than phrases, but a more exact search.
 In the case the user quoted the terms, you might want to analyze the text 
 with an alternate analyzer that:
 doesn't produce synonyms, doesnt decompose compounds, doesn't use 
 WordDelimiterFilter 
 (you would need to be using preserveOriginal=true at index time for the WDF 
 one), etc etc.
 This is similar to the way google's double quote operator works, its not 
 defined as phrase but this exact wording or phrase.
 For example compare results to a query of tests versus tests.
 Currently you can do this without heavy code duplication, but really only if 
 you make a separate field (which is wasteful),
 and make your custom QP lie about its field... in the examples I listed above 
 you can do this with a single field, yet still
 have a more exact phrase search.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Assigned: (SOLR-2353) SpellCheckCollator uses org.mortbay.log.Log for logging

2011-02-09 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned SOLR-2353:
-

Assignee: Mark Miller

 SpellCheckCollator uses org.mortbay.log.Log for logging
 ---

 Key: SOLR-2353
 URL: https://issues.apache.org/jira/browse/SOLR-2353
 Project: Solr
  Issue Type: Bug
  Components: spellchecker
Reporter: Sami Siren
Assignee: Mark Miller
Priority: Trivial
 Attachments: SOLR-2353.patch


 SLF4j should be used instead.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992809#comment-12992809
 ] 

hao yan commented on LUCENE-2903:
-

just uploaded. Sorry. 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



LUCENE-1608 CustomScoreQuery should support arbitrary queries.

2011-02-09 Thread Fernando Wasylyszyn
Hi everyone. I don't know yet (I am new on the discussion list) if updates in 
Jira are automatically notified to the list, so that is why I'm writing this 
mail. I just uploaded a class for demonstration purposes in 


LUCENE-1608: CustomScoreQuery should support arbitrary 
queries.(https://issues.apache.org/jira/browse/LUCENE-1608)

Comments and critics are more than welcome.

Thank you.

Regards.
Fernando.

PS: sorry if this is not the correct way for announce an update of a ticket in 
Jira.