[jira] Commented: (SOLR-1731) ArrayIndexOutOfBoundsException when highlighting

2010-07-21 Thread Leonhard Maylein (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12891018#action_12891018
 ] 

Leonhard Maylein commented on SOLR-1731:


We have the same problem whenever we search for a word which has synonyms 
defined.

> ArrayIndexOutOfBoundsException when highlighting
> 
>
> Key: SOLR-1731
> URL: https://issues.apache.org/jira/browse/SOLR-1731
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Affects Versions: 1.4
>Reporter: Tim Underwood
>Priority: Minor
>
> I'm seeing an java.lang.ArrayIndexOutOfBoundsException when trying to 
> highlight for certain queries.  The error seems to be an issue with the 
> combination of the ShingleFilterFactory, PositionFilterFactory and the 
> LengthFilterFactory. 
> Here's my fieldType definition:
>  omitNorms="true">
>   
> 
>  generateNumberParts="0" catenateWords="0" catenateNumbers="0" 
> catenateAll="1"/>
> 
> 
> 
>   
>   
>   
>outputUnigrams="true"/>
>   
>generateNumberParts="0" catenateWords="0" catenateNumbers="0" 
> catenateAll="1"/>
>   
>   
>
> 
> 
> Here's the field definition:
>  omitNorms="true"/>
> Here's a sample doc:
> 
> 
>   1
>   A 1280 C
> 
> 
> Doing a query for sku_new:"A 1280 C" and requesting highlighting throws the 
> exception (full stack trace below):  
> http://localhost:8983/solr/select/?q=sku_new%3A%22A+1280+C%22&version=2.2&start=0&rows=10&indent=on&&hl=on&hl.fl=sku_new&fl=*
> If I comment out the LengthFilterFactory from my query analyzer section 
> everything seems to work.  Commenting out just the PositionFilterFactory also 
> makes the exception go away and seems to work for this specific query.
> Full stack trace:
> java.lang.ArrayIndexOutOfBoundsException: -1
> at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:202)
> at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414)
> at 
> org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216)
> at 
> org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184)
> at 
> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226)
> at 
> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335)
> at 
> org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
> at 
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
> at 
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> at 
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at 
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> at 
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> at 
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
> at 
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at 
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> at org.mortbay.jetty.Server.handle(Server.java:285)
> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> at 
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
> at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
> at 
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
> at 
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Build failed in Hudson: Lucene-trunk #1245

2010-07-21 Thread Apache Hudson Server
See 

Changes:

[yonik] LUCENE-2542: remove final from some TopDocsCollector methods

--
[...truncated 2706 lines...]
[junit] Testsuite: org.apache.lucene.search.TestPrefixQuery
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.006 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestQueryTermVector
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.006 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestQueryWrapperFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.009 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestRegexpQuery
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.054 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestRegexpRandom
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 41.17 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestRegexpRandom2
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 130.688 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestScoreCachingWrappingScorer
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.006 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestScorerPerf
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.655 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestSetNorm
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.005 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestSimilarity
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.008 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestSimpleExplanations
[junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 2.646 sec
[junit] 
[junit] Testsuite: 
org.apache.lucene.search.TestSimpleExplanationsOfNonMatches
[junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 0.198 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestSloppyPhraseQuery
[junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.293 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestSort
[junit] Tests run: 24, Failures: 0, Errors: 0, Time elapsed: 5.806 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestSpanQueryFilter
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.015 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestTermRangeFilter
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 12.233 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestTermRangeQuery
[junit] Tests run: 11, Failures: 0, Errors: 0, Time elapsed: 0.04 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestTermScorer
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.013 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestTermVectors
[junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.294 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestThreadSafe
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 6.209 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestTimeLimitingCollector
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 1.308 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestTopDocsCollector
[junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.013 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestTopScoreDocCollector
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.005 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestWildcard
[junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.039 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.TestWildcardRandom
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 77.928 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.function.TestCustomScoreQuery
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 7.133 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.function.TestDocValues
[junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.008 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.function.TestFieldScoreQuery
[junit] Tests run: 12, Failures: 0, Errors: 0, Time elapsed: 0.223 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.function.TestOrdValues
[junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 0.115 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.function.TestValueSource
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.024 sec
[junit] 
[junit] Testsuite: org.apache.lucene.search.payloads.TestPayloadNearQuery
[junit] Tests run: 4

[jira] Updated: (LUCENE-2346) Explore other in-memory postinglist formats for realtime search

2010-07-21 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2346:
--

Fix Version/s: Realtime Branch
   (was: 4.0)

> Explore other in-memory postinglist formats for realtime search
> ---
>
> Key: LUCENE-2346
> URL: https://issues.apache.org/jira/browse/LUCENE-2346
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
>
> The current in-memory posting list format might not be optimal for searching. 
> VInt decoding performance and the lack of skip lists would arguably be the 
> biggest bottlenecks.
> For LUCENE-2312 we should investigate other formats.
> Some ideas:
> - PFOR or packed ints for posting slices?
> - Maybe even int[] slices instead of byte slices? This would be great for 
> search performance, but the additional memory overhead might not be 
> acceptable.
> - For realtime search it's usually desirable to evaluate the most recent 
> documents first.  So using backward pointers instead of forward pointers and 
> having the postinglist pointer point to the most recent docID in a list is 
> something to consider.
> - Skipping: if we use fixed-length postings ([packed] ints) we can do binary 
> search within a slice.  We can also locate a pointer then without scanning 
> and thus skip entire slices quickly.  Is that sufficient or would we need 
> more skipping layers, so that it's possible to skip directly to particular 
> slices?
> It would be awesome to find a format that doesn't slow down "normal" 
> indexing, but is very efficient for in-memory searches.  If we can't find 
> such a fits-all format, we should have a separate indexing chain for 
> real-time indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-07-21 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2324:
--

Fix Version/s: Realtime Branch
   (was: 4.0)

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-07-21 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2312:
--

Fix Version/s: Realtime Branch
   (was: 4.0)

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: Realtime Branch
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Hudson build is back to normal : Lucene-3.x #72

2010-07-21 Thread Apache Hudson Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2346) Explore other in-memory postinglist formats for realtime search

2010-07-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890915#action_12890915
 ] 

Jason Rutherglen commented on LUCENE-2346:
--

Are there any additional thoughts on this one?

> Explore other in-memory postinglist formats for realtime search
> ---
>
> Key: LUCENE-2346
> URL: https://issues.apache.org/jira/browse/LUCENE-2346
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 4.0
>
>
> The current in-memory posting list format might not be optimal for searching. 
> VInt decoding performance and the lack of skip lists would arguably be the 
> biggest bottlenecks.
> For LUCENE-2312 we should investigate other formats.
> Some ideas:
> - PFOR or packed ints for posting slices?
> - Maybe even int[] slices instead of byte slices? This would be great for 
> search performance, but the additional memory overhead might not be 
> acceptable.
> - For realtime search it's usually desirable to evaluate the most recent 
> documents first.  So using backward pointers instead of forward pointers and 
> having the postinglist pointer point to the most recent docID in a list is 
> something to consider.
> - Skipping: if we use fixed-length postings ([packed] ints) we can do binary 
> search within a slice.  We can also locate a pointer then without scanning 
> and thus skip entire slices quickly.  Is that sufficient or would we need 
> more skipping layers, so that it's possible to skip directly to particular 
> slices?
> It would be awesome to find a format that doesn't slow down "normal" 
> indexing, but is very efficient for in-memory searches.  If we can't find 
> such a fits-all format, we should have a separate indexing chain for 
> real-time indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Sequence IDs for NRT deletes

2010-07-21 Thread Jason Rutherglen
> long[] is probably safe

Yeah it's safe for most things...

> short[]

That could be a much better option for minimizing RAM usage, and then
implement wraparound.

On Wed, Jul 21, 2010 at 3:12 AM, Michael McCandless
 wrote:
> On Tue, Jul 20, 2010 at 4:21 PM, Jason Rutherglen
>  wrote:
>>> Right, much less GC if app frequently reopens.  But a 32X increase in
>>> RAM usage is not trivial; I think we shouldn't enable it by default?
>>
>> Right, the RAM usage is quite high!  Is there a more compact
>> representation we could use?  Ah well, either way for good RT
>> performance, there are some users who may want to use this option.
>
> Well, packed ints are more compact, but the decode cost would probably
> be catastrophic :)
>
> Maybe you could also use a smaller type (byte[], short[]) for sequence
> ids, but, you'd then have to handle wraparound/overflow.  (In fact
> even w/ int[] you have to handle wraparound?  long[] is probably safe
> :) )  EG on overflow, you'd have to allocate all new (zero'd) arrays
> for the next re-opened reader?
>
>>> Have you tested?
>>
>> The test would be a basic benchmark of queries against BV vs. an int[]
>> of deletes?
>
> Yes, in a normal reader  (ie, not testing NRT -- just testing cost of
> applying deletes via int cmp instead of BV lookup).
>
> Mike
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2010-07-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890909#action_12890909
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

We need to fill in the blanks on the terms dictionary
implementation. Michael B. has some good ideas on implementing it
using parallel arrays and dynamically updating a linked list
implemented as a parallel AtomicIntegerArray. A question we have
is regarding the use of a btree to quickly find the point of
insertion for a new term. The btree would replace the term index
which is binary searched and the term dictionary linearly
scanned. Perhaps there's a better data structure for concurrent
update and lookups?

Another use of the AtomicIntegerArray could be the deletes
sequence id int[]. However is it needed and would the lookup be
fast enough?



> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 3.0.1
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: 4.0
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-07-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890905#action_12890905
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}Implement logic to discard deletes from the deletes
buffer{quote}

Michael, where in the code is this supposed to occur?

{quote}Implement flush-by-ram logic{quote}

I'll make a go of this.

{quote}Maybe change delete logic: currently deletes are applied
when a segment is flushed. Maybe we can keep it this way in the
realtime-branch though, because that's most likely what we want
to do once the RAM buffer is searchable and deletes are cheaper
as they can then be done in-memory before flush{quote}

I think we'll keep things this way for this issue (ie, per
thread document writers), however for LUCENE-2312 I think we'll
want to implement foreground deletes (eg, updating the deleted
docs sequences int[]).

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: API changes between 2.9.2 and 2.9.3

2010-07-21 Thread Bill Janssen
Andi Vajda  wrote:

> 
> On Jul 21, 2010, at 19:59, Bill Janssen  wrote:
> 
> > Bill Janssen  wrote:
> >
> >> What's crashing with PyLucene 2.9.3 is this code:
> >>
> >> for field in x.getFields():
> >>
> >> where "x" is an instance of org.apache.lucene.document.Document.  I
> >> can
> >> print x and it looks OK, but an attempt to iterate over the list of
> >> fields seems broken.  Is this another iterator change?
> >
> > I see that I also can't iterate over x.getFields().listIterator(),
> > presumably because, in the Java 1.4 that Lucene 2.9.x uses,
> > java.util.Iterator doesn't "implement" java.lang.Iterable.  A tad
> > ridiculous.
> 
> Not ridiculous, impossible, since java.lang.Iterable appeared in Java
> 1.5 and Lucene 2.x claims Java 1.4 compatibility.

No, I mean it's ridiculous that I can't subscript or iterate, in Python,
a value of java.util.List.  You need something in jcc to support Java
1.4 sequence types as well as the new code for 1.5 sequence types.  I
presume that there will be a 2.9.4 in the future, right?

I looked at the jcc code a bit.  In jcc/python.py, this bit

if env.java_version >= '1.5':
iterable = findClass('java/lang/Iterable')
iterator = findClass('java/util/Iterator')
else:
iterable = iterator = None

could perhaps become

if env.java_version >= '1.5':
iterable = findClass('java/lang/Iterable')
iterator = findClass('java/util/Iterator')
else:
iterable = findClass('java/lang/Object')
iterator = findClass('java/lang/Iterator')

Not sure.  Probably more is required.

> > Certainly java.util.List should be a sequence of some sort.
> 
> It can be if declare it via the --sequence jcc command line flag.

So a change to the Makefile would be in order, for the 2.x branch.

Is there an auto-downcasting switch for jcc?  That is, it would be nice,
for sequences and mappings, if the "get" method would automatically cast
the retrieved value to the Pythonic representation of the most specific
type.

Bill


[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-07-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890893#action_12890893
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Looks like we're not using MergeDocIDRemapper anymore?

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Hudson build is back to normal : Solr-trunk #1208

2010-07-21 Thread Apache Hudson Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-21 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890848#action_12890848
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

{quote}
Essentially, the dependency on the smart chinese is optional in a sense that 
the lack of it will degrade the quality of clustering in Chinese, but will not 
break it. Let me see if I can make it optionally loadable in 
LuceneLanguageModelFactory too.
{quote}

I think we could handle this in a similar way as in Carrot2: attempt to load 
chinese tokenizer and fall back to the default one in case of class loading 
exceptions. The easiest implementation route would be to include smart chinese 
as a dependency during compilation of the clustering plugin with an 
understanding that the library may or may not be available during runtime. Is 
that possible with the current Solr compilation scripts?

> Upgrade Carrot2 to 3.2.0
> 
>
> Key: SOLR-1804
> URL: https://issues.apache.org/jira/browse/SOLR-1804
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Clustering
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, 
> SOLR-1804-carrot2-3.4.0-dev.patch
>
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Hudson build is back to normal : Solr-3.x #66

2010-07-21 Thread Apache Hudson Server
See 



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-2542) TopDocsCollector should be abstract super class that is the real "TopDocsCollector" contract, a subclass should implement the priority-queue logic. e.g. PQTopDocsCollect

2010-07-21 Thread Yonik Seeley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley resolved LUCENE-2542.
--

Resolution: Fixed

committed.

> TopDocsCollector should be abstract super class that is the real 
> "TopDocsCollector" contract, a subclass should implement the priority-queue 
> logic. e.g. PQTopDocsCollector
> ---
>
> Key: LUCENE-2542
> URL: https://issues.apache.org/jira/browse/LUCENE-2542
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Woody Anderson
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2542.patch, LUCENE-2542.patch, LUCENE-2542.patch, 
> LUCENE_3.0.2-2542.patch
>
>
> TopDocsCollector is both an abstract interface for producing TopDocs as well 
> as a PriorityQueue based implementation.
> Not all Collectors that could produce TopDocs must use a PriorityQueue, and 
> it would be advantageous to allow the TopDocsCollector to be an "interface" 
> type abstract class, with a PQTopDocsCollector sub-class.
> While doing this, it'd be good to clean up the generics uses in these 
> classes. As it's odd to create a TopFieldCollector and have to case the 
> TopDocs object, when this can be fixed with generics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: API changes between 2.9.2 and 2.9.3

2010-07-21 Thread Bill Janssen
Bill Janssen  wrote:

> Bill Janssen  wrote:
> 
> > What's crashing with PyLucene 2.9.3 is this code:
> > 
> >  for field in x.getFields():
> > 
> > where "x" is an instance of org.apache.lucene.document.Document.  I can
> > print x and it looks OK, but an attempt to iterate over the list of
> > fields seems broken.  Is this another iterator change?
> 
> I see that I also can't iterate over x.getFields().listIterator(),
> presumably because, in the Java 1.4 that Lucene 2.9.x uses,
> java.util.Iterator doesn't "implement" java.lang.Iterable.  A tad
> ridiculous.  Certainly java.util.List should be a sequence of some sort.

I looked into this a bit further.  The common-build.xml file for Lucene
2.9.x specifies

  
  

and in 1.4 the java.util.Iterable class from Java 1.5 doesn't exist.

The docs for JCC still say this:

``When generating wrappers for Python, JCC attempts to detect which
classes can be made iterable:

  * When a class declares to implement java.util.Iterator or something
compatible with it, JCC makes it iterable from Python.

  * When a Java class declares a method called iterator() with no
arguments returning a type compatible with java.util.Iterator, this
class is made iterable from Python.

  * When a Java class declares a method called next() with no arguments
returning an object type, this class is made iterable. Its next()
method is assumed to terminate iteration by returning null.''

Presumably that's no longer the case with JCC 2.6.  Probably should be
updated to whatever the current version does.  Or perhaps versioned and
checked into the source tree.

Bill


[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-07-21 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890812#action_12890812
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
We need to update the indexing chain comment in DocumentsWriterPerThread 
{quote}

There's a lot of code cleanup to do.  I just wanted to checkpoint what I have 
so far.


{quote}
Before I/we forget, maybe we can describe what we discussed about RT features 
such as the terms dictionary (the AtomicIntArray linked list, possible usage of 
a btree), the multi-level skip list (static levels), and other such features 
at: https://issues.apache.org/jira/browse/LUCENE-2312
{quote}

Yeah, I will.  But first I need to catch up on sleep :)

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Build failed in Hudson: Lucene-trunk #1244

2010-07-21 Thread Apache Hudson Server
See 

Changes:

[rmuir] LUCENE-2514: consume tokenstreams in QP like the indexer: dont create 
intermediate string

[uschindler] revert accidental commit by buschmi

[buschmi] LUCENE-2324: Committing second version of the patch to the real-time 
branch.  It's not done yet, but easier to track progress using the branch.

[gsingers] SOLR-1568: moved DistanceUtils up one package, as it isn't tier 
specific

[uschindler] LUCENE-2523: Modify the test to check only for failure on 
IR.open(), add TODO to make IW.ctor() fail early, too

[uschindler] LUCENE-2523: Add IndexFormatTooOldException and 
IndexFormatTooNewException

[rmuir] LUCENE-2458: queryparser turns all CJK queries into phrase queries

[uschindler] LUCENE-2549: Fix TimeLimitingCollector#TimeExceededException to 
record the absolute docid

[uschindler] Generics und Classloader Policeman fixes

[uschindler] A small improvement in FuzzyTermsEnum, as the ArrayList was 
initially allocated with one element less so internal array reallocated every 
time. This removes the use of ArrayList at all.

[uschindler] LUCENE-2541: Improve readability of TestNumericUtils by using 
autoboxing and varargs

[uschindler] LUCENE-2541: Fix NumericRangeQuery that returned incorrect results 
with endpoints near Long.MIN_VALUE and Long.MAX_VALUE

--
[...truncated 14626 lines...]
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.541 sec
[junit] 

junit-parallel:

common.test:
 [echo] Building wordnet...

common.init:

build-lucene:

init:

test:
 [echo] Building wordnet...

common.init:

build-lucene:

init:

compile-test:
 [echo] Building wordnet...

compile-analyzers-common:

common.init:

build-lucene:

init:

clover.setup:
[clover-setup] Clover Version 2.6.3, built on November 20 2009 (build-778)
[clover-setup] Loaded from: 
/export/home/hudson/tools/clover/clover2latest/clover-2.6.3.jar
[clover-setup] Clover: Open Source License registered to Apache.
[clover-setup] Clover is enabled with initstring 
'

clover.info:

clover:

common.compile-core:

compile-core:

common.compile-test:

junit-mkdir:
[mkdir] Created dir: 


junit-sequential:
[junit] Testsuite: org.apache.lucene.wordnet.TestSynonymTokenFilter
[junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 5.159 sec
[junit] 
[junit] Testsuite: org.apache.lucene.wordnet.TestWordnet
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.082 sec
[junit] 
[junit] - Standard Output ---
[junit] Opening Prolog file 

[junit] [1/2] Parsing 

[junit] 2 s(10001,1,'woods',n,1,0). 0 0 ndecent=0
[junit] 4 s(10001,3,'forest',n,1,0). 2 1 ndecent=0
[junit] 8 s(10003,2,'baron',n,1,1). 6 3 ndecent=0
[junit] [2/2] Building index to store synonyms,  map sizes are 8 and 4
[junit] row=1/8 doc= Document 
stored,indexed>
[junit] row=2/8 doc= Document 
stored,omitNorms stored,indexed>
[junit] row=4/8 doc= Document 
stored,indexed>
[junit] Optimizing..
[junit] Opening Prolog file 

[junit] [1/2] Parsing 

[junit] 2 s(10001,1,'woods',n,1,0). 0 0 ndecent=0
[junit] 4 s(10001,3,'forest',n,1,0). 2 1 ndecent=0
[junit] 8 s(10003,2,'baron',n,1,1). 6 3 ndecent=0
[junit] [2/2] Building index to store synonyms,  map sizes are 8 and 4
[junit] row=1/8 doc= Document 
stored,indexed>
[junit] row=2/8 doc= Document 
stored,omitNorms stored,indexed>
[junit] row=4/8 doc= Document 
stored,indexed>
[junit] Optimizing..
[junit] -  ---

junit-parallel:

common.test:
 [echo] Building xml-query-parser...

common.init:

build-lucene:

init:

test:
 [echo] Building xml-query-parser...

common.init:

build-lucene:

init:

compile-test:
 [echo] Building xml-query-parser...

build-queries:

common.init:

build-lucene:

init:

clover.setup:
[clover-setup] Clover Version 2.6.3, built on November 20 2009 (build-778)
[clover-setup] Loaded from: 
/export/home/hudson/tools/clover/clover2latest/clover-2.6.3.jar

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-07-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890790#action_12890790
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

We need to update the indexing chain comment in DocumentsWriterPerThread

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-07-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890789#action_12890789
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Michael, thanks for posting and committing the patch.  I'll be taking a look.

Before I/we forget, maybe we can describe what we discussed about RT features 
such as the terms dictionary (the AtomicIntArray linked list, possible usage of 
a btree), the multi-level skip list (static levels), and other such features 
at: https://issues.apache.org/jira/browse/LUCENE-2312

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: API changes between 2.9.2 and 2.9.3

2010-07-21 Thread Bill Janssen
> I'm going back to 2.9.2 :-).

For some reason, 2.9.2 installs JCC 2.4.1.  Is that right?  Shouldn't it
be 2.5.1?

Bill

holmes : /tmp/pylucene-2.9.2-1/jcc 99 % sudo python setup.py install
sudo python setup.py install
running install
running bdist_egg
running egg_info
writing JCC.egg-info/PKG-INFO
writing top-level names to JCC.egg-info/top_level.txt
writing dependency_links to JCC.egg-info/dependency_links.txt
reading manifest file 'JCC.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'JCC.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.5-i386/egg
running install_lib
running build_py
copying jcc/config.py -> build/lib.macosx-10.5-i386-2.5/jcc
copying jcc/classes/org/apache/jcc/PythonVM.class -> 
build/lib.macosx-10.5-i386-2.5/jcc/classes/org/apache/jcc
copying jcc/classes/org/apache/jcc/PythonException.class -> 
build/lib.macosx-10.5-i386-2.5/jcc/classes/org/apache/jcc
running build_ext
creating build/bdist.macosx-10.5-i386
creating build/bdist.macosx-10.5-i386/egg
creating build/bdist.macosx-10.5-i386/egg/jcc
copying build/lib.macosx-10.5-i386-2.5/jcc/__init__.py -> 
build/bdist.macosx-10.5-i386/egg/jcc
copying build/lib.macosx-10.5-i386-2.5/jcc/__main__.py -> 
build/bdist.macosx-10.5-i386/egg/jcc
copying build/lib.macosx-10.5-i386-2.5/jcc/_jcc.so -> 
build/bdist.macosx-10.5-i386/egg/jcc
creating build/bdist.macosx-10.5-i386/egg/jcc/classes
creating build/bdist.macosx-10.5-i386/egg/jcc/classes/org
creating build/bdist.macosx-10.5-i386/egg/jcc/classes/org/apache
creating build/bdist.macosx-10.5-i386/egg/jcc/classes/org/apache/jcc
copying 
build/lib.macosx-10.5-i386-2.5/jcc/classes/org/apache/jcc/PythonException.class 
-> build/bdist.macosx-10.5-i386/egg/jcc/classes/org/apache/jcc
copying 
build/lib.macosx-10.5-i386-2.5/jcc/classes/org/apache/jcc/PythonVM.class -> 
build/bdist.macosx-10.5-i386/egg/jcc/classes/org/apache/jcc
copying build/lib.macosx-10.5-i386-2.5/jcc/config.py -> 
build/bdist.macosx-10.5-i386/egg/jcc
copying build/lib.macosx-10.5-i386-2.5/jcc/cpp.py -> 
build/bdist.macosx-10.5-i386/egg/jcc
creating build/bdist.macosx-10.5-i386/egg/jcc/patches
copying build/lib.macosx-10.5-i386-2.5/jcc/patches/patch.4195 -> 
build/bdist.macosx-10.5-i386/egg/jcc/patches
copying build/lib.macosx-10.5-i386-2.5/jcc/patches/patch.43.0.6c11 -> 
build/bdist.macosx-10.5-i386/egg/jcc/patches
copying build/lib.macosx-10.5-i386-2.5/jcc/patches/patch.43.0.6c7 -> 
build/bdist.macosx-10.5-i386/egg/jcc/patches
copying build/lib.macosx-10.5-i386-2.5/jcc/python.py -> 
build/bdist.macosx-10.5-i386/egg/jcc
creating build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/functions.cpp -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/functions.h -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/JArray.cpp -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/JArray.h -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/jcc.cpp -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/JCCEnv.cpp -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/JCCEnv.h -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/jccfuncs.h -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/JObject.cpp -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/JObject.h -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/macros.h -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/jcc/sources/types.cpp -> 
build/bdist.macosx-10.5-i386/egg/jcc/sources
copying build/lib.macosx-10.5-i386-2.5/libjcc.dylib -> 
build/bdist.macosx-10.5-i386/egg
byte-compiling build/bdist.macosx-10.5-i386/egg/jcc/__init__.py to __init__.pyc
byte-compiling build/bdist.macosx-10.5-i386/egg/jcc/__main__.py to __main__.pyc
byte-compiling build/bdist.macosx-10.5-i386/egg/jcc/config.py to config.pyc
byte-compiling build/bdist.macosx-10.5-i386/egg/jcc/cpp.py to cpp.pyc
byte-compiling build/bdist.macosx-10.5-i386/egg/jcc/python.py to python.pyc
creating stub loader for jcc/_jcc.so
byte-compiling build/bdist.macosx-10.5-i386/egg/jcc/_jcc.py to _jcc.pyc
creating build/bdist.macosx-10.5-i386/egg/EGG-INFO
copying JCC.egg-info/PKG-INFO -> build/bdist.macosx-10.5-i386/egg/EGG-INFO
copying JCC.egg-info/SOURCES.txt -> build/bdist.macosx-10.5-i386/egg/EGG-INFO
copying JCC.egg-info/dependency_links.txt -> 
build/bdist.macosx-10.5-i386/egg/EGG-INFO
copying JCC.egg-info/not-zip-safe -> build/bdist.macosx-10.5-i386/egg/EGG-INFO
copying JCC.egg-info/top_level.txt -> build/bdist.macosx-10.5-i386/egg/EGG-IN

[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-21 Thread Stanislaw Osinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890757#action_12890757
 ] 

Stanislaw Osinski commented on SOLR-1804:
-

{quote}
Hi Stanislaw: this looks cool! So, carrot2 jars don't depend directly on 
Lucene, and we can re-enable this component in trunk, and simply maintain the 
LuceneLanguageModelFactory? 
{quote}

Correct. The only dependency on Lucene is {{LuceneLanguageModelFactory}}, which 
is now part of Solr code base. In fact, we could also try bringing back the 
clustering plugin to Solr trunk, though I haven't tried that yet.

{quote}
As far as the smart chinese, its currently not included with Solr, so I think 
this is why you have trouble. But could we enable a carrot2 factory for it that 
reflects it, in case the user puts the jar in the classpath?
{quote}

Essentially, the dependency on the smart chinese is optional in a sense that 
the lack of it will degrade the quality of clustering in Chinese, but will not 
break it. Let me see if I can make it optionally loadable in 
{{LuceneLanguageModelFactory}} too. If not, we'll have to live with degraded 
clustering quality in case of Chinese.

> Upgrade Carrot2 to 3.2.0
> 
>
> Key: SOLR-1804
> URL: https://issues.apache.org/jira/browse/SOLR-1804
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Clustering
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, 
> SOLR-1804-carrot2-3.4.0-dev.patch
>
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890748#action_12890748
 ] 

Robert Muir commented on SOLR-1804:
---

Hi Stanislaw: this looks cool! So, carrot2 jars don't depend directly on 
Lucene, and we can re-enable this component in trunk, and simply maintain the 
LuceneLanguageModelFactory?

As far as the smart chinese, its currently not included with Solr, so I think 
this is why you have trouble. But could we enable a carrot2 factory for it that 
reflects it, in case the user puts the jar in the classpath?


> Upgrade Carrot2 to 3.2.0
> 
>
> Key: SOLR-1804
> URL: https://issues.apache.org/jira/browse/SOLR-1804
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Clustering
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, 
> SOLR-1804-carrot2-3.4.0-dev.patch
>
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: API changes between 2.9.2 and 2.9.3

2010-07-21 Thread Bill Janssen
Thomas Koch  wrote:

> > ...
> > I realize that PyLucene doesn't make that easy because it doesn't warn
> > about deprecated API use.
> > 
> [Thomas Koch] Well this is a general drawback in Python as interpreted
> language I guess - wrong interfaces are only detected at runtime and are
> thus harder to test (unless you describe the interfaces and use tools such
> as pylint...)
> I wouldn't expect PyLucene to provide direct support here.
> 
> > One thing I could add to JCC is a command line flag to _not_ wrap any
> > deprecated APIs. With that applied to PyLucene, one could then find all
> > errors they'd be hitting when upgrading to 3.x. That being said, I don't
> > see
> > the difference between this and just upgrading to 3.x and looking for
> > the
> > very same errors since, by definition, 3.0 == 2.9 - deprecations. This
> > explains why I haven't implemented this feature so far.
> > 
> > Andi..
> > 
> [Thomas Koch] Thanks for the explanation - that makes it more clear to me
> now.
> 
> The question remains if it's feasible to support 2.x *and* 3.x  - as Bill
> mentioned "... I'd like to make it work on both." - me too.  I did fear that
> this makes things much more complicated and you end up with code "if
> lucene.VERSION.split('.')[0]>2: ... else ..." - we did that some time ago
> during GCJ and JCC based versions of PyLucene, but at that time it was
> merely a matter of different imports and init stuff (initVM).
> 
> But I understand now that as long as you remove deprecated code from 2.9 it
> *should* work with 2.9 and 3.0 as well! Right?
> 
> e.g.
> Hits search(Query query)
>Is now deprecated as 
> "Hits will be removed in Lucene 3.0" 
> 
> 2.9 already supports
> TopDocs search(Query, Filter, int) 
> Which one should use instead.
> 
> The problem here is that - as far as I understand - you can make it work
> with 2.9 and 3.0 - but then you loose backward compatibility with any 2.x
> version before 2.9 The point is that you may then end up forcing your
> users (admins) to install a newer version of PyLucene - which people may not
> want to do...

I changed my code to this:

try:
from lucene import TopDocs
except ImportError:
_have_topdocs = False
else:
_have_topdocs = True

[...]

if _have_topdocs:
topdocs = s.search(parsed_query, count or 100)
for hit in topdocs.scoreDocs:
doc = s.doc(hit.doc)
score = hit.score
rval.append((doc.get("id"), score,))
else:
hits = s.search(parsed_query)
for hit in hits:
doc = Hit.cast_(hit).getDocument()
score = Hit.cast_(hit).getScore()
rval.append((doc.get("id"), score,))

Unfortunately, 2.9.3 now coredumps on me (OS X 10.5.8, system python 2.5):

Exception Type:  EXC_BAD_ACCESS (SIGBUS)
Exception Codes: KERN_PROTECTION_FAILURE at 0x
Crashed Thread:  14

VM state:not at safepoint (normal execution)
VM Mutex/Monitor currently owned by a thread: None

Heap
 def new generation   total 4544K, used 2441K [0x0d5a, 0x0da8, 
0x0fd0)
  eden space 4096K,  48% used [0x0d5a, 0x0d7926a8, 0x0d9a)
  from space 448K, 100% used [0x0da1, 0x0da8, 0x0da8)
  to   space 448K,   0% used [0x0d9a, 0x0d9a, 0x0da1)
 tenured generation   total 60544K, used 722K [0x0fd0, 0x1382, 
0x2d5a)
   the space 60544K,   1% used [0x0fd0, 0x0fdb49c0, 0x0fdb4a00, 0x1382)
 compacting perm gen  total 8192K, used 2246K [0x2d5a, 0x2dda, 
0x315a)
   the space 8192K,  27% used [0x2d5a, 0x2d7d1ba8, 0x2d7d1c00, 0x2dda)
ro space 8192K,  63% used [0x315a, 0x31abcf60, 0x31abd000, 0x31da)
rw space 12288K,  43% used [0x31da, 0x322d35a8, 0x322d3600, 0x329a)

Virtual Machine arguments:
 JVM args: -Xms64m -Xmx512m -Xss100m -Djava.awt.headless=true
 Java command: 
 launcher type: generic

Thread 14 Crashed:
0   libjvm.dylib0x019b81cb 0x1915000 + 668107
1   libjvm.dylib0x01b23c47 JNI_CreateJavaVM_Impl + 96759
2   libjcc.dylib0x0073368d 
JCCEnv::callObjectMethod(_jobject*, _jmethodID*, ...) const + 73
3   libjcc.dylib0x00733254 JCCEnv::iterator(_jobject*) 
const + 34
4   _lucene.so  0x013c0e65 _object* 
get_iterator(java::util::t_List*) + 59
5   org.python.python   0x00121dfd PyObject_GetIter + 107
6   org.python.python   0x0018edbd PyEval_EvalFrameEx + 15227
7   org.python.python   0x00191173 PyEval_EvalCodeEx + 1638

I'm going back to 2.9.2 :-).

Bill


[jira] Commented: (LUCENE-2514) Change Term to use bytes

2010-07-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890740#action_12890740
 ] 

Robert Muir commented on LUCENE-2514:
-

Committed LUCENE-2514_qp.patch revision 966254

> Change Term to use bytes
> 
>
> Key: LUCENE-2514
> URL: https://issues.apache.org/jira/browse/LUCENE-2514
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Affects Versions: 4.0
>Reporter: Robert Muir
>Assignee: Uwe Schindler
> Attachments: LUCENE-2514-MTQPagedBytes.patch, 
> LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, 
> LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly 
> its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes 
> such as numerics, instead of using
> strange string encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (SOLR-1240) Numerical Range faceting

2010-07-21 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890728#action_12890728
 ] 

Yonik Seeley commented on SOLR-1240:


Thanks for the example, makes it so much easier  to casually review.

Rather than embedding "meta" to the list containing the counts, perhaps we 
should bite the bullet and add an additional level for the counts.  It would 
have been useful for other faceting types as well (and still would be in the 
future I think).  It should be much easier (and more consistent) for clients to 
handle, rather than trying to exclude the thing called "meta" when building the 
list of counts returned.

{code}

3
13
0
0
18

3
2
13
7
2


{code}

Also, I've never been a fan of adding the empty "facet_range" list when there 
are no facet.range commands... but I understand it's consistent with the other 
facet types.

> Numerical Range faceting
> 
>
> Key: SOLR-1240
> URL: https://issues.apache.org/jira/browse/SOLR-1240
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Gijs Kunze
>Priority: Minor
> Attachments: SOLR-1240.patch, SOLR-1240.patch, SOLR-1240.patch, 
> SOLR-1240.patch, SOLR-1240.patch, SOLR-1240.patch, SOLR-1240.patch
>
>
> For faceting numerical ranges using many facet.query query arguments leads to 
> unmanageably large queries as the fields you facet over increase. Adding the 
> same faceting parameter for numbers which already exists for dates should fix 
> this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-21 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1804:


Attachment: SOLR-1804-carrot2-3.4.0-dev-libs.zip

Libs required for the Carrot2 3.4.0 update.

> Upgrade Carrot2 to 3.2.0
> 
>
> Key: SOLR-1804
> URL: https://issues.apache.org/jira/browse/SOLR-1804
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Clustering
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: SOLR-1804-carrot2-3.4.0-dev-libs.zip, 
> SOLR-1804-carrot2-3.4.0-dev.patch
>
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (SOLR-1804) Upgrade Carrot2 to 3.2.0

2010-07-21 Thread Stanislaw Osinski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stanislaw Osinski updated SOLR-1804:


Attachment: SOLR-1804-carrot2-3.4.0-dev.patch

Hi,

As we're near the 3.4.0 release of Carrot2, I'm including a patch that upgrades 
the clustering plugin. The most notable changes are:

* [3.4.0] Carrot2 core no longer depends on Lucene APIs, so the {{build.xml}} 
can be enabled again. The only class that makes use of Lucene API, 
{{LuceneLanguageModelFactory}}, is now included in the plugin's code, so there 
shouldn't be any problems with refactoring. In fact, I've already updated 
{{LuceneLanguageModelFactory}} to remove the use of deprecated APIs.
* [3.3.0] The STC algorithm has seen some [significant scalability 
improvements|http://project.carrot2.org/release-3.3.0-notes.html]
* [3.2.0] Carrot2 core no longer depends on LGPL libraries, so all the JARs can 
now be included in Solr SVN and SOLR-2007 won't need fixing.

Included is a patch against r966211. A ZIP with JARs will follow in a sec.

A couple of notes:

* The upgrade requires upgrading Google collections to Guava. This is a drop-in 
replacement, all tests pass for me after the upgrade, plus the upgrade is 
[recommended|http://code.google.com/p/google-collections/] on the original 
Google Collections site.
* The patch includes Carrot2 3.4.0-dev JAR, but I guess it's worth committing 
already to avoid the library downloads hassle (SOLR-2007).
* Originally, Carrot2 supports clustering of Chinese content based on the Smart 
Chinese Tokenizer. This tokenizer would have to be referenced from the 
{{LuceneLanguageModelFactory}} class in Solr. However, when compiling the code 
in Ant, this smartcn doesn't seem available in the classpath. Is it a matter of 
modifying the build files, or it's a policy on dependencies between plugins?

Let me know if you have any problems applying the patch.

Thanks!

S.


> Upgrade Carrot2 to 3.2.0
> 
>
> Key: SOLR-1804
> URL: https://issues.apache.org/jira/browse/SOLR-1804
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Clustering
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: SOLR-1804-carrot2-3.4.0-dev.patch
>
>
> http://project.carrot2.org/release-3.2.0-notes.html
> Carrot2 is now LGPL free, which means we should be able to bundle the binary!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890718#action_12890718
 ] 

Robert Muir commented on LUCENE-1799:
-

bq. But... ICU's license is compatible w/ ASL (I think), and includes a working 
impl of BOCU-1, so aren't we in the clear here? Ie we are free to take that 
impl, tweak it, add to our sources, and include ICU's license in our 
LICENSE/NOTICE?

I dont know... personally i wouldnt feel comfortable committing something 
without getting guidance first. but we can explore the technicals with patches 
on this jira issue and not check the box and i think this is all ok for now.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890709#action_12890709
 ] 

Michael McCandless commented on LUCENE-1799:


{quote}

> Is there any reason not to make BOCU-1 Lucene's default encoding?

in my opinion, just IBM :)
{quote}

But... ICU's license is compatible w/ ASL (I think), and includes a working 
impl of BOCU-1, so aren't we in the clear here?  Ie we are free to take that 
impl, tweak it, add to our sources, and include ICU's license in our 
LICENSE/NOTICE?

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890706#action_12890706
 ] 

Robert Muir commented on LUCENE-1799:
-

bq. Is there any reason not to make BOCU-1 Lucene's default encoding?

in my opinion, just IBM :) But maybe we can make a strong implementation and 
they will approve it and give us a patent:

http://unicode.org/notes/tn6/#Intellectual_Property

bq. UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds 
like we expect little to no indexing or searching perf penalty (once we have a 
faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

I'd like to play with swapping it in as the default, just to see what problems 
(if any) there are, and to make sure all queries are supported, etc. I can 
upload a new patch that does it this way and we can play.


> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890703#action_12890703
 ] 

Michael McCandless commented on LUCENE-1799:


Is there any reason not to make BOCU-1 Lucene's default encoding?

UTF8 penalizes non-english languages, and BOCU-1 does not, and it sounds like 
we expect little to no indexing or searching perf penalty (once we have a 
faster interface to BOCU1, eg our own private impl, like UnicodeUtil).

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890654#action_12890654
 ] 

Robert Muir commented on LUCENE-1799:
-

bq. You can use any Charset to encode your terms. The javadocs should only 
note, that the byte[] order should be correct for range queries to work

I don't think we should add support for any non-unicode character sets.

bq. If you want your complete index e.g. in ISO-8859-1

I am 100% against doing this.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2514) Change Term to use bytes

2010-07-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890646#action_12890646
 ] 

Robert Muir commented on LUCENE-2514:
-

bq. This would also mean the BOCU-1 encoding could be used drop-in w/ 
QueryParser for basic (Term, Phrase) queries right?

Yes, they should then work (or there is a bug!)

> Change Term to use bytes
> 
>
> Key: LUCENE-2514
> URL: https://issues.apache.org/jira/browse/LUCENE-2514
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Affects Versions: 4.0
>Reporter: Robert Muir
>Assignee: Uwe Schindler
> Attachments: LUCENE-2514-MTQPagedBytes.patch, 
> LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, 
> LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly 
> its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes 
> such as numerics, instead of using
> strange string encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-07-21 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890643#action_12890643
 ] 

Michael Busch commented on LUCENE-2324:
---

OK, I committed to the branch.  I'll try tomorrow to merge trunk into the 
branch.  I was already warned that there will most likely be lots of conflicts 
- so help is welcome! :)  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-07-21 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2324:
--

Attachment: lucene-2324.patch

Finally a new version of the patch! (Sorry for keeping you guys waiting...)

It's not done yet, but it compiles (against realtime branch!) and >95% of the 
core test cases pass.

Work done in addition to last patch:

- Added DocumentsWriterPerThread
- Reimplemented big parts of DocumentsWriter
- Added DocumentsWriterThreadPool which is an extension point for different 
pool implementation.  The default impl is
  the ThreadAffinityDocumentsWriterThreadPool, which does what the old code did 
(try to assign a DWPT always to 
  the same thread).  It should be easy now to add Document#getSourceID() and 
another pool that can assign threads
  based on the sourceID.
- Initial implementation of sequenceIDs.  Currently they're only used to keep 
track of deletes and not for
  e.g. NRT readers yet.
- Lots of other changes here and there.

TODOs:

- Implement flush-by-ram logic
- Implement logic to discard deletes from the deletes buffer
- Finish sequenceID handling: IW#commit() and IW#close() should return ID of 
last flushed sequenceID
- Maybe change delete logic:  currently deletes are applied when a segment is 
flushed.  Maybe we can keep it this way
  in the realtime-branch though, because that's most likely what we want to do 
once the RAM buffer is searchable and
  deletes are cheaper as they can then be done in-memory before flush
- Fix unit tests (mostly exception handling and thread safety)
- New test cases, e.g. for sequenceID testing
- Simplify code:  In some places I copied code around, which can probably be 
further simplified
- I started removing some of the old setters/getters in IW which are not in 
IndexWriterConfig - need to finish that,
  or revert those changes and use a different patch
- Fix nocommits
- Performance testing

I'm planning to commit this soon to the realtime branch, even though it's 
obviously not done yet.  But it's a big 
patch and changes will be easier to track with an svn history.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 4.0
>
> Attachments: lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Sequence IDs for NRT deletes

2010-07-21 Thread Michael McCandless
On Tue, Jul 20, 2010 at 4:21 PM, Jason Rutherglen
 wrote:
>> Right, much less GC if app frequently reopens.  But a 32X increase in
>> RAM usage is not trivial; I think we shouldn't enable it by default?
>
> Right, the RAM usage is quite high!  Is there a more compact
> representation we could use?  Ah well, either way for good RT
> performance, there are some users who may want to use this option.

Well, packed ints are more compact, but the decode cost would probably
be catastrophic :)

Maybe you could also use a smaller type (byte[], short[]) for sequence
ids, but, you'd then have to handle wraparound/overflow.  (In fact
even w/ int[] you have to handle wraparound?  long[] is probably safe
:) )  EG on overflow, you'd have to allocate all new (zero'd) arrays
for the next re-opened reader?

>> Have you tested?
>
> The test would be a basic benchmark of queries against BV vs. an int[]
> of deletes?

Yes, in a normal reader  (ie, not testing NRT -- just testing cost of
applying deletes via int cmp instead of BV lookup).

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1799) Unicode compression

2010-07-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890628#action_12890628
 ] 

Michael McCandless commented on LUCENE-1799:


This is fabulous!  And a great example of what's now possible w/ the cutover to 
opaque binary terms w/ flex -- makes it easy to swap out how terms are encoded.

BOCU-1 is a much more compact encoding than UTF-8 for non-Latin languages.

This encoding would also naturally reduce the RAM required for the terms index 
and Terms/TermsIndex FieldCache (used when you sort by string field) as well, 
since Lucene just loads the [opaque] term bytes into RAM.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2514) Change Term to use bytes

2010-07-21 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890630#action_12890630
 ] 

Michael McCandless commented on LUCENE-2514:


+1 to commit

This would also mean the BOCU-1 encoding could be used drop-in w/ QueryParser 
for basic (Term, Phrase) queries right?

> Change Term to use bytes
> 
>
> Key: LUCENE-2514
> URL: https://issues.apache.org/jira/browse/LUCENE-2514
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Search
>Affects Versions: 4.0
>Reporter: Robert Muir
>Assignee: Uwe Schindler
> Attachments: LUCENE-2514-MTQPagedBytes.patch, 
> LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, 
> LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, 
> LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly 
> its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes 
> such as numerics, instead of using
> strange string encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: is there any resource for improve lucene index/search performance

2010-07-21 Thread Michael McCandless
Lucene's dev list and the issue tracking system is the place for ideas
on improving indexing/search performance.

We are always looking to improve performance.

Switching to int mult, using bitmaps, both sound interesting :)

Mike

On Tue, Jul 20, 2010 at 10:59 PM, Li Li  wrote:
> Or where to find any improvement proposal for lucene?
> e.g. I want to change the float point multiplication to integer
> multiplication or using bitmap for high frequent terms or something
> else like this. Is there any place  where I can find any resources or
> guys?
> thanks.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1799) Unicode compression

2010-07-21 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1799:
--

Attachment: LUCENE-1799.patch

A new patch that completely separates the BOCU factory from the implementation 
(which moves to common/miscellaneous). This has the following advantages:

- You can use any Charset to encode your terms. The javadocs should only note, 
that the byte[] order should be correct for range queries to work
- Theoretically you could remove the BOCU classes at all, one that wants to 
use, can simply get the Charset from ICUs factory and pass it to the 
AttributeFactory. The convenience class is still useful, especially if we can 
later natively implement the encoding without NIO (when patent issues are 
solved...)
- The test for the CustomCharsetTermAttributeFactory uses UTF-8 as charset and 
verifies that the created BytesRefs have the same format like a BytesRef 
created using the UnicodeUtils.
- The test also checks that encoding errors are bubbled up as RuntimeExceptions

TODO:

- docs
- handling of encoding errors configureable (replace with replacement char?)
- If you want your complete index e.g. in ISO-8859-1, there should be also 
convenience methods that take CharSequences/char[] in the factory/attribute to 
quickly convert strings to BytesRefs like UnicodeUtil does - by this its 
possible to create TermQueries directly using e.g. ISO-8859-1 encoding.

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1799) Unicode compression

2010-07-21 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1799:
--

Attachment: LUCENE-1799.patch

Here is a 100% legally valid implementation:

- Linking to icu4j-charsets is done dynamically by reflection. If you don't 
have ICU4J charsets in your classpath, the attribute throws explaining exception
- We dont need to ship the rather large JAR file with Lucene just for this class
- We dont have legal patent problems as we neither ship the API nor use it 
directly
- The backside is that the Test simple prints a warning but passes, so the 
class is not tested until you install icu4j-charsets.jar. We can put the JAR 
file on hudson, so it can be used during nightly builds. Or we download it 
dynamically on build.

I added further improvements to the encoder ittself:
- less variables
- correct error handling for encoding errors
- remove floating point from main loop

> Unicode compression
> ---
>
> Key: LUCENE-1799
> URL: https://issues.apache.org/jira/browse/LUCENE-1799
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.4.1
>Reporter: DM Smith
>Priority: Minor
> Attachments: LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, 
> LUCENE-1799.patch, LUCENE-1799.patch
>
>
> In lucene-1793, there is the off-topic suggestion to provide compression of 
> Unicode data. The motivation was a custom encoding in a Russian analyzer. The 
> original supposition was that it provided a more compact index.
> This led to the comment that a different or compressed encoding would be a 
> generally useful feature. 
> BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM 
> with an implementation in ICU. If Lucene provide it's own implementation a 
> freely avIlable, royalty-free license would need to be obtained.
> SCSU is another Unicode compression algorithm that could be used. 
> An advantage of these methods is that they work on the whole of Unicode. If 
> that is not needed an encoding such as iso8859-1 (or whatever covers the 
> input) could be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org