date:20110413

[
https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019234#comment-13019234
]

Simon Willnauer commented on LUCENE-2956:
-

bq. Though it worries me a little how complex the whole delete/update logic is
becoming (not only the part this patch adds).
I can not more agree. Its been very complex making all the tests pass and
figuring out all the little nifty cornercases here. A different, somewhat
simpler approach would be great. Eventually for Searchable Ram Buffers we might
need to switch to seq. ids anyway but I think for landing DWPT on trunk we can
go with the current approach.
I will update the latest patch and commit it to the branch and merge with trunk
again. Once that is done I will setup a hudson build for RT so we give it a
little exercise while we prepare moving to trunk.

Support updateDocument() with DWPTs
---

Key: LUCENE-2956
URL: https://issues.apache.org/jira/browse/LUCENE-2956
Project: Lucene - Java
Issue Type: Bug
Components: Index
Affects Versions: Realtime Branch
Reporter: Michael Busch
Assignee: Simon Willnauer
Priority: Minor
Fix For: Realtime Branch

Attachments: LUCENE-2956.patch

With separate DocumentsWriterPerThreads (DWPT) it can currently happen that
the delete part of an updateDocument() is flushed and committed separately
from the corresponding new document.
We need to make sure that updateDocument() is always an atomic operation from
a IW.commit() and IW.getReader() perspective. See LUCENE-2324 for more
details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Patch for http_proxy support in solr-ruby client

2011-04-13 Thread Duncan Robertson

Hi Otis,

The fork you're talking is mine! But the repos I forked is not official, so
I am trying to find out where the official version is so I can patch it.

D


On 13/04/2011 04:45, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:

 Hi,
 
 Hm, maybe you are asking where solr-ruby actually lives and is being
 developed?
 I'm not sure.  I see it under solr/client/ruby/solr-ruby (no new development
 in 
 ages?), but I also see an *active* solr-ruby fork over on
 https://github.com/bbcrd/solr-ruby .  So if you want to contribute to
 solr-ruby 
 on Github, get yourself a Github account, fork that solr-ruby, make your
 change, 
 and submit it via the pull request.  This is separate from Solr @ Apache.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
 From: Duncan Robertson duncan.robert...@bbc.co.uk
 To: dev@lucene.apache.org
 Sent: Tue, April 12, 2011 4:36:17 AM
 Subject: Patch for http_proxy support in solr-ruby client
 
 Hi,
 
 I have a patch for adding http_proxy support to the solr-ruby client.  I
 thought the project was managed via Github, but this turns out not to be  the
 case. It the process the same as for Solr itself?
 
 https://github.com/bbcrd/solr-ruby/compare/5b06e66f4e%5E...a76aee983e
 
 Best,
 Duncan
 
 
 http://www.bbc.co.uk/
 This  e-mail (and any attachments) is confidential and may contain personal
 views  which are not the views of the BBC unless specifically stated.
 If you have  received it in error, please delete it from your system.
 Do not use, copy or  disclose the information in any way nor act in reliance
 on 
 it and notify the  sender immediately.
 Please note that the BBC monitors e-mails sent or  received.
 Further communication will signify your consent to  this.
 
 
 -
 To  unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-2956) Support updateDocument() with DWPTs


 [ 
https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2956:


Attachment: LUCENE-2956.patch

here is an updated patch fixing some spellings, adds atomic updates for Term[] 
and Query[] and removes the LogMergePolicy restriction from TestRollingUpdates

 Support updateDocument() with DWPTs
 ---

 Key: LUCENE-2956
 URL: https://issues.apache.org/jira/browse/LUCENE-2956
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: Realtime Branch
Reporter: Michael Busch
Assignee: Simon Willnauer
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2956.patch, LUCENE-2956.patch


 With separate DocumentsWriterPerThreads (DWPT) it can currently happen that 
 the delete part of an updateDocument() is flushed and committed separately 
 from the corresponding new document.
 We need to make sure that updateDocument() is always an atomic operation from 
 a IW.commit() and IW.getReader() perspective.  See LUCENE-2324 for more 
 details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[HUDSON] Lucene-Solr-tests-only-trunk - Build # 7061 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/7061/

14 tests failed.
REGRESSION:  org.apache.lucene.index.TestIndexReaderReopen.testThreadSafety

Error Message:
Error occurred in thread Thread-110: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/1/test8006296579247039339tmp/_e_2.doc
 (Too many open files in system)

Stack Trace:
junit.framework.AssertionFailedError: Error occurred in thread Thread-110:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/1/test8006296579247039339tmp/_e_2.doc
 (Too many open files in system)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160)
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/1/test8006296579247039339tmp/_e_2.doc
 (Too many open files in system)
at 
org.apache.lucene.index.TestIndexReaderReopen.testThreadSafety(TestIndexReaderReopen.java:833)


REGRESSION:  
org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeDocCount0

Error Message:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/3/test8275723700845306539tmp/_0_0.tiv
 (Too many open files in system)

Stack Trace:
java.io.FileNotFoundException: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/3/test8275723700845306539tmp/_0_0.tiv
 (Too many open files in system)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:233)
at 
org.apache.lucene.store.FSDirectory$FSIndexOutput.init(FSDirectory.java:448)
at 
org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:312)
at 
org.apache.lucene.store.MockDirectoryWrapper.createOutput(MockDirectoryWrapper.java:348)
at 
org.apache.lucene.index.codecs.VariableGapTermsIndexWriter.init(VariableGapTermsIndexWriter.java:161)
at 
org.apache.lucene.index.codecs.standard.StandardCodec.fieldsConsumer(StandardCodec.java:58)
at 
org.apache.lucene.index.PerFieldCodecWrapper$FieldsWriter.init(PerFieldCodecWrapper.java:64)
at 
org.apache.lucene.index.PerFieldCodecWrapper.fieldsConsumer(PerFieldCodecWrapper.java:54)
at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:78)
at org.apache.lucene.index.TermsHash.flush(TermsHash.java:103)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:65)
at 
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:55)
at 
org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:567)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2497)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2462)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1211)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1180)
at 
org.apache.lucene.index.TestIndexWriterMergePolicy.addDoc(TestIndexWriterMergePolicy.java:221)
at 
org.apache.lucene.index.TestIndexWriterMergePolicy.testMergeDocCount0(TestIndexWriterMergePolicy.java:189)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160)


REGRESSION:  
org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull

Error Message:
addIndexes(Directory[]) + optimize() hit IOException after disk space was freed 
up

Stack Trace:
junit.framework.AssertionFailedError: addIndexes(Directory[]) + optimize() hit 
IOException after disk space was freed up
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160)
at 
org.apache.lucene.index.TestIndexWriterOnDiskFull.testAddIndexOnDiskFull(TestIndexWriterOnDiskFull.java:327)


REGRESSION:  org.apache.lucene.index.TestLongPostings.testLongPostings

Error Message:
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/5/longpostings.6978566692871504462/_14_0.tib
 (Too many open files in system)

Stack Trace:
java.io.FileNotFoundException: 
/usr/home/hudson/hudson-slave/workspace/Lucene-Solr-tests-only-trunk/checkout/lucene/build/test/5/longpostings.6978566692871504462/_14_0.tib
 (Too many open files in system)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:233)
at 
org.apache.lucene.store.FSDirectory$FSIndexOutput.init(FSDirectory.java:448)
at

TestIndexWriterDelete#testUpdatesOnDiskFull can false fail

2011-04-13 Thread Simon Willnauer

In TestIndexWriterDelete#testUpdatesOnDiskFull especially between line
538 and 553 we could get a random exception from the
MockDirectoryWrapper which makes the test fail since we are not
catching / expecting those exceptions.
I can make this fail  on trunk even in 1000 runs but on realtime it
fails quickly after I merged this morning. I think we should just
disable the random exception for this part and reenable after we are
done, see patch below! - Thoughts?


Index: lucene/src/test/org/apache/lucene/index/TestIndexWriterDelete.java
===
--- lucene/src/test/org/apache/lucene/index/TestIndexWriterDelete.java  
(revision
1091721)
+++ lucene/src/test/org/apache/lucene/index/TestIndexWriterDelete.java  (working
copy)
@@ -536,7 +536,9 @@
 fail(testName +  hit IOException after disk space was freed up);
   }
 }
-
+// prevent throwing a random exception here!!
+final double randomIOExceptionRate = dir.getRandomIOExceptionRate();
+dir.setRandomIOExceptionRate(0.0);
 if (!success) {
   // Must force the close else the writer can have
   // open files which cause exc in MockRAMDir.close
@@ -549,6 +551,7 @@
   _TestUtil.checkIndex(dir);
   TestIndexWriter.assertNoUnreferencedFiles(dir, after writer.close);
 }
+dir.setRandomIOExceptionRate(randomIOExceptionRate);

 // Finally, verify index is not corrupt, and, if
 // we succeeded, we see all docs changed, and if

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2956) Support updateDocument() with DWPTs


[ 
https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019298#comment-13019298
 ] 

Simon Willnauer commented on LUCENE-2956:
-

I committed that patch and merged with trunk

 Support updateDocument() with DWPTs
 ---

 Key: LUCENE-2956
 URL: https://issues.apache.org/jira/browse/LUCENE-2956
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: Realtime Branch
Reporter: Michael Busch
Assignee: Simon Willnauer
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2956.patch, LUCENE-2956.patch


 With separate DocumentsWriterPerThreads (DWPT) it can currently happen that 
 the delete part of an updateDocument() is flushed and committed separately 
 from the corresponding new document.
 We need to make sure that updateDocument() is always an atomic operation from 
 a IW.commit() and IW.getReader() perspective.  See LUCENE-2324 for more 
 details.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Numerical ids for terms?

2011-04-13 Thread Toke Eskildsen

On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote:
 Hi -- has there been any effort to create a numerical representation of 
 Lucene 
 indices. That is, to use the Lucene Directory backend as a large 
 term-document 
 matrix at index level. As this would require bijective mapping between terms 
 (per-field, as customary in Lucene) and a numerical index (integer, 
 monotonous 
 from 0 to numTerms()-1), I guess this requires some some special 
 modifications 
 to the Lucene core.

Maybe you're thinking about something like TermsEnum?
https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/TermsEnum.html
It provides ordinal-access to terms, represented with longs. In order to
make the access at index-level rather than segment-level you will have
to perform a merge of the ordinals from the different segments.

Unfortunately it is optional whether the codec supports ordinal-based
terms access and the default codec does not, so you will have to
explicitly select a codec when you build your index.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3018) Lucene Native Directory implementation need automated build


[ 
https://issues.apache.org/jira/browse/LUCENE-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019326#comment-13019326
 ] 

Simon Willnauer commented on LUCENE-3018:
-

varun,

pastbin links are not ideal for work on issues here. you can post small 
snippets directly here or upload a patch so we can review.
nevertheless, the example you have added to pastebin seems just like a generic 
example can you try to integrate it into the trunk/lucene/conrib/misk/build.xml 
file and make it compile the NativePosixUtil.cpp? If you have that you can 
create a patch with svn diff  LUCENE-3018.patch and upload it. if you need 3rd 
party libs like ant contrib you can upload them here too.

simon

 Lucene Native Directory implementation need automated build
 ---

 Key: LUCENE-3018
 URL: https://issues.apache.org/jira/browse/LUCENE-3018
 Project: Lucene - Java
  Issue Type: Wish
  Components: Build
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Varun Thacker
Priority: Minor
 Fix For: 4.0


 Currently the native directory impl in contrib/misc require manual action to 
 compile the c code (partially) documented in 
  
 https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/misc/src/java/overview.html
 yet it would be nice if we had an ant task and documentation for all 
 platforms how to compile them and set up the prerequisites.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Patch for http_proxy support in solr-ruby client

2011-04-13 Thread Erik Hatcher

Duncan -

I'm the original creator of solr-ruby and put it under Solr's svn.  But many 
folks are now using RSolr, and even in our own (JRuby-based product) we use 
simply Net::HTTP and not a library like solr-ruby or RSolr.  

I don't personally have incentive to continue to maintain solr-ruby, so maybe 
your fork is now official?   Though the git craze has made me feel weary 
because so many official versions are simply someone's personal fork.

We can pull solr-ruby from Solr's svn eventually, as something else more 
official takes its place.

Erik



On Apr 13, 2011, at 04:13 , Duncan Robertson wrote:

 Hi Otis,
 
 The fork you're talking is mine! But the repos I forked is not official, so
 I am trying to find out where the official version is so I can patch it.
 
 D
 
 
 On 13/04/2011 04:45, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:
 
 Hi,
 
 Hm, maybe you are asking where solr-ruby actually lives and is being
 developed?
 I'm not sure.  I see it under solr/client/ruby/solr-ruby (no new development
 in 
 ages?), but I also see an *active* solr-ruby fork over on
 https://github.com/bbcrd/solr-ruby .  So if you want to contribute to
 solr-ruby 
 on Github, get yourself a Github account, fork that solr-ruby, make your
 change, 
 and submit it via the pull request.  This is separate from Solr @ Apache.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
 From: Duncan Robertson duncan.robert...@bbc.co.uk
 To: dev@lucene.apache.org
 Sent: Tue, April 12, 2011 4:36:17 AM
 Subject: Patch for http_proxy support in solr-ruby client
 
 Hi,
 
 I have a patch for adding http_proxy support to the solr-ruby client.  I
 thought the project was managed via Github, but this turns out not to be  
 the
 case. It the process the same as for Solr itself?
 
 https://github.com/bbcrd/solr-ruby/compare/5b06e66f4e%5E...a76aee983e
 
 Best,
 Duncan
 
 
 http://www.bbc.co.uk/
 This  e-mail (and any attachments) is confidential and may contain personal
 views  which are not the views of the BBC unless specifically stated.
 If you have  received it in error, please delete it from your system.
 Do not use, copy or  disclose the information in any way nor act in reliance
 on 
 it and notify the  sender immediately.
 Please note that the BBC monitors e-mails sent or  received.
 Further communication will signify your consent to  this.
 
 
 -
 To  unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 http://www.bbc.co.uk/
 This e-mail (and any attachments) is confidential and may contain personal 
 views which are not the views of the BBC unless specifically stated.
 If you have received it in error, please delete it from your system.
 Do not use, copy or disclose the information in any way nor act in reliance 
 on it and notify the sender immediately.
 Please note that the BBC monitors e-mails sent or received.
 Further communication will signify your consent to this.
   
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3018) Lucene Native Directory implementation need automated build


[ 
https://issues.apache.org/jira/browse/LUCENE-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019329#comment-13019329
 ] 

Uwe Schindler commented on LUCENE-3018:
---

Hi,
I suggest to use ANT Contrib for compiling the C Parts. That includes fine in 
our build infrastructure and supplies ANT tasks for compiling and linking: 
[http://ant-contrib.sourceforge.net/cpptasks/index.html]

I think your example pastebin is using this. We only need to add the JAR in the 
lib folder of Lucene, so ANT can load the plugin.

 Lucene Native Directory implementation need automated build
 ---

 Key: LUCENE-3018
 URL: https://issues.apache.org/jira/browse/LUCENE-3018
 Project: Lucene - Java
  Issue Type: Wish
  Components: Build
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Varun Thacker
Priority: Minor
 Fix For: 4.0


 Currently the native directory impl in contrib/misc require manual action to 
 compile the c code (partially) documented in 
  
 https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/misc/src/java/overview.html
 yet it would be nice if we had an ant task and documentation for all 
 platforms how to compile them and set up the prerequisites.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3018) Lucene Native Directory implementation need automated build

2011-04-13 Thread Varun Thacker (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019356#comment-13019356
 ] 

Varun Thacker commented on LUCENE-3018:
---

I made the mistake of adding the ant-contrib jar and trying to compile it. This 
requires cpptasks which is not part of ant-contrib 
Link to the cpptasks jar : 
http://sourceforge.net/projects/ant-contrib/files/ant-contrib/cpptasks-1.0-beta4/

Adding this jar , I was able to compile the code. 


 Lucene Native Directory implementation need automated build
 ---

 Key: LUCENE-3018
 URL: https://issues.apache.org/jira/browse/LUCENE-3018
 Project: Lucene - Java
  Issue Type: Wish
  Components: Build
Affects Versions: 4.0
Reporter: Simon Willnauer
Assignee: Varun Thacker
Priority: Minor
 Fix For: 4.0


 Currently the native directory impl in contrib/misc require manual action to 
 compile the c code (partially) documented in 
  
 https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/misc/src/java/overview.html
 yet it would be nice if we had an ant task and documentation for all 
 platforms how to compile them and set up the prerequisites.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [HUDSON] Lucene-trunk - Build # 1528 - Still Failing

2011-04-13 Thread Michael McCandless

GC overhead limit exceeded...

Mike

http://blog.mikemccandless.com

On Tue, Apr 12, 2011 at 10:43 PM, Apache Hudson Server
hud...@hudson.apache.org wrote:
 Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1528/

 1 tests failed.
 REGRESSION:  org.apache.lucene.index.TestNRTThreads.testNRTThreads

 Error Message:
 Some threads threw uncaught exceptions!

 Stack Trace:
 junit.framework.AssertionFailedError: Some threads threw uncaught exceptions!
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1232)
        at 
 org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1160)
        at 
 org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:521)




 Build Log (for compile errors):
 [...truncated 11900 lines...]



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: TestIndexWriterDelete#testUpdatesOnDiskFull can false fail

2011-04-13 Thread Michael McCandless

+1

Mike

http://blog.mikemccandless.com

On Wed, Apr 13, 2011 at 5:58 AM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 In TestIndexWriterDelete#testUpdatesOnDiskFull especially between line
 538 and 553 we could get a random exception from the
 MockDirectoryWrapper which makes the test fail since we are not
 catching / expecting those exceptions.
 I can make this fail  on trunk even in 1000 runs but on realtime it
 fails quickly after I merged this morning. I think we should just
 disable the random exception for this part and reenable after we are
 done, see patch below! - Thoughts?


 Index: lucene/src/test/org/apache/lucene/index/TestIndexWriterDelete.java
 ===
 --- lucene/src/test/org/apache/lucene/index/TestIndexWriterDelete.java  
 (revision
 1091721)
 +++ lucene/src/test/org/apache/lucene/index/TestIndexWriterDelete.java  
 (working
 copy)
 @@ -536,7 +536,9 @@
             fail(testName +  hit IOException after disk space was freed up);
           }
         }
 -
 +        // prevent throwing a random exception here!!
 +        final double randomIOExceptionRate = dir.getRandomIOExceptionRate();
 +        dir.setRandomIOExceptionRate(0.0);
         if (!success) {
           // Must force the close else the writer can have
           // open files which cause exc in MockRAMDir.close
 @@ -549,6 +551,7 @@
           _TestUtil.checkIndex(dir);
           TestIndexWriter.assertNoUnreferencedFiles(dir, after 
 writer.close);
         }
 +        dir.setRandomIOExceptionRate(randomIOExceptionRate);

         // Finally, verify index is not corrupt, and, if
         // we succeeded, we see all docs changed, and if

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2956) Support updateDocument() with DWPTs

2011-04-13 Thread Jason Rutherglen (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019370#comment-13019370
]

Jason Rutherglen commented on LUCENE-2956:
--

Simon, nice work. I agree with Michael B. that the deletes are super complex.
We had discussed using sequence ids for all segments (not just the RT enabled
DWPT ones) however we never worked out a specification, eg, for things like
wrap around if a primitive short[] was used.

Shall we start again on LUCENE-2312? I think we still need/want to use
sequence ids there. The RT DWPTs shouldn't have so many documents that using a
long[] for the sequence ids is too RAM consuming?

Support updateDocument() with DWPTs
---

Attachments: LUCENE-2956.patch, LUCENE-2956.patch

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[HUDSON] Lucene-Solr-tests-only-3.x - Build # 7062 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/7062/

1 tests failed.
REGRESSION:  org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe

Error Message:
Java heap space

Stack Trace:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:589)
at java.lang.StringBuffer.append(StringBuffer.java:337)
at 
java.text.RuleBasedCollator.getCollationKey(RuleBasedCollator.java:617)
at 
org.apache.lucene.collation.CollationKeyFilter.incrementToken(CollationKeyFilter.java:93)
at 
org.apache.lucene.collation.CollationTestBase.assertThreadSafe(CollationTestBase.java:304)
at 
org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe(TestCollationKeyAnalyzer.java:89)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1076)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1008)




Build Log (for compile errors):
[...truncated 9243 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

need help in constructing a query

2011-04-13 Thread Ramamurthy, Premila

Need help in constructing a solr query,

I need the values for a field. I want values which does not have embedded 
space
The value of the indexed field should not have embedded space.


Please help.

Thanks
Premila

[jira] [Commented] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml

2011-04-13 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019377#comment-13019377
 ] 

Tommaso Teofili commented on SOLR-2436:
---

Hello Koji,
your patch seems fine to me from the functional point of view.

Just, I don't think the SolrUIMAConfigurationReader should be emptied, I 
wouldn't remove it preferring to assign to it the simple responsibility of 
reading args without the previous explicit Node traversing but, as you did, 
using the Solr way.
I also made some fixes to remove warning while getting objects from the 
NamedList.

 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml

2011-04-13 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated SOLR-2436:
--

Attachment: SOLR-2436-3.patch

 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Patch for http_proxy support in solr-ruby client

2011-04-13 Thread Duncan Robertson

Thanks Erik,

I hadn't seen RSolr and it looks like it fixes all the problems was having.
Maybe rather than keeping many solutions, I'll just take a look at this one.

Duncan


On 13/04/2011 14:51, Erik Hatcher erik.hatc...@gmail.com wrote:

 Duncan -
 
 I'm the original creator of solr-ruby and put it under Solr's svn.  But many
 folks are now using RSolr, and even in our own (JRuby-based product) we use
 simply Net::HTTP and not a library like solr-ruby or RSolr.
 
 I don't personally have incentive to continue to maintain solr-ruby, so maybe
 your fork is now official?   Though the git craze has made me feel weary
 because so many official versions are simply someone's personal fork.
 
 We can pull solr-ruby from Solr's svn eventually, as something else more
 official takes its place.
 
 Erik
 
 
 
 On Apr 13, 2011, at 04:13 , Duncan Robertson wrote:
 
 Hi Otis,
 
 The fork you're talking is mine! But the repos I forked is not official, so
 I am trying to find out where the official version is so I can patch it.
 
 D
 
 
 On 13/04/2011 04:45, Otis Gospodnetic otis_gospodne...@yahoo.com wrote:
 
 Hi,
 
 Hm, maybe you are asking where solr-ruby actually lives and is being
 developed?
 I'm not sure.  I see it under solr/client/ruby/solr-ruby (no new development
 in 
 ages?), but I also see an *active* solr-ruby fork over on
 https://github.com/bbcrd/solr-ruby .  So if you want to contribute to
 solr-ruby 
 on Github, get yourself a Github account, fork that solr-ruby, make your
 change, 
 and submit it via the pull request.  This is separate from Solr @ Apache.
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
 From: Duncan Robertson duncan.robert...@bbc.co.uk
 To: dev@lucene.apache.org
 Sent: Tue, April 12, 2011 4:36:17 AM
 Subject: Patch for http_proxy support in solr-ruby client
 
 Hi,
 
 I have a patch for adding http_proxy support to the solr-ruby client.  I
 thought the project was managed via Github, but this turns out not to be
 the
 case. It the process the same as for Solr itself?
 
 https://github.com/bbcrd/solr-ruby/compare/5b06e66f4e%5E...a76aee983e
 
 Best,
 Duncan
 
 
 http://www.bbc.co.uk/
 This  e-mail (and any attachments) is confidential and may contain personal
 views  which are not the views of the BBC unless specifically stated.
 If you have  received it in error, please delete it from your system.
 Do not use, copy or  disclose the information in any way nor act in
 reliance
 on 
 it and notify the  sender immediately.
 Please note that the BBC monitors e-mails sent or  received.
 Further communication will signify your consent to  this.
 
 
 -
 To  unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For  additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 http://www.bbc.co.uk/
 This e-mail (and any attachments) is confidential and may contain personal
 views which are not the views of the BBC unless specifically stated.
 If you have received it in error, please delete it from your system.
 Do not use, copy or disclose the information in any way nor act in reliance
 on it and notify the sender immediately.
 Please note that the BBC monitors e-mails sent or received.
 Further communication will signify your consent to this.
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

GSoC: LUCENE-2308: Separately specify a field's type

2011-04-13 Thread Nikola Tanković

Hi all,

if everything goes well I'll be delighted to be part of this project this
summer together with my assigned mentor Mike. My task will be to introduce
new classes to Lucene core which will enable to separate Fields' Lucene
properties from it's value (
https://issues.apache.org/jira/browse/LUCENE-2308).

As you assume, this will largely impact lucene  solr, so we need to think
this through thoroughly.

Changes will include:

   - Introduction of an FieldType class that will hold all the extra
   properties now stored inside Field instance other than field value itself.
   - New FieldTypeAttribute interface will be added to handle extension with
   new field properties inspired by IndexWriterConfig.
   - Refactoring and dividing of settings for term frequency and positioning
   can also be done
(LUCENE-2048https://issues.apache.org/jira/browse/LUCENE-2048
   )
   - Discuss possible effects of completion of
LUCENE-2310https://issues.apache.org/jira/browse/LUCENE-2310on this
project
   - Adequate Factory class for easier configuration of new Field instances
   together with manually added new FieldTypeAttributes
   - FieldType, once instantiated is read-only. Only fields value can be
   changed.
   - Simple hierarchy of Field classes with core properties logically
   predefaulted. E.g.:
  - NumberField,
  - StringField,
  - TextField,
  - NonIndexedField,


My questions and issues:

   - Backward compatibility? Will this go to Lucene 3.0?
   - What is the best way to break this into small baby steps?


Kindly,
Nikola Tanković

[jira] [Created] (LUCENE-3026) smartcn analysis throw NullPointer exception when the length of analysed text over 32767

smartcn analysis throw NullPointer exception when the length of analysed text 
over 32767


 Key: LUCENE-3026
 URL: https://issues.apache.org/jira/browse/LUCENE-3026
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.1, 4.0
Reporter: wangzhenghang


That all because of org.apache.lucene.analysis.cn.smart.hhmm.SegGraph's 
makeIndex() method:
  public ListSegToken makeIndex() {
ListSegToken result = new ArrayListSegToken();
int s = -1, count = 0, size = tokenListTable.size();
ListSegToken tokenList;
short index = 0;
while (count  size) {
  if (isStartExist(s)) {
tokenList = tokenListTable.get(s);
for (SegToken st : tokenList) {
  st.index = index;
  result.add(st);
  index++;
}
count++;
  }
  s++;
}
return result;
  }

'short index = 0;' should be 'int index = 0;'. And that's reported here 
http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=2, 
http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=11, the 
author XiaoPingGao have already fixed this 
bug:http://code.google.com/p/imdict-chinese-analyzer/source/browse/trunk/src/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3026) smartcn analysis throw NullPointer exception when the length of analysed text over 32767


 [ 
https://issues.apache.org/jira/browse/LUCENE-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhenghang updated LUCENE-3026:
--

Description: 
That's all because of org.apache.lucene.analysis.cn.smart.hhmm.SegGraph's 
makeIndex() method:
  public ListSegToken makeIndex() {
ListSegToken result = new ArrayListSegToken();
int s = -1, count = 0, size = tokenListTable.size();
ListSegToken tokenList;
short index = 0;
while (count  size) {
  if (isStartExist(s)) {
tokenList = tokenListTable.get(s);
for (SegToken st : tokenList) {
  st.index = index;
  result.add(st);
  index++;
}
count++;
  }
  s++;
}
return result;
  }

here 'short index = 0;' should be 'int index = 0;'. And that's reported here 
http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=2 and 
http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=11, the 
author XiaoPingGao have already fixed this 
bug:http://code.google.com/p/imdict-chinese-analyzer/source/browse/trunk/src/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java

  was:
That all because of org.apache.lucene.analysis.cn.smart.hhmm.SegGraph's 
makeIndex() method:
  public ListSegToken makeIndex() {
ListSegToken result = new ArrayListSegToken();
int s = -1, count = 0, size = tokenListTable.size();
ListSegToken tokenList;
short index = 0;
while (count  size) {
  if (isStartExist(s)) {
tokenList = tokenListTable.get(s);
for (SegToken st : tokenList) {
  st.index = index;
  result.add(st);
  index++;
}
count++;
  }
  s++;
}
return result;
  }

'short index = 0;' should be 'int index = 0;'. And that's reported here 
http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=2, 
http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=11, the 
author XiaoPingGao have already fixed this 
bug:http://code.google.com/p/imdict-chinese-analyzer/source/browse/trunk/src/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java


 smartcn analysis throw NullPointer exception when the length of analysed text 
 over 32767
 

 Key: LUCENE-3026
 URL: https://issues.apache.org/jira/browse/LUCENE-3026
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.1, 4.0
Reporter: wangzhenghang

 That's all because of org.apache.lucene.analysis.cn.smart.hhmm.SegGraph's 
 makeIndex() method:
   public ListSegToken makeIndex() {
 ListSegToken result = new ArrayListSegToken();
 int s = -1, count = 0, size = tokenListTable.size();
 ListSegToken tokenList;
 short index = 0;
 while (count  size) {
   if (isStartExist(s)) {
 tokenList = tokenListTable.get(s);
 for (SegToken st : tokenList) {
   st.index = index;
   result.add(st);
   index++;
 }
 count++;
   }
   s++;
 }
 return result;
   }
 here 'short index = 0;' should be 'int index = 0;'. And that's reported here 
 http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=2 and 
 http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=11, the 
 author XiaoPingGao have already fixed this 
 bug:http://code.google.com/p/imdict-chinese-analyzer/source/browse/trunk/src/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2312) Search on IndexWriter's RAM Buffer

2011-04-13 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019391#comment-13019391
 ] 

Jason Rutherglen commented on LUCENE-2312:
--

In the current patch, I'm copying the parallel array for the end of a term's 
postings per reader [re]open.  However in the case where we're opening a reader 
after each document is indexed, this is wasteful.  We can simply queue the term 
ids from the last indexed document, and only copy the newly updated values over 
to the 'read' only consistent parallel array.

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch

 Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Numerical ids for terms?

2011-04-13 Thread Gregor Heinrich


Thanks Toke and Kirill -- I guess that's the way to go (at least until v4.0).

Best regards

gregor

On 4/13/11 3:42 PM, Toke Eskildsen wrote:

On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote:

Hi -- has there been any effort to create a numerical representation of Lucene
indices. That is, to use the Lucene Directory backend as a large term-document
matrix at index level. As this would require bijective mapping between terms
(per-field, as customary in Lucene) and a numerical index (integer, monotonous
from 0 to numTerms()-1), I guess this requires some some special modifications
to the Lucene core.

Maybe you're thinking about something like TermsEnum?
https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/TermsEnum.html
It provides ordinal-access to terms, represented with longs. In order to
make the access at index-level rather than segment-level you will have
to perform a merge of the ordinals from the different segments.

Unfortunately it is optional whether the codec supports ordinal-based
terms access and the default codec does not, so you will have to
explicitly select a codec when you build your index.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-64) strict hierarchical facets


 [ 
https://issues.apache.org/jira/browse/SOLR-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Relephant updated SOLR-64:
--

Attachment: SOLR-64_3.1.0.diff

Hi all, we have just tried to apply solr-64 to 3.1. Attached 
SOLR-64_3.1.0.diff. 

Hope that helps.

 strict hierarchical facets
 --

 Key: SOLR-64
 URL: https://issues.apache.org/jira/browse/SOLR-64
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
Assignee: Koji Sekiguchi
 Fix For: 4.0

 Attachments: SOLR-64.patch, SOLR-64.patch, SOLR-64.patch, 
 SOLR-64.patch, SOLR-64_3.1.0.diff


 Strict Facet Hierarchies... each tag has at most one parent (a tree).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-64) strict hierarchical facets


 [ 
https://issues.apache.org/jira/browse/SOLR-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Relephant updated SOLR-64:
--

Attachment: (was: SOLR-64_3.1.0.diff)

 strict hierarchical facets
 --

 Key: SOLR-64
 URL: https://issues.apache.org/jira/browse/SOLR-64
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
Assignee: Koji Sekiguchi
 Fix For: 4.0

 Attachments: SOLR-64.patch, SOLR-64.patch, SOLR-64.patch, 
 SOLR-64.patch, SOLR-64_3.1.0.patch


 Strict Facet Hierarchies... each tag has at most one parent (a tree).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-64) strict hierarchical facets


 [ 
https://issues.apache.org/jira/browse/SOLR-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Relephant updated SOLR-64:
--

Attachment: SOLR-64_3.1.0.patch

 strict hierarchical facets
 --

 Key: SOLR-64
 URL: https://issues.apache.org/jira/browse/SOLR-64
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
Assignee: Koji Sekiguchi
 Fix For: 4.0

 Attachments: SOLR-64.patch, SOLR-64.patch, SOLR-64.patch, 
 SOLR-64.patch, SOLR-64_3.1.0.patch


 Strict Facet Hierarchies... each tag has at most one parent (a tree).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-64) strict hierarchical facets


[ 
https://issues.apache.org/jira/browse/SOLR-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019399#comment-13019399
 ] 

Relephant edited comment on SOLR-64 at 4/13/11 4:04 PM:


Hi all, we have just tried to apply solr-64 to 3.1. Attached 
SOLR-64_3.1.0.patch. 

Hope that helps.

  was (Author: relephant):
Hi all, we have just tried to apply solr-64 to 3.1. Attached 
SOLR-64_3.1.0.diff. 

Hope that helps.
  
 strict hierarchical facets
 --

 Key: SOLR-64
 URL: https://issues.apache.org/jira/browse/SOLR-64
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
Assignee: Koji Sekiguchi
 Fix For: 4.0

 Attachments: SOLR-64.patch, SOLR-64.patch, SOLR-64.patch, 
 SOLR-64.patch, SOLR-64_3.1.0.patch


 Strict Facet Hierarchies... each tag has at most one parent (a tree).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2939) Highlighter should try and use maxDocCharsToAnalyze in WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as when using CachingTokenStream

2011-04-13 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019421#comment-13019421
 ] 

Mark Miller commented on LUCENE-2939:
-

Okay - I'm going to commit to trunk shortly.

 Highlighter should try and use maxDocCharsToAnalyze in 
 WeightedSpanTermExtractor when adding a new field to MemoryIndex as well as 
 when using CachingTokenStream
 

 Key: LUCENE-2939
 URL: https://issues.apache.org/jira/browse/LUCENE-2939
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 3.1.1, 3.2, 4.0

 Attachments: LUCENE-2939.patch, LUCENE-2939.patch, LUCENE-2939.patch, 
 LUCENE-2939.patch


 huge documents can be drastically slower than need be because the entire 
 field is added to the memory index
 this cost can be greatly reduced in many cases if we try and respect 
 maxDocCharsToAnalyze
 things can be improved even further by respecting this setting with 
 CachingTokenStream

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

PayloadProcessorProvider Usage

2011-04-13 Thread Shai Erera

Hey,

In Lucene 3.1 we've introduced PayloadProcessorProvider which allows you to
rewrite payloads of terms during merge. The main scenario is when you merge
indexes, and you want to rewrite/remap payloads of the incoming indexes, but
one can certainly use it to rewrite the payloads of a term, in a given
index.

When we worked on it, we thought of two ways the user can rewrite payloads
when he merges indexes:

1) Set PPP on the target IW, call addIndexes(IndexReader), while PPP will be
applied on the incoming directories only.
2) Set PPP on the source IW, call IW.optimize(), then use
targetIW.addIndexes(Directory).

The latter is better since in both cases the incoming segments are rewritten
anyway, however in the first case you might run into merging segments of the
target index as well, something you might want to avoid (that was the
purpose of optimizing addIndexes(Directory)).

But it turns out the latter is not so easy to achieve. If the source index
has only 1 segment (at least in my case, ~100% of the time), then calling
optimize() doesn't do anything because the MP thinks the index is already
optimized and returns no MergeSpec. To overcome this, I wrote a
ForceOptimizeMP which extends LogMP and forces optimize even if there is
only one segment.

Another option is to set the noCFSRation to 1.0 and flip the useCompoundFile
flag (ie if source is compound, create no compound and vice versa). That can
work too, but I don't think it's very good, because the source index will be
changed from compound to non (or vice versa), which is something that the
app didn't want.

So I think option 1 is better, but I wanted to ask if someone knows of a
better way to achieve this?

Shai

[jira] [Commented] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml


[ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019465#comment-13019465
 ] 

Uwe Schindler commented on SOLR-2436:
-

I just looked at the patch, is the SOLR-2436_2.patch still active or replaced 
by Kojis?

I ask because:
{noformat}
+try{
+  final InputSource is = new 
InputSource(loader.openConfig(uimaConfigFile));
+  DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
+  // only enable xinclude, if SystemId is present (makes no sense 
otherwise)
+  if (is.getSystemId() != null) {
+try {
+  dbf.setXIncludeAware(true);
+  dbf.setNamespaceAware(true);
+} catch( UnsupportedOperationException e ) {
+  LOG.warn( XML parser doesn't support XInclude option );
+}
+  }
{noformat}

This XInclude Handling is broken (the if-clause never gets executed). We have a 
new framework that makes XML-Loading from ResourceLoaders working correct, even 
with relative pathes! Just look at the example committed during the cleanup 
issue (look at other places in solr where DocumentBuilders or XMLStreamReaders 
are instantiated. The new Solr way to load such files is a special URI scheme 
that is internally used to resolve ResourceLoader resources correctly (see 
SOLR-1656).

The latest patch looks fine, it embeds the config directly, which seems much 
more consistent.

 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml


[ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019470#comment-13019470
 ] 

Uwe Schindler commented on SOLR-2436:
-

Here is the new way to load XML from ResourceLoaders in Solr (taken from 
Config). This code also intercepts errors and warnings and logs them correctly 
(parsers tend to write them to System.err):

{code:java}
is = new InputSource(loader.openConfig(name));
is.setSystemId(SystemIdResolver.createSystemIdFromResourceName(name));


// only enable xinclude, if a SystemId is available
if (is.getSystemId() != null) {
try {
dbf.setXIncludeAware(true);
dbf.setNamespaceAware(true);
} catch(UnsupportedOperationException e) {
log.warn(name +  XML parser doesn't support XInclude option);
}
}

final DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(new SystemIdResolver(loader));
db.setErrorHandler(xmllog);
try {
doc = db.parse(is);
} finally {
// some XML parsers are broken and don't close the byte stream (but 
they should according to spec)
IOUtils.closeQuietly(is.getByteStream());
}
{code}

 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml


[ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019470#comment-13019470
 ] 

Uwe Schindler edited comment on SOLR-2436 at 4/13/11 6:14 PM:
--

Here is the new way to load XML from ResourceLoaders in Solr (taken from 
Config). This code also intercepts errors and warnings and logs them correctly 
(parsers tend to write them to System.err):

{code:java}
public static final Logger log = LoggerFactory.getLogger(Config.class);
private static final XMLErrorLogger xmllog = new XMLErrorLogger(log);

...

final InputSource is = new InputSource(loader.openConfig(name));
is.setSystemId(SystemIdResolver.createSystemIdFromResourceName(name));

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
// only enable xinclude, if a SystemId is available
if (is.getSystemId() != null) {
try {
dbf.setXIncludeAware(true);
dbf.setNamespaceAware(true);
} catch(UnsupportedOperationException e) {
log.warn(name +  XML parser doesn't support XInclude option);
}
}

final DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(new SystemIdResolver(loader));
db.setErrorHandler(xmllog);
try {
doc = db.parse(is);
} finally {
// some XML parsers are broken and don't close the byte stream (but 
they should according to spec)
IOUtils.closeQuietly(is.getByteStream());
}
{code}

  was (Author: thetaphi):
Here is the new way to load XML from ResourceLoaders in Solr (taken from 
Config). This code also intercepts errors and warnings and logs them correctly 
(parsers tend to write them to System.err):

{code:java}
is = new InputSource(loader.openConfig(name));
is.setSystemId(SystemIdResolver.createSystemIdFromResourceName(name));


// only enable xinclude, if a SystemId is available
if (is.getSystemId() != null) {
try {
dbf.setXIncludeAware(true);
dbf.setNamespaceAware(true);
} catch(UnsupportedOperationException e) {
log.warn(name +  XML parser doesn't support XInclude option);
}
}

final DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(new SystemIdResolver(loader));
db.setErrorHandler(xmllog);
try {
doc = db.parse(is);
} finally {
// some XML parsers are broken and don't close the byte stream (but 
they should according to spec)
IOUtils.closeQuietly(is.getByteStream());
}
{code}
  
 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml


[ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019473#comment-13019473
 ] 

Uwe Schindler commented on SOLR-2436:
-

Maybe we should add my last comment into the Wiki: Howto load XML from Solr's 
config resources, to prevent broken code again from appearing (if this no 
issue here anymore this is fine, I was just alarmed). I had a hard time to fix 
all XML handling in Solr (DIH is still broken with charsets), but XInclude now 
works as expected everywhere.

 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml

2011-04-13 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019474#comment-13019474
 ] 

Mark Miller commented on SOLR-2436:
---

bq. Maybe we should add my last comment into the Wiki: 

+1

 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml

2011-04-13 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019476#comment-13019476
 ] 

Mark Miller commented on SOLR-2436:
---

Or perhaps we need a utility method and pointer to that?

 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: An IDF variation with penalty for very rare terms

2011-04-13 Thread Marvin Humphrey

On Wed, Apr 13, 2011 at 01:01:09AM +0400, Earwin Burrfoot wrote:
 Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- 
 ?
 Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
 Traditional log(N/x) tail, but when nearing zero freq, instead of
 going to +inf you do a nice round bump (with controlled
 height/location/sharpness) and drop down to -inf (or zero).
 
I haven't used that technique, nor can I quote academic literature blessing
it.  Nevertheless, what you're doing makes sense makes sense to me.

 Rationale is that - most good, discriminating terms are found in at
 least a certain percentage of your documents, but there are lots of
 mostly unique crapterms, which at some collection sizes stop being
 strictly unique and with IDF's help explode your scores.

So you've designed a heuristic that allows you to filter a certain kind of
noise.  It sounds a lot like how people tune length normalization to adapt to
their document collections.  Many tuning techniques are corpus-specific.
Whatever works, works!

Marvin Humphrey


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[HUDSON] Lucene-Solr-tests-only-3.x - Build # 7075 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/7075/

1 tests failed.
REGRESSION:  org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe

Error Message:
Java heap space

Stack Trace:
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2894)
at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:117)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:589)
at java.lang.StringBuffer.append(StringBuffer.java:337)
at 
java.text.RuleBasedCollator.getCollationKey(RuleBasedCollator.java:617)
at 
org.apache.lucene.collation.CollationKeyFilter.incrementToken(CollationKeyFilter.java:93)
at 
org.apache.lucene.collation.CollationTestBase.assertThreadSafe(CollationTestBase.java:304)
at 
org.apache.lucene.collation.TestCollationKeyAnalyzer.testThreadSafe(TestCollationKeyAnalyzer.java:89)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1082)
at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1010)




Build Log (for compile errors):
[...truncated 5276 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3026) smartcn analyzer throw NullPointer exception when the length of analysed text over 32767


 [ 
https://issues.apache.org/jira/browse/LUCENE-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhenghang updated LUCENE-3026:
--

Summary: smartcn analyzer throw NullPointer exception when the length of 
analysed text over 32767  (was: smartcn analysis throw NullPointer exception 
when the length of analysed text over 32767)

 smartcn analyzer throw NullPointer exception when the length of analysed text 
 over 32767
 

 Key: LUCENE-3026
 URL: https://issues.apache.org/jira/browse/LUCENE-3026
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.1, 4.0
Reporter: wangzhenghang

 That's all because of org.apache.lucene.analysis.cn.smart.hhmm.SegGraph's 
 makeIndex() method:
   public ListSegToken makeIndex() {
 ListSegToken result = new ArrayListSegToken();
 int s = -1, count = 0, size = tokenListTable.size();
 ListSegToken tokenList;
 short index = 0;
 while (count  size) {
   if (isStartExist(s)) {
 tokenList = tokenListTable.get(s);
 for (SegToken st : tokenList) {
   st.index = index;
   result.add(st);
   index++;
 }
 count++;
   }
   s++;
 }
 return result;
   }
 here 'short index = 0;' should be 'int index = 0;'. And that's reported here 
 http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=2 and 
 http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=11, the 
 author XiaoPingGao have already fixed this 
 bug:http://code.google.com/p/imdict-chinese-analyzer/source/browse/trunk/src/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3026) smartcn analyzer throw NullPointer exception when the length of analysed text over 32767

2011-04-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019636#comment-13019636
 ] 

Robert Muir commented on LUCENE-3026:
-

This sounds like a bug, do you want to try your hand at contributing a patch?

See http://wiki.apache.org/lucene-java/HowToContribute for some instructions.


 smartcn analyzer throw NullPointer exception when the length of analysed text 
 over 32767
 

 Key: LUCENE-3026
 URL: https://issues.apache.org/jira/browse/LUCENE-3026
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.1, 4.0
Reporter: wangzhenghang

 That's all because of org.apache.lucene.analysis.cn.smart.hhmm.SegGraph's 
 makeIndex() method:
   public ListSegToken makeIndex() {
 ListSegToken result = new ArrayListSegToken();
 int s = -1, count = 0, size = tokenListTable.size();
 ListSegToken tokenList;
 short index = 0;
 while (count  size) {
   if (isStartExist(s)) {
 tokenList = tokenListTable.get(s);
 for (SegToken st : tokenList) {
   st.index = index;
   result.add(st);
   index++;
 }
 count++;
   }
   s++;
 }
 return result;
   }
 here 'short index = 0;' should be 'int index = 0;'. And that's reported here 
 http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=2 and 
 http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=11, the 
 author XiaoPingGao have already fixed this 
 bug:http://code.google.com/p/imdict-chinese-analyzer/source/browse/trunk/src/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect

2011-04-13 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019637#comment-13019637
 ] 

Robert Muir commented on LUCENE-3022:
-

This sounds like a bug, do you want to try your hand at contributing a patch?

See http://wiki.apache.org/lucene-java/HowToContribute for some instructions.


 DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
 -

 Key: LUCENE-3022
 URL: https://issues.apache.org/jira/browse/LUCENE-3022
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 2.9.4, 3.1
Reporter: Johann Höchtl
Priority: Minor
   Original Estimate: 5m
  Remaining Estimate: 5m

 When using the DictionaryCompoundWordTokenFilter with a german dictionary, I 
 got a strange behaviour:
 The german word streifenbluse (blouse with stripes) was decompounded to 
 streifen (stripe),reifen(tire) which makes no sense at all.
 I thought the flag onlyLongestMatch would fix this, because streifen is 
 longer than reifen, but it had no effect.
 So I reviewed the sourcecode and found the problem:
 [code]
 protected void decomposeInternal(final Token token) {
 // Only words longer than minWordSize get processed
 if (token.length()  this.minWordSize) {
   return;
 }
 
 char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
 
 for (int i=0;itoken.length()-this.minSubwordSize;++i) {
 Token longestMatchToken=null;
 for (int j=this.minSubwordSize-1;jthis.maxSubwordSize;++j) {
 if(i+jtoken.length()) {
 break;
 }
 if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
 if (this.onlyLongestMatch) {
if (longestMatchToken!=null) {
  if (longestMatchToken.length()j) {
longestMatchToken=createToken(i,j,token);
  }
} else {
  longestMatchToken=createToken(i,j,token);
}
 } else {
tokens.add(createToken(i,j,token));
 }
 } 
 }
 if (this.onlyLongestMatch  longestMatchToken!=null) {
   tokens.add(longestMatchToken);
 }
 }
   }
 [/code]
 should be changed to 
 [code]
 protected void decomposeInternal(final Token token) {
 // Only words longer than minWordSize get processed
 if (token.termLength()  this.minWordSize) {
   return;
 }
 char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
 Token longestMatchToken=null;
 for (int i=0;itoken.termLength()-this.minSubwordSize;++i) {
 for (int j=this.minSubwordSize-1;jthis.maxSubwordSize;++j) {
 if(i+jtoken.termLength()) {
 break;
 }
 if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
 if (this.onlyLongestMatch) {
if (longestMatchToken!=null) {
  if (longestMatchToken.termLength()j) {
longestMatchToken=createToken(i,j,token);
  }
} else {
  longestMatchToken=createToken(i,j,token);
}
 } else {
tokens.add(createToken(i,j,token));
 }
 }
 }
 }
 if (this.onlyLongestMatch  longestMatchToken!=null) {
 tokens.add(longestMatchToken);
 }
   }
 [/code]
 So, that only the longest token is really indexed and the onlyLongestMatch 
 Flag makes sense.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml

2011-04-13 Thread Koji Sekiguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019649#comment-13019649
 ] 

Koji Sekiguchi commented on SOLR-2436:
--

Hi Uwe,

The problematic snippet regarding XInclude handling has been first introduced 
in my patch that I borrowed from DIH. When I did it, I missed something. Thank 
you for the alarm.

Now we are trying to embed the config in update processor instead of loading it 
from out of solrconfig.xml, the problematic snippet are gone.

 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-2436) move uimaConfig to under the uima's update processor in solrconfig.xml

2011-04-13 Thread Koji Sekiguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-2436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019652#comment-13019652
 ] 

Koji Sekiguchi commented on SOLR-2436:
--

The patch looks good, Tommaso!

If it is going to commit, it breaks back-compat. I think we need a note for 
users in CHANGES.txt.

 move uimaConfig to under the uima's update processor in solrconfig.xml
 --

 Key: SOLR-2436
 URL: https://issues.apache.org/jira/browse/SOLR-2436
 Project: Solr
  Issue Type: Improvement
Affects Versions: 3.1
Reporter: Koji Sekiguchi
Priority: Minor
 Attachments: SOLR-2436-3.patch, SOLR-2436.patch, SOLR-2436.patch, 
 SOLR-2436.patch, SOLR-2436_2.patch


 Solr contrib UIMA has its config just beneath config. I think it should 
 move to uima's update processor tag.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-2467) Custom analyzer load exceptions are not logged.

2011-04-13 Thread Alexander Kistanov (JIRA)

Custom analyzer load exceptions are not logged.
---

 Key: SOLR-2467
 URL: https://issues.apache.org/jira/browse/SOLR-2467
 Project: Solr
  Issue Type: Bug
Affects Versions: 3.1
Reporter: Alexander Kistanov
Priority: Minor


If any exception occurred on custom analyzer load the following catch code is 
working:

{code:title=solr/src/java/org/apache/solr/schema/IndexSchema.java}
  } catch (Exception e) {
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
  Cannot load analyzer: +analyzerName );
  }
{code}

Analyzer load exception e is not logged at all.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[HUDSON] Lucene-Solr-tests-only-3.x - Build # 7082 - Failure

Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/7082/

No tests ran.

Build Log (for compile errors):
[...truncated 118 lines...]



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3026) smartcn analyzer throw NullPointer exception when the length of analysed text over 32767


[ 
https://issues.apache.org/jira/browse/LUCENE-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13019671#comment-13019671
 ] 

wangzhenghang commented on LUCENE-3026:
---

It's done

 smartcn analyzer throw NullPointer exception when the length of analysed text 
 over 32767
 

 Key: LUCENE-3026
 URL: https://issues.apache.org/jira/browse/LUCENE-3026
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/analyzers
Affects Versions: 3.1, 4.0
Reporter: wangzhenghang
 Attachments: LUCENE-3026.patch


 That's all because of org.apache.lucene.analysis.cn.smart.hhmm.SegGraph's 
 makeIndex() method:
   public ListSegToken makeIndex() {
 ListSegToken result = new ArrayListSegToken();
 int s = -1, count = 0, size = tokenListTable.size();
 ListSegToken tokenList;
 short index = 0;
 while (count  size) {
   if (isStartExist(s)) {
 tokenList = tokenListTable.get(s);
 for (SegToken st : tokenList) {
   st.index = index;
   result.add(st);
   index++;
 }
 count++;
   }
   s++;
 }
 return result;
   }
 here 'short index = 0;' should be 'int index = 0;'. And that's reported here 
 http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=2 and 
 http://code.google.com/p/imdict-chinese-analyzer/issues/detail?id=11, the 
 author XiaoPingGao have already fixed this 
 bug:http://code.google.com/p/imdict-chinese-analyzer/source/browse/trunk/src/org/apache/lucene/analysis/cn/smart/hhmm/SegGraph.java

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-3026) smartcn analyzer throw NullPointer exception when the length of analysed text over 32767