[jira] Resolved: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
[ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen resolved LUCENE-1540. - Resolution: Fixed ok no new failures, closing as fixed, Thanks Shai and Robert for your help here! > Improvements to contrib.benchmark for TREC collections > -- > > Key: LUCENE-1540 > URL: https://issues.apache.org/jira/browse/LUCENE-1540 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Tim Armstrong >Assignee: Doron Cohen >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, > LUCENE-1540.patch, trecdocs.zip > > > The benchmarking utilities for TREC test collections (http://trec.nist.gov) > are quite limited and do not support some of the variations in format of > older TREC collections. > I have been doing some benchmarking work with Lucene and have had to modify > the package to support: > * Older TREC document formats, which the current parser fails on due to > missing document headers. > * Variations in query format - newlines after tag causing the query > parser to get confused. > * Ability to detect and read in uncompressed text collections > * Storage of document numbers by default without storing full text. > I can submit a patch if there is interest, although I will probably want to > write unit tests for the new functionality first. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1395) Integrate Katta
[ https://issues.apache.org/jira/browse/SOLR-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991279#comment-12991279 ] JohnWu commented on SOLR-1395: -- TomLiu: as you said:QueryComponent returns DocSlice, but XMLWrite or EmbeddedServer returns SolrDocumentList from DocList. I set the requestHandler to solr.MultiEmbeddedSearchHandler but the queryComponent still return the DocSlice. Can you give me some advices? Thanks! JohnWu > Integrate Katta > --- > > Key: SOLR-1395 > URL: https://issues.apache.org/jira/browse/SOLR-1395 > Project: Solr > Issue Type: New Feature >Affects Versions: 1.4 >Reporter: Jason Rutherglen >Priority: Minor > Fix For: Next > > Attachments: SOLR-1395.patch, SOLR-1395.patch, SOLR-1395.patch, > back-end.log, front-end.log, hadoop-core-0.19.0.jar, katta-core-0.6-dev.jar, > katta-solrcores.jpg, katta.node.properties, katta.zk.properties, > log4j-1.2.13.jar, solr-1395-1431-3.patch, solr-1395-1431-4.patch, > solr-1395-1431-katta0.6.patch, solr-1395-1431-katta0.6.patch, > solr-1395-1431.patch, solr-1395-katta-0.6.2-1.patch, > solr-1395-katta-0.6.2-2.patch, solr-1395-katta-0.6.2-3.patch, > solr-1395-katta-0.6.2.patch, test-katta-core-0.6-dev.jar, > zkclient-0.1-dev.jar, zookeeper-3.2.1.jar > > Original Estimate: 336h > Remaining Estimate: 336h > > We'll integrate Katta into Solr so that: > * Distributed search uses Hadoop RPC > * Shard/SolrCore distribution and management > * Zookeeper based failover > * Indexes may be built using Hadoop -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reassigned LUCENE-2909: -- Assignee: Koji Sekiguchi > NGramTokenFilter may generate offsets that exceed the length of original text > - > > Key: LUCENE-2909 > URL: https://issues.apache.org/jira/browse/LUCENE-2909 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Affects Versions: 2.9.4 >Reporter: Shinya Kasatani >Assignee: Koji Sekiguchi >Priority: Minor > Attachments: TokenFilterOffset.patch > > > Whan using NGramTokenFilter combined with CharFilters that lengthen the > original text (such as "ß" -> "ss"), the generated offsets exceed the length > of the origianal text. > This causes InvalidTokenOffsetsException when you try to highlight the text > in Solr. > While it is not possible to know the accurate offset of each character once > you tokenize the whole text with tokenizers like KeywordTokenizer, > NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
[ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shinya Kasatani updated LUCENE-2909: Attachment: TokenFilterOffset.patch The patch that fixes the problem, including tests. > NGramTokenFilter may generate offsets that exceed the length of original text > - > > Key: LUCENE-2909 > URL: https://issues.apache.org/jira/browse/LUCENE-2909 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Affects Versions: 2.9.4 >Reporter: Shinya Kasatani >Priority: Minor > Attachments: TokenFilterOffset.patch > > > Whan using NGramTokenFilter combined with CharFilters that lengthen the > original text (such as "ß" -> "ss"), the generated offsets exceed the length > of the origianal text. > This causes InvalidTokenOffsetsException when you try to highlight the text > in Solr. > While it is not possible to know the accurate offset of each character once > you tokenize the whole text with tokenizers like KeywordTokenizer, > NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
NGramTokenFilter may generate offsets that exceed the length of original text - Key: LUCENE-2909 URL: https://issues.apache.org/jira/browse/LUCENE-2909 Project: Lucene - Java Issue Type: Bug Components: contrib/analyzers Affects Versions: 2.9.4 Reporter: Shinya Kasatani Priority: Minor Whan using NGramTokenFilter combined with CharFilters that lengthen the original text (such as "ß" -> "ss"), the generated offsets exceed the length of the origianal text. This causes InvalidTokenOffsetsException when you try to highlight the text in Solr. While it is not possible to know the accurate offset of each character once you tokenize the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid generating invalid offsets. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Keyword - search statistics
Hi Is there any way i can get 'no of times' a key word searched in SOLR ? *Here is my solr package details* Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 -Selvaraj
Re: Distributed Indexing
Hey, We're making good progress, but our DistributedUpdateRequestHandler is having a bit of an identity crisis, so we thought we'd ask what other people's opinions are. The current situation is as follows: We've added a method to ContentStreamHandlerBase to check if an update request is distributed or not (based on the presence/validity of the 'shards' parameter). So a non-distributed request will proceed as normal but a distributed request would be passed on to the DistributedUpdateRequestHandler to deal with. The reason this choice is made in the ContentStreamHandlerBase is so that the DistributedUpdateRequestHandler can use the URL the request came in on to determine where to distribute update requests. Eg. an update request is sent to: http://localhost:8983/solr/update/csv?shards=shard1,shard2... then the DistributedUpdateRequestHandler knows to send requests to: shard1/update/csv shard2/update/csv Alternatively, if the request wasn't distributed, it would simply be handled by whichever request handler "/update/csv" uses. Herein lies the problem. The DistributedUpdateRequestHandler is not really a request handler in the same way as the CSVRequestHandler or XmlUpdateRequestHandlers are. If anything, it's more like a "plugin" for the various existing update request handlers, to allow them to deal with distributed requests - a "distributor" if you will. It isn't designed to be able to receive and handle requests directly. We would like this "DistributedUpdateRequestHandler" to be defined in the solrconfig to allow flexibility for setting up multiple different DistributedUpdateRequestHandlers with different ShardDistributionPolicies etc.and also to allow us to get the appropriate instance from the core in the code. There seem to be two paths for doing this: 1. Leave it as an implementation of SolrRequestHandler and hope the user doesn't directly send update requests to it (ie. a request to http://localhost:8983/solr/ would most likely cripple something). So it would be defined in the solrconfig something like: 2. Create a new plugin type for the solrconfig, say "updateRequestDistributor" which would involve creating a new interface for the DistributedUpdateRequestHandler to implement, then registering it with the core. It would be defined in the solrconfig something like: solr.HashedDistributionPolicy This would mean that it couldn't directly receive requests, but that an instance could still easily be retrieved from the core to handle the distribution of update requests. Any thoughts on the above issue (or a more succinct, descriptive name for the class) are most welcome! Alex
[jira] Issue Comment Edited: (SOLR-2341) Shard distribution policy
[ https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991225#comment-12991225 ] William Mayor edited comment on SOLR-2341 at 2/6/11 10:00 PM: -- This patch makes the implemented policy deterministic. This is missing from the previous patch. The policy code has also been refactored into its own package. was (Author: williammayor): This patch makes the implemented policy deterministic. This is missing from the previous patch. The policy code has also been refactored into it's own package. > Shard distribution policy > - > > Key: SOLR-2341 > URL: https://issues.apache.org/jira/browse/SOLR-2341 > Project: Solr > Issue Type: New Feature >Reporter: William Mayor >Priority: Minor > Attachments: SOLR-2341.patch, SOLR-2341.patch > > > A first crack at creating policies to be used for determining to which of a > list of shards a document should go. See discussion on "Distributed Indexing" > on dev-list. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Arabic Analyzer
Here is a port of lucene.java's arabic analyzer ( https://issues.apache.org/jira/browse/LUCENENET-392 ) You can safely remove nunit dependency and test cases from the project. DIGY -Original Message- From: Ben Foster [mailto:b...@planetcloud.co.uk] Sent: Sunday, February 06, 2011 5:47 PM To: lucene-net-...@lucene.apache.org Subject: Re: Arabic Analyzer Is it still possible to use fixed term queries in Arabic (i.e. NOT using an Analyzer)? Thanks Ben On 6 February 2011 00:51, Prescott Nasser wrote: > > Unfortunately, I don't think we have that. We're working on creating a new > port of the java lucene code, but I don't know the timeline yet - I'm sure > there will be a lot of chatter on this mailing list soon. > > ~Prescott > > > > > > > > Date: Sat, 5 Feb 2011 22:57:11 + > > Subject: Arabic Analyzer > > From: b...@planetcloud.co.uk > > To: lucene-net-...@lucene.apache.org > > > > Is there an Arabic Analyzer available for Lucene.NET. I see there has > been > > one contributed to the Java project but wasn't sure if this has been > ported. > > > > Thanks, > > > > Ben > -- Ben Foster planetcloud The Elms, Hawton Newark-on-Trent Nottinghamshire NG24 3RL www.planetcloud.co.uk
Re: Distributed Indexing
Hi Good call about the policies being deterministic, should've thought of that earlier. We've changed the patch to include this and I've removed the random assignment one (for obvious reasons). Take a look and let me know what's to do. ( https://issues.apache.org/jira/browse/SOLR-2341) Cheers William On Thu, Feb 3, 2011 at 5:00 PM, Upayavira wrote: > > On Thu, 03 Feb 2011 15:12 +, "Alex Cowell" wrote: > > Hi all, > > Just a couple of questions that have arisen. > > 1. For handling non-distributed update requests (shards param is not > present or is invalid), our code currently > >- assumes the user would like the data indexed, so gets the request >handler assigned to "/update" >- executes the request using core.execute() for the SolrCore associated >with the original request > > Is this what we want it to do and is using core.execute() from within a > request handler a valid method of passing on the update request? > > > Take a look at how it is done in > handler.component.SearchHandler.handleRequestBody(). I'd say try to follow > as similar approach as possible. E.g. it is the SearchHandler that does much > of the work, branching depending on whether it found a shards parameter. > > > 2. We have partially implemented an update processor which actually > generates and sends the split update requests to each specified shard (as > designated by the policy). As it stands, the code shares a lot in common > with the HttpCommComponent class used for distributed search. Should we look > at "opening up" the HttpCommComponent class so it could be used by our > request handler as well or should we continue with our current > implementation and worry about that later? > > > I agree that you are going to want to implement an UpdateRequestProcessor. > However, it would seem to me that, unlike search, you're not going to want > to bother with the existing processor and associated component chain, you're > going to want to replace the processor with a distributed version. > > As to the HttpCommComponent, I'd suggest you make your own educated > decision. How similar is the class? Could one serve both needs effectively? > > > 3. Our update processor uses a MultiThreadedHttpConnectionManager to send > parallel updates to shards, can anyone give some appropriate values to be > used for the defaultMaxConnectionsPerHost and maxTotalConnections params? > Won't the values used for distributed search be a little high for > distributed indexing? > > > You are right, these will likely be lower for distributed indexing, however > I'd suggest not worrying about it for now, as it is easy to tweak later. > > Upayavira > > --- > Enterprise Search Consultant at Sourcesense UK, > Making Sense of Open Source >
[jira] Updated: (SOLR-2341) Shard distribution policy
[ https://issues.apache.org/jira/browse/SOLR-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] William Mayor updated SOLR-2341: Attachment: SOLR-2341.patch This patch makes the implemented policy deterministic. This is missing from the previous patch. The policy code has also been refactored into it's own package. > Shard distribution policy > - > > Key: SOLR-2341 > URL: https://issues.apache.org/jira/browse/SOLR-2341 > Project: Solr > Issue Type: New Feature >Reporter: William Mayor >Priority: Minor > Attachments: SOLR-2341.patch, SOLR-2341.patch > > > A first crack at creating policies to be used for determining to which of a > list of shards a document should go. See discussion on "Distributed Indexing" > on dev-list. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
[ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991224#comment-12991224 ] Doron Cohen commented on LUCENE-1540: - Following suggestions by Robert, brought back case insensitivity of path names by upper casing with Locale.ENGLISH as suggested in [toUpperCase()|http://download.oracle.com/javase/6/docs/api/java/lang/String.html#toUpperCase%28%29]. Committed: - r1067764 - 3x - r1067772 - trunk > Improvements to contrib.benchmark for TREC collections > -- > > Key: LUCENE-1540 > URL: https://issues.apache.org/jira/browse/LUCENE-1540 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Tim Armstrong >Assignee: Doron Cohen >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, > LUCENE-1540.patch, trecdocs.zip > > > The benchmarking utilities for TREC test collections (http://trec.nist.gov) > are quite limited and do not support some of the variations in format of > older TREC collections. > I have been doing some benchmarking work with Lucene and have had to modify > the package to support: > * Older TREC document formats, which the current parser fails on due to > missing document headers. > * Variations in query format - newlines after tag causing the query > parser to get confused. > * Ability to detect and read in uncompressed text collections > * Storage of document numbers by default without storing full text. > I can submit a patch if there is interest, although I will probably want to > write unit tests for the new functionality first. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991221#comment-12991221 ] hao yan commented on LUCENE-2903: - Hi, Paul I tested ByteBuffer->IntBuffer, it is not faster than converting int[] <-> byte[]. > Improvement of PForDelta Codec > -- > > Key: LUCENE-2903 > URL: https://issues.apache.org/jira/browse/LUCENE-2903 > Project: Lucene - Java > Issue Type: Improvement >Reporter: hao yan > Attachments: LUCENE_2903.patch, LUCENE_2903.patch > > > There are 3 versions of PForDelta implementations in the Bulk Branch: > FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. > The FrameOfRef is a very basic one which is essentially a binary encoding > (may result in huge index size). > The PatchedFrameOfRef is the implmentation based on the original version of > PForDelta in the literatures. > The PatchedFrameOfRef2 is my previous implementation which are improved this > time. (The Codec name is changed to NewPForDelta.). > In particular, the changes are: > 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the > old PForDelta does not support very large exceptions (since > the Simple16 does not support very large numbers). Now this has been fixed in > the new LCPForDelta. > 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other > two PForDelta implementation in the bulk branch (FrameOfRef and > PatchedFrameOfRef). The codec's name is "NewPForDelta", as you can see in the > CodecProvider and PForDeltaFixedIntBlockCodec. > 3. The performance test results are: > 1) My "NewPForDelta" codec is faster then FrameOfRef and PatchedFrameOfRef > for almost all kinds of queries, slightly worse then BulkVInt. > 2) My "NewPForDelta" codec can result in the smallest index size among all 4 > methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) > 3) All performance test results are achieved by running with "-server" > instead of "-client" -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991222#comment-12991222 ] hao yan commented on LUCENE-2903: - And it sure complicate the pfordelta algorithm a lot by using intbuffer.set/get. > Improvement of PForDelta Codec > -- > > Key: LUCENE-2903 > URL: https://issues.apache.org/jira/browse/LUCENE-2903 > Project: Lucene - Java > Issue Type: Improvement >Reporter: hao yan > Attachments: LUCENE_2903.patch, LUCENE_2903.patch > > > There are 3 versions of PForDelta implementations in the Bulk Branch: > FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. > The FrameOfRef is a very basic one which is essentially a binary encoding > (may result in huge index size). > The PatchedFrameOfRef is the implmentation based on the original version of > PForDelta in the literatures. > The PatchedFrameOfRef2 is my previous implementation which are improved this > time. (The Codec name is changed to NewPForDelta.). > In particular, the changes are: > 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the > old PForDelta does not support very large exceptions (since > the Simple16 does not support very large numbers). Now this has been fixed in > the new LCPForDelta. > 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other > two PForDelta implementation in the bulk branch (FrameOfRef and > PatchedFrameOfRef). The codec's name is "NewPForDelta", as you can see in the > CodecProvider and PForDeltaFixedIntBlockCodec. > 3. The performance test results are: > 1) My "NewPForDelta" codec is faster then FrameOfRef and PatchedFrameOfRef > for almost all kinds of queries, slightly worse then BulkVInt. > 2) My "NewPForDelta" codec can result in the smallest index size among all 4 > methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) > 3) All performance test results are achieved by running with "-server" > instead of "-client" -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991220#comment-12991220 ] hao yan commented on LUCENE-2903: - HI, Michael Did u try FrameOfRef and PatchedFrameOfRef? > Improvement of PForDelta Codec > -- > > Key: LUCENE-2903 > URL: https://issues.apache.org/jira/browse/LUCENE-2903 > Project: Lucene - Java > Issue Type: Improvement >Reporter: hao yan > Attachments: LUCENE_2903.patch, LUCENE_2903.patch > > > There are 3 versions of PForDelta implementations in the Bulk Branch: > FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. > The FrameOfRef is a very basic one which is essentially a binary encoding > (may result in huge index size). > The PatchedFrameOfRef is the implmentation based on the original version of > PForDelta in the literatures. > The PatchedFrameOfRef2 is my previous implementation which are improved this > time. (The Codec name is changed to NewPForDelta.). > In particular, the changes are: > 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the > old PForDelta does not support very large exceptions (since > the Simple16 does not support very large numbers). Now this has been fixed in > the new LCPForDelta. > 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other > two PForDelta implementation in the bulk branch (FrameOfRef and > PatchedFrameOfRef). The codec's name is "NewPForDelta", as you can see in the > CodecProvider and PForDeltaFixedIntBlockCodec. > 3. The performance test results are: > 1) My "NewPForDelta" codec is faster then FrameOfRef and PatchedFrameOfRef > for almost all kinds of queries, slightly worse then BulkVInt. > 2) My "NewPForDelta" codec can result in the smallest index size among all 4 > methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) > 3) All performance test results are achieved by running with "-server" > instead of "-client" -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1067699 - in /lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src: java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java test/org/apache/lucene/benchmark/byTask/feed
Interesting... Thanks Robert for pointing this out! > "To obtain correct results for locale insensitive strings, use toUpperCase(Locale.ENGLISH)" Actually this is one of the things I tried and did solve it - with toUpperCase(Locale.US) - not exactly Locale.ENGLISH but quite similar I assume - and as you suggest it felt wrong, for wrong reasons... Perhaps I'll change it like this, case insensitivity is a good think when running in various OS's. On Sun, Feb 6, 2011 at 6:55 PM, Robert Muir wrote: > Thanks for catching this Doron. Another option if you want to keep the > case-insensitive feature here would be to use > toUpperCase(Locale.ENGLISH) > > It might look bad, but its actually recommended by the JDK for > locale-insensitive strings: > > http://download.oracle.com/javase/6/docs/api/java/lang/String.html#toUpperCase() > > On Sun, Feb 6, 2011 at 11:43 AM, wrote: > > Author: doronc > > Date: Sun Feb 6 16:43:54 2011 > > New Revision: 1067699 > > > > URL: http://svn.apache.org/viewvc?rev=1067699&view=rev > > Log: > > LUCENE-1540: Improvements to contrib.benchmark for TREC collections - fix > test failures in some locales due to toUpperCase() > > > > Modified: > > > > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java > > > > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/trecdocs.zip > > > > Modified: > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java > > URL: > http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java?rev=1067699&r1=1067698&r2=1067699&view=diff > > > == > > --- > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java > (original) > > +++ > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java > Sun Feb 6 16:43:54 2011 > > @@ -29,7 +29,12 @@ import java.util.Map; > > public abstract class TrecDocParser { > > > > /** Types of trec parse paths, */ > > - public enum ParsePathType { GOV2, FBIS, FT, FR94, LATIMES } > > + public enum ParsePathType { GOV2("gov2"), FBIS("fbis"), FT("ft"), > FR94("fr94"), LATIMES("latimes"); > > +public final String dirName; > > +private ParsePathType(String dirName) { > > + this.dirName = dirName; > > +} > > + } > > > > /** trec parser type used for unknown extensions */ > > public static final ParsePathType DEFAULT_PATH_TYPE = > ParsePathType.GOV2; > > @@ -46,7 +51,7 @@ public abstract class TrecDocParser { > > static final Map pathName2Type = new > HashMap(); > > static { > > for (ParsePathType ppt : ParsePathType.values()) { > > - pathName2Type.put(ppt.name(),ppt); > > + pathName2Type.put(ppt.dirName,ppt); > > } > > } > > > > @@ -59,7 +64,7 @@ public abstract class TrecDocParser { > > public static ParsePathType pathType(File f) { > > int pathLength = 0; > > while (f != null && ++pathLength < MAX_PATH_LENGTH) { > > - ParsePathType ppt = pathName2Type.get(f.getName().toUpperCase()); > > + ParsePathType ppt = pathName2Type.get(f.getName()); > > if (ppt!=null) { > > return ppt; > > } > > > > Modified: > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/trecdocs.zip > > URL: > http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/trecdocs.zip?rev=1067699&r1=1067698&r2=1067699&view=diff > > > == > > Binary files - no diff available. > > > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
[ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991210#comment-12991210 ] Doron Cohen commented on LUCENE-1540: - bq. I wish we knew of a good solution, because I hate it when things aren't completely reproducible everywhere. Thanks Robert, I am actually very pleased with this array of testing with various parameters like locale and others randomly selected - it is very powreful, and since the failure printed all the parameters used and even the ant line to reproduce(\!) - it was possible to reproduce in 3x, and, once understanding what the problem was also possible to reproduce in trunk - to me this is testing's heaven... > Improvements to contrib.benchmark for TREC collections > -- > > Key: LUCENE-1540 > URL: https://issues.apache.org/jira/browse/LUCENE-1540 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Tim Armstrong >Assignee: Doron Cohen >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, > LUCENE-1540.patch, trecdocs.zip > > > The benchmarking utilities for TREC test collections (http://trec.nist.gov) > are quite limited and do not support some of the variations in format of > older TREC collections. > I have been doing some benchmarking work with Lucene and have had to modify > the package to support: > * Older TREC document formats, which the current parser fails on due to > missing document headers. > * Variations in query format - newlines after tag causing the query > parser to get confused. > * Ability to detect and read in uncompressed text collections > * Storage of document numbers by default without storing full text. > I can submit a patch if there is interest, although I will probably want to > write unit tests for the new functionality first. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2609) Generate jar containing test classes.
[ https://issues.apache.org/jira/browse/LUCENE-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera resolved LUCENE-2609. Resolution: Fixed Committed revision 1067738. Thanks all for your comments and help ! > Generate jar containing test classes. > - > > Key: LUCENE-2609 > URL: https://issues.apache.org/jira/browse/LUCENE-2609 > Project: Lucene - Java > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.2 >Reporter: Drew Farris >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, > LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, > LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch > > > The test classes are useful for writing unit tests for code external to the > Lucene project. It would be helpful to build a jar of these classes and > publish them as a maven dependency. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2907) automaton termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991198#comment-12991198 ] Uwe Schindler commented on LUCENE-2907: --- Thanks, really nice now :-) > automaton termsenum bug when running with multithreaded search > -- > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2907.patch, LUCENE-2907.patch, > LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, > seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2907) automaton termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-2907. - Resolution: Fixed Assignee: Robert Muir Committed revision 1067720. > automaton termsenum bug when running with multithreaded search > -- > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir >Assignee: Robert Muir > Attachments: LUCENE-2907.patch, LUCENE-2907.patch, > LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, > seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2256) CommonsHttpSolrServer.deleteById(emptyList) causes SolrException: missing_content_stream
[ https://issues.apache.org/jira/browse/SOLR-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991191#comment-12991191 ] Stevo Slavic commented on SOLR-2256: I've experienced similar behavior with SolrJ 1.4.1 - later discovered that actual problem was that index schema was outdated, it was missing a field which was present in document. > CommonsHttpSolrServer.deleteById(emptyList) causes SolrException: > missing_content_stream > > > Key: SOLR-2256 > URL: https://issues.apache.org/jira/browse/SOLR-2256 > Project: Solr > Issue Type: Bug > Components: clients - java >Affects Versions: 1.4.1 >Reporter: Maxim Valyanskiy >Priority: Minor > > Call to deleteById method of CommonsHttpSolrServer with empty list causes > following exception: > org.apache.solr.common.SolrException: missing_content_stream > missing_content_stream > request: http://127.0.0.1:8983/solr/update/javabin > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435) > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) > at > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) > at org.apache.solr.client.solrj.SolrServer.deleteById(SolrServer.java:106) > at > ru.org.linux.spring.SearchQueueListener.reindexMessage(SearchQueueListener.java:89) > Here is TCP stream captured by Wireshark: > = > POST /solr/update HTTP/1.1 > Content-Type: application/x-www-form-urlencoded; charset=UTF-8 > User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0 > Host: 127.0.0.1:8983 > Content-Length: 20 > wt=javabin&version=1 > = > HTTP/1.1 400 missing_content_stream > Content-Type: text/html; charset=iso-8859-1 > Content-Length: 1401 > Server: Jetty(6.1.3) > = [ html reply skipped ] === -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2908) clean up serialization in the codebase
[ https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991187#comment-12991187 ] Uwe Schindler commented on LUCENE-2908: --- +1 > clean up serialization in the codebase > -- > > Key: LUCENE-2908 > URL: https://issues.apache.org/jira/browse/LUCENE-2908 > Project: Lucene - Java > Issue Type: Task >Reporter: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-2908.patch > > > We removed contrib/remote, but forgot to cleanup serialization hell > everywhere. > this is no longer needed, never really worked (e.g. across versions), and > slows > development (e.g. i wasted a long time debugging stupid serialization of > Similarity.idfExplain when trying to make a patch for the scoring system). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2906) Filter to process output of ICUTokenizer and create overlapping bigrams for CJK
[ https://issues.apache.org/jira/browse/LUCENE-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991186#comment-12991186 ] Robert Muir commented on LUCENE-2906: - {quote} How will this differ from the SmartChineseAnalyzer? {quote} The SmartChineseAnalyzer is for Simplified Chinese only... this is about the language-independent technique similar to what CJKAnalyzer does today. {quote} I doubt it but can this be in 3.1? {quote} Well i hate the way CJKAnalyzer treats things like supplementary characters (wrongly). This is definitely a bug, and fixed here. Part of me wants to fix this as quickly as possible. At the same time though, I would prefer 3.2... otherwise I would feel like I am rushing things. I don't think 3.2 needs to come a year after 3.1... in fact since we have a stable branch I think its stupid to make bugfix releases like 3.1.1 when we could just push out a new minor version (3.2) with bugfixes instead. The whole branch is intended to be stable changes, so I think this is better use of our time. But this is just my opinion, we can discuss it later on the list as one idea to promote more rapid releases. > Filter to process output of ICUTokenizer and create overlapping bigrams for > CJK > > > Key: LUCENE-2906 > URL: https://issues.apache.org/jira/browse/LUCENE-2906 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis >Reporter: Tom Burton-West >Priority: Minor > Attachments: LUCENE-2906.patch > > > The ICUTokenizer produces unigrams for CJK. We would like to use the > ICUTokenizer but have overlapping bigrams created for CJK as in the CJK > Analyzer. This filter would take the output of the ICUtokenizer, read the > ScriptAttribute and for selected scripts (Han, Kana), would produce > overlapping bigrams. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
[ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991181#comment-12991181 ] Doron Cohen commented on LUCENE-1540: - Fix for the locale issue merged to trunk at r1076605. Keeping open for a day or so to make sure there are no more failures. > Improvements to contrib.benchmark for TREC collections > -- > > Key: LUCENE-1540 > URL: https://issues.apache.org/jira/browse/LUCENE-1540 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Tim Armstrong >Assignee: Doron Cohen >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, > LUCENE-1540.patch, trecdocs.zip > > > The benchmarking utilities for TREC test collections (http://trec.nist.gov) > are quite limited and do not support some of the variations in format of > older TREC collections. > I have been doing some benchmarking work with Lucene and have had to modify > the package to support: > * Older TREC document formats, which the current parser fails on due to > missing document headers. > * Variations in query format - newlines after tag causing the query > parser to get confused. > * Ability to detect and read in uncompressed text collections > * Storage of document numbers by default without storing full text. > I can submit a patch if there is interest, although I will probably want to > write unit tests for the new functionality first. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2907) automaton termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991180#comment-12991180 ] Robert Muir commented on LUCENE-2907: - bq. I am not sure if CompiledAutomation is a good name since it is not really an automaton is it? it is a compiled form of the automaton... and it is a dfa, mathematically. At the end of the day this CompiledAutomaton is an internal api, we can change its name at any time. > automaton termsenum bug when running with multithreaded search > -- > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907.patch, LUCENE-2907.patch, > LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, > seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
[ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991179#comment-12991179 ] Robert Muir commented on LUCENE-1540: - Hi Doron, about the test random seeds: It is complicated (though maybe we could fix this!) for the same random seed in trunk to work just like 3.x But for the locales: the way it picks a random locale is from the available system locales. This changes from jre to jre, so unfortunately we cannot guarantee that the same seed chooses the same locale randomly... Its the same with timezones too... and these even change in minor jdk updates! I wish we knew of a good solution, because I hate it when things aren't completely reproducible everywhere. > Improvements to contrib.benchmark for TREC collections > -- > > Key: LUCENE-1540 > URL: https://issues.apache.org/jira/browse/LUCENE-1540 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Tim Armstrong >Assignee: Doron Cohen >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, > LUCENE-1540.patch, trecdocs.zip > > > The benchmarking utilities for TREC test collections (http://trec.nist.gov) > are quite limited and do not support some of the variations in format of > older TREC collections. > I have been doing some benchmarking work with Lucene and have had to modify > the package to support: > * Older TREC document formats, which the current parser fails on due to > missing document headers. > * Variations in query format - newlines after tag causing the query > parser to get confused. > * Ability to detect and read in uncompressed text collections > * Storage of document numbers by default without storing full text. > I can submit a patch if there is interest, although I will probably want to > write unit tests for the new functionality first. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2907) automaton termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991178#comment-12991178 ] Simon Willnauer commented on LUCENE-2907: - patch looks good - just being super picky: you don't need all the this.bla in CompiledAutomaton ;) I am not sure if CompiledAutomation is a good name since it is not really an automaton is it? simon > automaton termsenum bug when running with multithreaded search > -- > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907.patch, LUCENE-2907.patch, > LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, > seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2908) clean up serialization in the codebase
[ https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991177#comment-12991177 ] Simon Willnauer commented on LUCENE-2908: - big +1 to get rid of Serializable its broken anyway, slow and not really working across versions! Folks that want to send stuff through the wire using java serialization should put api sugar on top. > clean up serialization in the codebase > -- > > Key: LUCENE-2908 > URL: https://issues.apache.org/jira/browse/LUCENE-2908 > Project: Lucene - Java > Issue Type: Task >Reporter: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-2908.patch > > > We removed contrib/remote, but forgot to cleanup serialization hell > everywhere. > this is no longer needed, never really worked (e.g. across versions), and > slows > development (e.g. i wasted a long time debugging stupid serialization of > Similarity.idfExplain when trying to make a patch for the scoring system). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
[ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991176#comment-12991176 ] Doron Cohen commented on LUCENE-1540: - I am able to reproduce this on Linux. The test fails with *locale tr_TR* because TrecDocParser was upper-casing the file names for deciding which parser to apply. Problem with this is that toUpperCase is locale sensitive, and so the file name no longer matched the enum name. Fixed by adding a lower case dirName member to the enums. Also recreated the test files zip with '-UN u' for UTF8 handling of file names in the zip. committed at r1067699 for 3x. In trunk the test passes with same args also in Linux, but fails if you pass the locale that was randomly selected in 3x, i.e. like this: ant test -Dtestcase=TrecContentSourceTest -Dtestmethod=testTrecFeedDirAllTypes -Dtests.locale=tr_TR Will merge the fix to trunk shortly. > Improvements to contrib.benchmark for TREC collections > -- > > Key: LUCENE-1540 > URL: https://issues.apache.org/jira/browse/LUCENE-1540 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Tim Armstrong >Assignee: Doron Cohen >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, > LUCENE-1540.patch, trecdocs.zip > > > The benchmarking utilities for TREC test collections (http://trec.nist.gov) > are quite limited and do not support some of the variations in format of > older TREC collections. > I have been doing some benchmarking work with Lucene and have had to modify > the package to support: > * Older TREC document formats, which the current parser fails on due to > missing document headers. > * Variations in query format - newlines after tag causing the query > parser to get confused. > * Ability to detect and read in uncompressed text collections > * Storage of document numbers by default without storing full text. > I can submit a patch if there is interest, although I will probably want to > write unit tests for the new functionality first. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2907) automaton termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2907: Attachment: LUCENE-2907.patch here's the same patch, but cleaned up a bit (e.g. making some things private, final, etc) > automaton termsenum bug when running with multithreaded search > -- > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907.patch, LUCENE-2907.patch, > LUCENE-2907_repro.patch, correct_seeks.txt, incorrect_seeks.txt, > seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2894) Use of google-code-prettify for Lucene/Solr Javadoc
[ https://issues.apache.org/jira/browse/LUCENE-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991173#comment-12991173 ] Steven Rowe commented on LUCENE-2894: - Both of the nightly Hudson Maven builds failed because javadoc jars were not produced by the Ant build (scroll down to the bottom to see the error about javadoc jars not being available to deploy): https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-trunk/17/consoleText https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-3.x/16/consoleText > Use of google-code-prettify for Lucene/Solr Javadoc > --- > > Key: LUCENE-2894 > URL: https://issues.apache.org/jira/browse/LUCENE-2894 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2894.patch, LUCENE-2894.patch, LUCENE-2894.patch, > LUCENE-2894.patch > > > My company, RONDHUIT uses google-code-prettify (Apache License 2.0) in > Javadoc for syntax highlighting: > http://www.rondhuit-demo.com/RCSS/api/com/rondhuit/solr/analysis/JaReadingSynonymFilterFactory.html > I think we can use it for Lucene javadoc (java sample code in overview.html > etc) and Solr javadoc (Analyzer Factories etc) to improve or simplify our > life. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: svn commit: r1067699 - in /lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src: java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java test/org/apache/lucene/benchmark/byTask/feed
Thanks for catching this Doron. Another option if you want to keep the case-insensitive feature here would be to use toUpperCase(Locale.ENGLISH) It might look bad, but its actually recommended by the JDK for locale-insensitive strings: http://download.oracle.com/javase/6/docs/api/java/lang/String.html#toUpperCase() On Sun, Feb 6, 2011 at 11:43 AM, wrote: > Author: doronc > Date: Sun Feb 6 16:43:54 2011 > New Revision: 1067699 > > URL: http://svn.apache.org/viewvc?rev=1067699&view=rev > Log: > LUCENE-1540: Improvements to contrib.benchmark for TREC collections - fix > test failures in some locales due to toUpperCase() > > Modified: > > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java > > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/trecdocs.zip > > Modified: > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java > URL: > http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java?rev=1067699&r1=1067698&r2=1067699&view=diff > == > --- > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java > (original) > +++ > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/TrecDocParser.java > Sun Feb 6 16:43:54 2011 > @@ -29,7 +29,12 @@ import java.util.Map; > public abstract class TrecDocParser { > > /** Types of trec parse paths, */ > - public enum ParsePathType { GOV2, FBIS, FT, FR94, LATIMES } > + public enum ParsePathType { GOV2("gov2"), FBIS("fbis"), FT("ft"), > FR94("fr94"), LATIMES("latimes"); > + public final String dirName; > + private ParsePathType(String dirName) { > + this.dirName = dirName; > + } > + } > > /** trec parser type used for unknown extensions */ > public static final ParsePathType DEFAULT_PATH_TYPE = ParsePathType.GOV2; > @@ -46,7 +51,7 @@ public abstract class TrecDocParser { > static final Map pathName2Type = new > HashMap(); > static { > for (ParsePathType ppt : ParsePathType.values()) { > - pathName2Type.put(ppt.name(),ppt); > + pathName2Type.put(ppt.dirName,ppt); > } > } > > @@ -59,7 +64,7 @@ public abstract class TrecDocParser { > public static ParsePathType pathType(File f) { > int pathLength = 0; > while (f != null && ++pathLength < MAX_PATH_LENGTH) { > - ParsePathType ppt = pathName2Type.get(f.getName().toUpperCase()); > + ParsePathType ppt = pathName2Type.get(f.getName()); > if (ppt!=null) { > return ppt; > } > > Modified: > lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/trecdocs.zip > URL: > http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/benchmark/src/test/org/apache/lucene/benchmark/byTask/feeds/trecdocs.zip?rev=1067699&r1=1067698&r2=1067699&view=diff > == > Binary files - no diff available. > > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1799) Unicode compression
[ https://issues.apache.org/jira/browse/LUCENE-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991170#comment-12991170 ] DM Smith commented on LUCENE-1799: -- Any idea as to when this will be released? > Unicode compression > --- > > Key: LUCENE-1799 > URL: https://issues.apache.org/jira/browse/LUCENE-1799 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Affects Versions: 2.4.1 >Reporter: DM Smith >Priority: Minor > Attachments: Benchmark.java, Benchmark.java, Benchmark.java, > LUCENE-1779.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, > LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, > LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, > LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799.patch, LUCENE-1799_big.patch > > > In lucene-1793, there is the off-topic suggestion to provide compression of > Unicode data. The motivation was a custom encoding in a Russian analyzer. The > original supposition was that it provided a more compact index. > This led to the comment that a different or compressed encoding would be a > generally useful feature. > BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM > with an implementation in ICU. If Lucene provide it's own implementation a > freely avIlable, royalty-free license would need to be obtained. > SCSU is another Unicode compression algorithm that could be used. > An advantage of these methods is that they work on the whole of Unicode. If > that is not needed an encoding such as iso8859-1 (or whatever covers the > input) could be used. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2906) Filter to process output of ICUTokenizer and create overlapping bigrams for CJK
[ https://issues.apache.org/jira/browse/LUCENE-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991169#comment-12991169 ] DM Smith commented on LUCENE-2906: -- Two questions: How will this differ from the SmartChineseAnalyzer? I doubt it but can this be in 3.1? > Filter to process output of ICUTokenizer and create overlapping bigrams for > CJK > > > Key: LUCENE-2906 > URL: https://issues.apache.org/jira/browse/LUCENE-2906 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis >Reporter: Tom Burton-West >Priority: Minor > Attachments: LUCENE-2906.patch > > > The ICUTokenizer produces unigrams for CJK. We would like to use the > ICUTokenizer but have overlapping bigrams created for CJK as in the CJK > Analyzer. This filter would take the output of the ICUtokenizer, read the > ScriptAttribute and for selected scripts (Han, Kana), would produce > overlapping bigrams. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2908) clean up serialization in the codebase
clean up serialization in the codebase -- Key: LUCENE-2908 URL: https://issues.apache.org/jira/browse/LUCENE-2908 Project: Lucene - Java Issue Type: Task Reporter: Robert Muir Fix For: 4.0 Attachments: LUCENE-2908.patch We removed contrib/remote, but forgot to cleanup serialization hell everywhere. this is no longer needed, never really worked (e.g. across versions), and slows development (e.g. i wasted a long time debugging stupid serialization of Similarity.idfExplain when trying to make a patch for the scoring system). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2908) clean up serialization in the codebase
[ https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2908: Attachment: LUCENE-2908.patch attached is a patch. all tests pass. > clean up serialization in the codebase > -- > > Key: LUCENE-2908 > URL: https://issues.apache.org/jira/browse/LUCENE-2908 > Project: Lucene - Java > Issue Type: Task >Reporter: Robert Muir > Fix For: 4.0 > > Attachments: LUCENE-2908.patch > > > We removed contrib/remote, but forgot to cleanup serialization hell > everywhere. > this is no longer needed, never really worked (e.g. across versions), and > slows > development (e.g. i wasted a long time debugging stupid serialization of > Similarity.idfExplain when trying to make a patch for the scoring system). -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2894) Use of google-code-prettify for Lucene/Solr Javadoc
[ https://issues.apache.org/jira/browse/LUCENE-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991165#comment-12991165 ] Koji Sekiguchi commented on LUCENE-2894: On my mac, there is prettify correctly under api directory after ant package: {code} $ cd solr $ ant clean set-fsdir package $ ls build/docs/api/ allclasses-frame.html deprecated-list.html package-list allclasses-noframe.htmlhelp-doc.html prettify constant-values.html index-all.html resources contrib-solr-analysis-extras index.html serialized-form.html contrib-solr-cell orgsolr contrib-solr-clusteringoverview-frame.htmlsolrj contrib-solr-dataimporthandler overview-summary.html stylesheet+prettify.css contrib-solr-uima overview-tree.html {code} > Use of google-code-prettify for Lucene/Solr Javadoc > --- > > Key: LUCENE-2894 > URL: https://issues.apache.org/jira/browse/LUCENE-2894 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2894.patch, LUCENE-2894.patch, LUCENE-2894.patch, > LUCENE-2894.patch > > > My company, RONDHUIT uses google-code-prettify (Apache License 2.0) in > Javadoc for syntax highlighting: > http://www.rondhuit-demo.com/RCSS/api/com/rondhuit/solr/analysis/JaReadingSynonymFilterFactory.html > I think we can use it for Lucene javadoc (java sample code in overview.html > etc) and Solr javadoc (Analyzer Factories etc) to improve or simplify our > life. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-2894) Use of google-code-prettify for Lucene/Solr Javadoc
[ https://issues.apache.org/jira/browse/LUCENE-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Sekiguchi reopened LUCENE-2894: Reopening the issue. Lucene javadoc on hudson looks fine (syntax highlighting works correctly): https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/overview-summary.html but Solr javadoc on hudson looks not good: https://hudson.apache.org/hudson/job/Solr-trunk/javadoc/org/apache/solr/handler/component/TermsComponent.html Building of both javadocs on my local is working fine. > Use of google-code-prettify for Lucene/Solr Javadoc > --- > > Key: LUCENE-2894 > URL: https://issues.apache.org/jira/browse/LUCENE-2894 > Project: Lucene - Java > Issue Type: Improvement > Components: Javadocs >Reporter: Koji Sekiguchi >Assignee: Koji Sekiguchi >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2894.patch, LUCENE-2894.patch, LUCENE-2894.patch, > LUCENE-2894.patch > > > My company, RONDHUIT uses google-code-prettify (Apache License 2.0) in > Javadoc for syntax highlighting: > http://www.rondhuit-demo.com/RCSS/api/com/rondhuit/solr/analysis/JaReadingSynonymFilterFactory.html > I think we can use it for Lucene javadoc (java sample code in overview.html > etc) and Solr javadoc (Analyzer Factories etc) to improve or simplify our > life. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON-MAVEN] Lucene-Solr-Maven-trunk #17: POMs out of sync
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-trunk/17/ No tests ran. Build Log (for compile errors): [...truncated 7757 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON] Lucene-Solr-tests-only-3.x - Build # 4561 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/4561/ 1 tests failed. REGRESSION: org.apache.solr.client.solrj.TestLBHttpSolrServer.testReliability Error Message: No live SolrServers available to handle this request Stack Trace: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:222) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.solr.client.solrj.TestLBHttpSolrServer.testReliability(TestLBHttpSolrServer.java:177) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977) Caused by: org.apache.solr.client.solrj.SolrServerException: java.net.SocketTimeoutException: Read timed out at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:484) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:206) Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:146) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:428) Build Log (for compile errors): [...truncated 10090 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2906) Filter to process output of ICUTokenizer and create overlapping bigrams for CJK
[ https://issues.apache.org/jira/browse/LUCENE-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2906: Attachment: LUCENE-2906.patch here's a patch going in a slightly different direction (though we can still add some special icu-only stuff here). instead the patch synchronizes the token types of ICUTokenizer with StandardTokenizer, adds the necessarily types to both, and then adds the bigramming logic to standardfilter. this way, cjk works easily "out of box", for all of unicode (e.g. supplementaries) and plays well with other languages. i deprecated cjktokenizer in the patch and pulled out its special full-width filter into a separate tokenfilter. > Filter to process output of ICUTokenizer and create overlapping bigrams for > CJK > > > Key: LUCENE-2906 > URL: https://issues.apache.org/jira/browse/LUCENE-2906 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis >Reporter: Tom Burton-West >Priority: Minor > Attachments: LUCENE-2906.patch > > > The ICUTokenizer produces unigrams for CJK. We would like to use the > ICUTokenizer but have overlapping bigrams created for CJK as in the CJK > Analyzer. This filter would take the output of the ICUtokenizer, read the > ScriptAttribute and for selected scripts (Han, Kana), would produce > overlapping bigrams. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [HUDSON] Lucene-Solr-tests-only-3.x - Build # 4555 - Failure
checking... On Sun, Feb 6, 2011 at 2:19 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > I think this is happening because of LUCENE-1540... > > Mike > > On Sun, Feb 6, 2011 at 5:25 AM, Apache Hudson Server > wrote: > > Build: > https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/4555/ > > > > 1 tests failed. > > REGRESSION: > > org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes > > > > Error Message: > > expected: but was: > > > > Stack Trace: > >at > org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.assertDocData(TrecContentSourceTest.java:70) > >at > org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes(TrecContentSourceTest.java:369) > >at > org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045) > >at > org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977) > > > > > > > > > > Build Log (for compile errors): > > [...truncated 6504 lines...] > > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
[jira] Updated: (LUCENE-2907) automaton termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2907: Attachment: LUCENE-2907.patch attached is a patch. I removed all the transient/synchronized stuff from the query. Instead: AutomatonTermsEnum only takes an immutable, compiled form of the automaton (essentially a sorted transitions array). the query computes this compiled form (or any other simpler rewritten form) in its ctor. > automaton termsenum bug when running with multithreaded search > -- > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907.patch, LUCENE-2907_repro.patch, > correct_seeks.txt, incorrect_seeks.txt, seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2907) automaton termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2907: Summary: automaton termsenum bug when running with multithreaded search (was: termsenum bug when running with multithreaded search) editing description so its not confusing, sorry :) > automaton termsenum bug when running with multithreaded search > -- > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, > incorrect_seeks.txt, seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [HUDSON] Lucene-Solr-tests-only-3.x - Build # 4555 - Failure
I think this is happening because of LUCENE-1540... Mike On Sun, Feb 6, 2011 at 5:25 AM, Apache Hudson Server wrote: > Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/4555/ > > 1 tests failed. > REGRESSION: > org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes > > Error Message: > expected: but was: > > Stack Trace: > at > org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.assertDocData(TrecContentSourceTest.java:70) > at > org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes(TrecContentSourceTest.java:369) > at > org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045) > at > org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977) > > > > > Build Log (for compile errors): > [...truncated 6504 lines...] > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections
[ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991145#comment-12991145 ] Michael McCandless commented on LUCENE-1540: I think this commit has caused a failure on at least 3.x? {noformat} [junit] Testcase: testTrecFeedDirAllTypes(org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest): Caused an ERROR [junit] expected: but was: [junit] at org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.assertDocData(TrecContentSourceTest.java:70) [junit] at org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes(TrecContentSourceTest.java:369) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045) [junit] at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977) [junit] [junit] [junit] Tests run: 6, Failures: 0, Errors: 1, Time elapsed: 0.488 sec [junit] [junit] - Standard Error - [junit] WARNING: test method: 'testBadDate' left thread running: Thread[Thread-6,5,main] [junit] RESOURCE LEAK: test method: 'testBadDate' left 1 thread(s) running [junit] NOTE: reproduce with: ant test -Dtestcase=TrecContentSourceTest -Dtestmethod=testBadDate -Dtests.seed=-1485993969467368126:6510043524258948665 -Dtests.multiplier=5 [junit] NOTE: reproduce with: ant test -Dtestcase=TrecContentSourceTest -Dtestmethod=testTrecFeedDirAllTypes -Dtests.seed=-1485993969467368126:-9055415333820766139 -Dtests.multiplier=5 [junit] NOTE: test params are: locale=tr_TR, timezone=Europe/Zagreb [junit] NOTE: all tests run in this JVM: [junit] [TrecContentSourceTest] [junit] NOTE: FreeBSD 8.2-RC2 amd64/Sun Microsystems Inc. 1.6.0 (64-bit)/cpus=16,threads=1,free=66439840,total=86376448 [junit] - --- {noformat} > Improvements to contrib.benchmark for TREC collections > -- > > Key: LUCENE-1540 > URL: https://issues.apache.org/jira/browse/LUCENE-1540 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Tim Armstrong >Assignee: Doron Cohen >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, > LUCENE-1540.patch, trecdocs.zip > > > The benchmarking utilities for TREC test collections (http://trec.nist.gov) > are quite limited and do not support some of the variations in format of > older TREC collections. > I have been doing some benchmarking work with Lucene and have had to modify > the package to support: > * Older TREC document formats, which the current parser fails on due to > missing document headers. > * Variations in query format - newlines after tag causing the query > parser to get confused. > * Ability to detect and read in uncompressed text collections > * Storage of document numbers by default without storing full text. > I can submit a patch if there is interest, although I will probably want to > write unit tests for the new functionality first. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON-MAVEN] Lucene-Solr-Maven-3.x #16: POMs out of sync
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-Maven-3.x/16/ No tests ran. Build Log (for compile errors): [...truncated 8390 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[HUDSON] Lucene-Solr-tests-only-3.x - Build # 4555 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-3.x/4555/ 1 tests failed. REGRESSION: org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes Error Message: expected: but was: Stack Trace: at org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.assertDocData(TrecContentSourceTest.java:70) at org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes(TrecContentSourceTest.java:369) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977) Build Log (for compile errors): [...truncated 6504 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2907) termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991141#comment-12991141 ] Uwe Schindler commented on LUCENE-2907: --- Yes the numbered states cache was always bugging me. But at least it is now synchronized (it was not even that at the beginning). I think the problem may be that the parallel queries doing different segments with different numbered states at the same time. +1 to remove the cache and calculate on ctor, then its really stateless! > termsenum bug when running with multithreaded search > > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, > incorrect_seeks.txt, seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2907) termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991140#comment-12991140 ] Robert Muir commented on LUCENE-2907: - in combination with other things. in my opinion the problem is the cache in getNumberedStates. But the real solution (in my opinion) is to clean up all this crap so the termsenum only takes a completely immutable view of what it needs and for the Query to compile once in its ctor, and remove any stupid caching. So, this is what I am working on now. > termsenum bug when running with multithreaded search > > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, > incorrect_seeks.txt, seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2907) termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991137#comment-12991137 ] Uwe Schindler commented on LUCENE-2907: --- A bug in automaton that only hapoens in multi-threaded? So its the cache there? > termsenum bug when running with multithreaded search > > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, > incorrect_seeks.txt, seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2907) termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991133#comment-12991133 ] Robert Muir commented on LUCENE-2907: - bq. Have you found out what happens or where a thread-safety issue could be? Yes, i found the bug... unfortunately it is actually my automaton problem :( I will create a nice patch today. bq. The information on this issue is too small, there seems to be lots of IRC/GTalk communication in parallel. what do you mean? mike was working a long time on the bug, but quickly had to stop working on it, so he emailed me all of his state. I took over from there for a while, and i opened this issue with my debugging... though I didn't have much time to work on it yesterday (only like 1 hour), because I already had plans. I tried to be completely open and dump all of my state/debugging information/brainstorming on this JIRA issue, but it only resulted in me reporting misleading and confusing information... so I think the information on this issue is actually too much? > termsenum bug when running with multithreaded search > > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, > incorrect_seeks.txt, seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2609) Generate jar containing test classes.
[ https://issues.apache.org/jira/browse/LUCENE-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991132#comment-12991132 ] Shai Erera commented on LUCENE-2609: Thanks Steven ! Committed revision 1067623 (3x). Merging to trunk now ... > Generate jar containing test classes. > - > > Key: LUCENE-2609 > URL: https://issues.apache.org/jira/browse/LUCENE-2609 > Project: Lucene - Java > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.2 >Reporter: Drew Farris >Assignee: Shai Erera >Priority: Minor > Fix For: 3.1, 4.0 > > Attachments: LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, > LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch, > LUCENE-2609.patch, LUCENE-2609.patch, LUCENE-2609.patch > > > The test classes are useful for writing unit tests for code external to the > Lucene project. It would be helpful to build a jar of these classes and > publish them as a maven dependency. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2907) termsenum bug when running with multithreaded search
[ https://issues.apache.org/jira/browse/LUCENE-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991130#comment-12991130 ] Uwe Schindler commented on LUCENE-2907: --- Have you found out what happens or where a thread-safety issue could be? Each thread and each query should have its own TermsEnum! Is there maybe a cache on codec-level involved? At least there are no multiple-instance (static) caches in the search-side of the TermsEnums, so there must be a multi-threading issue in the underlying SegmentReaders. To conclude: At least Robert found out that FieldCacheTermsEnum always works correct? Is this true? The information on this issue is too small, there seems to be lots of IRC/GTalk communication in parallel. > termsenum bug when running with multithreaded search > > > Key: LUCENE-2907 > URL: https://issues.apache.org/jira/browse/LUCENE-2907 > Project: Lucene - Java > Issue Type: Bug >Reporter: Robert Muir > Attachments: LUCENE-2907_repro.patch, correct_seeks.txt, > incorrect_seeks.txt, seeks_diff.txt > > > This one popped in hudson (with a test that runs the same query against > fieldcache, and with a filter rewrite, and compares results) > However, its actually worse and unrelated to the fieldcache: you can set both > to filter rewrite and it will still fail. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org