[jira] Issue Comment Edited: (LUCENENET-380) Evaluate Sharpen as a port tool
[ https://issues.apache.org/jira/browse/LUCENENET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929488#action_12929488 ] Aaron Powell edited comment on LUCENENET-380 at 11/8/10 4:50 AM: - I've created an external repository to make it easier for managing the testing of the different tools available for converting Java to .NET, available here: https://hg.slace.biz/lucene-porting Note this is only for finding a suitable tool for the conversion and will be rolled back to ASF once a tool is found. was (Author: slace): I've created an external repository to make it easier for managing the testing of the different tools available for converting Java to .NET, available here: https://bitbucket.org/slace/lucene-porting Note this is only for finding a suitable tool for the conversion and will be rolled back to ASF once a tool is found. Evaluate Sharpen as a port tool --- Key: LUCENENET-380 URL: https://issues.apache.org/jira/browse/LUCENENET-380 Project: Lucene.Net Issue Type: Task Reporter: George Aroush Attachments: IndexWriter.java, NIOFSDirectory.java, QueryParser.java, TestBufferedIndexInput.java, TestDateFilter.java This task is to evaluate Sharpen as a port tool for Lucene.Net. The files to be evaluated are attached. We need to run those files (which are off Java Lucene 2.9.2) against Sharpen and compare the result against JLCA result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENENET-380) Evaluate Sharpen as a port tool
[ https://issues.apache.org/jira/browse/LUCENENET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929525#action_12929525 ] Aaron Powell commented on LUCENENET-380: I've started off the wiki over at bitbucket - http://hg.slace.biz/lucene-porting/wiki/Home It's also just a mercurial repo so anyone can update it and send back pull requests: http://hg.slace.biz/lucene-porting/wiki Evaluate Sharpen as a port tool --- Key: LUCENENET-380 URL: https://issues.apache.org/jira/browse/LUCENENET-380 Project: Lucene.Net Issue Type: Task Reporter: George Aroush Attachments: IndexWriter.java, NIOFSDirectory.java, QueryParser.java, TestBufferedIndexInput.java, TestDateFilter.java This task is to evaluate Sharpen as a port tool for Lucene.Net. The files to be evaluated are attached. We need to run those files (which are off Java Lucene 2.9.2) against Sharpen and compare the result against JLCA result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: lucene 4.0 release date
thank you. 2010/11/8 Uwe Schindler u...@thetaphi.de: You have to also use Solr 4.0 :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Monday, November 08, 2010 8:47 AM To: dev@lucene.apache.org; simon.willna...@gmail.com Subject: Re: lucene 4.0 release date thank you. so if I want to use new compress/decompress algorithm, I must use lucene 4.0 in svn? Is there any patch for old release such as 2.9?because I need solr 1.4 which based on lucene 2.9 2010/11/8 Simon Willnauer simon.willna...@googlemail.com: Li Li, there is no official / unofficial release date for lucene 4.0 if you want to use the latest and greatest features you need to checkout trunk of use a nightly build. My guess would be that there is at least 6 to 8 month to the next release but I can be wrong (more likely it might take even longer) . For PFoR etc you should look into: https://issues.apache.org/jira/browse/LUCENE-1410 https://issues.apache.org/jira/browse/LUCENE-2723 to get started - and read mikes blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-c odec.html There is also S9 https://issues.apache.org/jira/browse/LUCENE-2189 and GroupVInt impls https://issues.apache.org/jira/browse/LUCENE-2735 simon On Mon, Nov 8, 2010 at 4:59 AM, Li Li fancye...@gmail.com wrote: hi all, When will lucene 4.0 be released? I want to replace VInt compression with fast ones such as PForDelta. In my application, decompressing a docList of 10M will use about 300ms. In Performance of Compressed Inverted List Caching in Search Engines. With J. Zhang and X.Long. 17th International World Wide Web Conference (WWW), April 2008. , the author says PForDelta is much faster than VInt. And I also found a java implementation in http://code.google.com/p/integer-array-compress-kit/ it's speed is 500 (M int / sec). But to achieve, I have to modify index file format. And I found http://wiki.apache.org/lucene-java/FlexibleIndexing in lucene 4.0, it will support fore flexible index format. I want to know when it will be released so as to decide whether wait it or doing it myself. Thank you. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: lucene 4.0 release date
question about svn structure of lucene I visit http://svn.apache.org/repos/asf/lucene/ it contains many things .. .htaccess board-reports/ dev/ java/ lucene.net/ mahout/ openrelevance/ pylucene/ sandbox/ site/ solr/ I just want to use lucene/java + solr. what directories should I check out? it seems http://svn.apache.org/repos/asf/lucene/dev/ is current developed version http://svn.apache.org/repos/asf/lucene/java/ is old version before 3.0 So I just need http://svn.apache.org/repos/asf/lucene/dev/ and https://svn.apache.org/repos/asf/lucene/dev/? 2010/11/8 Li Li fancye...@gmail.com: thank you. 2010/11/8 Uwe Schindler u...@thetaphi.de: You have to also use Solr 4.0 :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Monday, November 08, 2010 8:47 AM To: dev@lucene.apache.org; simon.willna...@gmail.com Subject: Re: lucene 4.0 release date thank you. so if I want to use new compress/decompress algorithm, I must use lucene 4.0 in svn? Is there any patch for old release such as 2.9?because I need solr 1.4 which based on lucene 2.9 2010/11/8 Simon Willnauer simon.willna...@googlemail.com: Li Li, there is no official / unofficial release date for lucene 4.0 if you want to use the latest and greatest features you need to checkout trunk of use a nightly build. My guess would be that there is at least 6 to 8 month to the next release but I can be wrong (more likely it might take even longer) . For PFoR etc you should look into: https://issues.apache.org/jira/browse/LUCENE-1410 https://issues.apache.org/jira/browse/LUCENE-2723 to get started - and read mikes blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-c odec.html There is also S9 https://issues.apache.org/jira/browse/LUCENE-2189 and GroupVInt impls https://issues.apache.org/jira/browse/LUCENE-2735 simon On Mon, Nov 8, 2010 at 4:59 AM, Li Li fancye...@gmail.com wrote: hi all, When will lucene 4.0 be released? I want to replace VInt compression with fast ones such as PForDelta. In my application, decompressing a docList of 10M will use about 300ms. In Performance of Compressed Inverted List Caching in Search Engines. With J. Zhang and X.Long. 17th International World Wide Web Conference (WWW), April 2008. , the author says PForDelta is much faster than VInt. And I also found a java implementation in http://code.google.com/p/integer-array-compress-kit/ it's speed is 500 (M int / sec). But to achieve, I have to modify index file format. And I found http://wiki.apache.org/lucene-java/FlexibleIndexing in lucene 4.0, it will support fore flexible index format. I want to know when it will be released so as to decide whether wait it or doing it myself. Thank you. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: lucene 4.0 release date
Apache Lucene and Apache Solr merged to one checkout at: http://svn.apache.org/repos/asf/lucene/dev/ The combined version now share the same version numbers: Lucene 3.x: - http://svn.apache.org/repos/asf/lucene/dev/branches/branch3.x Lucene trunk: - http://svn.apache.org/repos/asf/lucene/dev/trunk - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Monday, November 08, 2010 9:19 AM To: dev@lucene.apache.org Subject: Re: lucene 4.0 release date question about svn structure of lucene I visit http://svn.apache.org/repos/asf/lucene/ it contains many things .. .htaccess board-reports/ dev/ java/ lucene.net/ mahout/ openrelevance/ pylucene/ sandbox/ site/ solr/ I just want to use lucene/java + solr. what directories should I check out? it seems http://svn.apache.org/repos/asf/lucene/dev/ is current developed version http://svn.apache.org/repos/asf/lucene/java/ is old version before 3.0 So I just need http://svn.apache.org/repos/asf/lucene/dev/ and https://svn.apache.org/repos/asf/lucene/dev/? 2010/11/8 Li Li fancye...@gmail.com: thank you. 2010/11/8 Uwe Schindler u...@thetaphi.de: You have to also use Solr 4.0 :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Li Li [mailto:fancye...@gmail.com] Sent: Monday, November 08, 2010 8:47 AM To: dev@lucene.apache.org; simon.willna...@gmail.com Subject: Re: lucene 4.0 release date thank you. so if I want to use new compress/decompress algorithm, I must use lucene 4.0 in svn? Is there any patch for old release such as 2.9?because I need solr 1.4 which based on lucene 2.9 2010/11/8 Simon Willnauer simon.willna...@googlemail.com: Li Li, there is no official / unofficial release date for lucene 4.0 if you want to use the latest and greatest features you need to checkout trunk of use a nightly build. My guess would be that there is at least 6 to 8 month to the next release but I can be wrong (more likely it might take even longer) . For PFoR etc you should look into: https://issues.apache.org/jira/browse/LUCENE-1410 https://issues.apache.org/jira/browse/LUCENE-2723 to get started - and read mikes blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordel ta-c odec.html There is also S9 https://issues.apache.org/jira/browse/LUCENE-2189 and GroupVInt impls https://issues.apache.org/jira/browse/LUCENE-2735 simon On Mon, Nov 8, 2010 at 4:59 AM, Li Li fancye...@gmail.com wrote: hi all, When will lucene 4.0 be released? I want to replace VInt compression with fast ones such as PForDelta. In my application, decompressing a docList of 10M will use about 300ms. In Performance of Compressed Inverted List Caching in Search Engines. With J. Zhang and X.Long. 17th International World Wide Web Conference (WWW), April 2008. , the author says PForDelta is much faster than VInt. And I also found a java implementation in http://code.google.com/p/integer-array-compress-kit/ it's speed is 500 (M int / sec). But to achieve, I have to modify index file format. And I found http://wiki.apache.org/lucene-java/FlexibleIndexing in lucene 4.0, it will support fore flexible index format. I want to know when it will be released so as to decide whether wait it or doing it myself. Thank you. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- --- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Antw.: Solr-3.x - Build # 160 - Failure
On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote: No updates on the Hudson issue until now. What should we do? Disable Clover report generation for now? +1 - test / CI-Build success are more important to me! I have no idea, what else we could do. Uwe --- Uwe Schindler Generics Policeman Bremen, Germany - Reply message - Von: Apache Hudson Server hud...@hudson.apache.org Datum: Mo., Nov. 8, 2010 06:55 Betreff: Solr-3.x - Build # 160 - Failure An: dev@lucene.apache.org Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/ All tests passed Build Log (for compile errors): [...truncated 18776 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Antw.: Solr-3.x - Build # 160 - Failure
-1 For succeeding tests we have running and working builds. For me the clover report is more important, and that one works! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, November 08, 2010 10:04 AM To: dev@lucene.apache.org Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote: No updates on the Hudson issue until now. What should we do? Disable Clover report generation for now? +1 - test / CI-Build success are more important to me! I have no idea, what else we could do. Uwe --- Uwe Schindler Generics Policeman Bremen, Germany - Reply message - Von: Apache Hudson Server hud...@hudson.apache.org Datum: Mo., Nov. 8, 2010 06:55 Betreff: Solr-3.x - Build # 160 - Failure An: dev@lucene.apache.org Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/ All tests passed Build Log (for compile errors): [...truncated 18776 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Antw.: Solr-3.x - Build # 160 - Failure
On Mon, Nov 8, 2010 at 10:14 AM, Uwe Schindler u...@thetaphi.de wrote: -1 For succeeding tests we have running and working builds. For me the clover report is more important, and that one works! Ah you are right we have other builds - still confuses me, nevermind. But I disagree that a broken clover is important its just annoying. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, November 08, 2010 10:04 AM To: dev@lucene.apache.org Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote: No updates on the Hudson issue until now. What should we do? Disable Clover report generation for now? +1 - test / CI-Build success are more important to me! I have no idea, what else we could do. Uwe --- Uwe Schindler Generics Policeman Bremen, Germany - Reply message - Von: Apache Hudson Server hud...@hudson.apache.org Datum: Mo., Nov. 8, 2010 06:55 Betreff: Solr-3.x - Build # 160 - Failure An: dev@lucene.apache.org Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/ All tests passed Build Log (for compile errors): [...truncated 18776 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Antw.: Solr-3.x - Build # 160 - Failure
Clover is not broken, only Hudson plugin that links the clover report in the workspace. And it is important to have at least one version of the clover report. I use it quite often to verify coverage. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, November 08, 2010 10:44 AM To: Uwe Schindler Cc: dev@lucene.apache.org Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure On Mon, Nov 8, 2010 at 10:14 AM, Uwe Schindler u...@thetaphi.de wrote: -1 For succeeding tests we have running and working builds. For me the clover report is more important, and that one works! Ah you are right we have other builds - still confuses me, nevermind. But I disagree that a broken clover is important its just annoying. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, November 08, 2010 10:04 AM To: dev@lucene.apache.org Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote: No updates on the Hudson issue until now. What should we do? Disable Clover report generation for now? +1 - test / CI-Build success are more important to me! I have no idea, what else we could do. Uwe --- Uwe Schindler Generics Policeman Bremen, Germany - Reply message - Von: Apache Hudson Server hud...@hudson.apache.org Datum: Mo., Nov. 8, 2010 06:55 Betreff: Solr-3.x - Build # 160 - Failure An: dev@lucene.apache.org Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/ All tests passed Build Log (for compile errors): [...truncated 18776 lines...] --- -- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Antw.: Solr-3.x - Build # 160 - Failure
We got some response for our Clover Hudson issue bug (see attached mail). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, November 08, 2010 10:44 AM To: Uwe Schindler Cc: dev@lucene.apache.org Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure On Mon, Nov 8, 2010 at 10:14 AM, Uwe Schindler u...@thetaphi.de wrote: -1 For succeeding tests we have running and working builds. For me the clover report is more important, and that one works! Ah you are right we have other builds - still confuses me, nevermind. But I disagree that a broken clover is important its just annoying. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Simon Willnauer [mailto:simon.willna...@googlemail.com] Sent: Monday, November 08, 2010 10:04 AM To: dev@lucene.apache.org Subject: Re: Antw.: Solr-3.x - Build # 160 - Failure On Mon, Nov 8, 2010 at 7:26 AM, Uwe Schindler u...@thetaphi.de wrote: No updates on the Hudson issue until now. What should we do? Disable Clover report generation for now? +1 - test / CI-Build success are more important to me! I have no idea, what else we could do. Uwe --- Uwe Schindler Generics Policeman Bremen, Germany - Reply message - Von: Apache Hudson Server hud...@hudson.apache.org Datum: Mo., Nov. 8, 2010 06:55 Betreff: Solr-3.x - Build # 160 - Failure An: dev@lucene.apache.org Build: https://hudson.apache.org/hudson/job/Solr-3.x/160/ All tests passed Build Log (for compile errors): [...truncated 18776 lines...] --- -- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org ---BeginMessage--- [ http://issues.hudson-ci.org/browse/HUDSON-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stubbs updated HUDSON-7836: --- Attachment: HUDSON-7836-stacktrace.txt Stack trace from the master Hudson's log at the time the build failed with this error. Clover and cobertura parsing on hudson master fails because of invalid XML -- Key: HUDSON-7836 URL: http://issues.hudson-ci.org/browse/HUDSON-7836 Project: Hudson Issue Type: Bug Components: clover, cobertura Affects Versions: current Reporter: thetaphi Assignee: stephenconnolly Priority: Critical Attachments: HUDSON-7836-stacktrace.txt Since a few days, on our Apache Hudson installation, parsing of Clover's clover.xml or the Coberture's coverage.xml file fails (but not in all cases, sometimes it simply passes with the same build and same job configuration). This only happens after transferring to master, the reports and xml file is created on Hudson slave. It seems like the network code somehow breaks the xml file during transfer to the master. Downloading th clover.xml from the workspace to my local computer and validating it confirms, that it is not incorrectly formatted and has no XML parse errors. - Here are errors that appear during clover publishing: [https://hudson.apache.org/hudson/job/Lucene-trunk/1336/console] - For cobertura: [https://hudson.apache.org/hudson/view/Directory/job/dir-shared-metrics/34/console] -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.hudson-ci.org/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ---End Message--- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Rethinking spatial implementation
Some questions: @Grant: Can you clarify what you mean with the Sinusoidal projection is broken? Would it be possible to use a LGPL library like the Java Topology Suite (JTS: http://www.vividsolutions.com/jts/JTSHome.htm)? Neo4j is using JTS for creating a spatial index (code is here: https://github.com/neo4j/neo4j-spatial)... (I've just seen that JTS has some index creation classes, but I'm not at all familiar with them) Christopher On Mon, Nov 8, 2010 at 1:10 AM, Grant Ingersoll gsing...@apache.org wrote: On Nov 6, 2010, at 5:23 PM, Christopher Schmidt wrote: Hi Ryan, thx for your answer. You mean there is room for improvement and volunteers? We've been looking at replacing it with the Military Grid system. The primary issue with the current is that the Sinusoidal projection is broken which then breaks almost all the tests. I worked on it for a while trying to straighten it out, but gave up and now think it is easier to implement clean. I definitely would like to see a tier/grid implementation. On Friday, November 5, 2010, Ryan McKinley ryan...@gmail.com wrote: Hi Christopher - I do not believe there is any active work on this. From what I understand, the Tier implementation works OK within some constraints, but we could not get it to pass more robust testing that the other methods were using. However, LatLonType and GeoHashField are well tested and work well -- the Tier type may have better performance when your index is really large, but no active developers understand it and no-one has stepped up to figure it out. ryan On Wed, Nov 3, 2010 at 3:16 PM, Christopher Schmidt fakod...@googlemail.com wrote: Hi all, I saw a mail thread Rethinking Cartesian Tiers implementation (here). Is there any work in progress regarding this? If yes, is the current implementation deprecated or do you plan some enhancements (other projections or spatial indexes) ? I am asking because I want to use Lucene's spatial indexing in a production system... -- Christopher twitter: @fakod blog: http://blog.fakod.eu - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Christopher twitter: @fakod blog: http://blog.fakod.eu - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Christopher twitter: @fakod blog: http://blog.fakod.eu
[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929556#action_12929556 ] M Alexander commented on LUCENE-2745: - {quote} I think that ArabicLetterTokenizer, which is the tokenizer used by ArabicAnalyzer, is obsolete (as of version 3.1), since StandardTokenizer, which implements the Unicode word segmentation rules from UAX#29, should be able to properly tokenize Arabic. StandardTokenizer recognizes email addresses, hostnames, and URLs, so your concern would be addressed. (See LUCENE-2167, though, which was just reopened to turn off full URL output.) You can test this by composing your own analyzer, if you're willing to try using using as-yet-unreleased branch_3X, from which 3.1 will be cut (hopefully fairly soon): just copy ArabicAnalyzer class and swap in StandardTokenizer for ArabicLetterTokenizer {quote} I tried to test this and failed (miserably). I think I struggled to patch LUCENE-2167 correctly through my eclipse. I might just wait for branch_3X release to make my life easier. I will then create my own Analyzer to perform Arabic Text Analysis and another one for Farsi Text Analysis. Both Analyzers will have the ability to handle diacritics as well as email addresses, hostnames and so on. I will colse this issue for now (will re-open in the future if needed). Quick question - any thoughts of handling Arabic email addresses and hostnames in the future? Thanks to both of you for the time taken and I shall wait for the branch release to solve my issue. ArabicAnalyzer - the ability to recognise email addresses host names and so on -- Key: LUCENE-2745 URL: https://issues.apache.org/jira/browse/LUCENE-2745 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 Environment: All Reporter: M Alexander The ArabicAnalyzer does not recognise email addresses, hostnames and so on. For example, a...@hotmail.com will be tokenised to [adam] [hotmail] [com] It would be great if the ArabicAnalyzer can tokenises this to [a...@hotmail.com]. The same applies to hostnames and so on. Can this be resolved? I hope so Thanks MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M Alexander closed LUCENE-2745. --- Resolution: Later Will wait for the relaese, which should have the solution within ArabicAnalyzer - the ability to recognise email addresses host names and so on -- Key: LUCENE-2745 URL: https://issues.apache.org/jira/browse/LUCENE-2745 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 Environment: All Reporter: M Alexander The ArabicAnalyzer does not recognise email addresses, hostnames and so on. For example, a...@hotmail.com will be tokenised to [adam] [hotmail] [com] It would be great if the ArabicAnalyzer can tokenises this to [a...@hotmail.com]. The same applies to hostnames and so on. Can this be resolved? I hope so Thanks MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929558#action_12929558 ] M Alexander commented on LUCENE-2745: - Oh, do you have a rough timing of the branch_3X release date? ArabicAnalyzer - the ability to recognise email addresses host names and so on -- Key: LUCENE-2745 URL: https://issues.apache.org/jira/browse/LUCENE-2745 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 Environment: All Reporter: M Alexander The ArabicAnalyzer does not recognise email addresses, hostnames and so on. For example, a...@hotmail.com will be tokenised to [adam] [hotmail] [com] It would be great if the ArabicAnalyzer can tokenises this to [a...@hotmail.com]. The same applies to hostnames and so on. Can this be resolved? I hope so Thanks MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929560#action_12929560 ] Robert Muir commented on LUCENE-2167: - {quote} In theory, you should just feed initial text as a single monster token from hell into analysis chain, and then you only have TokenFilters, none/one/some of which might split this token. If there are no TokenFilters at all, you get a NOT_ANALYZED case without extra flags, yahoo! The only problem here is the need for ability to wrap arbitrary Reader in a TermAttribute :/ {quote} No thanks, i dont want to read my entire documents into RAM and have massive gc'ing going on. We don't need to have a mega-tokenizer that solves everyones problems... this is just supposed to be a good general-purpose tokenizer. Implement StandardTokenizer with the UAX#29 Standard Key: LUCENE-2167 URL: https://issues.apache.org/jira/browse/LUCENE-2167 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Affects Versions: 3.1, 4.0 Reporter: Shyamal Prasad Assignee: Steven Rowe Priority: Minor Fix For: 3.1, 4.0 Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex Original Estimate: 0.5h Remaining Estimate: 0.5h It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense. Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims: bq. This should be a good tokenizer for most European-language documents The new StandardTokenizer could then say bq. This should be a good tokenizer for most languages. All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2745) ArabicAnalyzer - the ability to recognise email addresses host names and so on
[ https://issues.apache.org/jira/browse/LUCENE-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929566#action_12929566 ] Steven Rowe commented on LUCENE-2745: - bq. Oh, do you have a rough timing of the branch_3X release date? Wild guess: January 2011 ArabicAnalyzer - the ability to recognise email addresses host names and so on -- Key: LUCENE-2745 URL: https://issues.apache.org/jira/browse/LUCENE-2745 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.9.2, 2.9.3, 3.0, 3.0.1, 3.0.2 Environment: All Reporter: M Alexander The ArabicAnalyzer does not recognise email addresses, hostnames and so on. For example, a...@hotmail.com will be tokenised to [adam] [hotmail] [com] It would be great if the ArabicAnalyzer can tokenises this to [a...@hotmail.com]. The same applies to hostnames and so on. Can this be resolved? I hope so Thanks MAA -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENENET-380) Evaluate Sharpen as a port tool
[ https://issues.apache.org/jira/browse/LUCENENET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929567#action_12929567 ] George Aroush commented on LUCENENET-380: - Few points: 1) Work on ASF projects need to be done at ASF. Please use this JIRA issue and the mailing list to communicate questions, report progress and results. 2) The converted files need to be attached to this JIRA issue, so we have a record of it and able to evaluate by all. 3) Prescott point of highlighting pre / post processing work is a good one and important. Please write this up as you work on this task. 4) More than one person can work on this JIRA issue, just keep everyone posted. My expected outcome of this JIRA issue is: 1) What pre / post processing did you use if any? It would also help to show the raw output with and without the pre processing. 2) How close is the result of those 5 attached files to the existing converted C# files? This includes the layout of the code (was anything lost or considerably change?) but most importantly, are the public APIs consistent? The reason why I picked those 5 files is because those are the ones JLCA has some of the most issues with, so it should be a good barometer seeing how Sharpen does. Evaluate Sharpen as a port tool --- Key: LUCENENET-380 URL: https://issues.apache.org/jira/browse/LUCENENET-380 Project: Lucene.Net Issue Type: Task Reporter: George Aroush Attachments: IndexWriter.java, NIOFSDirectory.java, QueryParser.java, TestBufferedIndexInput.java, TestDateFilter.java This task is to evaluate Sharpen as a port tool for Lucene.Net. The files to be evaluated are attached. We need to run those files (which are off Java Lucene 2.9.2) against Sharpen and compare the result against JLCA result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-2222) Merge duplicates documents with uniqueKey
Merge duplicates documents with uniqueKey - Key: SOLR- URL: https://issues.apache.org/jira/browse/SOLR- Project: Solr Issue Type: Bug Affects Versions: 1.4.1 Reporter: Andreas Laager When merging one core into an other one could get multiple documents for one uniqueKey. As a result the facet counts are wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2222) Merge duplicates documents with uniqueKey
[ https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929574#action_12929574 ] Koji Sekiguchi commented on SOLR-: -- I think this is expected behavior because Solr just calls Lucene's IndexWriter.addIndexes() to merge indexes and Lucene doesn't care uniqueKeys. Merge duplicates documents with uniqueKey - Key: SOLR- URL: https://issues.apache.org/jira/browse/SOLR- Project: Solr Issue Type: Bug Affects Versions: 1.4.1 Reporter: Andreas Laager When merging one core into an other one could get multiple documents for one uniqueKey. As a result the facet counts are wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Rethinking spatial implementation
Neo4j is using JTS for creating a spatial index (code is here: https://github.com/neo4j/neo4j-spatial)... (I've just seen that JTS has some index creation classes, but I'm not at all familiar with them) JTS does not have a spatial index -- it is good for spatial operations (check if some shape is within/intersects/etc another shape) In Neo4j, they use JTS to build an RTree that is stored in their native graph format: https://github.com/neo4j/neo4j-spatial/blob/master/src/main/java/org/neo4j/gis/spatial/RTreeIndex.java Building an RTree in lucene is a bit more difficult since we can not easily update the value of a given field. I'd like to figure some way to do this though. ryan - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Rethinking spatial implementation
Hi All, FYI, Apache SIS [1], currently Incubating, is working on building an ASLv2 licensed library comparable to JTS or GeoTools. You'll notice that most of the GIS related libs out there are GPL or LGPL (or at least I did), so I decided to do something about it. If anyone else is interested in joining the cause, we'd welcome you over there. At present, we have code that implements a QuadTree storage and does PointRadius and bounding box computations, as well as a REST-ful web service to handle spatial location based on those 2 methods. We're close to making an 0.1-incubating release. Cheers, Chris [1] http://incubator.apache.org/sis/ On 11/8/10 2:40 AM, Chris Male gento...@gmail.com wrote: Hi, I'll jump in and give my opinion: Can you clarify what you mean with the Sinusoidal projection is broken? Inside Spatial Lucene's Cartesian codebase is an implementation of Sinusoidal projection. Grant discovered while working on improving the testing coverage of the code that the implementation doesn't actually match the formula specified on Wikipedia. When we tried to change it, many tests broke since the overall logic somehow depends on this broken implementation. Would it be possible to use a LGPL library like the Java Topology Suite (JTS: http://www.vividsolutions.com/jts/JTSHome.htm)? This is something we've talked about using. I think it would be nice to offload some of the geographic-specific from Lucene. So using another library would be good. At the same time it limits our options for optimizations and the like. I'm certainly looking into it though. Thanks, Chris Neo4j is using JTS for creating a spatial index (code is here: https://github.com/neo4j/neo4j-spatial)... (I've just seen that JTS has some index creation classes, but I'm not at all familiar with them) Christopher On Mon, Nov 8, 2010 at 1:10 AM, Grant Ingersoll gsing...@apache.org wrote: On Nov 6, 2010, at 5:23 PM, Christopher Schmidt wrote: Hi Ryan, thx for your answer. You mean there is room for improvement and volunteers? We've been looking at replacing it with the Military Grid system. The primary issue with the current is that the Sinusoidal projection is broken which then breaks almost all the tests. I worked on it for a while trying to straighten it out, but gave up and now think it is easier to implement clean. I definitely would like to see a tier/grid implementation. On Friday, November 5, 2010, Ryan McKinley ryan...@gmail.com wrote: Hi Christopher - I do not believe there is any active work on this. From what I understand, the Tier implementation works OK within some constraints, but we could not get it to pass more robust testing that the other methods were using. However, LatLonType and GeoHashField are well tested and work well -- the Tier type may have better performance when your index is really large, but no active developers understand it and no-one has stepped up to figure it out. ryan On Wed, Nov 3, 2010 at 3:16 PM, Christopher Schmidt fakod...@googlemail.com wrote: Hi all, I saw a mail thread Rethinking Cartesian Tiers implementation (here). Is there any work in progress regarding this? If yes, is the current implementation deprecated or do you plan some enhancements (other projections or spatial indexes) ? I am asking because I want to use Lucene's spatial indexing in a production system... -- Christopher twitter: @fakod blog: http://blog.fakod.eu - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Christopher twitter: @fakod blog: http://blog.fakod.eu - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Antw.: Solr-3.x - Build # 160 - Failure
On Mon, Nov 8, 2010 at 4:44 AM, Simon Willnauer simon.willna...@googlemail.com wrote: On Mon, Nov 8, 2010 at 10:14 AM, Uwe Schindler u...@thetaphi.de wrote: -1 For succeeding tests we have running and working builds. For me the clover report is more important, and that one works! Ah you are right we have other builds - still confuses me, nevermind. But I disagree that a broken clover is important its just annoying. when it works, it works... i don't think we should disable it, its useful in finding untested things / bugs. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2222) Merge duplicates documents with uniqueKey
[ https://issues.apache.org/jira/browse/SOLR-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929577#action_12929577 ] Andreas Laager commented on SOLR-: -- I've read this that Lucene does not care about the unique key. But where does the uniqueKey configuration in the schema.xml come from? Is that part of SOLR? If yes then also SOLR should also care about it on merging cores. Our system is using solr with a live core dedicated for inserts that gets merged into a search core from time to time. We expect a better search performance out of this. I expect a negativ performance impact when I have to handle all the duplicated documents after the merge. Merge duplicates documents with uniqueKey - Key: SOLR- URL: https://issues.apache.org/jira/browse/SOLR- Project: Solr Issue Type: Bug Affects Versions: 1.4.1 Reporter: Andreas Laager When merging one core into an other one could get multiple documents for one uniqueKey. As a result the facet counts are wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard
[ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929587#action_12929587 ] Earwin Burrfoot commented on LUCENE-2167: - bq. No thanks, i dont want to read my entire documents into RAM and have massive gc'ing going on. This is obvious. And that's why I was talking about wrapping Reader in an Attribute, not copying its contents. How to do so, is much less obvious. And that's why I called it a problem. bq. We don't need to have a mega-tokenizer that solves everyones problems... this is just supposed to be a good general-purpose tokenizer. Exactly. That's why I'm thinking of a way to get some composability, instead of having to fully rewrite tokenizer once you want extras. Implement StandardTokenizer with the UAX#29 Standard Key: LUCENE-2167 URL: https://issues.apache.org/jira/browse/LUCENE-2167 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Affects Versions: 3.1, 4.0 Reporter: Shyamal Prasad Assignee: Steven Rowe Priority: Minor Fix For: 3.1, 4.0 Attachments: LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-jflex-tld-macro-gen.patch, LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, standard.zip, StandardTokenizerImpl.jflex Original Estimate: 0.5h Remaining Estimate: 0.5h It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense. Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims: bq. This should be a good tokenizer for most European-language documents The new StandardTokenizer could then say bq. This should be a good tokenizer for most languages. All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-792) Pivot (ie: Decision Tree) Faceting Component
[ https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929590#action_12929590 ] Peter Karich commented on SOLR-792: --- Hi Toke and all, maybe I am a bit evil or stupid but could someone enlight me why this patch is necessary? Why can't you we the existing mechanisms in Solr (facets!) and a bit logic while indexing: http://markmail.org/message/2aza6nnsiw3l4bbb#query:+page:1+mid:3j3ttojacpjoyfg5+state:results This has no performance problems when using tons of categories. We already using it with lots of categories. It works out of the box with a nearly infinity depth (either you need a DB - unlimited or the URL length is the limit). The only drawback of this approach is that you won't be able to display two or more 'branches' at the same time. Only one current branch with the current possible categories is possible, which is no limitation in our case. Because the UI would be unusable if too many items would be visible at the same time. One could introduce a special update component for this feature which uses a category tree (in RAM) built from the json or xml definition. I could create such a component if someone is interested. Regards, Peter. Pivot (ie: Decision Tree) Faceting Component Key: SOLR-792 URL: https://issues.apache.org/jira/browse/SOLR-792 Project: Solr Issue Type: New Feature Reporter: Erik Hatcher Assignee: Yonik Seeley Priority: Minor Attachments: SOLR-792-as-helper-class.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch A component to do multi-level faceting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-792) Pivot (ie: Decision Tree) Faceting Component
[ https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929596#action_12929596 ] Grant Ingersoll commented on SOLR-792: -- Hi Peter, I like to think of it as What if faceting and doesn't require the categories to be defined up front. You can solve this through hierarchical faceting, too, but this (pivot) approach doesn't require a traditional relationship description like hierarchical faceting does. Pivot (ie: Decision Tree) Faceting Component Key: SOLR-792 URL: https://issues.apache.org/jira/browse/SOLR-792 Project: Solr Issue Type: New Feature Reporter: Erik Hatcher Assignee: Yonik Seeley Priority: Minor Attachments: SOLR-792-as-helper-class.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch A component to do multi-level faceting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-792) Pivot (ie: Decision Tree) Faceting Component
[ https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929606#action_12929606 ] Toke Eskildsen commented on SOLR-792: - I'd be interested to hear what the focus of SOLR-792 is, as opposed to SOLR-64. Or to put it another way: If SOLR-64 was adapted to accept a list of fields for the hierarchy, what would the purpose of SOLR-792 be? Pivot (ie: Decision Tree) Faceting Component Key: SOLR-792 URL: https://issues.apache.org/jira/browse/SOLR-792 Project: Solr Issue Type: New Feature Reporter: Erik Hatcher Assignee: Yonik Seeley Priority: Minor Attachments: SOLR-792-as-helper-class.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-PivotFaceting.patch, SOLR-792-raw-type.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, SOLR-792.patch A component to do multi-level faceting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2746) Implement PMC Branding Guidelines
Implement PMC Branding Guidelines - Key: LUCENE-2746 URL: https://issues.apache.org/jira/browse/LUCENE-2746 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Per the Trademark committee's Branding Requirements, there are a number of things we need to do across our projects to comply. See http://www.apache.org/foundation/marks/pmcs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2746) Implement PMC Branding Guidelines
[ https://issues.apache.org/jira/browse/LUCENE-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-2746: Attachment: LUCENE-2746.patch Work in the guidelines. Implement PMC Branding Guidelines - Key: LUCENE-2746 URL: https://issues.apache.org/jira/browse/LUCENE-2746 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Attachments: LUCENE-2746.patch Per the Trademark committee's Branding Requirements, there are a number of things we need to do across our projects to comply. See http://www.apache.org/foundation/marks/pmcs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENENET-379) Clean up Lucene.Net website
[ https://issues.apache.org/jira/browse/LUCENENET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929622#action_12929622 ] Grant Ingersoll commented on LUCENENET-379: --- Please see https://issues.apache.org/jira/browse/LUCENE-2746. Also, keep in mind we will probably be dumping Forrest at some point in the near future in favor of the ASF house CMS. Clean up Lucene.Net website --- Key: LUCENENET-379 URL: https://issues.apache.org/jira/browse/LUCENENET-379 Project: Lucene.Net Issue Type: Task Reporter: George Aroush The existing Lucene.Net home page at http://lucene.apache.org/lucene.net/ is still based on the incubation, out of date design. This JIRA task is to bring it up to date with other ASF project's web page. The existing website is here: https://svn.apache.org/repos/asf/lucene/lucene.net/site/ See http://www.apache.org/dev/project-site.html to get started. It would be best to start by cloning an existing ASF project's website and adopting it for Lucene.Net. Some examples, https://svn.apache.org/repos/asf/lucene/pylucene/site/ and https://svn.apache.org/repos/asf/lucene/java/site/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Lucene-Solr-tests-only-trunk - Build # 1135 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1135/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize Error Message: expected:248 but was:256 Stack Trace: junit.framework.AssertionFailedError: expected:248 but was:256 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOptimize.java:119) at org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(TestThreadedOptimize.java:141) Build Log (for compile errors): [...truncated 3079 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929629#action_12929629 ] Jason Rutherglen commented on LUCENE-2680: -- I'm running test-core multiple times and am seeing some lurking test failures (thanks to the randomized tests that have been recently added). I'm guessing they're related to the syncs on IW and DW not being in sync some of the time. I will clean up the patch so that others may properly review it and hopefully we can figure out what's going on. Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1137 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1137/ 1 tests failed. REGRESSION: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: expected:2 but was:3 Stack Trace: junit.framework.AssertionFailedError: expected:2 but was:3 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:201) Build Log (for compile errors): [...truncated 8752 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
Any updates/progress with this? I'm looking at ways to implement an RTree with lucene -- and this discussion seems relevant thanks ryan On Sat, Sep 25, 2010 at 5:42 PM, mark harwood markharw...@yahoo.co.uk wrote: Both these on disk data structures and the ones in a B+ tree have seek offsets into files that require disk seeks. And both could use document ids as key values. Yep. However my approach doesn't use a doc id as a key that is searched in any B+ tree index (which involves disk seeks) - it is used as direct offset into a file to get the pointer into a links data structure. But do these disk data structures support dynamic addition and deletion of (larger numbers of) document links? Yes, the slide deck I linked to shows how links (like documents) spend the early stages of life being merged frequently in the smaller, newer segments and over time migrate into larger, more stable segments as part of Lucene transactions. That's the theory - I'm currently benchmarking an early prototype. - Original Message From: Paul Elschot paul.elsc...@xs4all.nl To: dev@lucene.apache.org Sent: Sat, 25 September, 2010 22:03:28 Subject: Re: Document links Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood: My starting point in the solution I propose was to eliminate linking via any type of key. Key lookups mean indexes and indexes mean disk seeks. Graph traversals have exponential numbers of links and so all these index disk seeks start to stack up. The solution I propose uses doc ids as more-or-less direct pointers into file structures avoiding any index lookup. I've started coding up some tests using the file structures I outlined and will compare that with a traditional key-based approach. Both these on disk data structures and the ones in a B+ tree have seek offsets into files that require disk seeks. And both could use document ids as key values. But do these disk data structures support dynamic addition and deletion of (larger numbers of) document links? B+ trees are a standard solution for problems like this one, and it would probably not be easy to outperform them. It may be possible to improve performance of B+ trees somewhat by specializing for the fairly simple keys that would be needed, and by encoding very short lists of links for a single document directly into a seek offset to avoid the actual seek, but that's about it. Regards, Paul Elschot For reference - playing the Kevin Bacon game on a traditional Lucene index of IMDB data took 18 seconds to find a short path that Neo4j finds in 200 milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m edges). Going from actor-movies-actors-movies produces a lot of key lookups and the difference between key indexes and direct node pointers becomes clear. I know path finding analysis is perhaps not a typical Lucene application but other forms of link analysis e.g. recommendation engines require similar performance. Cheers Mark On 25 Sep 2010, at 11:41, Paul Elschot wrote: Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: While not exactly equivalent, it reminds me of our earlier discussion around layered segments for dealing with field updates Right. Fast discovery of document relations is a foundation on which lots of things like this can build. Relations can be given types to support a number of different use cases. How about using this (bsd licenced) tree as a starting point: http://bplusdotnet.sourceforge.net/ It has various keys: ao. byte array, String and long. A fixed size byte array as key seems to be just fine: two bytes for a field number, four for the segment number and four for the in-segment document id. The separate segment number would allow to minimize the updates in the tree during merges. One could also use the normal doc id directly. The value could then be a similar to the key, but without the field number, and with an indication of the direction of the link. Or perhaps the direction of the link should be added to the key. A link would be present twice, once for each direction. Also both directions could have their own payloads. It could be put in its own file as a separate 'segment', or maybe each segment could allow for allocation of a part of this tree. I like this somehow, in case it is done right one might never need a relational database again. Well, almost... Regards, Paul Elschot - Original Message From: Grant Ingersoll gsing...@apache.org To: dev@lucene.apache.org Sent: Fri, 24 September, 2010 16:26:27 Subject: Re: Document links While not exactly equivalent, it reminds me of our earlier discussion around layered segments for dealing with field updates [1], [2], albeit this is a bit more generic since one could not only use the links for relating documents, but one could use special links
Re: Document links
I came to the conclusion that the transient meaning of document ids is too deeply ingrained in Lucene's design to use them to underpin any reliable linking. While it might work for relatively static indexes, any index with a reasonable number of updates or deletes will invalidate any stored document references in ways which are very hard to track. Lucene's compaction shuffles IDs without taking care to preserve identity, unlike graph DBs like Neo4j (see recycling IDs here: http://goo.gl/5UbJi ) Cheers, Mark - Original Message From: Ryan McKinley ryan...@gmail.com To: dev@lucene.apache.org Sent: Mon, 8 November, 2010 19:03:59 Subject: Re: Document links Any updates/progress with this? I'm looking at ways to implement an RTree with lucene -- and this discussion seems relevant thanks ryan On Sat, Sep 25, 2010 at 5:42 PM, mark harwood markharw...@yahoo.co.uk wrote: Both these on disk data structures and the ones in a B+ tree have seek offsets into files that require disk seeks. And both could use document ids as key values. Yep. However my approach doesn't use a doc id as a key that is searched in any B+ tree index (which involves disk seeks) - it is used as direct offset into a file to get the pointer into a links data structure. But do these disk data structures support dynamic addition and deletion of (larger numbers of) document links? Yes, the slide deck I linked to shows how links (like documents) spend the early stages of life being merged frequently in the smaller, newer segments and over time migrate into larger, more stable segments as part of Lucene transactions. That's the theory - I'm currently benchmarking an early prototype. - Original Message From: Paul Elschot paul.elsc...@xs4all.nl To: dev@lucene.apache.org Sent: Sat, 25 September, 2010 22:03:28 Subject: Re: Document links Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood: My starting point in the solution I propose was to eliminate linking via any type of key. Key lookups mean indexes and indexes mean disk seeks. Graph traversals have exponential numbers of links and so all these index disk seeks start to stack up. The solution I propose uses doc ids as more-or-less direct pointers into file structures avoiding any index lookup. I've started coding up some tests using the file structures I outlined and will compare that with a traditional key-based approach. Both these on disk data structures and the ones in a B+ tree have seek offsets into files that require disk seeks. And both could use document ids as key values. But do these disk data structures support dynamic addition and deletion of (larger numbers of) document links? B+ trees are a standard solution for problems like this one, and it would probably not be easy to outperform them. It may be possible to improve performance of B+ trees somewhat by specializing for the fairly simple keys that would be needed, and by encoding very short lists of links for a single document directly into a seek offset to avoid the actual seek, but that's about it. Regards, Paul Elschot For reference - playing the Kevin Bacon game on a traditional Lucene index of IMDB data took 18 seconds to find a short path that Neo4j finds in 200 milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m edges). Going from actor-movies-actors-movies produces a lot of key lookups and the difference between key indexes and direct node pointers becomes clear. I know path finding analysis is perhaps not a typical Lucene application but other forms of link analysis e.g. recommendation engines require similar performance. Cheers Mark On 25 Sep 2010, at 11:41, Paul Elschot wrote: Op vrijdag 24 september 2010 17:57:45 schreef mark harwood: While not exactly equivalent, it reminds me of our earlier discussion around layered segments for dealing with field updates Right. Fast discovery of document relations is a foundation on which lots of things like this can build. Relations can be given types to support a number of different use cases. How about using this (bsd licenced) tree as a starting point: http://bplusdotnet.sourceforge.net/ It has various keys: ao. byte array, String and long. A fixed size byte array as key seems to be just fine: two bytes for a field number, four for the segment number and four for the in-segment document id. The separate segment number would allow to minimize the updates in the tree during merges. One could also use the normal doc id directly. The value could then be a similar to the key, but without the field number, and with an indication of the direction of the link. Or perhaps the direction of the link should be added to the key. A link would be present twice, once for each direction. Also both directions could have their own payloads. It could be put in its own file as a separate 'segment', or maybe each segment
[jira] Commented: (LUCENE-2729) Index corruption after 'read past EOF' under heavy update load and snapshot export
[ https://issues.apache.org/jira/browse/LUCENE-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929702#action_12929702 ] John Wang commented on LUCENE-2729: --- zoie does not touch index files, only adds an index.directory file containing version information. Index corruption after 'read past EOF' under heavy update load and snapshot export -- Key: LUCENE-2729 URL: https://issues.apache.org/jira/browse/LUCENE-2729 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1, 3.0.2 Environment: Happens on both OS X 10.6 and Windows 2008 Server. Integrated with zoie (using a zoie snapshot from 2010-08-06: zoie-2.0.0-snapshot-20100806.jar). Reporter: Nico Krijnen Attachments: 2010-11-02 IndexWriter infoStream log.zip We have a system running lucene and zoie. We use lucene as a content store for a CMS/DAM system. We use the hot-backup feature of zoie to make scheduled backups of the index. This works fine for small indexes and when there are not a lot of changes to the index when the backup is made. On large indexes (about 5 GB to 19 GB), when a backup is made while the index is being changed a lot (lots of document additions and/or deletions), we almost always get a 'read past EOF' at some point, followed by lots of 'Lock obtain timed out'. At that point we get lots of 0 kb files in the index, data gets lots, and the index is unusable. When we stop our server, remove the 0kb files and restart our server, the index is operational again, but data has been lost. I'm not sure if this is a zoie or a lucene issue, so i'm posting it to both. Hopefully someone has some ideas where to look to fix this. Some more details... Stack trace of the read past EOF and following Lock obtain timed out: {code} 78307 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] ERROR proj.zoie.impl.indexing.internal.BaseSearchIndex - read past EOF java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:154) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:39) at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:37) at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:69) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:245) at org.apache.lucene.index.IndexFileDeleter.init(IndexFileDeleter.java:166) at org.apache.lucene.index.DirectoryReader.doCommit(DirectoryReader.java:725) at org.apache.lucene.index.IndexReader.commit(IndexReader.java:987) at org.apache.lucene.index.IndexReader.commit(IndexReader.java:973) at org.apache.lucene.index.IndexReader.decRef(IndexReader.java:162) at org.apache.lucene.index.IndexReader.close(IndexReader.java:1003) at proj.zoie.impl.indexing.internal.BaseSearchIndex.deleteDocs(BaseSearchIndex.java:203) at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:223) at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) at proj.zoie.impl.indexing.internal.BatchedIndexDataLoader$LoaderThread.run(BatchedIndexDataLoader.java:373) 579336 [proj.zoie.impl.indexing.internal.realtimeindexdataloa...@31ca5085] ERROR proj.zoie.impl.indexing.internal.LuceneIndexDataLoader - Problem copying segments: Lock obtain timed out: org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: org.apache.lucene.store.singleinstancel...@5ad0b895: write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1060) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:957) at proj.zoie.impl.indexing.internal.DiskSearchIndex.openIndexWriter(DiskSearchIndex.java:176) at proj.zoie.impl.indexing.internal.BaseSearchIndex.loadFromIndex(BaseSearchIndex.java:228) at proj.zoie.impl.indexing.internal.LuceneIndexDataLoader.loadFromIndex(LuceneIndexDataLoader.java:153) at proj.zoie.impl.indexing.internal.DiskLuceneIndexDataLoader.loadFromIndex(DiskLuceneIndexDataLoader.java:134) at proj.zoie.impl.indexing.internal.RealtimeIndexDataLoader.processBatch(RealtimeIndexDataLoader.java:171) at
[jira] Created: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
Deprecate/remove language-specific tokenizers in favor of StandardTokenizer --- Key: LUCENE-2747 URL: https://issues.apache.org/jira/browse/LUCENE-2747 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Fix For: 3.1, 4.0 As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
On Monday 08 November 2010 20:03:59 Ryan McKinley wrote: Any updates/progress with this? I'm looking at ways to implement an RTree with lucene -- and this discussion seems relevant Did you consider merging the bits of each dimension into a NumericField? For example: one dimension a0 a1 .. an and a second dimension b0 b1 .. bn into a0 b0 a1 b1 .. an bn and then index this number as a NumericField. Regards, Paul Elschot - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
On Mon, Nov 8, 2010 at 3:20 PM, Paul Elschot paul.elsc...@xs4all.nl wrote: On Monday 08 November 2010 20:03:59 Ryan McKinley wrote: Any updates/progress with this? I'm looking at ways to implement an RTree with lucene -- and this discussion seems relevant Did you consider merging the bits of each dimension into a NumericField? For example: one dimension a0 a1 .. an and a second dimension b0 b1 .. bn into a0 b0 a1 b1 .. an bn and then index this number as a NumericField. Something like the geohash algorithm but with n dimensions? The linking work that Mark discussed seems nice since it would give faster access to navigating the tree -- finding N nearest neigbhors etc... - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2223) Separate out generic Solr site from release specific content.
Separate out generic Solr site from release specific content. --- Key: SOLR-2223 URL: https://issues.apache.org/jira/browse/SOLR-2223 Project: Solr Issue Type: Task Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor It would be useful for deployment purposes if we separated out the Solr site that is non-release specific from the release specific content. This would make it easier to apply updates, etc. while still keeping release specific info handy. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2748) Convert all Lucene web properties to use the ASF CMS
Convert all Lucene web properties to use the ASF CMS Key: LUCENE-2748 URL: https://issues.apache.org/jira/browse/LUCENE-2748 Project: Lucene - Java Issue Type: Bug Reporter: Grant Ingersoll The new CMS has a lot of nice features (and some kinks to still work out) and Forrest just doesn't cut it anymore, so we should move to the ASF CMS: http://apache.org/dev/cms.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
On Mon, Nov 8, 2010 at 2:52 PM, mark harwood markharw...@yahoo.co.uk wrote: I came to the conclusion that the transient meaning of document ids is too deeply ingrained in Lucene's design to use them to underpin any reliable linking. What about if we define an id field (like in solr)? Whatever does the traversal would need to make a Mapid,docID, but that is still better then then needing to do a query for each link. While it might work for relatively static indexes, any index with a reasonable number of updates or deletes will invalidate any stored document references in ways which are very hard to track. Lucene's compaction shuffles IDs without taking care to preserve identity, unlike graph DBs like Neo4j (see recycling IDs here: http://goo.gl/5UbJi ) oh ya -- and it is even more akward since each subreader often reuses the same docId ryan - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENENET-380) Evaluate Sharpen as a port tool
[ https://issues.apache.org/jira/browse/LUCENENET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929725#action_12929725 ] Aaron Powell commented on LUCENENET-380: George, The reason I spun up the external repo is so that it's easy to track changes and have a collaborative effort trying to find the right tool for the job. Can we spin up a repo under the ASF so we can collaboratively work on a solution? Evaluate Sharpen as a port tool --- Key: LUCENENET-380 URL: https://issues.apache.org/jira/browse/LUCENENET-380 Project: Lucene.Net Issue Type: Task Reporter: George Aroush Attachments: IndexWriter.java, Lucene.Net.Sharpen20101104.zip, NIOFSDirectory.java, QueryParser.java, TestBufferedIndexInput.java, TestDateFilter.java This task is to evaluate Sharpen as a port tool for Lucene.Net. The files to be evaluated are attached. We need to run those files (which are off Java Lucene 2.9.2) against Sharpen and compare the result against JLCA result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2747: Attachment: LUCENE-2747.patch here's a quick stab at a patch. I had to add at least minimal support to ReusableAnalyzerBase in case you want charfilters, since it doesn't have any today. maybe there is a better way to do it though. Deprecate/remove language-specific tokenizers in favor of StandardTokenizer --- Key: LUCENE-2747 URL: https://issues.apache.org/jira/browse/LUCENE-2747 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2747.patch As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
On Monday 08 November 2010 21:34:18 Ryan McKinley wrote: On Mon, Nov 8, 2010 at 3:20 PM, Paul Elschot paul.elsc...@xs4all.nl wrote: On Monday 08 November 2010 20:03:59 Ryan McKinley wrote: Any updates/progress with this? I'm looking at ways to implement an RTree with lucene -- and this discussion seems relevant Did you consider merging the bits of each dimension into a NumericField? For example: one dimension a0 a1 .. an and a second dimension b0 b1 .. bn into a0 b0 a1 b1 .. an bn and then index this number as a NumericField. Something like the geohash algorithm but with n dimensions? Yes. It is also a simple bounded volume hierarchy. Regards, Paul Elschot - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929738#action_12929738 ] Robert Muir commented on LUCENE-2747: - bq. CharFilter must at least also implement read() to read one char. Thats incorrect. only read(char[] cbuf, int off, int len) is abstract in Reader. CharStream extends Reader, but only adds correctOffset. CharFilter extends CharStream, but only delegates read(char[] cbuf, int off, int len) So implementing read() only adds useless code duplication here. Deprecate/remove language-specific tokenizers in favor of StandardTokenizer --- Key: LUCENE-2747 URL: https://issues.apache.org/jira/browse/LUCENE-2747 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2747.patch As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1145 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1145/ 4 tests failed. FAILED: junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64 Error Message: this writer hit an OutOfMemoryError; cannot complete optimize Stack Trace: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot complete optimize at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2394) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2316) at org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:129) at org.apache.lucene.search.TestNumericRangeQuery64.beforeClass(TestNumericRangeQuery64.java:90) FAILED: junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64 Error Message: null Stack Trace: java.lang.NullPointerException at org.apache.lucene.search.TestNumericRangeQuery64.afterClass(TestNumericRangeQuery64.java:97) FAILED: junit.framework.TestSuite.org.apache.lucene.search.TestNumericRangeQuery64 Error Message: directory of test was not closed, opened from: org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653) Stack Trace: junit.framework.AssertionFailedError: directory of test was not closed, opened from: org.apache.lucene.util.LuceneTestCase.newDirectory(LuceneTestCase.java:653) at org.apache.lucene.util.LuceneTestCase.afterClassLuceneTestCaseJ4(LuceneTestCase.java:331) REGRESSION: org.apache.lucene.search.TestPrefixFilter.testPrefixFilter Error Message: ConcurrentMergeScheduler hit unhandled exceptions Stack Trace: junit.framework.AssertionFailedError: ConcurrentMergeScheduler hit unhandled exceptions at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:458) Build Log (for compile errors): [...truncated 3116 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2749) Lexically sorted shingle filter
Lexically sorted shingle filter --- Key: LUCENE-2749 URL: https://issues.apache.org/jira/browse/LUCENE-2749 Project: Lucene - Java Issue Type: New Feature Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Priority: Minor Fix For: 3.1, 4.0 Sometimes people want to know if words have co-occurred within a specific window onto the token stream, but don't care what the order is. A Lucene token filter (LexicallySortedWindowFilter?), perhaps implemented as a ShingleFilter sub-class, could provide this functionality. This feature would allow for exact term set equality queries (in the case of a full-field-width window). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-2680: - Attachment: LUCENE-2680.patch Here's a cleaned up patch, please take a look. I ran 'ant test-core' 5 times with no failures, however running the below several times does eventually produce a failure. ant test-core -Dtestcase=TestThreadedOptimize -Dtestmethod=testThreadedOptimize -Dtests.seed=1547315783637080859:5267275843141383546 ant test-core -Dtestcase=TestIndexWriterMergePolicy -Dtestmethod=testMaxBufferedDocsChange -Dtests.seed=7382971652679988823:-6672235304390823521 Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1709) Distributed Date Faceting
[ https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929760#action_12929760 ] Peter Karich commented on SOLR-1709: Hi Peter Sturge, what are the limitations of this patch? only that earlier + later isn't supported? What are the issues before commiting this into trunk? Distributed Date Faceting - Key: SOLR-1709 URL: https://issues.apache.org/jira/browse/SOLR-1709 Project: Solr Issue Type: Improvement Components: SearchComponents - other Affects Versions: 1.4 Reporter: Peter Sturge Priority: Minor Attachments: FacetComponent.java, FacetComponent.java, ResponseBuilder.java, solr-1.4.0-solr-1709.patch This patch is for adding support for date facets when using distributed searches. Date faceting across multiple machines exposes some time-based issues that anyone interested in this behaviour should be aware of: Any time and/or time-zone differences are not accounted for in the patch (i.e. merged date facets are at a time-of-day, not necessarily at a universal 'instant-in-time', unless all shards are time-synced to the exact same time). The implementation uses the first encountered shard's facet_dates as the basis for subsequent shards' data to be merged in. This means that if subsequent shards' facet_dates are skewed in relation to the first by 1 'gap', these 'earlier' or 'later' facets will not be merged in. There are several reasons for this: * Performance: It's faster to check facet_date lists against a single map's data, rather than against each other, particularly if there are many shards * If 'earlier' and/or 'later' facet_dates are added in, this will make the time range larger than that which was requested (e.g. a request for one hour's worth of facets could bring back 2, 3 or more hours of data) This could be dealt with if timezone and skew information was added, and the dates were normalized. One possibility for adding such support is to [optionally] add 'timezone' and 'now' parameters to the 'facet_dates' map. This would tell requesters what time and TZ the remote server thinks it is, and so multiple shards' time data can be normalized. The patch affects 2 files in the Solr core: org.apache.solr.handler.component.FacetComponent.java org.apache.solr.handler.component.ResponseBuilder.java The main changes are in FacetComponent - ResponseBuilder is just to hold the completed SimpleOrderedMap until the finishStage. One possible enhancement is to perhaps make this an optional parameter, but really, if facet.date parameters are specified, it is assumed they are desired. Comments suggestions welcome. As a favour to ask, if anyone could take my 2 source files and create a PATCH file from it, it would be greatly appreciated, as I'm having a bit of trouble with svn (don't shoot me, but my environment is a Redmond-based os company). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENENET-380) Evaluate Sharpen as a port tool
[ https://issues.apache.org/jira/browse/LUCENENET-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929768#action_12929768 ] Mauricio Scheffer commented on LUCENENET-380: - @Aaron Powell: the ASF has official Git mirrors at github, see https://github.com/apache/lucene.net It's outdated so there seems to be a problem with the ASF sync, I'd notify the ASF infrastructure team about it. See also http://www.apache.org/dev/git.html Evaluate Sharpen as a port tool --- Key: LUCENENET-380 URL: https://issues.apache.org/jira/browse/LUCENENET-380 Project: Lucene.Net Issue Type: Task Reporter: George Aroush Attachments: IndexWriter.java, Lucene.Net.Sharpen20101104.zip, NIOFSDirectory.java, QueryParser.java, TestBufferedIndexInput.java, TestDateFilter.java This task is to evaluate Sharpen as a port tool for Lucene.Net. The files to be evaluated are attached. We need to run those files (which are off Java Lucene 2.9.2) against Sharpen and compare the result against JLCA result. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support
[ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Burton-West updated SOLR-2211: -- Attachment: SOLR-2211.patch Patch implements Solr UAX29TokenizerFactory and TestUAX29TokenizerFactory. Tom Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support --- Key: SOLR-2211 URL: https://issues.apache.org/jira/browse/SOLR-2211 Project: Solr Issue Type: New Feature Affects Versions: 3.1 Reporter: Tom Burton-West Priority: Minor Attachments: SOLR-2211.patch The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing. Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1. However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer. A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Assigned: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support
[ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned SOLR-2211: - Assignee: Robert Muir Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support --- Key: SOLR-2211 URL: https://issues.apache.org/jira/browse/SOLR-2211 Project: Solr Issue Type: New Feature Affects Versions: 3.1 Reporter: Tom Burton-West Assignee: Robert Muir Priority: Minor Attachments: SOLR-2211.patch The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing. Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1. However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer. A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Document links
What about if we define an id field (like in solr)? Last time I floated the idea of supporting primary keys as a core concept in Lucene (in the context of helping doc updates, not linking) there were objections along the lines of lucene shouldn't try to be a database On 8 Nov 2010, at 20:47, Ryan McKinley ryan...@gmail.com wrote: On Mon, Nov 8, 2010 at 2:52 PM, mark harwood markharw...@yahoo.co.uk wrote: I came to the conclusion that the transient meaning of document ids is too deeply ingrained in Lucene's design to use them to underpin any reliable linking. What about if we define an id field (like in solr)? Whatever does the traversal would need to make a Mapid,docID, but that is still better then then needing to do a query for each link. While it might work for relatively static indexes, any index with a reasonable number of updates or deletes will invalidate any stored document references in ways which are very hard to track. Lucene's compaction shuffles IDs without taking care to preserve identity, unlike graph DBs like Neo4j (see recycling IDs here: http://goo.gl/5UbJi ) oh ya -- and it is even more akward since each subreader often reuses the same docId ryan - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1709) Distributed Date Faceting
[ https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929786#action_12929786 ] Peter Sturge commented on SOLR-1709: Hi Peter, Thanks for your message. There's of course the issue of 'now' as described in some of the above comments. This is perhaps a little ancillary to this issue, but not totally irrelevant. The issue of time zone/skew on distributed shards is currently handled by SOLR-1729 by passing a 'facet.date.now=epochtime' parameter in the search query. This is then used by the particapating shards to use as 'now'. Of course, there are a number of ways to skin that one, but this is a straightforward solution that is backward compatible and still easy to implement in client code. Note that the facet.date.now change is not part of this patch - see SOLR-1729 for a separate patch for this parameter. (kept separate because it's, strictly speaking, a separate issue generally for distributed search) It's not that eariler/later aren't supported - the date facet 'edges' are fine, it's just the patch will 'quantize the ends' of the start/end date facets if the time is skewed from the calling server. This is where SOLR-1729 comes into play, so that this doesn't happen. As this is a pre-3x/4x branch patch, the testing is a bit limited on the latest trunk(s). Having said that, I have this (and SOLR-1729) building/running fine on my svn 3x branch release copy. Any other questions, or info you need, please do let me know. Thanks! Peter Distributed Date Faceting - Key: SOLR-1709 URL: https://issues.apache.org/jira/browse/SOLR-1709 Project: Solr Issue Type: Improvement Components: SearchComponents - other Affects Versions: 1.4 Reporter: Peter Sturge Priority: Minor Attachments: FacetComponent.java, FacetComponent.java, ResponseBuilder.java, solr-1.4.0-solr-1709.patch This patch is for adding support for date facets when using distributed searches. Date faceting across multiple machines exposes some time-based issues that anyone interested in this behaviour should be aware of: Any time and/or time-zone differences are not accounted for in the patch (i.e. merged date facets are at a time-of-day, not necessarily at a universal 'instant-in-time', unless all shards are time-synced to the exact same time). The implementation uses the first encountered shard's facet_dates as the basis for subsequent shards' data to be merged in. This means that if subsequent shards' facet_dates are skewed in relation to the first by 1 'gap', these 'earlier' or 'later' facets will not be merged in. There are several reasons for this: * Performance: It's faster to check facet_date lists against a single map's data, rather than against each other, particularly if there are many shards * If 'earlier' and/or 'later' facet_dates are added in, this will make the time range larger than that which was requested (e.g. a request for one hour's worth of facets could bring back 2, 3 or more hours of data) This could be dealt with if timezone and skew information was added, and the dates were normalized. One possibility for adding such support is to [optionally] add 'timezone' and 'now' parameters to the 'facet_dates' map. This would tell requesters what time and TZ the remote server thinks it is, and so multiple shards' time data can be normalized. The patch affects 2 files in the Solr core: org.apache.solr.handler.component.FacetComponent.java org.apache.solr.handler.component.ResponseBuilder.java The main changes are in FacetComponent - ResponseBuilder is just to hold the completed SimpleOrderedMap until the finishStage. One possible enhancement is to perhaps make this an optional parameter, but really, if facet.date parameters are specified, it is assumed they are desired. Comments suggestions welcome. As a favour to ask, if anyone could take my 2 source files and create a PATCH file from it, it would be greatly appreciated, as I'm having a bit of trouble with svn (don't shoot me, but my environment is a Redmond-based os company). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Resolved: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support
[ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved SOLR-2211. --- Resolution: Fixed Fix Version/s: 4.0 3.1 Committed revision 1032776, 1032779 (3x). Thanks Tom! Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support --- Key: SOLR-2211 URL: https://issues.apache.org/jira/browse/SOLR-2211 Project: Solr Issue Type: New Feature Affects Versions: 3.1 Reporter: Tom Burton-West Assignee: Robert Muir Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2211.patch The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing. Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1. However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer. A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2040) is ConcurrentLRUCache really a thread-safe/LRU implementation?
[ https://issues.apache.org/jira/browse/SOLR-2040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929800#action_12929800 ] Yonik Seeley commented on SOLR-2040: ConcurrentLRUCache is not strictly bounded by the size - when the max size is hit, we still allow other puts to proceed while evicting the oldest entries - by design for greater concurrency. If adds to the cache are very cheap to generate, this is not an appropriate cache to use since evictions won't keep up with additions. The uses in Solr are all appropriate however, so trying to fix this via additional synchronization will only result in lower throughput. is ConcurrentLRUCache really a thread-safe/LRU implementation? -- Key: SOLR-2040 URL: https://issues.apache.org/jira/browse/SOLR-2040 Project: Solr Issue Type: Bug Reporter: lszwycn hi, i wrote a simple test {code} package lru.solr; import java.util.ArrayList; import java.util.List; import java.util.Random; import java.util.concurrent.Callable; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.concurrent.atomic.AtomicInteger; public class ConcurrentLRUCacheTest { static final int loop = 1; static final int threadCount = 500; static final ConcurrentLRUCache lruMap = new ConcurrentLRUCache(128, 80, 100, 100, false, false, null); static final ExecutorService exec = Executors .newFixedThreadPool(threadCount); static final AtomicInteger totalRuncounter = new AtomicInteger(); static final AtomicInteger putCounter = new AtomicInteger(); static final AtomicInteger sizeCounter = new AtomicInteger(); static long totalTime = 0; public static void main(String[] args) throws Exception { ListCallableLong callList = new ArrayListCallableLong(); for (int i = 0; i threadCount; i++) { callList.add(new CallableLong() { int maxCacheSize = 0; int maxCacheInternalMapSize = 0; public Long call() throws Exception { final long begin = System.nanoTime(); Random r = new Random(); for (int j = 0; j loop; j++) { totalRuncounter.getAndIncrement(); int n = r.nextInt(1); int currentCacheSize = lruMap.size(); int currentCacheInternalMapSize = lruMap.getMap() .size(); maxCacheSize = Math.max(currentCacheSize, maxCacheSize); maxCacheInternalMapSize = Math.max( currentCacheInternalMapSize, maxCacheInternalMapSize); if (null == lruMap.get(n)) { lruMap.put(n, j); putCounter.getAndIncrement(); } else { lruMap.size(); sizeCounter.getAndIncrement(); } } System.out.println(maxCacheSize: + maxCacheSize + ,maxCacheInternalMapSize: + maxCacheInternalMapSize); final long end = System.nanoTime(); return (end - begin); } }); } ListFutureLong futureList = exec.invokeAll(callList); for (FutureLong future : futureList) { totalTime += future.get(); } System.out.println(final cache size: + lruMap.size()); System.out.println(final cache internal map size: + lruMap.getMap().size()); System.out.println(total get: +totalRuncounter + spend time= + totalTime / 1000 + , put: + putCounter.get() + , size:
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929810#action_12929810 ] Jason Rutherglen commented on LUCENE-2680: -- The problem could be that IW deleteDocument is not synced on IW, when I tried adding the sync, there was deadlock perhaps from DW waitReady. We could be adding pending deletes to segments that are not quite current because we're not adding them in an IW sync block. Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENENET-379) Clean up Lucene.Net website
[ https://issues.apache.org/jira/browse/LUCENENET-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929813#action_12929813 ] Prescott Nasser commented on LUCENENET-379: --- Any objections then to just digging into the ASF CMS system? Also, in terms of what the page should look like, do we still want to mimic the other lucene pages? or should we go with the skeleton that apache.org uses? Clean up Lucene.Net website --- Key: LUCENENET-379 URL: https://issues.apache.org/jira/browse/LUCENENET-379 Project: Lucene.Net Issue Type: Task Reporter: George Aroush The existing Lucene.Net home page at http://lucene.apache.org/lucene.net/ is still based on the incubation, out of date design. This JIRA task is to bring it up to date with other ASF project's web page. The existing website is here: https://svn.apache.org/repos/asf/lucene/lucene.net/site/ See http://www.apache.org/dev/project-site.html to get started. It would be best to start by cloning an existing ASF project's website and adopting it for Lucene.Net. Some examples, https://svn.apache.org/repos/asf/lucene/pylucene/site/ and https://svn.apache.org/repos/asf/lucene/java/site/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Lucene-3.x - Build # 175 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-3.x/175/ All tests passed Build Log (for compile errors): [...truncated 21371 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929826#action_12929826 ] Uwe Schindler commented on LUCENE-2747: --- Ay, ay Code Dup Policeman. From perf standpoint for real FilterReaders in java.io that would be no-go, but here it's fine as Tokenizers always buffer. Also java.io's FilterReader are different and delegate this method, but not CharFilter. Deprecate/remove language-specific tokenizers in favor of StandardTokenizer --- Key: LUCENE-2747 URL: https://issues.apache.org/jira/browse/LUCENE-2747 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2747.patch As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1149 - Still Failing
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1149/ 1 tests failed. REGRESSION: org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration Error Message: expected:2 but was:3 Stack Trace: junit.framework.AssertionFailedError: expected:2 but was:3 at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.solr.cloud.CloudStateUpdateTest.testCoreRegistration(CloudStateUpdateTest.java:201) Build Log (for compile errors): [...truncated 8711 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929834#action_12929834 ] Robert Muir commented on LUCENE-2747: - Wait, thats an interesting point, any advantage to actually using real FilterReaders for this API? Deprecate/remove language-specific tokenizers in favor of StandardTokenizer --- Key: LUCENE-2747 URL: https://issues.apache.org/jira/browse/LUCENE-2747 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2747.patch As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2211) Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support
[ https://issues.apache.org/jira/browse/SOLR-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929849#action_12929849 ] Robert Muir commented on SOLR-2211: --- Great, I look forward to the results. By the way, on SOLR-2210 i also added the ICU filters, you could consider replacing LowerCaseFilterFactory with ICUNormalizer2Factory (just use the defaults). In addition to better lowercasing (e.g. ß - ss), this would also bring the advantages described in http://unicode.org/reports/tr15/ Alternatively, if you are already using both LowerCaseFilterFactory and ASCIIFoldingFilterFactory, you can replace both with ICUFoldingFilterFactory, which goes further and also incorporates http://www.unicode.org/reports/tr30/tr30-4.html Create Solr FilterFactory for Lucene StandardTokenizer with UAX#29 support --- Key: SOLR-2211 URL: https://issues.apache.org/jira/browse/SOLR-2211 Project: Solr Issue Type: New Feature Affects Versions: 3.1 Reporter: Tom Burton-West Assignee: Robert Muir Priority: Minor Fix For: 3.1, 4.0 Attachments: SOLR-2211.patch The Lucene 3.x StandardTokenizer with UAX#29 support provides benefits for non-English tokenizing. Presently it can be invoked by using the StandardTokenizerFactory and setting the Version to 3.1. However, it would be useful to be able to use the improved unicode processing without necessarily including the ip address and email address processing of StandardAnalyzer. A FilterFactory that allowed the use of the StandardTokenizer with UAX#29 support on its own would be useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1152 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1152/ 1 tests failed. REGRESSION: org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety Error Message: null Stack Trace: junit.framework.AssertionFailedError: at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.lucene.index.TestIndexWriter.testCommitThreadSafety(TestIndexWriter.java:2385) Build Log (for compile errors): [...truncated 3107 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2750) add Kamikaze 3.0.1 into Lucene
add Kamikaze 3.0.1 into Lucene -- Key: LUCENE-2750 URL: https://issues.apache.org/jira/browse/LUCENE-2750 Project: Lucene - Java Issue Type: Sub-task Components: contrib/* Reporter: hao yan Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve significantly better performance then Kamikaze 2.0.0 in terms of both compressed size and decompression speed. The main difference between the two versions is Kamikaze 3.0.x uses the much more efficient implementation of the PForDelta compression algorithm. My goal is to integrate the highly efficient PForDelta implementation into Lucene Codec. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-trunk - Build # 1357 - Still Failing
Build: https://hudson.apache.org/hudson/job/Lucene-trunk/1357/ All tests passed Build Log (for compile errors): [...truncated 18287 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1155 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1155/ 1 tests failed. REGRESSION: org.apache.solr.TestDistributedSearch.testDistribSearch Error Message: Some threads threw uncaught exceptions! Stack Trace: junit.framework.AssertionFailedError: Some threads threw uncaught exceptions! at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:437) at org.apache.solr.SolrTestCaseJ4.tearDown(SolrTestCaseJ4.java:78) at org.apache.solr.BaseDistributedSearchTestCase.tearDown(BaseDistributedSearchTestCase.java:144) Build Log (for compile errors): [...truncated 8857 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (SOLR-2224) TermVectorComponent did not return results when using distributedProcess in distribution envs
TermVectorComponent did not return results when using distributedProcess in distribution envs - Key: SOLR-2224 URL: https://issues.apache.org/jira/browse/SOLR-2224 Project: Solr Issue Type: Bug Components: SearchComponents - other Affects Versions: 4.0 Environment: JDK1.6/Tomcat6 Reporter: tom liu when using distributed query, TVRH did not return any results. in distributedProcess, tv creates one request, that use TermVectorParams.DOC_IDS, for example, tv.docIds=10001 but queryCommponent returns ids, that is uniqueKeys, not DOCIDS. so, in distribution envs, must not use distributedProcess. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (SOLR-2224) TermVectorComponent did not return results when using distributedProcess in distribution envs
[ https://issues.apache.org/jira/browse/SOLR-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929916#action_12929916 ] tom liu commented on SOLR-2224: --- we can delete distributedProcess method, and add modifyRequest method: {noformat} public void modifyRequest(ResponseBuilder rb, SearchComponent who, ShardRequest sreq) { if (rb.stage == ResponseBuilder.STAGE_GET_FIELDS) sreq.params.set(tv, true); else sreq.params.set(tv, false); } {noformat} TermVectorComponent did not return results when using distributedProcess in distribution envs - Key: SOLR-2224 URL: https://issues.apache.org/jira/browse/SOLR-2224 Project: Solr Issue Type: Bug Components: SearchComponents - other Affects Versions: 4.0 Environment: JDK1.6/Tomcat6 Reporter: tom liu when using distributed query, TVRH did not return any results. in distributedProcess, tv creates one request, that use TermVectorParams.DOC_IDS, for example, tv.docIds=10001 but queryCommponent returns ids, that is uniqueKeys, not DOCIDS. so, in distribution envs, must not use distributedProcess. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2680) Improve how IndexWriter flushes deletes against existing segments
[ https://issues.apache.org/jira/browse/LUCENE-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929927#action_12929927 ] Jason Rutherglen commented on LUCENE-2680: -- Ok, TestThreadedOptimize works when the DW sync'ed pushSegmentInfos method isn't called anymore (no extra per-segment deleting is going on), and stops working when pushSegmentInfos is turned back on. Something about the sync on DW is causing a problem. Hmm... We need another way to pass segment infos around consistently. Improve how IndexWriter flushes deletes against existing segments - Key: LUCENE-2680 URL: https://issues.apache.org/jira/browse/LUCENE-2680 Project: Lucene - Java Issue Type: Improvement Reporter: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch, LUCENE-2680.patch IndexWriter buffers up all deletes (by Term and Query) and only applies them if 1) commit or NRT getReader() is called, or 2) a merge is about to kickoff. We do this because, for a large index, it's very costly to open a SegmentReader for every segment in the index. So we defer as long as we can. We do it just before merge so that the merge can eliminate the deleted docs. But, most merges are small, yet in a big index we apply deletes to all of the segments, which is really very wasteful. Instead, we should only apply the buffered deletes to the segments that are about to be merged, and keep the buffer around for the remaining segments. I think it's not so hard to do; we'd have to have generations of pending deletions, because the newly merged segment doesn't need the same buffered deletions applied again. So every time a merge kicks off, we pinch off the current set of buffered deletions, open a new set (the next generation), and record which segment was created as of which generation. This should be a very sizable gain for large indices that mix deletes, though, less so in flex since opening the terms index is much faster. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929934#action_12929934 ] DM Smith commented on LUCENE-2747: -- I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer. I think it is important to have a tokenizer that does not try to be too smart. I think it'd be good to have a SimpleAnalyzer based upon UAX#29, too. Then I'd be happy. Deprecate/remove language-specific tokenizers in favor of StandardTokenizer --- Key: LUCENE-2747 URL: https://issues.apache.org/jira/browse/LUCENE-2747 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2747.patch As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929936#action_12929936 ] Steven Rowe commented on LUCENE-2747: - Robert, your patch looks good - I have a couple of questions: * You removed {{TestHindiFilters.testTokenizer()}}, {{TestIndicTokenizer.testBasics()}} and {{TestIndicTokenizer.testFormat()}}, but these would be useful in {{TestStandardAnalyzer}} and {{TestUAX29Tokenizer}}, wouldn't they? * You did not remove {{ArabicLetterTokenizer}} and {{IndicTokenizer}}, presumably so that they can be used with Lucene 4.0+ when the supplied {{Version}} is less than 3.1 -- good catch, I had forgotten this requirement -- but when can we actually get rid of these? Since they will be staying, shouldn't their tests remain too, but using {{Version.LUCENE_30}} instead of {{TEST_VERSION_CURRENT}}? Deprecate/remove language-specific tokenizers in favor of StandardTokenizer --- Key: LUCENE-2747 URL: https://issues.apache.org/jira/browse/LUCENE-2747 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2747.patch As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2747) Deprecate/remove language-specific tokenizers in favor of StandardTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12929945#action_12929945 ] Steven Rowe commented on LUCENE-2747: - bq. I'm not too keen on this. For classics and ancient texts the standard analyzer is not as good as the simple analyzer. I think it is important to have a tokenizer that does not try to be too smart. I think it'd be good to have a SimpleAnalyzer based upon UAX#29, too. {{UAX29Tokenizer}} could be combined with {{LowercaseFilter}} to provide that, no? Robert is arguing in the reopened LUCENE-2167 for {{StandardTokenizer}} to be stripped down so that it only implements UAX#29 rules (i.e., dropping URL+Email recognition), so if that comes to pass, {{StandardAnalyzer}} would just be UAX#29+lowercase+stopword (with English stopwords by default, but those can be overridden in the ctor) -- would that make you happy? Deprecate/remove language-specific tokenizers in favor of StandardTokenizer --- Key: LUCENE-2747 URL: https://issues.apache.org/jira/browse/LUCENE-2747 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: 3.1, 4.0 Reporter: Steven Rowe Fix For: 3.1, 4.0 Attachments: LUCENE-2747.patch As of Lucene 3.1, StandardTokenizer implements UAX#29 word boundary rules to provide language-neutral tokenization. Lucene contains several language-specific tokenizers that should be replaced by UAX#29-based StandardTokenizer (deprecated in 3.1 and removed in 4.0). The language-specific *analyzers*, by contrast, should remain, because they contain language-specific post-tokenization filters. The language-specific analyzers should switch to StandardTokenizer in 3.1. Some usages of language-specific tokenizers will need additional work beyond just replacing the tokenizer in the language-specific analyzer. For example, PersianAnalyzer currently uses ArabicLetterTokenizer, and depends on the fact that this tokenizer breaks tokens on the ZWNJ character (zero-width non-joiner; U+200C), but in the UAX#29 word boundary rules, ZWNJ is not a word boundary. Robert Muir has suggested using a char filter converting ZWNJ to spaces prior to StandardTokenizer in the converted PersianAnalyzer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Lucene-Solr-tests-only-trunk - Build # 1158 - Failure
Build: https://hudson.apache.org/hudson/job/Lucene-Solr-tests-only-trunk/1158/ 1 tests failed. REGRESSION: org.apache.solr.cloud.BasicDistributedZkTest.testDistribSearch Error Message: .response.numFound:35!=67 Stack Trace: junit.framework.AssertionFailedError: .response.numFound:35!=67 at org.apache.solr.BaseDistributedSearchTestCase.compareResponses(BaseDistributedSearchTestCase.java:553) at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:307) at org.apache.solr.cloud.BasicDistributedZkTest.doTest(BasicDistributedZkTest.java:127) at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:562) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:878) at org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:844) Build Log (for compile errors): [...truncated 8715 lines...] - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org