[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations
[ https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719422#action_12719422 ] Shai Erera commented on LUCENE-1677: I think test-core is broken too ... Remove GCJ IndexReader specializations -- Key: LUCENE-1677 URL: https://issues.apache.org/jira/browse/LUCENE-1677 Project: Lucene - Java Issue Type: Task Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 These specializations are outdated, unsupported, most probably pointless due to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are going to ask people on java-user, anybody replied that they need it?). While giving nothing, they make SegmentReader instantiation code look real ugly. If nobody objects, I'm going to post a patch that removes these from Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1474) Incorrect SegmentInfo.delCount when IndexReader.flush() is used
[ https://issues.apache.org/jira/browse/LUCENE-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719435#action_12719435 ] Adrian Hempel commented on LUCENE-1474: --- Hi Michael, The index that Erik was working with contained segments created with a pre-2.4.1 version of Lucene, so we don't believe this is a regression. Regards, Adrian Incorrect SegmentInfo.delCount when IndexReader.flush() is used --- Key: LUCENE-1474 URL: https://issues.apache.org/jira/browse/LUCENE-1474 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4 Reporter: Marcel Reutegger Assignee: Michael McCandless Fix For: 2.4.1, 2.9 Attachments: CheckIndex.txt, IndexReaderTest.java When deleted documents are flushed using IndexReader.flush() the delCount in SegmentInfo is updated based on the current value and SegmentReader.pendingDeleteCount (introduced by LUCENE-1267). It seems that pendingDeleteCount is not reset after the commit, which means after a second flush() or close() of an index reader the delCount in SegmentInfo is incorrect. A subsequent IndexReader.open() call will fail with an error when assertions are enabled. E.g.: java.lang.AssertionError: delete count mismatch: info=3 vs BitVector=2 at org.apache.lucene.index.SegmentReader.loadDeletedDocs(SegmentReader.java:405) [...] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #859
There is the wrong name in the pom.xml.template for contrib/remote Here is a diff with a patch: Index: contrib/remote/pom.xml.template === --- contrib/remote/pom.xml.template (revision 784550) +++ contrib/remote/pom.xml.template (working copy) @@ -28,7 +28,7 @@ version@version@/version /parent groupIdorg.apache.lucene/groupId - artifactIdlucene-regex/artifactId + artifactIdlucene-remote/artifactId nameLucene Remote/name version@version@/version descriptionRemote Searchable based on RMI/description simon On Mon, Jun 15, 2009 at 4:19 AM, Apache Hudson Serverhud...@hudson.zones.apache.org wrote: See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/859/changes Changes: [mikemccand] LUCENE-1539: add DeleteByPercent, FlushReader tasks, and ability to open reader on a labelled commit point [mikemccand] LUCENE-1571: fix LatLongDistanceFilter to respect deleted docs [mikemccand] LUCENE-979: remove a few more old benchmark things [mikemccand] revert accidental commit [mikemccand] LUCENE-1677: deprecate gcj specializations, and the system properties that let you specify which SegmentReader impl class to use [mikemccand] LUCENE-1407: move RemoteSearchable out of core into contrib/remote -- [...truncated 6620 lines...] build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: jar-src: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/lucene-misc-2.9-SNAPSHOT-src.jar dist-maven: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc [artifact:install-provider] Installing provider: org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime [artifact:pom] Error downloading parent pom org.apache.lucene:lucene-contrib::2.9-SNAPSHOT: Missing: [artifact:pom] -- [artifact:pom] 1) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT [artifact:pom] Path to dependency: [artifact:pom] 1) unspecified:unspecified:jar:0.0 [artifact:pom] 2) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT [artifact:pom] [artifact:pom] -- [artifact:pom] 1 required artifact is missing. [artifact:pom] [artifact:pom] for artifact: [artifact:pom] unspecified:unspecified:jar:0.0 [artifact:pom] [artifact:pom] from the specified remote repositories: [artifact:pom] central (http://repo1.maven.org/maven2) [artifact:deploy] Deploying to file://http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/dist/maven [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] [INFO] repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: remote, so will be created [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1.jar to remote [artifact:deploy] Uploaded 52K [artifact:deploy] [INFO] Retrieving previous metadata from remote [artifact:deploy] [INFO] repository metadata for: 'artifact org.apache.lucene:lucene-misc' could not be found on repository: remote, so will be created [artifact:deploy] [INFO] Uploading repository metadata for: 'artifact org.apache.lucene:lucene-misc' [artifact:deploy] [INFO] Uploading project information for lucene-misc 2.9-20090615.021803-1 [artifact:deploy] [INFO] Retrieving previous metadata from remote [artifact:deploy] [INFO] repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: remote, so will be created [artifact:deploy] [INFO] Uploading repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-sources.jar to remote [artifact:deploy] Uploaded 53K [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-javadoc.jar to remote [artifact:deploy] Uploaded 142K [echo] Building queries... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: jar-src: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/lucene-queries-2.9-SNAPSHOT-src.jar dist-maven: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries [artifact:install-provider] Installing provider: org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime [artifact:pom]
[jira] Created: (LUCENE-1691) An index copied over another index can result in corruption
An index copied over another index can result in corruption --- Key: LUCENE-1691 URL: https://issues.apache.org/jira/browse/LUCENE-1691 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Adrian Hempel Priority: Minor Fix For: 2.4.1 After restoring an older backup of an index over the top of a newer version of the index, attempts to open the index can result in CorruptIndexExceptions, such as: {noformat} Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _ed: fieldsReader shows 1137 but segmentInfo shows 1389 at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:362) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:228) at org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:55) at org.apache.lucene.index.ReadOnlyMultiSegmentReader.init(ReadOnlyMultiSegmentReader.java:27) at org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:102) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653) at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.lucene.index.IndexReader.open(IndexReader.java:237) {noformat} The apparent cause is the strategy of taking the maximum of the ID in the segments.gen file, and the IDs of the apparently valid segment files (See lines 523-593 [here|http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/index/SegmentInfos.java?annotate=751393]), and using this as the current generation of the index. This will include stale segments that existed before the backup was restored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #859
Thanks, Simon! I just committed the fix. Michael On 6/15/09 12:20 AM, Simon Willnauer wrote: There is the wrong name in the pom.xml.template for contrib/remote Here is a diff with a patch: Index: contrib/remote/pom.xml.template === --- contrib/remote/pom.xml.template (revision 784550) +++ contrib/remote/pom.xml.template (working copy) @@ -28,7 +28,7 @@ version@version@/version /parent groupIdorg.apache.lucene/groupId -artifactIdlucene-regex/artifactId +artifactIdlucene-remote/artifactId nameLucene Remote/name version@version@/version descriptionRemote Searchable based on RMI/description simon On Mon, Jun 15, 2009 at 4:19 AM, Apache Hudson Serverhud...@hudson.zones.apache.org wrote: See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/859/changes Changes: [mikemccand] LUCENE-1539: add DeleteByPercent, FlushReader tasks, and ability to open reader on a labelled commit point [mikemccand] LUCENE-1571: fix LatLongDistanceFilter to respect deleted docs [mikemccand] LUCENE-979: remove a few more old benchmark things [mikemccand] revert accidental commit [mikemccand] LUCENE-1677: deprecate gcj specializations, and the system properties that let you specify which SegmentReader impl class to use [mikemccand] LUCENE-1407: move RemoteSearchable out of core into contrib/remote -- [...truncated 6620 lines...] build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: jar-src: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/lucene-misc-2.9-SNAPSHOT-src.jar dist-maven: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc [artifact:install-provider] Installing provider: org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime [artifact:pom] Error downloading parent pom org.apache.lucene:lucene-contrib::2.9-SNAPSHOT: Missing: [artifact:pom] -- [artifact:pom] 1) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT [artifact:pom] Path to dependency: [artifact:pom] 1) unspecified:unspecified:jar:0.0 [artifact:pom] 2) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT [artifact:pom] [artifact:pom] -- [artifact:pom] 1 required artifact is missing. [artifact:pom] [artifact:pom] for artifact: [artifact:pom] unspecified:unspecified:jar:0.0 [artifact:pom] [artifact:pom] from the specified remote repositories: [artifact:pom] central (http://repo1.maven.org/maven2) [artifact:deploy] Deploying to file://http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/dist/maven [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] [INFO] repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: remote, so will be created [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1.jar to remote [artifact:deploy] Uploaded 52K [artifact:deploy] [INFO] Retrieving previous metadata from remote [artifact:deploy] [INFO] repository metadata for: 'artifact org.apache.lucene:lucene-misc' could not be found on repository: remote, so will be created [artifact:deploy] [INFO] Uploading repository metadata for: 'artifact org.apache.lucene:lucene-misc' [artifact:deploy] [INFO] Uploading project information for lucene-misc 2.9-20090615.021803-1 [artifact:deploy] [INFO] Retrieving previous metadata from remote [artifact:deploy] [INFO] repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: remote, so will be created [artifact:deploy] [INFO] Uploading repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-sources.jar to remote [artifact:deploy] Uploaded 53K [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-javadoc.jar to remote [artifact:deploy] Uploaded 142K [echo] Building queries... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: jar-src: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/lucene-queries-2.9-SNAPSHOT-src.jar dist-maven: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries [artifact:install-provider] Installing provider: org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime
RE: svn commit: r784540 - in /lucene/java/trunk: ./ contrib/remote/ contrib/remote/src/ contrib/remote/src/java/ contrib/remote/src/java/org/ contrib/remote/src/java/org/apache/ contrib/remote/src/jav
Hi Mike, after adding a new contrib, I think we should also add this to the site docs and also the javadocs generation in the main build.xml. Should I prepare this? I have done this for spatial and trie in the past, too. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: mikemcc...@apache.org [mailto:mikemcc...@apache.org] Sent: Sunday, June 14, 2009 1:13 PM To: java-comm...@lucene.apache.org Subject: svn commit: r784540 - in /lucene/java/trunk: ./ contrib/remote/ contrib/remote/src/ contrib/remote/src/java/ contrib/remote/src/java/org/ contrib/remote/src/java/org/apache/ contrib/remote/src/java/org/apache/lucene/ contrib/remote/src/java/org/apache/... Author: mikemccand Date: Sun Jun 14 11:13:04 2009 New Revision: 784540 URL: http://svn.apache.org/viewvc?rev=784540view=rev Log: LUCENE-1407: move RemoteSearchable out of core into contrib/remote Added: lucene/java/trunk/contrib/remote/ lucene/java/trunk/contrib/remote/build.xml lucene/java/trunk/contrib/remote/pom.xml.template lucene/java/trunk/contrib/remote/src/ lucene/java/trunk/contrib/remote/src/java/ lucene/java/trunk/contrib/remote/src/java/org/ lucene/java/trunk/contrib/remote/src/java/org/apache/ lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/ lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/search/ lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/search/RMIRemo teSearchable.java lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/search/RemoteC achingWrapperFilter.java - copied, changed from r784216, lucene/java/trunk/src/java/org/apache/lucene/search/RemoteCachingWrapperFi lter.java lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/search/RemoteS earchable.java - copied, changed from r784216, lucene/java/trunk/src/java/org/apache/lucene/search/RemoteSearchable.java lucene/java/trunk/contrib/remote/src/test/ lucene/java/trunk/contrib/remote/src/test/org/ lucene/java/trunk/contrib/remote/src/test/org/apache/ lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/ lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/ lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/RemoteC achingWrapperFilterHelper.java - copied unchanged from r784216, lucene/java/trunk/src/test/org/apache/lucene/search/RemoteCachingWrapperFi lterHelper.java lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/TestRem oteCachingWrapperFilter.java - copied, changed from r784216, lucene/java/trunk/src/test/org/apache/lucene/search/TestRemoteCachingWrapp erFilter.java lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/TestRem oteSearchable.java - copied, changed from r784216, lucene/java/trunk/src/test/org/apache/lucene/search/TestRemoteSearchable.j ava lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/TestRem oteSort.java Removed: lucene/java/trunk/src/java/org/apache/lucene/search/RemoteCachingWrapperFi lter.java lucene/java/trunk/src/java/org/apache/lucene/search/RemoteSearchable.java lucene/java/trunk/src/test/org/apache/lucene/search/RemoteCachingWrapperFi lterHelper.java lucene/java/trunk/src/test/org/apache/lucene/search/TestRemoteCachingWrapp erFilter.java lucene/java/trunk/src/test/org/apache/lucene/search/TestRemoteSearchable.j ava Modified: lucene/java/trunk/CHANGES.txt lucene/java/trunk/build.xml lucene/java/trunk/common-build.xml lucene/java/trunk/src/java/org/apache/lucene/search/CachingSpanFilter.java lucene/java/trunk/src/java/org/apache/lucene/search/CachingWrapperFilter.j ava lucene/java/trunk/src/java/org/apache/lucene/search/FilterManager.java lucene/java/trunk/src/java/org/apache/lucene/search/Searchable.java lucene/java/trunk/src/test/org/apache/lucene/search/TestSort.java Modified: lucene/java/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=784540r1=7 84539r2=784540view=diff == --- lucene/java/trunk/CHANGES.txt (original) +++ lucene/java/trunk/CHANGES.txt Sun Jun 14 11:13:04 2009 @@ -196,6 +196,11 @@ were deprecated. You should instantiate the Directory manually before and pass it to these classes (LUCENE-1451, LUCENE-1658). (Uwe Schindler) + +21. LUCENE-1407: Move RemoteSearchable, RemoteCachingWrapperFilter out +of Lucene's core into new contrib/remote package. Searchable no +longer extends java.rmi.Remote (Simon Willnauer via Mike +McCandless) Bug fixes Modified: lucene/java/trunk/build.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/build.xml?rev=784540r1=784 539r2=784540view=diff
[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations
[ https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719491#action_12719491 ] Michael McCandless commented on LUCENE-1677: bq. I think test-core is broken too ... It should be fixed now? (I reverted it). Remove GCJ IndexReader specializations -- Key: LUCENE-1677 URL: https://issues.apache.org/jira/browse/LUCENE-1677 Project: Lucene - Java Issue Type: Task Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 These specializations are outdated, unsupported, most probably pointless due to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are going to ask people on java-user, anybody replied that they need it?). While giving nothing, they make SegmentReader instantiation code look real ugly. If nobody objects, I'm going to post a patch that removes these from Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations
[ https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719504#action_12719504 ] Shai Erera commented on LUCENE-1677: You're right. I updated build.xml, but the change for test-core was actually in common-build.xml. sorry for the false alarm. Remove GCJ IndexReader specializations -- Key: LUCENE-1677 URL: https://issues.apache.org/jira/browse/LUCENE-1677 Project: Lucene - Java Issue Type: Task Reporter: Earwin Burrfoot Assignee: Michael McCandless Fix For: 2.9 These specializations are outdated, unsupported, most probably pointless due to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are going to ask people on java-user, anybody replied that they need it?). While giving nothing, they make SegmentReader instantiation code look real ugly. If nobody objects, I'm going to post a patch that removes these from Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #859
FYI, Simon, you are still a contrib committer ;-) On Jun 15, 2009, at 3:20 AM, Simon Willnauer wrote: There is the wrong name in the pom.xml.template for contrib/remote Here is a diff with a patch: Index: contrib/remote/pom.xml.template === --- contrib/remote/pom.xml.template (revision 784550) +++ contrib/remote/pom.xml.template (working copy) @@ -28,7 +28,7 @@ version@version@/version /parent groupIdorg.apache.lucene/groupId - artifactIdlucene-regex/artifactId + artifactIdlucene-remote/artifactId nameLucene Remote/name version@version@/version descriptionRemote Searchable based on RMI/description simon On Mon, Jun 15, 2009 at 4:19 AM, Apache Hudson Serverhud...@hudson.zones.apache.org wrote: See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/859/ changes Changes: [mikemccand] LUCENE-1539: add DeleteByPercent, FlushReader tasks, and ability to open reader on a labelled commit point [mikemccand] LUCENE-1571: fix LatLongDistanceFilter to respect deleted docs [mikemccand] LUCENE-979: remove a few more old benchmark things [mikemccand] revert accidental commit [mikemccand] LUCENE-1677: deprecate gcj specializations, and the system properties that let you specify which SegmentReader impl class to use [mikemccand] LUCENE-1407: move RemoteSearchable out of core into contrib/remote -- [...truncated 6620 lines...] build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: jar-src: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/lucene-misc-2.9-SNAPSHOT-src.jar dist-maven: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc [artifact:install-provider] Installing provider: org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime [artifact:pom] Error downloading parent pom org.apache.lucene:lucene-contrib::2.9-SNAPSHOT: Missing: [artifact:pom] -- [artifact:pom] 1) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT [artifact:pom] Path to dependency: [artifact:pom] 1) unspecified:unspecified:jar:0.0 [artifact:pom] 2) org.apache.lucene:lucene-contrib:pom:2.9- SNAPSHOT [artifact:pom] [artifact:pom] -- [artifact:pom] 1 required artifact is missing. [artifact:pom] [artifact:pom] for artifact: [artifact:pom] unspecified:unspecified:jar:0.0 [artifact:pom] [artifact:pom] from the specified remote repositories: [artifact:pom] central (http://repo1.maven.org/maven2) [artifact:deploy] Deploying to file://http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/dist/maven [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] [INFO] repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: remote, so will be created [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9- SNAPSHOT/lucene-misc-2.9-20090615.021803-1.jar to remote [artifact:deploy] Uploaded 52K [artifact:deploy] [INFO] Retrieving previous metadata from remote [artifact:deploy] [INFO] repository metadata for: 'artifact org.apache.lucene:lucene-misc' could not be found on repository: remote, so will be created [artifact:deploy] [INFO] Uploading repository metadata for: 'artifact org.apache.lucene:lucene-misc' [artifact:deploy] [INFO] Uploading project information for lucene- misc 2.9-20090615.021803-1 [artifact:deploy] [INFO] Retrieving previous metadata from remote [artifact:deploy] [INFO] repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: remote, so will be created [artifact:deploy] [INFO] Uploading repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9- SNAPSHOT/lucene-misc-2.9-20090615.021803-1-sources.jar to remote [artifact:deploy] Uploaded 53K [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9- SNAPSHOT/lucene-misc-2.9-20090615.021803-1-javadoc.jar to remote [artifact:deploy] Uploaded 142K [echo] Building queries... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: jar-src: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/lucene-queries-2.9-SNAPSHOT-src.jar dist-maven: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries [artifact:install-provider] Installing provider:
Re: Build failed in Hudson: Lucene-trunk #859
Uh! I didn't know that I can commit to all contribs Good to know but I have been inactive for a while so I want use my power with care! simon On Mon, Jun 15, 2009 at 12:34 PM, Grant Ingersollgsing...@apache.org wrote: FYI, Simon, you are still a contrib committer ;-) On Jun 15, 2009, at 3:20 AM, Simon Willnauer wrote: There is the wrong name in the pom.xml.template for contrib/remote Here is a diff with a patch: Index: contrib/remote/pom.xml.template === --- contrib/remote/pom.xml.template (revision 784550) +++ contrib/remote/pom.xml.template (working copy) @@ -28,7 +28,7 @@ version@version@/version /parent groupIdorg.apache.lucene/groupId - artifactIdlucene-regex/artifactId + artifactIdlucene-remote/artifactId nameLucene Remote/name version@version@/version descriptionRemote Searchable based on RMI/description simon On Mon, Jun 15, 2009 at 4:19 AM, Apache Hudson Serverhud...@hudson.zones.apache.org wrote: See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/859/changes Changes: [mikemccand] LUCENE-1539: add DeleteByPercent, FlushReader tasks, and ability to open reader on a labelled commit point [mikemccand] LUCENE-1571: fix LatLongDistanceFilter to respect deleted docs [mikemccand] LUCENE-979: remove a few more old benchmark things [mikemccand] revert accidental commit [mikemccand] LUCENE-1677: deprecate gcj specializations, and the system properties that let you specify which SegmentReader impl class to use [mikemccand] LUCENE-1407: move RemoteSearchable out of core into contrib/remote -- [...truncated 6620 lines...] build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: jar-src: [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/lucene-misc-2.9-SNAPSHOT-src.jar dist-maven: [copy] Copying 1 file to http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc [artifact:install-provider] Installing provider: org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime [artifact:pom] Error downloading parent pom org.apache.lucene:lucene-contrib::2.9-SNAPSHOT: Missing: [artifact:pom] -- [artifact:pom] 1) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT [artifact:pom] Path to dependency: [artifact:pom] 1) unspecified:unspecified:jar:0.0 [artifact:pom] 2) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT [artifact:pom] [artifact:pom] -- [artifact:pom] 1 required artifact is missing. [artifact:pom] [artifact:pom] for artifact: [artifact:pom] unspecified:unspecified:jar:0.0 [artifact:pom] [artifact:pom] from the specified remote repositories: [artifact:pom] central (http://repo1.maven.org/maven2) [artifact:deploy] Deploying to file://http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/dist/maven [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] [INFO] repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: remote, so will be created [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1.jar to remote [artifact:deploy] Uploaded 52K [artifact:deploy] [INFO] Retrieving previous metadata from remote [artifact:deploy] [INFO] repository metadata for: 'artifact org.apache.lucene:lucene-misc' could not be found on repository: remote, so will be created [artifact:deploy] [INFO] Uploading repository metadata for: 'artifact org.apache.lucene:lucene-misc' [artifact:deploy] [INFO] Uploading project information for lucene-misc 2.9-20090615.021803-1 [artifact:deploy] [INFO] Retrieving previous metadata from remote [artifact:deploy] [INFO] repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: remote, so will be created [artifact:deploy] [INFO] Uploading repository metadata for: 'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT' [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-sources.jar to remote [artifact:deploy] Uploaded 53K [artifact:deploy] [INFO] Retrieving previous build number from remote [artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-javadoc.jar to remote [artifact:deploy] Uploaded 142K [echo] Building queries... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: jar-src: [jar] Building jar:
[jira] Commented: (LUCENE-1691) An index copied over another index can result in corruption
[ https://issues.apache.org/jira/browse/LUCENE-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719513#action_12719513 ] Michael McCandless commented on LUCENE-1691: Copying over an existing index, without first removing all files in that index, is not a supported use case for Lucene. Ie, to restore from backup you should make an empty dir and copy back your index files. An index copied over another index can result in corruption --- Key: LUCENE-1691 URL: https://issues.apache.org/jira/browse/LUCENE-1691 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Adrian Hempel Priority: Minor Fix For: 2.4.1 After restoring an older backup of an index over the top of a newer version of the index, attempts to open the index can result in CorruptIndexExceptions, such as: {noformat} Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _ed: fieldsReader shows 1137 but segmentInfo shows 1389 at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:362) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:228) at org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:55) at org.apache.lucene.index.ReadOnlyMultiSegmentReader.init(ReadOnlyMultiSegmentReader.java:27) at org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:102) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653) at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.lucene.index.IndexReader.open(IndexReader.java:237) {noformat} The apparent cause is the strategy of taking the maximum of the ID in the segments.gen file, and the IDs of the apparently valid segment files (See lines 523-593 [here|http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/index/SegmentInfos.java?annotate=751393]), and using this as the current generation of the index. This will include stale segments that existed before the backup was restored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1691) An index copied over another index can result in corruption
[ https://issues.apache.org/jira/browse/LUCENE-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719520#action_12719520 ] Adrian Hempel commented on LUCENE-1691: --- I realised that would probably be the case, but in the real world, this will be a common occurrence. Hence my raising this issue as an Improvement rather than a Bug. An index copied over another index can result in corruption --- Key: LUCENE-1691 URL: https://issues.apache.org/jira/browse/LUCENE-1691 Project: Lucene - Java Issue Type: Improvement Components: Store Reporter: Adrian Hempel Priority: Minor Fix For: 2.4.1 After restoring an older backup of an index over the top of a newer version of the index, attempts to open the index can result in CorruptIndexExceptions, such as: {noformat} Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _ed: fieldsReader shows 1137 but segmentInfo shows 1389 at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:362) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:228) at org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:55) at org.apache.lucene.index.ReadOnlyMultiSegmentReader.init(ReadOnlyMultiSegmentReader.java:27) at org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:102) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653) at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.lucene.index.IndexReader.open(IndexReader.java:237) {noformat} The apparent cause is the strategy of taking the maximum of the ID in the segments.gen file, and the IDs of the apparently valid segment files (See lines 523-593 [here|http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/index/SegmentInfos.java?annotate=751393]), and using this as the current generation of the index. This will include stale segments that existed before the backup was restored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1691) An index copied over another index can result in corruption
Adrian Hempel (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719520#action_12719520 ] Adrian Hempel commented on LUCENE-1691: --- I realised that would probably be the case, but in the real world, this will be a common occurrence. Delete the index you are copying over first? Hence my raising this issue as an Improvement rather than a Bug. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719535#action_12719535 ] Shai Erera commented on LUCENE-1630: Ok I was just about to post the patch, when the Spatial tests failed. After some investigation, I found out the following, and would appreciate your suggestions. IndexSearcher.search(QueryWeight weight, Filter filter, final int nDocs, Sort sort, boolean fillFields) I wrote the following code: {code} // try to create a Scorer in out-of-order mode, just to know which TFC // version to instantiate. boolean docsScoredInOrder = false; if (subReaders.length 0) { docsScoredInOrder = !weight.scorer(subReaders[0], false, false).scoresOutOfOrder(); } TopFieldCollector collector = TopFieldCollector.create(sort, nDocs, fillFields, fieldSortDoTrackScores, fieldSortDoMaxScore, docsScoredInOrder); search(weight, filter, collector); {code} For clarification - I need to know which TFC instance to create (in-order / out-of-order). For that, I need to first create a Scorer, asking for out-of-order one, and then check whether the Scorer is indeed an out-of-order or not. That's a dummy Scorer, as I never use it afterwards, but since we didn't want to add scoresOutOfOrder to Weight, but Scorer, I don't have any other choice. For Spatial, this creates a problem. One of the tests uses ConstantScoreQuery and passes in a Filter. CSQ.scorer() creates a new Scorer and uses the given Filter as reference. In Spatial, every time Filter.getDocIdSet() is called, the internal filter populates a WeakHashMap of distances (with the doc id as key), and doesn't clear it between invocations. It also updates the base of the key to handle multiple readers. Therefore the docs of the first reader are added twice - once for the dummy invocation and the second time since the base is updated (LatLongDistanceFilter.java, line 222) to reader.maxDoc(). I tried to create a new distances map on every invocation, but then another test fails. I don't know this code very well, and I don't know which is the best solution: * Complicate the code in IndexSearcher to create a Scorer, then collect it and then proceed w/ iterating on the readers, from the 2nd forward. This is a real ugly change, I tried it and quickly reverted. It also breaks the current beauty of having all the search methods call search(Weight, Filter, Collector). * Fix LatLongDistanceFilter code to check if reader.maxDoc() == nextOffset, then do nextOffset -= reader.maxDoc(). This is not pretty either, since it assumes a certain implementation and use of it, which I don't like either. * Add scoresOutOfOrder to Weight, but I don't know if we want to add this knowledge to Weight, and it fits nicely in Scorer. Any suggestions? perhaps a different fix to Spatial? Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer
[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539 ] Earwin Burrfoot commented on LUCENE-1630: - I like the last option most. Creating dummy scorer looks ugly to me, and looks like it will cause more problems of the same kind in the future. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539 ] Earwin Burrfoot edited comment on LUCENE-1630 at 6/15/09 5:36 AM: -- I like the last option (move scoresOutOfOrder to Weight) most. Creating dummy scorer looks ugly to me, and looks like it will cause more problems of the same kind in the future. was (Author: earwin): I like the last option most. Creating dummy scorer looks ugly to me, and looks like it will cause more problems of the same kind in the future. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: svn commit: r784758 - in /lucene/java/trunk: ./ docs/ docs/lucene-sandbox/ src/site/src/documentation/content/xdocs/
Thanks, uwe! simon On Mon, Jun 15, 2009 at 2:45 PM, uschind...@apache.org wrote: Author: uschindler Date: Mon Jun 15 12:45:05 2009 New Revision: 784758 URL: http://svn.apache.org/viewvc?rev=784758view=rev Log: LUCENE-1407: move RemoteSearchable out of core into contrib/remote (add javadocs to developer resources) Modified: lucene/java/trunk/build.xml lucene/java/trunk/docs/benchmarks.html lucene/java/trunk/docs/contributions.html lucene/java/trunk/docs/demo.html lucene/java/trunk/docs/demo2.html lucene/java/trunk/docs/demo3.html lucene/java/trunk/docs/demo4.html lucene/java/trunk/docs/fileformats.html lucene/java/trunk/docs/gettingstarted.html lucene/java/trunk/docs/index.html lucene/java/trunk/docs/linkmap.html lucene/java/trunk/docs/linkmap.pdf lucene/java/trunk/docs/lucene-sandbox/index.html lucene/java/trunk/docs/queryparsersyntax.html lucene/java/trunk/docs/scoring.html lucene/java/trunk/src/site/src/documentation/content/xdocs/site.xml Modified: lucene/java/trunk/build.xml URL: http://svn.apache.org/viewvc/lucene/java/trunk/build.xml?rev=784758r1=784757r2=784758view=diff == --- lucene/java/trunk/build.xml (original) +++ lucene/java/trunk/build.xml Mon Jun 15 12:45:05 2009 @@ -309,6 +309,7 @@ packageset dir=contrib/miscellaneous/src/java/ packageset dir=contrib/queries/src/java/ packageset dir=contrib/regex/src/java/ + packageset dir=contrib/remote/src/java/ packageset dir=contrib/snowball/src/java/ packageset dir=contrib/spatial/src/java/ packageset dir=contrib/spellchecker/src/java/ Modified: lucene/java/trunk/docs/benchmarks.html URL: http://svn.apache.org/viewvc/lucene/java/trunk/docs/benchmarks.html?rev=784758r1=784757r2=784758view=diff == --- lucene/java/trunk/docs/benchmarks.html (original) +++ lucene/java/trunk/docs/benchmarks.html Mon Jun 15 12:45:05 2009 @@ -161,6 +161,9 @@ a href=api/contrib-regex/index.htmlRegex/a /div div class=menuitem +a href=api/contrib-remote/index.htmlRemote/a +/div +div class=menuitem a href=api/contrib-snowball/index.htmlSnowball/a /div div class=menuitem Modified: lucene/java/trunk/docs/contributions.html URL: http://svn.apache.org/viewvc/lucene/java/trunk/docs/contributions.html?rev=784758r1=784757r2=784758view=diff == --- lucene/java/trunk/docs/contributions.html (original) +++ lucene/java/trunk/docs/contributions.html Mon Jun 15 12:45:05 2009 @@ -163,6 +163,9 @@ a href=api/contrib-regex/index.htmlRegex/a /div div class=menuitem +a href=api/contrib-remote/index.htmlRemote/a +/div +div class=menuitem a href=api/contrib-snowball/index.htmlSnowball/a /div div class=menuitem Modified: lucene/java/trunk/docs/demo.html URL: http://svn.apache.org/viewvc/lucene/java/trunk/docs/demo.html?rev=784758r1=784757r2=784758view=diff == --- lucene/java/trunk/docs/demo.html (original) +++ lucene/java/trunk/docs/demo.html Mon Jun 15 12:45:05 2009 @@ -163,6 +163,9 @@ a href=api/contrib-regex/index.htmlRegex/a /div div class=menuitem +a href=api/contrib-remote/index.htmlRemote/a +/div +div class=menuitem a href=api/contrib-snowball/index.htmlSnowball/a /div div class=menuitem Modified: lucene/java/trunk/docs/demo2.html URL: http://svn.apache.org/viewvc/lucene/java/trunk/docs/demo2.html?rev=784758r1=784757r2=784758view=diff == --- lucene/java/trunk/docs/demo2.html (original) +++ lucene/java/trunk/docs/demo2.html Mon Jun 15 12:45:05 2009 @@ -163,6 +163,9 @@ a href=api/contrib-regex/index.htmlRegex/a /div div class=menuitem +a href=api/contrib-remote/index.htmlRemote/a +/div +div class=menuitem a href=api/contrib-snowball/index.htmlSnowball/a /div div class=menuitem Modified: lucene/java/trunk/docs/demo3.html URL: http://svn.apache.org/viewvc/lucene/java/trunk/docs/demo3.html?rev=784758r1=784757r2=784758view=diff == --- lucene/java/trunk/docs/demo3.html (original) +++ lucene/java/trunk/docs/demo3.html Mon Jun 15 12:45:05 2009 @@ -163,6 +163,9 @@ a href=api/contrib-regex/index.htmlRegex/a /div div class=menuitem +a href=api/contrib-remote/index.htmlRemote/a +/div +div class=menuitem a href=api/contrib-snowball/index.htmlSnowball/a /div div class=menuitem Modified: lucene/java/trunk/docs/demo4.html URL: http://svn.apache.org/viewvc/lucene/java/trunk/docs/demo4.html?rev=784758r1=784757r2=784758view=diff
[jira] Updated: (LUCENE-1691) An index copied over another index can result in corruption
[ https://issues.apache.org/jira/browse/LUCENE-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Hempel updated LUCENE-1691: -- Fix Version/s: (was: 2.4.1) Affects Version/s: 2.4.1 An index copied over another index can result in corruption --- Key: LUCENE-1691 URL: https://issues.apache.org/jira/browse/LUCENE-1691 Project: Lucene - Java Issue Type: Improvement Components: Store Affects Versions: 2.4.1 Reporter: Adrian Hempel Priority: Minor After restoring an older backup of an index over the top of a newer version of the index, attempts to open the index can result in CorruptIndexExceptions, such as: {noformat} Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _ed: fieldsReader shows 1137 but segmentInfo shows 1389 at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:362) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306) at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:228) at org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:55) at org.apache.lucene.index.ReadOnlyMultiSegmentReader.init(ReadOnlyMultiSegmentReader.java:27) at org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:102) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653) at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.lucene.index.IndexReader.open(IndexReader.java:237) {noformat} The apparent cause is the strategy of taking the maximum of the ID in the segments.gen file, and the IDs of the apparently valid segment files (See lines 523-593 [here|http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/index/SegmentInfos.java?annotate=751393]), and using this as the current generation of the index. This will include stale segments that existed before the backup was restored. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719555#action_12719555 ] Mark Miller commented on LUCENE-1595: - Okay, how about something like this: we document up the changes and the conversion processes in the benchmark CHANGES and then, maybe check for removed alg properties in the algorithms and throw an exception pointing people to the CHANGES file if we find one? Or something along those lines? I'd like to make the transition as smooth as possible. Split DocMaker into ContentSource and DocMaker -- Key: LUCENE-1595 URL: https://issues.apache.org/jira/browse/LUCENE-1595 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Mark Miller Fix For: 2.9 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch This issue proposes some refactoring to the benchmark package. Today, DocMaker has two roles: collecting documents from a collection and preparing a Document object. These two should actually be split up to ContentSource and DocMaker, which will use a ContentSource instance. ContentSource will implement all the methods of DocMaker, like getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 1591, by having a basic ContentSource that offers input stream services, and wraps a file (for example) with a bzip or gzip streams etc. DocMaker will implement the makeDocument methods, reusing DocState etc. The idea is that collecting the Enwiki documents, for example, should be the same whether I create documents using DocState, add payloads or index additional metadata. Same goes for Trec and Reuters collections, as well as LineDocMaker. In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 99% the same and 99% different. Most of their differences lie in the way they read the data, while most of the similarity lies in the way they create documents (using DocState). That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker (just the reuse of DocState). Also, other DocMakers do not use that DocState today, something they could have gotten for free with this refactoring proposed. So by having a EnwikiContentSource, ReutersContentSource and others (TREC, Line, Simple), I can write several DocMakers, such as DocStateMaker, ConfigurableDocMaker (one which accpets all kinds of config options) and custom DocMakers (payload, facets, sorting), passing to them a ContentSource instance and reuse the same DocMaking algorithm with many content sources, as well as the same ContentSource algorithm with many DocMaker implementations. This will also give us the opportunity to perf test content sources alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead of creating a Document object. I've already done so in my code environment (I extend the benchmark package for my application's purposes) and I like the flexibility I have. I think this can be a nice contribution to the benchmark package, which can result in some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719558#action_12719558 ] Mark Miller commented on LUCENE-1518: - This issue is marked as part of LUCENE-1345, which has been pushed to 3.1. Also, it has not yet found an assignee. Speak out, or I will push this to 3.1. Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719561#action_12719561 ] Mark Miller commented on LUCENE-1313: - Whats the verdict on this one Mike? Got the impression this was a likely 3.1 ... Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Enable near realtime search in Lucene without external dependencies. When RAM NRT is enabled, the implementation adds a RAMDirectory to IndexWriter. Flushes go to the ramdir unless there is no available space. Merges are completed in the ram dir until there is no more available ram. IW.optimize and IW.commit flush the ramdir to the primary directory, all other operations try to keep segments in ram until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719565#action_12719565 ] Mark Miller commented on LUCENE-1595: - bq. Does this make sense? Okay, sounds good. Silence is consent around here, so I think we are good to go with this patch as soon as I go over it a bit. I'll wait till you post this last one. Split DocMaker into ContentSource and DocMaker -- Key: LUCENE-1595 URL: https://issues.apache.org/jira/browse/LUCENE-1595 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Mark Miller Fix For: 2.9 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch This issue proposes some refactoring to the benchmark package. Today, DocMaker has two roles: collecting documents from a collection and preparing a Document object. These two should actually be split up to ContentSource and DocMaker, which will use a ContentSource instance. ContentSource will implement all the methods of DocMaker, like getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 1591, by having a basic ContentSource that offers input stream services, and wraps a file (for example) with a bzip or gzip streams etc. DocMaker will implement the makeDocument methods, reusing DocState etc. The idea is that collecting the Enwiki documents, for example, should be the same whether I create documents using DocState, add payloads or index additional metadata. Same goes for Trec and Reuters collections, as well as LineDocMaker. In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 99% the same and 99% different. Most of their differences lie in the way they read the data, while most of the similarity lies in the way they create documents (using DocState). That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker (just the reuse of DocState). Also, other DocMakers do not use that DocState today, something they could have gotten for free with this refactoring proposed. So by having a EnwikiContentSource, ReutersContentSource and others (TREC, Line, Simple), I can write several DocMakers, such as DocStateMaker, ConfigurableDocMaker (one which accpets all kinds of config options) and custom DocMakers (payload, facets, sorting), passing to them a ContentSource instance and reuse the same DocMaking algorithm with many content sources, as well as the same ContentSource algorithm with many DocMaker implementations. This will also give us the opportunity to perf test content sources alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead of creating a Document object. I've already done so in my code environment (I extend the benchmark package for my application's purposes) and I like the flexibility I have. I think this can be a nice contribution to the benchmark package, which can result in some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719567#action_12719567 ] Uwe Schindler commented on LUCENE-1518: --- Push to 3.1! -- Uwe Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reassigned LUCENE-1606: - Assignee: Uwe Schindler Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719570#action_12719570 ] Uwe Schindler commented on LUCENE-1606: --- I take it, I think it is almost finished. The only problems at the moment are bundling the external library in contrib, which is BSD licensed, are there any problems? If not, I can manage the inclusion into the regex contrib. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1518: Fix Version/s: (was: 2.9) 3.1 Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719563#action_12719563 ] Shai Erera commented on LUCENE-1595: ok I agree. I've already documented CHANGES. I'll add to PerfTask a deprecated method checkObsoleteSettings which will throw an exception if it finds doc.add.log.step and doc.delete.log.step. doc.maker is still a valid one, but when you'll try to cast the argument to a DocMaker, you'll get an exception, b/c it's now a concrete class and not interface. Does this make sense? I'll post a patch soon. Split DocMaker into ContentSource and DocMaker -- Key: LUCENE-1595 URL: https://issues.apache.org/jira/browse/LUCENE-1595 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Mark Miller Fix For: 2.9 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch This issue proposes some refactoring to the benchmark package. Today, DocMaker has two roles: collecting documents from a collection and preparing a Document object. These two should actually be split up to ContentSource and DocMaker, which will use a ContentSource instance. ContentSource will implement all the methods of DocMaker, like getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 1591, by having a basic ContentSource that offers input stream services, and wraps a file (for example) with a bzip or gzip streams etc. DocMaker will implement the makeDocument methods, reusing DocState etc. The idea is that collecting the Enwiki documents, for example, should be the same whether I create documents using DocState, add payloads or index additional metadata. Same goes for Trec and Reuters collections, as well as LineDocMaker. In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 99% the same and 99% different. Most of their differences lie in the way they read the data, while most of the similarity lies in the way they create documents (using DocState). That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker (just the reuse of DocState). Also, other DocMakers do not use that DocState today, something they could have gotten for free with this refactoring proposed. So by having a EnwikiContentSource, ReutersContentSource and others (TREC, Line, Simple), I can write several DocMakers, such as DocStateMaker, ConfigurableDocMaker (one which accpets all kinds of config options) and custom DocMakers (payload, facets, sorting), passing to them a ContentSource instance and reuse the same DocMaking algorithm with many content sources, as well as the same ContentSource algorithm with many DocMaker implementations. This will also give us the opportunity to perf test content sources alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead of creating a Document object. I've already done so in my code environment (I extend the benchmark package for my application's purposes) and I like the flexibility I have. I think this can be a nice contribution to the benchmark package, which can result in some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719571#action_12719571 ] Mark Miller commented on LUCENE-1606: - I don't think there is a problem with BSD. I know Grant has committed a BSD licensed stop word list in the past. I've asked explicitly about it before, but got no response. I'll try and dig a little, but Grant is the PMC head and he did it, so we wouldnt be following bad company... Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker
[ https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1595: --- Attachment: LUCENE-1595.patch Patch adds a checkObsoleteSettings to PerfTask to alert on the use of doc.add.log.step and doc.delete.log.step, as well as documentation in CHANGES. all benchmark tests pass. Split DocMaker into ContentSource and DocMaker -- Key: LUCENE-1595 URL: https://issues.apache.org/jira/browse/LUCENE-1595 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Shai Erera Assignee: Mark Miller Fix For: 2.9 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch This issue proposes some refactoring to the benchmark package. Today, DocMaker has two roles: collecting documents from a collection and preparing a Document object. These two should actually be split up to ContentSource and DocMaker, which will use a ContentSource instance. ContentSource will implement all the methods of DocMaker, like getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 1591, by having a basic ContentSource that offers input stream services, and wraps a file (for example) with a bzip or gzip streams etc. DocMaker will implement the makeDocument methods, reusing DocState etc. The idea is that collecting the Enwiki documents, for example, should be the same whether I create documents using DocState, add payloads or index additional metadata. Same goes for Trec and Reuters collections, as well as LineDocMaker. In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 99% the same and 99% different. Most of their differences lie in the way they read the data, while most of the similarity lies in the way they create documents (using DocState). That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker (just the reuse of DocState). Also, other DocMakers do not use that DocState today, something they could have gotten for free with this refactoring proposed. So by having a EnwikiContentSource, ReutersContentSource and others (TREC, Line, Simple), I can write several DocMakers, such as DocStateMaker, ConfigurableDocMaker (one which accpets all kinds of config options) and custom DocMakers (payload, facets, sorting), passing to them a ContentSource instance and reuse the same DocMaking algorithm with many content sources, as well as the same ContentSource algorithm with many DocMaker implementations. This will also give us the opportunity to perf test content sources alone (i.e., compare bzip, gzip and regular input streams), w/o the overhead of creating a Document object. I've already done so in my code environment (I extend the benchmark package for my application's purposes) and I like the flexibility I have. I think this can be a nice contribution to the benchmark package, which can result in some code cleanup as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1313: --- Fix Version/s: (was: 2.9) 3.1 OK let's push it to 3.1. It's very much in progress, but 1) the iterations are slow (it's a big patch), 2) it's a biggish change so I'd prefer to it shortly after a release, not shortly before, so it has plenty of time to bake on trunk. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Enable near realtime search in Lucene without external dependencies. When RAM NRT is enabled, the implementation adds a RAMDirectory to IndexWriter. Flushes go to the ramdir unless there is no available space. Merges are completed in the ram dir until there is no more available ram. IW.optimize and IW.commit flush the ramdir to the primary directory, all other operations try to keep segments in ram until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: svn commit: r784540 - in /lucene/java/trunk: ./ contrib/remote/ contrib/remote/src/ contrib/remote/src/java/ contrib/remote/src/java/org/ contrib/remote/src/java/org/apache/ contrib/remote/src/j
Super, thanks Uwe! Mike On Mon, Jun 15, 2009 at 8:46 AM, Uwe Schindleru...@thetaphi.de wrote: Committed into general site docs (developer-resources) and into trunk's docs (large patch, because navigation changed). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, June 15, 2009 11:40 AM To: java-dev@lucene.apache.org Subject: Re: svn commit: r784540 - in /lucene/java/trunk: ./ contrib/remote/ contrib/remote/src/ contrib/remote/src/java/ contrib/remote/src/java/org/ contrib/remote/src/java/org/apache/ contrib/remote/src/java/org/apache/lucene/ contrib/remote/src/java/org/a On Mon, Jun 15, 2009 at 3:41 AM, Uwe Schindleru...@thetaphi.de wrote: Hi Mike, after adding a new contrib, I think we should also add this to the site docs and also the javadocs generation in the main build.xml. Woops, you're right. Should I prepare this? I have done this for spatial and trie in the past, too. Yes please? Thanks! Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719602#action_12719602 ] Uwe Schindler commented on LUCENE-1606: --- Robert: I applied the patch locally, one test was still using @Override, fixed that. I did only download automaton.jar not the source package. Do you know, if automaton.jar is compiled using -source 1.4 -target 1.4 (it was compiled using ant 1.7 and Java 1.6). If not sure, I will try to build it again from source and use the correct compiler switches. The regex contrib module is Java 1.4 until now. If automaton only works with 1.5, we should wait until 3.0 to release it. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719605#action_12719605 ] Robert Muir commented on LUCENE-1606: - Uwe, you are correct, I just took a glance at the automaton source code and saw StringBuilder, so I think it is safe to say it only works with 1.5... Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719606#action_12719606 ] Uwe Schindler commented on LUCENE-1606: --- Doesn't seem to work, I will check the sources: {code} compile-core: [javac] Compiling 12 source files to C:\Projects\lucene\trunk\build\contrib\regex\classes\java [javac] C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\regex\AutomatonFuzzyQuery.java:11: cannot access dk.brics.automaton.Automaton [javac] bad class file: C:\Projects\lucene\trunk\contrib\regex\lib\automaton .jar(dk/brics/automaton/Automaton.class) [javac] class file has wrong version 49.0, should be 48.0 [javac] Please remove or make sure it appears in the correct subdirectory of the classpath. [javac] import dk.brics.automaton.Automaton; [javac] ^ [javac] 1 error {code} Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719607#action_12719607 ] Uwe Schindler commented on LUCENE-1606: --- So I tend to move this to 3.0 or 3.1, because of missing support in regex contrib. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Core JDK 1.4 compatible.
By the way: I compiled core and corresponding tests with an old JDK 1.4 version, I found locally on my machine. Works fine! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler (JIRA) [mailto:j...@apache.org] Sent: Monday, June 15, 2009 5:48 PM To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [ https://issues.apache.org/jira/browse/LUCENE- 1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=12719606#action_12719606 ] Uwe Schindler commented on LUCENE-1606: --- Doesn't seem to work, I will check the sources: {code} compile-core: [javac] Compiling 12 source files to C:\Projects\lucene\trunk\build\contrib\regex\classes\java [javac] C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\r egex\AutomatonFuzzyQuery.java:11: cannot access dk.brics.automaton.Automaton [javac] bad class file: C:\Projects\lucene\trunk\contrib\regex\lib\automaton .jar(dk/brics/automaton/Automaton.class) [javac] class file has wrong version 49.0, should be 48.0 [javac] Please remove or make sure it appears in the correct subdirectory of the classpath. [javac] import dk.brics.automaton.Automaton; [javac] ^ [javac] 1 error {code} Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE- 1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Core JDK 1.4 compatible.
:) But those days are numbered! Mike On Mon, Jun 15, 2009 at 11:55 AM, Uwe Schindleru...@thetaphi.de wrote: By the way: I compiled core and corresponding tests with an old JDK 1.4 version, I found locally on my machine. Works fine! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler (JIRA) [mailto:j...@apache.org] Sent: Monday, June 15, 2009 5:48 PM To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [ https://issues.apache.org/jira/browse/LUCENE- 1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=12719606#action_12719606 ] Uwe Schindler commented on LUCENE-1606: --- Doesn't seem to work, I will check the sources: {code} compile-core: [javac] Compiling 12 source files to C:\Projects\lucene\trunk\build\contrib\regex\classes\java [javac] C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\r egex\AutomatonFuzzyQuery.java:11: cannot access dk.brics.automaton.Automaton [javac] bad class file: C:\Projects\lucene\trunk\contrib\regex\lib\automaton .jar(dk/brics/automaton/Automaton.class) [javac] class file has wrong version 49.0, should be 48.0 [javac] Please remove or make sure it appears in the correct subdirectory of the classpath. [javac] import dk.brics.automaton.Automaton; [javac] ^ [javac] 1 error {code} Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE- 1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Core JDK 1.4 compatible.
It would help if we have a target date, then I'll know how many more X's I need to mark on the Calendar :) On Mon, Jun 15, 2009 at 6:56 PM, Michael McCandless luc...@mikemccandless.com wrote: :) But those days are numbered! Mike On Mon, Jun 15, 2009 at 11:55 AM, Uwe Schindleru...@thetaphi.de wrote: By the way: I compiled core and corresponding tests with an old JDK 1.4 version, I found locally on my machine. Works fine! Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler (JIRA) [mailto:j...@apache.org] Sent: Monday, June 15, 2009 5:48 PM To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex) [ https://issues.apache.org/jira/browse/LUCENE- 1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=12719606#action_12719606 ] Uwe Schindler commented on LUCENE-1606: --- Doesn't seem to work, I will check the sources: {code} compile-core: [javac] Compiling 12 source files to C:\Projects\lucene\trunk\build\contrib\regex\classes\java [javac] C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\r egex\AutomatonFuzzyQuery.java:11: cannot access dk.brics.automaton.Automaton [javac] bad class file: C:\Projects\lucene\trunk\contrib\regex\lib\automaton .jar(dk/brics/automaton/Automaton.class) [javac] class file has wrong version 49.0, should be 48.0 [javac] Please remove or make sure it appears in the correct subdirectory of the classpath. [javac] import dk.brics.automaton.Automaton; [javac] ^ [javac] 1 error {code} Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE- 1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719612#action_12719612 ] Robert Muir commented on LUCENE-1606: - Uwe, sorry about this. I did just verify automaton.jar can be compiled for Java 5 (at least it does not have java 1.6 dependencies), so perhaps this can be integrated for a later release. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1606: -- Fix Version/s: (was: 2.9) 3.0 I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1). Robert: you could supply a patch with StringBuilder toString() variants and all those @Override uncommented-in. And it works correct with 1.5 (I am working with 1.5 here locally - I hate 1.6...). Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 3.0 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1599) SpanRegexQuery and SpanNearQuery is not working with MultiSearcher
[ https://issues.apache.org/jira/browse/LUCENE-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719619#action_12719619 ] Mark Miller commented on LUCENE-1599: - Something is modifying the original query itself. In MultiSearcher.rewrite: public Query rewrite(Query original) throws IOException { Query[] queries = new Query[searchables.length]; for (int i = 0; i searchables.length; i++) { queries[i] = searchables[i].rewrite(original); } return queries[0].combine(queries); } On the first time through the loop, the SpanRegexQuery will contain the regex pattern, but the first time it hits rewrite, it will be changed to the expanded query. This shouldnt happen. On the next time through the loop, original query will not contain a regex pattern, but will instead be the first time through the loop's rewritten query. Oddness. I'll dig in and try and fix for 2.9. SpanRegexQuery and SpanNearQuery is not working with MultiSearcher -- Key: LUCENE-1599 URL: https://issues.apache.org/jira/browse/LUCENE-1599 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.4.1 Environment: lucene-core 2.4.1, lucene-regex 2.4.1 Reporter: Billow Gao Fix For: 2.9 Attachments: TestSpanRegexBug.java Original Estimate: 2h Remaining Estimate: 2h MultiSearcher is using: queries[i] = searchables[i].rewrite(original); to rewrite query and then use combine to combine them. But SpanRegexQuery's rewrite is different from others. After you call it on the same query, it always return the same rewritten queries. As a result, only search on the first IndexSearcher work. All others are using the first IndexSearcher's rewrite queries. So many terms are missing and return unexpected result. Billow -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1599) SpanRegexQuery and SpanNearQuery is not working with MultiSearcher
[ https://issues.apache.org/jira/browse/LUCENE-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller reassigned LUCENE-1599: --- Assignee: Mark Miller SpanRegexQuery and SpanNearQuery is not working with MultiSearcher -- Key: LUCENE-1599 URL: https://issues.apache.org/jira/browse/LUCENE-1599 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.4.1 Environment: lucene-core 2.4.1, lucene-regex 2.4.1 Reporter: Billow Gao Assignee: Mark Miller Fix For: 2.9 Attachments: TestSpanRegexBug.java Original Estimate: 2h Remaining Estimate: 2h MultiSearcher is using: queries[i] = searchables[i].rewrite(original); to rewrite query and then use combine to combine them. But SpanRegexQuery's rewrite is different from others. After you call it on the same query, it always return the same rewritten queries. As a result, only search on the first IndexSearcher work. All others are using the first IndexSearcher's rewrite queries. So many terms are missing and return unexpected result. Billow -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)
[ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719623#action_12719623 ] Robert Muir commented on LUCENE-1606: - Uwe, ok. Not to try to complicate things, but related to LUCENE-1689 and java 1.5, I could easily modify the Wildcard functionality here to work correctly with suppl. characters This could be an alternative to fixing the WildcardQuery ? operator in core. Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 3.0 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)
Why do you hate 1.6 Uwe? Mike On Mon, Jun 15, 2009 at 12:10 PM, Uwe Schindler (JIRA)j...@apache.org wrote: [ https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1606: -- Fix Version/s: (was: 2.9) 3.0 I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1). Robert: you could supply a patch with StringBuilder toString() variants and all those @Override uncommented-in. And it works correct with 1.5 (I am working with 1.5 here locally - I hate 1.6...). Automaton Query/Filter (scalable regex) --- Key: LUCENE-1606 URL: https://issues.apache.org/jira/browse/LUCENE-1606 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Robert Muir Assignee: Uwe Schindler Priority: Minor Fix For: 3.0 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable). Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms. Some use cases I envision: 1. lexicography/etc on large text corpora 2. looking for things such as urls where the prefix is not constant (http:// or ftp://) The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter enumerates terms in a special way, by using the underlying state machine. Here is my short description from the comments: The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do: 1. Look at the portion that is OK (did not enter a reject state in the DFA) 2. Generate the next possible String and seek to that. the Query simply wraps the filter with ConstantScoreQuery. I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I thought the primary goal of switching to AttributeSource (yes, the name is very generic...) was to allow extensibility to what's created per-Token, so that an app could add their own attrs without costly subclassing/casting per Token, independent of other other things adding their tokens, etc. EG, trie* takes advantage of this extensibility by adding a ShiftAttribute. Subclassing Token in your app wasn't a good solution for various reasons. I do think the API is somewhat more cumbersome than before, and I don't like that about it (consumability!). But net/net I think the change is good, and it's one of the baby steps for flexible indexing (bullet #11): http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard Ie it addresses the flexibility during analysis. I don't think anything was held back in this effort. Grant, are you referring to LUCENE-1458? That's held back simply because the only person working on it (me) got distracted by other things to work on. Flexible indexing (all of bullet #11) is a complex project, and we need to break it into baby steps like this one. We've already made good progress on it: you can already make custom attrs and a custom (but, package private) indexing chain if you want. Next step is pluggable codecs for writing index files (LUCENE-1458), and APIs for reading them (that generalize Terms/TermDoc/TermPositions we have today). Mike On Sun, Jun 14, 2009 at 11:41 PM, Shai Ereraser...@gmail.com wrote: The old API is deprecated, and therefore when we release 2.9 there might be some people who'd think they should move away from it, to better prepare for 3.0 (while in fact this many not be the case). Also, we should make sure that when we remove all the deprecations, this will still exist (and therefore, why deprecate it now?), if we think this should indeed be kept around for at least a while longer. I personally am all for keeping it around (it will save me a huge refactoring of an Analyzer package I wrote), but I have to admit it's only because I've got quite comfortable with the existing API, and did not have the time to try the new one yet. Shai On Mon, Jun 15, 2009 at 3:49 AM, Mark Miller markrmil...@gmail.com wrote: Mark Miller wrote: I don't know how I feel about rolling the new token api back. I will say that I originally had no issue with it because I am very excited about Lucene-1458. At the same time though, I'm thinking Lucene-1458 is a very advanced issue that will likely be for really expert usage (though I can see benefits falling to general users). I'm slightly iffy about making an intuitive api much less intuitive for an expert future feature that hasn't fully materialized in Lucene yet. It almost seems like that fight should weigh towards general usage and standard users. I don't have a better proposal though, nor the time to consider it at the moment. I was just more curious if anyone else had any thoughts. I hadn't realized Grant had asked a similar question not long ago with no response. Not sure how to take that, but I'd think that would indicate less problems with people than more. On the other hand, you don't have to switch yet (with trunk) and we have yet to release it. I wonder how many non dev, every day users have really had to tussle with the new API yet. Not many people complaining too loudly at the moment though. Asking for a roll back seems a bit extreme without a little more support behind it than we have seen. - Mark PS I know you didnt ask for a rollback Grant - just kind of talking in a general manner. I see your point on getting the search side in, I'm just not sure I agree that it really matters if one hits before the other. Like Mike says, you don't have to switch to the new API yet. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1650) Small fix in CustomScoreQuery JavaDoc
[ https://issues.apache.org/jira/browse/LUCENE-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller reassigned LUCENE-1650: --- Assignee: Mark Miller Small fix in CustomScoreQuery JavaDoc - Key: LUCENE-1650 URL: https://issues.apache.org/jira/browse/LUCENE-1650 Project: Lucene - Java Issue Type: Improvement Components: Javadocs Affects Versions: 2.9, 3.0 Reporter: Simon Willnauer Assignee: Mark Miller Priority: Minor Fix For: 2.9 Attachments: customScoreQuery_CodeChange+JavaDoc.patch, customScoreQuery_JavaDoc.patch I have fixed the javadoc for Modified Score formular in CustomScoreQuery. - Patch attached: customScoreQuery_JavaDoc.patch I'm quite curious why the method: public float customScore(int doc, float subQueryScore, float valSrcScores[]) calls public float customScore(int doc, float subQueryScore, float valSrcScore]) only in 2 of the 3 cases which makes the choice to override either one of the customScore methods dependent on the number of ValueSourceQuery passed to the constructor. I figure it would be more consistent if it would call the latter in all 3 cases. I also attached a patch which proposes a fix for that issue. The patch does also include the JavaDoc issue mentioned above. - customScoreQuery_CodeChange+JavaDoc.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1583) SpanOrQuery skipTo() doesn't always move forwards
[ https://issues.apache.org/jira/browse/LUCENE-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller reassigned LUCENE-1583: --- Assignee: Mark Miller I guess I'll do this one. You out there reading Paul Elschot? This look right to you? Any issues it might cause? Else I guess I'll have to put on my thinking cap and figure it myself. SpanOrQuery skipTo() doesn't always move forwards - Key: LUCENE-1583 URL: https://issues.apache.org/jira/browse/LUCENE-1583 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1 Reporter: Moti Nisenson Assignee: Mark Miller Priority: Minor Fix For: 2.9 Attachments: LUCENE-1583.patch In SpanOrQuery the skipTo() method is improperly implemented if the target doc is less than or equal to the current doc, since skipTo() may not be called for any of the clauses' spans: public boolean skipTo(int target) throws IOException { if (queue == null) { return initSpanQueue(target); } while (queue.size() != 0 top().doc() target) { if (top().skipTo(target)) { queue.adjustTop(); } else { queue.pop(); } } return queue.size() != 0; } This violates the correct behavior (as described in the Spans interface documentation), that skipTo() should always move forwards, in other words the correct implementation would be: public boolean skipTo(int target) throws IOException { if (queue == null) { return initSpanQueue(target); } boolean skipCalled = false; while (queue.size() != 0 top().doc() target) { if (top().skipTo(target)) { queue.adjustTop(); } else { queue.pop(); } skipCalled = true; } if (skipCalled) { return queue.size() != 0; } return next(); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1688) Deprecating StopAnalyzer ENGLISH_STOP_WORDS - General replacement with an immutable Set
[ https://issues.apache.org/jira/browse/LUCENE-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719630#action_12719630 ] Mark Miller commented on LUCENE-1688: - If no one else claims this for 2.9, I guess I'll do it. Deprecating StopAnalyzer ENGLISH_STOP_WORDS - General replacement with an immutable Set --- Key: LUCENE-1688 URL: https://issues.apache.org/jira/browse/LUCENE-1688 Project: Lucene - Java Issue Type: Improvement Reporter: Simon Willnauer Priority: Minor Fix For: 2.9, 3.0 Attachments: StopWords.patch StopAnalyzer and StandartAnalyzer are using the static final array ENGLISH_STOP_WORDS by default in various places. Internally this array is converted into a mutable set which looks kind of weird to me. I think the way to go is to deprecate all use of the static final array and replace it with an immutable implementation of CharArraySet. Inside an analyzer it does not make sense to have a mutable set anyway and we could prevent set creation each time an analyzer is created. In the case of an immutable set we won't have multithreading issues either. in essence we get rid of a fair bit of converting string array to set code, do not have a PUBLIC static reference to an array (which is mutable) and reduce the overhead of analyzer creation. let me know what you think and I create a patch for it. simon -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-973) Token of returns in CJKTokenizer + new TestCJKTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719635#action_12719635 ] Mark Miller commented on LUCENE-973: You guys looking for this for 2.9? If so, any volunteers? If I assign myself any more, I won't likely get to them all. Token of returns in CJKTokenizer + new TestCJKTokenizer --- Key: LUCENE-973 URL: https://issues.apache.org/jira/browse/LUCENE-973 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.3 Reporter: Toru Matsuzawa Priority: Minor Fix For: 2.9 Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, LUCENE-973.patch, with-patch.jpg, without-patch.jpg The string returns as Token in the boundary of two byte character and one byte character. There is no problem in CJKAnalyzer. When CJKTokenizer is used with the unit, it becomes a problem. (Use it with Solr etc.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 15, 2009, at 12:19 PM, Michael McCandless wrote: I don't think anything was held back in this effort. Grant, are you referring to LUCENE-1458? That's held back simply because the only person working on it (me) got distracted by other things to work on. I'm sorry, I didn't mean to imply Michael B. was holding back on the work. The patch has always felt half done to me because what's the point of having all of these attributes in the index if you don't have anyway of searching them, thus I was struck by the need to get it in prior to making it available in search.I realize it's complex, but here we are forcing people to upgrade for some future, long term goal. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness
[ https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1630: --- Attachment: LUCENE-1630.patch ok - let's start iterating on the patch. Anyone volunteer to accept it (and then I'll update CHANGES via ?)? Patch include: * QueryWeight with the new scorer(IndexReader, soreDocsInOrder, topScorer) and scoresOutOfOrder(). * Added methods to Searcher (this breaks back-compat, but it's already broken here because of 1575). * BooleanWeight now creates BS or BS2 up front, and therefore BS2's code is simplified. All tests pass. Mating Collector and Scorer on doc Id orderness --- Key: LUCENE-1630 URL: https://issues.apache.org/jira/browse/LUCENE-1630 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1630.patch This is a spin off of LUCENE-1593. This issue proposes to expose appropriate API on Scorer and Collector such that one can create an optimized Collector based on a given Scorer's doc-id orderness and vice versa. Copied from LUCENE-1593, here is the list of changes: # Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. #* Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. #* Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. # Add to Scorer isOutOfOrder with a default to false, and override in BS to true. # Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns true/false. #* Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. #* Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. # Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: * Deprecate Searchable in favor of Searcher. * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. * Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. * Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. * Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. One other optimization that was discussed in LUCENE-1593 is to expose a topScorer() API (on Weight) which returns a Scorer that its score(Collector) will be called, and additionally add a start() method to DISI. That will allow Scorers to initialize either on start() or score(Collector). This was proposed mainly because of BS and BS2 which check if they are initialized in every call to next(), skipTo() and score(). Personally I prefer to see that in a separate issue, following that one (as it might add methods to QueryWeight). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the
[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719639#action_12719639 ] Mark Miller commented on LUCENE-1486: - Should this go in contrib rather than core? That seems to have been the approach so far, any reason to vary it up here? Well, actually, looks like I see the multi field parser in core. Makes sense to put subclasses there I guess. You think this is ready to commit Mark? If so, I should be able to review it (unless you want to commit it yourself). Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
[ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-1486: Attachment: LUCENE-1486.patch Reformatted to lucene formatting, removed author tag, removed a couple unused fields, changed to patch format Tests don't pass because it doesnt work quite correctly with the new constantscore multi term queries yet. Wildcards, ORs etc inside Phrase queries Key: LUCENE-1486 URL: https://issues.apache.org/jira/browse/LUCENE-1486 Project: Lucene - Java Issue Type: Improvement Components: QueryParser Affects Versions: 2.4 Reporter: Mark Harwood Priority: Minor Fix For: 2.9 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, TestComplexPhraseQuery.java An extension to the default QueryParser that overrides the parsing of PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries. The implementation feels a little hacky - this is arguably better handled in QueryParser itself. This works as a proof of concept for much of the query parser syntax. Examples from the Junit test include: checkMatches(\j* smyth~\, 1,2); //wildcards and fuzzies are OK in phrases checkMatches(\(jo* -john) smith\, 2); // boolean logic works checkMatches(\jo* smith\~2, 1,2,3); // position logic works. checkBadQuery(\jo* id:1 smith\); //mixing fields in a phrase is bad checkBadQuery(\jo* \smith\ \); //phrases inside phrases is bad checkBadQuery(\jo* [sma TO smZ]\ \); //range queries inside phrases not supported Code plus Junit test to follow... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719653#action_12719653 ] Richard Marr commented on LUCENE-1690: -- Sounds reasonable although that'll take a little longer for me to do. I'll have a think about it. Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
The high-level description of the new API looks good (being able to add arbitrary properties to tokens), unfortunately, I've never had the time to try and use it and give any constructive feedback. As far as difficulty of use, I assume this only applies to implementing your own TokenFilter? It seems like most standard users would be just stringing together existing TokenFilters to create custom Analyzers? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Yonik Seeley wrote: The high-level description of the new API looks good (being able to add arbitrary properties to tokens), unfortunately, I've never had the time to try and use it and give any constructive feedback. As far as difficulty of use, I assume this only applies to implementing your own TokenFilter? It seems like most standard users would be just stringing together existing TokenFilters to create custom Analyzers? -Yonik http://www.lucidimagination.com True - its the implementation. And just trying to understand whats going on the first time you see it. Its not particularly difficult, but its also not obvious like the previous API was. As a user, I would ask why that is so, and frankly the answer wouldn't do much for me (as a user). I don't know if most 'standard' users implement their own or not. I will say, and perhaps I was in a special situation, I was writing them and modifying them almost as soon as I started playing with Lucene. And even when I wasnt, I needed to understand the code to understand some of the complexities that could occur, and thankfully, that was breezy to do. Right now, if you told me to go convert all of Solr to the new API you would hear a mighty groan. As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1692) Contrib analyzers need tests
Contrib analyzers need tests Key: LUCENE-1692 URL: https://issues.apache.org/jira/browse/LUCENE-1692 Project: Lucene - Java Issue Type: Test Components: contrib/analyzers Reporter: Robert Muir The analyzers in contrib need tests, preferably ones that test the behavior of all the Token 'attributes' involved (offsets, type, etc) and not just what they do with token text. This way, they can be converted to the new api without breakage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719665#action_12719665 ] Jason Rutherglen commented on LUCENE-1313: -- Just wanted to give an update on this, I'm running the unit tests with flushToRAM=true, the ones that fail are (mostly) tests that look for files when they're now in RAM (temporarily) and the like. I'm not sure what to do with these tests, 1) ignore them (kind of hard to not run specific methods, I think) 2) or conditionalize them to run only if flushToRAM=false. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Enable near realtime search in Lucene without external dependencies. When RAM NRT is enabled, the implementation adds a RAMDirectory to IndexWriter. Flushes go to the ramdir unless there is no available space. Merges are completed in the ram dir until there is no more available ram. IW.optimize and IW.commit flush the ramdir to the primary directory, all other operations try to keep segments in ram until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Jun 14, 2009, at 8:05 PM, Michael Busch wrote: I'd be happy to discuss other API proposals that anybody brings up here, that have the same advantages and are more intuitive. We could also beef up the documentation and give a better example about how to convert a stream/filter from the old to the new API; a constructive suggestion that Uwe made at the ApacheCon. More questions: 1. What about Highlighter and MoreLikeThis? They have not been converted. Also, what are they going to do if the attributes they need are not available? Caveat emptor? 2. Same for TermVectors. What if the user specifies with positions and offsets, but the analyzer doesn't produce them? Caveat emptor? (BTW, this is also true for the new omit TF stuff) 3. Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. Also, can we add something like an AttributeTermQuery? Seems like it could work similar to the BoostingTermQuery. I'm sure more will come to me. -Grant
[jira] Updated: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1673: -- Attachment: LUCENE-1673.patch Updated patch: - now with extended JavaDocs - additional tests for float/doubles - additional tests for equals/hashcode - changes.txt - lot of reformatting The only open point is the name of TrieUtils, any idea for package and/or name? Changes to FieldCache and SortField to always require a parser (see discussion with Yonik), which is a new case to be openend after this. Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark, I created an issue for this. I just think you know, converting an analyzer to the new api is really not that bad. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719689#action_12719689 ] Michael McCandless commented on LUCENE-1673: bq. So one using new code must always specify the parser when using SortField.INT (SortField.AUTO is already deprectaed so no problem). This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser. We could alternatively make NumericSortField (subclassing SortField), that just uses the right parser? Did you think about / decide against making a NumericField (that'd set the right tokenStream itself)? Other questions/comments: * Could we change ShiftAttribute - NumericShiftAttribute? * How about oal.util.NumericUtils instead of TrieUtils? * Can we rename RangeQuery - TextRangeQuery (TermRangeQuery), to make it clear that its range checking is by Term sort order. * Should we support byte/short for trie indexed fields as well? (Since SortField, FieldCache support these numeric types too...). Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719692#action_12719692 ] Michael McCandless commented on LUCENE-1673: bq. The only open point is the name of TrieUtils, any idea for package and/or name? I think NumericUtils? (I'd like the naming to be consistent w/ NumericRangeQuery, NumericTokenStream, since it's very much a public API, ie users must interact directly with it to get their SortField (maybe) and FieldCache parser). Leaving it util seems OK, since it's used by analysis searching. Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719699#action_12719699 ] Yonik Seeley commented on LUCENE-1673: -- bq. This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser. Allow passing parser==null to get the default? bq. We could alternatively make NumericSortField (subclassing SortField), that just uses the right parser? A factory method TrieUtils.getSortField() could also return the right SortField. Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
Also, what about the case where one might have attributes that are meant for downstream TokenFilters, but not necessarily for indexing? Offsets and type come to mind. Is it the case now that those attributes are not automatically added to the index? If they are ignored now, what if I want to add them? I admit, I'm having a hard time finding the code that specifically loops over the Attributes. I recall seeing it, but can no longer find it. There is a new Attribute called ShiftAttribute (or NumericShiftAttribute), when trie range is moved to core. This attribute contains the shifted-away bits from the prefix encoded value during trie indexing. The idea is to e.g. have TokenFilters that may additional payloads or others to trie values, but only do this for specific precisions. In future, it may also be interesting to automatically add this attribute to the index. Maybe we should add a read/store method to attributes, that adds an attribute to the Posting using a IndexOutput/IndexInput (like the serialization methods). Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindleru...@thetaphi.de wrote: There is a new Attribute called ShiftAttribute (or NumericShiftAttribute), when trie range is moved to core. This attribute contains the shifted-away bits from the prefix encoded value during trie indexing. I was wondering about this To make use of ShiftAttribute, you need to understand the trie encoding scheme itself. If you understood that, you'd be able to look at the actual token value if you were interested in what shift was used. So it's redundant, has a runtime cost, it's not currently used anywhere, and it's not useful to fields other than Trie. Perhaps it shouldn't exist (yet)? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719726#action_12719726 ] Uwe Schindler commented on LUCENE-1673: --- {quote} bq. This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser. Mike: This will apply to int/long/float/double as well right? How would you do this (require a parser for only numeric sorts) back-compatibly? EG, the others (String, DOC, etc.) don't require a parser. Yonik: Allow passing parser==null to get the default? bq. We could alternatively make NumericSortField (subclassing SortField), that just uses the right parser? A factory method TrieUtils.getSortField() could also return the right SortField. {quote} I want to move this into a new issue after, I will open one. Nevertheless, I would like to remove emphasis from NumericUtils (which is in realyity a helper class). So I want to make the current human-readable numeric parsers public and also add the trie parsers to FieldCache. The SortField factory is then the only parts really needed in NumericUtils, but not really. The parser is a singleton, works for all trie fields and could also live somewhere else or nowhere at all, if the Parsers all stay in FieldCache. bq. Should we support byte/short for trie indexed fields as well? (Since SortField, FieldCache support these numeric types too...). For bytes, TrieRange is not very interesting, for shorts, maybe, but I would subsume them during indexing as simple integers. You could not speedup searching, but limit index size a little bit. bq. Could we change ShiftAttribute - NumericShiftAttribute? No problem, I do this. There is also missing the link from the TokenStream in the javadocs to this, see also my reply in java-dev to Grants mail. bq. Can we rename RangeQuery - TextRangeQuery (TermRangeQuery), to make it clear that its range checking is by Term sort order. We can do this and deprecate the old one, but I added a note to Javadocs (see patch). I would do this outside of this issue. bq. How about oal.util.NumericUtils instead of TrieUtils? That was my first idea, too. What to do with o.a.l.doc.NumberTools (deprecate?). And also update contrib/spatial to use NumericUtils instead of the copied and not really goo NumberUtils from Solr (Yonik said, it was written at a very early stage, and is not effective with UTF-8 encoding and the TermEnum posioning with the term prefixes). It would be a index-format change for spatial, but as the code was not yet released (in Lucene), the Lucene version should not use NumberUtils at all. Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719729#action_12719729 ] Uwe Schindler commented on LUCENE-1673: --- bq. Did you think about / decide against making a NumericField (that'd set the right tokenStream itself)? The problem currently is: - Field is final and so I must extend AbstractField. But some methods of Document return Field and not AbstractField. - NumericField would only work for indexing, but when retrieving from index (stored fields), it would change to Field. Maybe we should move this after the index-specific schemas and so on. Or document, that it can be only used for indexing. By the way: How do you like the factories in NumericRangeQuery and the setValue methods, working like StringBuffer.append() in NumericTokenStream? This makes it really easy to index. The only good thing of NumericField would be the possibility to automatically disable TF and Norms per default when indexing. Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindleru...@thetaphi.de wrote: There is a new Attribute called ShiftAttribute (or NumericShiftAttribute), when trie range is moved to core. This attribute contains the shifted- away bits from the prefix encoded value during trie indexing. I was wondering about this To make use of ShiftAttribute, you need to understand the trie encoding scheme itself. If you understood that, you'd be able to look at the actual token value if you were interested in what shift was used. So it's redundant, has a runtime cost, it's not currently used anywhere, and it's not useful to fields other than Trie. Perhaps it shouldn't exist (yet)? The idea was to make the indexing process controllable. You were the one, who asked e.g. for the possibility to add payloads to trie fields and so on. Using the shift attribute, you have full control of the token types. OK, it's a little bit redundant; you could also use the TypeAttribute (which is already used to mark highest precision and lower precision values). One question about the whole TokenStream: In the original case we discussed about Payloads/Position and TrieRange. If this would be implemented in future versions, the question is, how should I set the PositionIncrement/Offsets in the token stream to create a Position of 0 in the index. I do not understand the indexing process here, especially this deprecated boolean flag about something negative (not sure what the name was). Should I set PositionIncrement to 0 for all Trie fields per default. How about PositionIncrementGap, when indexing more than one field? All not really clear. The position would be simplier to implement, but doing this with an attribute, that is indexes together with the other attributes like a payload would be the most ideal solution for future versions of TrieRange. (Maybe we could also use the Offset attribute for the highest precision bits) Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
If you understood that, you'd be able to look at the actual token value if you were interested in what shift was used. So it's redundant, has a runtime cost, it's not currently used anywhere, and it's not useful to fields other than Trie. Perhaps it shouldn't exist (yet)? You are right, you could also decode the shift value from the first char of the token... I think, I will remove the ShiftAttribute and only set the TermType to highest, lower precisions. By this, one could easily add a payload to the real numeric value using a TokenFilter. Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719738#action_12719738 ] Uwe Schindler commented on LUCENE-1673: --- I think, I remove the ShiftAttribute in complete, its really useless. Maybe, I add a getShift() method to NumericUtils, that returns the shift value of a Token/String. See java-dev mailing with Yonik. Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as it is. On the other hand, if this keeps our only numeric query implementation, we could call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here are problems). Same for the TokenStreams and Filters. # Maybe the pairs of classes for indexing and searching should be moved into one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The problem here: ctors must be able to pass int, long, double, float as range parameters. For the end user, mixing these 4 types in one class is hard to handle. If somebody forgets to add a L to a long, it suddenly instantiates a int version of range query, hitting no results and so on. Same with other types. Maybe accept java.lang.Number as parameter (because nullable for half-open bounds) and one enum for the type. # TrieUtils move into o.a.l.util? or document or? # Move TokenStreams into o.a.l.analysis, ShiftAttribute into o.a.l.analysis.tokenattributes? Somewhere else? # If we rename the classes, should Solr stay with Trie (because there are different impls)? # Maybe add a subclass of AbstractField, that automatically creates these TokenStreams and omits norms/tf per default for easier addition to Document instances? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words experimental, warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() I fixed this. This was going me on my nerves the whole day when I wrote javadocs for NumericTokenStream... Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Some great points - especially the decision between a deprecated API, and a new experimental one subject to change. Bit of a rock and a hard place for a new user. Perhaps we should add a little note with some guidance. - Mark Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words experimental, warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Robert Muir rcm...@gmail.com -- - Mark http://www.lucidimagination.com - To
Re: New Token API was Re: Payloads and TrieRangeQuery
This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words experimental, warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes complaints), I think its all good. but in my opinion this is because many do not have any tests. I converted a few of these and its just grunt work but if there are no tests, its impossible to verify the conversion is correct. Thanks for pointing that out. We probably get lazy with tests, especially in contrib, and this brings up a good point - we should probably push for tests or write them before committing more often. Sometimes I'm sure it just comes downto a tradeoff though - no resources at the time, the class looked clear cut, and it was just contrib anyway. But then here we are ... a healthy dose of grunt work is bad enough when you have tests to check it. -- - Mark
RE: New Token API was Re: Payloads and TrieRangeQuery
By the way, there is an empty de subdir in SVN inside analysis. Can this be removed? And, in tests: test/o/a/l/index/store is somehow wrong placed. The class inside should be in test/o/a/l/store. Should I move? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Uwe Schindler [mailto:u...@thetaphi.de] Sent: Monday, June 15, 2009 10:18 PM To: java-dev@lucene.apache.org Subject: RE: New Token API was Re: Payloads and TrieRangeQuery there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() I fixed this. This was going me on my nerves the whole day when I wrote javadocs for NumericTokenStream... Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: New Token API was Re: Payloads and TrieRangeQuery
And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com wrote: This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words experimental, warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has probably heard that groan before. hope this doesn't sound like a complaint, Complaints are fine in any case. Every now and then, it might cause a little rant from me or something, but please don't let that dissuade you :) Who doesnt like to rant and rave now and then. As long as thoughts and opinions are coming out in a non negative way (which certainly includes
[jira] Commented: (LUCENE-1673) Move TrieRange to core
[ https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719761#action_12719761 ] Michael McCandless commented on LUCENE-1673: OK let's open a new issue for how to best integrate/default SortField and FieldCache. bq. Nevertheless, I would like to remove emphasis from NumericUtils (which is in realyity a helper class). +1 bq. For bytes, TrieRange is not very interesting, for shorts, maybe, but I would subsume them during indexing as simple integers. You could not speedup searching, but limit index size a little bit. Well, a RangeQuery on a plain text byte or short field requires sneakiness (knowing that you must zero-pad; keeping document.NumberUtils around); I think it's best if NumericXXX in Lucene handles all of java's native numeric types. And you want a byte[] or short[] out of FieldCache (to not waste RAM having to upgrade to an int[]). We can do this under the (a?) new issue too... bq. The SortField factory is then the only parts really needed in NumericUtils, but not really. The parser is a singleton, works for all trie fields and could also live somewhere else or nowhere at all, if the Parsers all stay in FieldCache. (Under a new issue, but...) I'm not really a fan of leaving the parser in FieldCache and expecting a user to know to create the SortField with that parser. NumericSortField would make it much more consumable to direct Lucene users. {quote} bq. Can we rename RangeQuery - TextRangeQuery (TermRangeQuery), to make it clear that its range checking is by Term sort order. We can do this and deprecate the old one, but I added a note to Javadocs (see patch). I would do this outside of this issue. {quote} OK. One benefit of a rename is it's a reminder to users on upgrading to consider whether they should in fact switch to NumericRangeQuery. {quote} bq. How about oal.util.NumericUtils instead of TrieUtils? That was my first idea, too. What to do with o.a.l.doc.NumberTools (deprecate?). And also update contrib/spatial to use NumericUtils instead of the copied and not really goo NumberUtils from Solr (Yonik said, it was written at a very early stage, and is not effective with UTF-8 encoding and the TermEnum posioning with the term prefixes). It would be a index-format change for spatial, but as the code was not yet released (in Lucene), the Lucene version should not use NumberUtils at all. {quote} +1 on both (if we can add byte/short to trie*); we should do this before 2.9 since we can still change locallucene's format. Maybe open a new issue for that, too? We're forking off new 2.9 issues left and right here!! bq. I think, I remove the ShiftAttribute in complete, its really useless. Maybe, I add a getShift() method to NumericUtils, that returns the shift value of a Token/String. See java-dev mailing with Yonik. OK {quote} bq. Did you think about / decide against making a NumericField (that'd set the right tokenStream itself)? Field is final and so I must extend AbstractField. But some methods of Document return Field and not AbstractField. {quote} Can we just un-final Field? {quote} NumericField would only work for indexing, but when retrieving from index (stored fields), it would change to Field. Maybe we should move this after the index-specific schemas and so on. Or document, that it can be only used for indexing. {quote} True, but we already have such challenges between index vs search time Document; documenting it it seems fine. bq. By the way: How do you like the factories in NumericRangeQuery and the setValue methods, working like StringBuffer.append() in NumericTokenStream? This makes it really easy to index. I think this is great! I like that you return NumericTokenStream :) bq. The only good thing of NumericField would be the possibility to automatically disable TF and Norms per default when indexing. Consumability (good defaults)! (And also not having to know that you must go and get a tokenStream from NumericUtils). Move TrieRange to core -- Key: LUCENE-1673 URL: https://issues.apache.org/jira/browse/LUCENE-1673 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1673.patch, LUCENE-1673.patch TrieRange was iterated many times and seems stable now (LUCENE-1470, LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to its default FieldTypes (SOLR-940) and if possible I want to move it to core before release of 2.9. Before this can be done, there are some things to think about: # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how should they be called in core? I would suggest to leave it as
Re: New Token API was Re: Payloads and TrieRangeQuery
I have implemented most of that actually (the interface part and Token implementing all of them). The problem is a paradigm change with the new API: the assumption is that there is always only one single instance of an Attribute. With the old API, it is recommended to reuse the passed-in token, but you don't have to, you can also return a new one with every call of next(). Now with this change the indexer classes should only know about the interfaces, if shouldn't know Token anymore, which seems fine when Token implements all those interfaces. BUT, since there can be more than once instance of Token, the indexer would have to call getAttribute() for all Attributes it needs after each call of next(). I haven't measured how expensive that is, but it seems like a severe performance hit. That's basically the main reason why the backwards compatibility is ensured in such a goofy way right now. Michael On 6/15/09 1:28 PM, Uwe Schindler wrote: And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindleru...@thetaphi.de wrote: And, in tests: test/o/a/l/index/store is somehow wrong placed. The class inside should be in test/o/a/l/store. Should I move? Please do! Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is assigned to me for a while now - I just haven't found the time to do it yet. It's great that you started the work on that! I'll try to review the patch in the next couple of days and help with fixing the remaining ones. I'd like to get the AttributeSource improvements patch out first. I'll try that tonight. Michael On 6/15/09 1:35 PM, Robert Muir wrote: Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com wrote: This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words experimental, warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but necessary if they are to be migrated. I'd be willing to assist with some of this but I don't want to really work the issue if its gonna be a waste of time at the end of the day... The chances of this issue being fully reverted are so remote that I really wouldnt let that stop you ... On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: As Lucene's contrib hasn't been fully converted either (and its been quite some time now), someone has
Re: [jira] Commented: (LUCENE-1673) Move TrieRange to core
Michael McCandless (JIRA) wrote: We're forking off new 2.9 issues left and right here!! Evil :) You guys are like small team working against me. We still have 29+- issue to wrap up though, so probably plenty of time. I hope we can set a rough target date soon though - it really feels like we could drag for quite a bit longer if we wanted to. Remember the last time we started to push for 2.9 in Dec/Jan :) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
Michael OK, I plan on adding some tests for the analyzers that don't have any. I didn't try to migrate things such as highlighter, which are definitely just as important, only because I'm not familiar with that territory. But I think I can figure out what the various language analyzers are trying to do and add tests / convert the remaining ones. On Mon, Jun 15, 2009 at 4:42 PM, Michael Buschbusch...@gmail.com wrote: I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is assigned to me for a while now - I just haven't found the time to do it yet. It's great that you started the work on that! I'll try to review the patch in the next couple of days and help with fixing the remaining ones. I'd like to get the AttributeSource improvements patch out first. I'll try that tonight. Michael On 6/15/09 1:35 PM, Robert Muir wrote: Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com wrote: This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words experimental, warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering what one of them does is not necessarily obvious, and is completely unrelated but
RE: New Token API was Re: Payloads and TrieRangeQuery
Maybe change the deprecation wrapper around next() and next(Token) [the default impl of incrementToken()] to check, if the retrieved token is not identical to the attribute and then just copy the contents to the instance-Token? This would be a slowdown, but only be the case for very rare TokenStreams that did not reuse token before (and were slow before, too). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de _ From: Michael Busch [mailto:busch...@gmail.com] Sent: Monday, June 15, 2009 10:39 PM To: java-dev@lucene.apache.org Subject: Re: New Token API was Re: Payloads and TrieRangeQuery I have implemented most of that actually (the interface part and Token implementing all of them). The problem is a paradigm change with the new API: the assumption is that there is always only one single instance of an Attribute. With the old API, it is recommended to reuse the passed-in token, but you don't have to, you can also return a new one with every call of next(). Now with this change the indexer classes should only know about the interfaces, if shouldn't know Token anymore, which seems fine when Token implements all those interfaces. BUT, since there can be more than once instance of Token, the indexer would have to call getAttribute() for all Attributes it needs after each call of next(). I haven't measured how expensive that is, but it seems like a severe performance hit. That's basically the main reason why the backwards compatibility is ensured in such a goofy way right now. Michael On 6/15/09 1:28 PM, Uwe Schindler wrote: And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. One solution to fix this useNewAPI problem would be to change the AttributeSource in a way, to return classes that implement interfaces (as you proposed some weeks ago). The good old Token class then do not need to be deprecated, it could simply implement all 4 interfaces. AttributeSource then must implement a registry, which classes implement which interfaces. So if somebody wants a TermAttribute, he always gets the Token. In all other cases, the default could be a *Impl default class. In this case, next() could simply return this Token (which is the all 4 attributes). The reuseableToken is simply thrown away in the deprecated API, the reuseable Token comes from the AttributeSource and is per-instance. Is this an idea? Uwe - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible
[ https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719766#action_12719766 ] Michael McCandless commented on LUCENE-1541: Uwe, what's the plan on this issue...? Should it wait until 3.1? Trie range - make trie range indexing more flexible --- Key: LUCENE-1541 URL: https://issues.apache.org/jira/browse/LUCENE-1541 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Ning Li Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1541.patch, LUCENE-1541.patch In the current trie range implementation, a single precision step is specified. With a large precision step (say 8), a value is indexed in fewer terms (8) but the number of terms for a range can be large. With a small precision step (say 2), the number of terms for a range is smaller but a value is indexed in more terms (32). We want to add an option that different precision steps can be set for different precisions. An expert can use this option to keep the number of terms for a range small and at the same time index a value in a small number of terms. See the discussion in LUCENE-1470 that results in this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1673) Move TrieRange to core
On Mon, Jun 15, 2009 at 4:42 PM, Mark Millermarkrmil...@gmail.com wrote: Remember the last time we started to push for 2.9 in Dec/Jan :) Yes this is very much on my mind too!! So maybe, it's a race between the trie* group of issues, and the other 28 ;) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Near Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719767#action_12719767 ] Jason Rutherglen commented on LUCENE-1313: -- TestThreadedOptimize is throwing a ensureContiguousMerge exception. I think this is highlighting the change to merging all ram segments to a single primaryDir segment can sometimes lead to choosing segments that are non-contiguous? I'm not sure of the best way to handle this. Near Realtime Search Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Enable near realtime search in Lucene without external dependencies. When RAM NRT is enabled, the implementation adds a RAMDirectory to IndexWriter. Flushes go to the ramdir unless there is no available space. Merges are completed in the ram dir until there is no more available ram. IW.optimize and IW.commit flush the ramdir to the primary directory, all other operations try to keep segments in ram until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible
[ https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719769#action_12719769 ] Uwe Schindler commented on LUCENE-1541: --- I see no real use in it, it does not affect query performance, only index size. Maybe we should move it to 3.1 until I have some time, but the Payload thing is more interesting and maybe this can be combined. Trie range - make trie range indexing more flexible --- Key: LUCENE-1541 URL: https://issues.apache.org/jira/browse/LUCENE-1541 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Ning Li Assignee: Uwe Schindler Priority: Minor Fix For: 2.9 Attachments: LUCENE-1541.patch, LUCENE-1541.patch In the current trie range implementation, a single precision step is specified. With a large precision step (say 8), a value is indexed in fewer terms (8) but the number of terms for a range can be large. With a small precision step (say 2), the number of terms for a range is smaller but a value is indexed in more terms (32). We want to add an option that different precision steps can be set for different precisions. An expert can use this option to keep the number of terms for a range small and at the same time index a value in a small number of terms. See the discussion in LUCENE-1470 that results in this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: [jira] Commented: (LUCENE-1673) Move TrieRange to core
Sorry, I think these new issues may also be in 3.1 (not all), but I want to have this trie stuff with a clean API before 2.9 and not deprecate parts of it again in 3.1, shortly after release :-( This issues are no hard changes, its just a little bit API cleanup you can do in your freetime :-] -- I know I am a little bit late, but I am working hard on this :) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, June 15, 2009 10:51 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-1673) Move TrieRange to core On Mon, Jun 15, 2009 at 4:42 PM, Mark Millermarkrmil...@gmail.com wrote: Remember the last time we started to push for 2.9 in Dec/Jan :) Yes this is very much on my mind too!! So maybe, it's a race between the trie* group of issues, and the other 28 ;) Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1541) Trie range - make trie range indexing more flexible
[ https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1541: --- Fix Version/s: (was: 2.9) 3.1 OK, moving out to 3.1. Trie range - make trie range indexing more flexible --- Key: LUCENE-1541 URL: https://issues.apache.org/jira/browse/LUCENE-1541 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Ning Li Assignee: Uwe Schindler Priority: Minor Fix For: 3.1 Attachments: LUCENE-1541.patch, LUCENE-1541.patch In the current trie range implementation, a single precision step is specified. With a large precision step (say 8), a value is indexed in fewer terms (8) but the number of terms for a range can be large. With a small precision step (say 2), the number of terms for a range is smaller but a value is indexed in more terms (32). We want to add an option that different precision steps can be set for different precisions. An expert can use this option to keep the number of terms for a range small and at the same time index a value in a small number of terms. See the discussion in LUCENE-1470 that results in this issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
I may do the Highlighter. Its annoying though - I'll have to break back compat because Token is part of the public API (Fragmenter, etc). Robert Muir wrote: Michael OK, I plan on adding some tests for the analyzers that don't have any. I didn't try to migrate things such as highlighter, which are definitely just as important, only because I'm not familiar with that territory. But I think I can figure out what the various language analyzers are trying to do and add tests / convert the remaining ones. On Mon, Jun 15, 2009 at 4:42 PM, Michael Buschbusch...@gmail.com wrote: I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is assigned to me for a while now - I just haven't found the time to do it yet. It's great that you started the work on that! I'll try to review the patch in the next couple of days and help with fixing the remaining ones. I'd like to get the AttributeSource improvements patch out first. I'll try that tonight. Michael On 6/15/09 1:35 PM, Robert Muir wrote: Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com wrote: This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words experimental, warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while. With the old one, everything was just kind of immediately self evident. That makes me think new users might be a little more confused when they first meet again. I'm not a new user though, so its only a guess really. reverse engineering
[jira] Commented: (LUCENE-973) Token of returns in CJKTokenizer + new TestCJKTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719776#action_12719776 ] Steven Rowe commented on LUCENE-973: +1 from me for inclusion in 2.9. Mark, as you wrote a couple of hours ago on java-dev, in response to Robert Muir's complaint about the lack of tests in contrib: bq. we should probably push for tests or write them before committing more often. Here's a chance to improve the situation: this issue adds a test to a contrib module where there currently are none! Token of returns in CJKTokenizer + new TestCJKTokenizer --- Key: LUCENE-973 URL: https://issues.apache.org/jira/browse/LUCENE-973 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.3 Reporter: Toru Matsuzawa Priority: Minor Fix For: 2.9 Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, LUCENE-973.patch, with-patch.jpg, without-patch.jpg The string returns as Token in the boundary of two byte character and one byte character. There is no problem in CJKAnalyzer. When CJKTokenizer is used with the unit, it becomes a problem. (Use it with Solr etc.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Some SVN cleanup, was: New Token API was Re: Payloads and TrieRangeQuery
Done, tests pass. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, June 15, 2009 10:40 PM To: java-dev@lucene.apache.org Subject: Re: New Token API was Re: Payloads and TrieRangeQuery On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindleru...@thetaphi.de wrote: And, in tests: test/o/a/l/index/store is somehow wrong placed. The class inside should be in test/o/a/l/store. Should I move? Please do! Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: New Token API was Re: Payloads and TrieRangeQuery
yeah about 5 seconds in I saw that and decided to stick with what I know :) On Mon, Jun 15, 2009 at 5:10 PM, Mark Millermarkrmil...@gmail.com wrote: I may do the Highlighter. Its annoying though - I'll have to break back compat because Token is part of the public API (Fragmenter, etc). Robert Muir wrote: Michael OK, I plan on adding some tests for the analyzers that don't have any. I didn't try to migrate things such as highlighter, which are definitely just as important, only because I'm not familiar with that territory. But I think I can figure out what the various language analyzers are trying to do and add tests / convert the remaining ones. On Mon, Jun 15, 2009 at 4:42 PM, Michael Buschbusch...@gmail.com wrote: I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is assigned to me for a while now - I just haven't found the time to do it yet. It's great that you started the work on that! I'll try to review the patch in the next couple of days and help with fixing the remaining ones. I'd like to get the AttributeSource improvements patch out first. I'll try that tonight. Michael On 6/15/09 1:35 PM, Robert Muir wrote: Michael, again I am terrible with such things myself... Personally I am impressed that you have the back compat, even if you don't change any code at all I think some reformatting of javadocs might make the situation a lot friendlier. I just listed everything that came to my mind immediately. I guess I will also mention that one of the reasons I might not use the new API is that since all filters, etc on the same chain must use the same API, its discouraging if all the contrib stuff doesn't support the new API, it makes me want to just stick with the old so everything will work. So I think contribs being on the new API is really important otherwise no one will want to use it. On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com wrote: This is excellent feedback, Robert! I agree this is confusing; especially having a deprecated API and only a experimental one that replaces the old one. We need to change that. And I don't like the *useNewAPI*() methods either. I spent a lot of time thinking about backwards compatibility for this API. It's tricky to do without sacrificing performance. In API patches I find myself spending more time for backwards-compatibility than for the actual new feature! :( I'll try to think about how to simplify this confusing old/new API mix. However, we need to make the decisions: a) if we want to release this new API with 2.9, b) if yes to a), if we want to remove the old API in 3.0? If yes to a) and no to b), then we'll have to support both APIs for a presumably very long time, so we then need to have a better solution for the backwards-compatibility here. -Michael On 6/15/09 1:09 PM, Robert Muir wrote: let me try some slightly more constructive feedback: new user looks at TokenStream javadocs: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html immediately they see deprecated, text in red with the words experimental, warnings in bold, the whole thing is scary! due to the use of 'e.g.' the javadoc for .incrementToken() is cut off in a bad way, and its probably the most important method to a new user! there's also a stray bold tag gone haywire somewhere, possibly .incrementToken() from a technical perspective, the documentation is excellent! but for a new user unfamiliar with lucene, its unclear exactly what steps to take: use the scary red experimental api or the old deprecated one? theres also some fairly advanced stuff such as .captureState and .restoreState that might be better in a different place. finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing [one is static, one is not], especially because it states all streams and filters in one chain must use the same API, is there a way to simplify this? i'm really terrible with javadocs myself, but perhaps we can come up with a way to improve the presentation... maybe that will make the difference. On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote: Mark, I'll see if I can get tests produced for some of those analyzers. as a new user of the new api myself, I think I can safely say the most confusing thing about it is having the old deprecated API mixed in the javadocs with it :) On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote: Robert Muir wrote: Mark, I created an issue for this. Thanks Robert, great idea. I just think you know, converting an analyzer to the new api is really not that bad. I don't either. I'm really just complaining about the initial readability. Once you know whats up, its not too much different. I just have found myself having to refigure out whats up (a short task to be sure) over again after I leave it for a while.
[jira] Commented: (LUCENE-973) Token of returns in CJKTokenizer + new TestCJKTokenizer
[ https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719781#action_12719781 ] Robert Muir commented on LUCENE-973: very nice. although it might be a tad trickier to convert to the new API, anything with tests is easier! in other words, i have the existing cjktokenizer converted, but whose to say I did it right :) Token of returns in CJKTokenizer + new TestCJKTokenizer --- Key: LUCENE-973 URL: https://issues.apache.org/jira/browse/LUCENE-973 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.3 Reporter: Toru Matsuzawa Priority: Minor Fix For: 2.9 Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, LUCENE-973.patch, with-patch.jpg, without-patch.jpg The string returns as Token in the boundary of two byte character and one byte character. There is no problem in CJKAnalyzer. When CJKTokenizer is used with the unit, it becomes a problem. (Use it with Solr etc.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org