[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations

2009-06-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719422#action_12719422
 ] 

Shai Erera commented on LUCENE-1677:


I think test-core is broken too ...

 Remove GCJ IndexReader specializations
 --

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9


 These specializations are outdated, unsupported, most probably pointless due 
 to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you 
 are going to ask people on java-user, anybody replied that they need it?). 
 While giving nothing, they make SegmentReader instantiation code look real 
 ugly.
 If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1474) Incorrect SegmentInfo.delCount when IndexReader.flush() is used

2009-06-15 Thread Adrian Hempel (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719435#action_12719435
 ] 

Adrian Hempel commented on LUCENE-1474:
---

Hi Michael,

The index that Erik was working with contained segments created with a 
pre-2.4.1 version of Lucene, so we don't believe this is a regression.

Regards,
Adrian

 Incorrect SegmentInfo.delCount when IndexReader.flush() is used
 ---

 Key: LUCENE-1474
 URL: https://issues.apache.org/jira/browse/LUCENE-1474
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4
Reporter: Marcel Reutegger
Assignee: Michael McCandless
 Fix For: 2.4.1, 2.9

 Attachments: CheckIndex.txt, IndexReaderTest.java


 When deleted documents are flushed using IndexReader.flush() the delCount in 
 SegmentInfo is updated based on the current value and 
 SegmentReader.pendingDeleteCount (introduced by LUCENE-1267). It seems that 
 pendingDeleteCount is not reset after the commit, which means after a second 
 flush() or close() of an index reader the delCount in SegmentInfo is 
 incorrect. A subsequent IndexReader.open() call will fail with an error when 
 assertions are enabled. E.g.:
 java.lang.AssertionError: delete count mismatch: info=3 vs BitVector=2
   at 
 org.apache.lucene.index.SegmentReader.loadDeletedDocs(SegmentReader.java:405)
 [...]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #859

2009-06-15 Thread Simon Willnauer
There is the wrong name in the pom.xml.template for contrib/remote
Here is a diff with a patch:

Index: contrib/remote/pom.xml.template
===
--- contrib/remote/pom.xml.template (revision 784550)
+++ contrib/remote/pom.xml.template (working copy)
@@ -28,7 +28,7 @@
 version@version@/version
   /parent
   groupIdorg.apache.lucene/groupId
-  artifactIdlucene-regex/artifactId
+  artifactIdlucene-remote/artifactId
   nameLucene Remote/name
   version@version@/version
   descriptionRemote Searchable based on RMI/description



simon
On Mon, Jun 15, 2009 at 4:19 AM, Apache Hudson
Serverhud...@hudson.zones.apache.org wrote:
 See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/859/changes

 Changes:

 [mikemccand] LUCENE-1539: add DeleteByPercent, FlushReader tasks, and ability 
 to open reader on a labelled commit point

 [mikemccand] LUCENE-1571: fix LatLongDistanceFilter to respect deleted docs

 [mikemccand] LUCENE-979: remove a few more old benchmark things

 [mikemccand] revert accidental commit

 [mikemccand] LUCENE-1677: deprecate gcj specializations, and the system 
 properties that let you specify which SegmentReader impl class to use

 [mikemccand] LUCENE-1407: move RemoteSearchable out of core into 
 contrib/remote

 --
 [...truncated 6620 lines...]
 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:

 jar-src:
      [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/lucene-misc-2.9-SNAPSHOT-src.jar

 dist-maven:
     [copy] Copying 1 file to 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc
 [artifact:install-provider] Installing provider: 
 org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime
 [artifact:pom] Error downloading parent pom 
 org.apache.lucene:lucene-contrib::2.9-SNAPSHOT: Missing:
 [artifact:pom] --
 [artifact:pom] 1) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT
 [artifact:pom]   Path to dependency:
 [artifact:pom]          1) unspecified:unspecified:jar:0.0
 [artifact:pom]          2) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT
 [artifact:pom]
 [artifact:pom] --
 [artifact:pom] 1 required artifact is missing.
 [artifact:pom]
 [artifact:pom] for artifact:
 [artifact:pom]   unspecified:unspecified:jar:0.0
 [artifact:pom]
 [artifact:pom] from the specified remote repositories:
 [artifact:pom]   central (http://repo1.maven.org/maven2)
 [artifact:deploy] Deploying to 
 file://http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/dist/maven
 [artifact:deploy] [INFO] Retrieving previous build number from remote
 [artifact:deploy] [INFO] repository metadata for: 'snapshot 
 org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: 
 remote, so will be created
 [artifact:deploy] Uploading: 
 org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1.jar
  to remote
 [artifact:deploy] Uploaded 52K
 [artifact:deploy] [INFO] Retrieving previous metadata from remote
 [artifact:deploy] [INFO] repository metadata for: 'artifact 
 org.apache.lucene:lucene-misc' could not be found on repository: remote, so 
 will be created
 [artifact:deploy] [INFO] Uploading repository metadata for: 'artifact 
 org.apache.lucene:lucene-misc'
 [artifact:deploy] [INFO] Uploading project information for lucene-misc 
 2.9-20090615.021803-1
 [artifact:deploy] [INFO] Retrieving previous metadata from remote
 [artifact:deploy] [INFO] repository metadata for: 'snapshot 
 org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: 
 remote, so will be created
 [artifact:deploy] [INFO] Uploading repository metadata for: 'snapshot 
 org.apache.lucene:lucene-misc:2.9-SNAPSHOT'
 [artifact:deploy] [INFO] Retrieving previous build number from remote
 [artifact:deploy] Uploading: 
 org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-sources.jar
  to remote
 [artifact:deploy] Uploaded 53K
 [artifact:deploy] [INFO] Retrieving previous build number from remote
 [artifact:deploy] Uploading: 
 org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-javadoc.jar
  to remote
 [artifact:deploy] Uploaded 142K
     [echo] Building queries...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:

 jar-src:
      [jar] Building jar: 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/lucene-queries-2.9-SNAPSHOT-src.jar

 dist-maven:
     [copy] Copying 1 file to 
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries
 [artifact:install-provider] Installing provider: 
 org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime
 [artifact:pom] 

[jira] Created: (LUCENE-1691) An index copied over another index can result in corruption

2009-06-15 Thread Adrian Hempel (JIRA)
An index copied over another index can result in corruption
---

 Key: LUCENE-1691
 URL: https://issues.apache.org/jira/browse/LUCENE-1691
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Adrian Hempel
Priority: Minor
 Fix For: 2.4.1


After restoring an older backup of an index over the top of a newer version of 
the index, attempts to open the index can result in CorruptIndexExceptions, 
such as:

{noformat}
Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ for 
segment _ed: fieldsReader shows 1137 but segmentInfo shows 1389
at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:362)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:228)
at 
org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:55)
at 
org.apache.lucene.index.ReadOnlyMultiSegmentReader.init(ReadOnlyMultiSegmentReader.java:27)
at 
org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:102)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
at 
org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:237)
{noformat}

The apparent cause is the strategy of taking the maximum of the ID in the 
segments.gen file, and the IDs of the apparently valid segment files (See lines 
523-593 
[here|http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/index/SegmentInfos.java?annotate=751393]),
 and using this as the current generation of the index.  This will include 
stale segments that existed before the backup was restored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #859

2009-06-15 Thread Michael Busch

Thanks, Simon!

I just committed the fix.

 Michael

On 6/15/09 12:20 AM, Simon Willnauer wrote:

There is the wrong name in the pom.xml.template for contrib/remote
Here is a diff with a patch:

Index: contrib/remote/pom.xml.template
===
--- contrib/remote/pom.xml.template (revision 784550)
+++ contrib/remote/pom.xml.template (working copy)
@@ -28,7 +28,7 @@
  version@version@/version
/parent
groupIdorg.apache.lucene/groupId
-artifactIdlucene-regex/artifactId
+artifactIdlucene-remote/artifactId
nameLucene Remote/name
version@version@/version
descriptionRemote Searchable based on RMI/description



simon
On Mon, Jun 15, 2009 at 4:19 AM, Apache Hudson
Serverhud...@hudson.zones.apache.org  wrote:
   

See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/859/changes

Changes:

[mikemccand] LUCENE-1539: add DeleteByPercent, FlushReader tasks, and ability 
to open reader on a labelled commit point

[mikemccand] LUCENE-1571: fix LatLongDistanceFilter to respect deleted docs

[mikemccand] LUCENE-979: remove a few more old benchmark things

[mikemccand] revert accidental commit

[mikemccand] LUCENE-1677: deprecate gcj specializations, and the system 
properties that let you specify which SegmentReader impl class to use

[mikemccand] LUCENE-1407: move RemoteSearchable out of core into contrib/remote

--
[...truncated 6620 lines...]
build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

jar-src:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/lucene-misc-2.9-SNAPSHOT-src.jar

dist-maven:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc
[artifact:install-provider] Installing provider: 
org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime
[artifact:pom] Error downloading parent pom 
org.apache.lucene:lucene-contrib::2.9-SNAPSHOT: Missing:
[artifact:pom] --
[artifact:pom] 1) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT
[artifact:pom]   Path to dependency:
[artifact:pom]  1) unspecified:unspecified:jar:0.0
[artifact:pom]  2) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT
[artifact:pom]
[artifact:pom] --
[artifact:pom] 1 required artifact is missing.
[artifact:pom]
[artifact:pom] for artifact:
[artifact:pom]   unspecified:unspecified:jar:0.0
[artifact:pom]
[artifact:pom] from the specified remote repositories:
[artifact:pom]   central (http://repo1.maven.org/maven2)
[artifact:deploy] Deploying to 
file://http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/dist/maven
[artifact:deploy] [INFO] Retrieving previous build number from remote
[artifact:deploy] [INFO] repository metadata for: 'snapshot 
org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: 
remote, so will be created
[artifact:deploy] Uploading: 
org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1.jar
 to remote
[artifact:deploy] Uploaded 52K
[artifact:deploy] [INFO] Retrieving previous metadata from remote
[artifact:deploy] [INFO] repository metadata for: 'artifact 
org.apache.lucene:lucene-misc' could not be found on repository: remote, so 
will be created
[artifact:deploy] [INFO] Uploading repository metadata for: 'artifact 
org.apache.lucene:lucene-misc'
[artifact:deploy] [INFO] Uploading project information for lucene-misc 
2.9-20090615.021803-1
[artifact:deploy] [INFO] Retrieving previous metadata from remote
[artifact:deploy] [INFO] repository metadata for: 'snapshot 
org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on repository: 
remote, so will be created
[artifact:deploy] [INFO] Uploading repository metadata for: 'snapshot 
org.apache.lucene:lucene-misc:2.9-SNAPSHOT'
[artifact:deploy] [INFO] Retrieving previous build number from remote
[artifact:deploy] Uploading: 
org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-sources.jar
 to remote
[artifact:deploy] Uploaded 53K
[artifact:deploy] [INFO] Retrieving previous build number from remote
[artifact:deploy] Uploading: 
org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-javadoc.jar
 to remote
[artifact:deploy] Uploaded 142K
 [echo] Building queries...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

jar-src:
  [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/lucene-queries-2.9-SNAPSHOT-src.jar

dist-maven:
 [copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries
[artifact:install-provider] Installing provider: 
org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime

RE: svn commit: r784540 - in /lucene/java/trunk: ./ contrib/remote/ contrib/remote/src/ contrib/remote/src/java/ contrib/remote/src/java/org/ contrib/remote/src/java/org/apache/ contrib/remote/src/jav

2009-06-15 Thread Uwe Schindler
Hi Mike,

after adding a new contrib, I think we should also add this to the site docs
and also the javadocs generation in the main build.xml.

Should I prepare this? I have done this for spatial and trie in the past,
too.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: mikemcc...@apache.org [mailto:mikemcc...@apache.org]
 Sent: Sunday, June 14, 2009 1:13 PM
 To: java-comm...@lucene.apache.org
 Subject: svn commit: r784540 - in /lucene/java/trunk: ./ contrib/remote/
 contrib/remote/src/ contrib/remote/src/java/ contrib/remote/src/java/org/
 contrib/remote/src/java/org/apache/
 contrib/remote/src/java/org/apache/lucene/
 contrib/remote/src/java/org/apache/...
 
 Author: mikemccand
 Date: Sun Jun 14 11:13:04 2009
 New Revision: 784540
 
 URL: http://svn.apache.org/viewvc?rev=784540view=rev
 Log:
 LUCENE-1407: move RemoteSearchable out of core into contrib/remote
 
 Added:
 lucene/java/trunk/contrib/remote/
 lucene/java/trunk/contrib/remote/build.xml
 lucene/java/trunk/contrib/remote/pom.xml.template
 lucene/java/trunk/contrib/remote/src/
 lucene/java/trunk/contrib/remote/src/java/
 lucene/java/trunk/contrib/remote/src/java/org/
 lucene/java/trunk/contrib/remote/src/java/org/apache/
 lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/
 lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/search/
 
 lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/search/RMIRemo
 teSearchable.java
 
 lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/search/RemoteC
 achingWrapperFilter.java
   - copied, changed from r784216,
 lucene/java/trunk/src/java/org/apache/lucene/search/RemoteCachingWrapperFi
 lter.java
 
 lucene/java/trunk/contrib/remote/src/java/org/apache/lucene/search/RemoteS
 earchable.java
   - copied, changed from r784216,
 lucene/java/trunk/src/java/org/apache/lucene/search/RemoteSearchable.java
 lucene/java/trunk/contrib/remote/src/test/
 lucene/java/trunk/contrib/remote/src/test/org/
 lucene/java/trunk/contrib/remote/src/test/org/apache/
 lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/
 lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/
 
 lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/RemoteC
 achingWrapperFilterHelper.java
   - copied unchanged from r784216,
 lucene/java/trunk/src/test/org/apache/lucene/search/RemoteCachingWrapperFi
 lterHelper.java
 
 lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/TestRem
 oteCachingWrapperFilter.java
   - copied, changed from r784216,
 lucene/java/trunk/src/test/org/apache/lucene/search/TestRemoteCachingWrapp
 erFilter.java
 
 lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/TestRem
 oteSearchable.java
   - copied, changed from r784216,
 lucene/java/trunk/src/test/org/apache/lucene/search/TestRemoteSearchable.j
 ava
 
 lucene/java/trunk/contrib/remote/src/test/org/apache/lucene/search/TestRem
 oteSort.java
 Removed:
 
 lucene/java/trunk/src/java/org/apache/lucene/search/RemoteCachingWrapperFi
 lter.java
 
 lucene/java/trunk/src/java/org/apache/lucene/search/RemoteSearchable.java
 
 lucene/java/trunk/src/test/org/apache/lucene/search/RemoteCachingWrapperFi
 lterHelper.java
 
 lucene/java/trunk/src/test/org/apache/lucene/search/TestRemoteCachingWrapp
 erFilter.java
 
 lucene/java/trunk/src/test/org/apache/lucene/search/TestRemoteSearchable.j
 ava
 Modified:
 lucene/java/trunk/CHANGES.txt
 lucene/java/trunk/build.xml
 lucene/java/trunk/common-build.xml
 
 lucene/java/trunk/src/java/org/apache/lucene/search/CachingSpanFilter.java
 
 lucene/java/trunk/src/java/org/apache/lucene/search/CachingWrapperFilter.j
 ava
 lucene/java/trunk/src/java/org/apache/lucene/search/FilterManager.java
 lucene/java/trunk/src/java/org/apache/lucene/search/Searchable.java
 lucene/java/trunk/src/test/org/apache/lucene/search/TestSort.java
 
 Modified: lucene/java/trunk/CHANGES.txt
 URL:
 http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?rev=784540r1=7
 84539r2=784540view=diff
 ==
 
 --- lucene/java/trunk/CHANGES.txt (original)
 +++ lucene/java/trunk/CHANGES.txt Sun Jun 14 11:13:04 2009
 @@ -196,6 +196,11 @@
  were deprecated. You should instantiate the Directory manually before
  and pass it to these classes (LUCENE-1451, LUCENE-1658).
  (Uwe Schindler)
 +
 +21. LUCENE-1407: Move RemoteSearchable, RemoteCachingWrapperFilter out
 +of Lucene's core into new contrib/remote package.  Searchable no
 +longer extends java.rmi.Remote (Simon Willnauer via Mike
 +McCandless)
 
  Bug fixes
 
 
 Modified: lucene/java/trunk/build.xml
 URL:
 http://svn.apache.org/viewvc/lucene/java/trunk/build.xml?rev=784540r1=784
 539r2=784540view=diff
 

[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations

2009-06-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719491#action_12719491
 ] 

Michael McCandless commented on LUCENE-1677:


bq. I think test-core is broken too ...

It should be fixed now?  (I reverted it).


 Remove GCJ IndexReader specializations
 --

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9


 These specializations are outdated, unsupported, most probably pointless due 
 to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you 
 are going to ask people on java-user, anybody replied that they need it?). 
 While giving nothing, they make SegmentReader instantiation code look real 
 ugly.
 If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations

2009-06-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719504#action_12719504
 ] 

Shai Erera commented on LUCENE-1677:


You're right. I updated build.xml, but the change for test-core was actually in 
common-build.xml. sorry for the false alarm.

 Remove GCJ IndexReader specializations
 --

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9


 These specializations are outdated, unsupported, most probably pointless due 
 to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you 
 are going to ask people on java-user, anybody replied that they need it?). 
 While giving nothing, they make SegmentReader instantiation code look real 
 ugly.
 If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Build failed in Hudson: Lucene-trunk #859

2009-06-15 Thread Grant Ingersoll

FYI, Simon, you are still a contrib committer ;-)


On Jun 15, 2009, at 3:20 AM, Simon Willnauer wrote:


There is the wrong name in the pom.xml.template for contrib/remote
Here is a diff with a patch:

Index: contrib/remote/pom.xml.template
===
--- contrib/remote/pom.xml.template (revision 784550)
+++ contrib/remote/pom.xml.template (working copy)
@@ -28,7 +28,7 @@
version@version@/version
  /parent
  groupIdorg.apache.lucene/groupId
-  artifactIdlucene-regex/artifactId
+  artifactIdlucene-remote/artifactId
  nameLucene Remote/name
  version@version@/version
  descriptionRemote Searchable based on RMI/description



simon
On Mon, Jun 15, 2009 at 4:19 AM, Apache Hudson
Serverhud...@hudson.zones.apache.org wrote:
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/859/ 
changes


Changes:

[mikemccand] LUCENE-1539: add DeleteByPercent, FlushReader tasks,  
and ability to open reader on a labelled commit point


[mikemccand] LUCENE-1571: fix LatLongDistanceFilter to respect  
deleted docs


[mikemccand] LUCENE-979: remove a few more old benchmark things

[mikemccand] revert accidental commit

[mikemccand] LUCENE-1677: deprecate gcj specializations, and the  
system properties that let you specify which SegmentReader impl  
class to use


[mikemccand] LUCENE-1407: move RemoteSearchable out of core into  
contrib/remote


--
[...truncated 6620 lines...]
build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

jar-src:
 [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/lucene-misc-2.9-SNAPSHOT-src.jar

dist-maven:
[copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc
[artifact:install-provider] Installing provider:  
org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime
[artifact:pom] Error downloading parent pom  
org.apache.lucene:lucene-contrib::2.9-SNAPSHOT: Missing:

[artifact:pom] --
[artifact:pom] 1) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT
[artifact:pom]   Path to dependency:
[artifact:pom]  1) unspecified:unspecified:jar:0.0
[artifact:pom]  2) org.apache.lucene:lucene-contrib:pom:2.9- 
SNAPSHOT

[artifact:pom]
[artifact:pom] --
[artifact:pom] 1 required artifact is missing.
[artifact:pom]
[artifact:pom] for artifact:
[artifact:pom]   unspecified:unspecified:jar:0.0
[artifact:pom]
[artifact:pom] from the specified remote repositories:
[artifact:pom]   central (http://repo1.maven.org/maven2)
[artifact:deploy] Deploying to 
file://http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/dist/maven
[artifact:deploy] [INFO] Retrieving previous build number from remote
[artifact:deploy] [INFO] repository metadata for: 'snapshot  
org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on  
repository: remote, so will be created
[artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9- 
SNAPSHOT/lucene-misc-2.9-20090615.021803-1.jar to remote

[artifact:deploy] Uploaded 52K
[artifact:deploy] [INFO] Retrieving previous metadata from remote
[artifact:deploy] [INFO] repository metadata for: 'artifact  
org.apache.lucene:lucene-misc' could not be found on repository:  
remote, so will be created
[artifact:deploy] [INFO] Uploading repository metadata for:  
'artifact org.apache.lucene:lucene-misc'
[artifact:deploy] [INFO] Uploading project information for lucene- 
misc 2.9-20090615.021803-1

[artifact:deploy] [INFO] Retrieving previous metadata from remote
[artifact:deploy] [INFO] repository metadata for: 'snapshot  
org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on  
repository: remote, so will be created
[artifact:deploy] [INFO] Uploading repository metadata for:  
'snapshot org.apache.lucene:lucene-misc:2.9-SNAPSHOT'

[artifact:deploy] [INFO] Retrieving previous build number from remote
[artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9- 
SNAPSHOT/lucene-misc-2.9-20090615.021803-1-sources.jar to remote

[artifact:deploy] Uploaded 53K
[artifact:deploy] [INFO] Retrieving previous build number from remote
[artifact:deploy] Uploading: org/apache/lucene/lucene-misc/2.9- 
SNAPSHOT/lucene-misc-2.9-20090615.021803-1-javadoc.jar to remote

[artifact:deploy] Uploaded 142K
[echo] Building queries...

javacc-uptodate-check:

javacc-notice:

jflex-uptodate-check:

jflex-notice:

common.init:

build-lucene:

build-lucene-tests:

init:

clover.setup:

clover.info:

clover:

compile-core:

jar-src:
 [jar] Building jar: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries/lucene-queries-2.9-SNAPSHOT-src.jar

dist-maven:
[copy] Copying 1 file to 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/queries
[artifact:install-provider] Installing provider:  

Re: Build failed in Hudson: Lucene-trunk #859

2009-06-15 Thread Simon Willnauer
Uh! I didn't know that I can commit to all contribs
Good to know but I have been inactive for a while so I want use my
power with care!

simon


On Mon, Jun 15, 2009 at 12:34 PM, Grant Ingersollgsing...@apache.org wrote:
 FYI, Simon, you are still a contrib committer ;-)


 On Jun 15, 2009, at 3:20 AM, Simon Willnauer wrote:

 There is the wrong name in the pom.xml.template for contrib/remote
 Here is a diff with a patch:

 Index: contrib/remote/pom.xml.template
 ===
 --- contrib/remote/pom.xml.template     (revision 784550)
 +++ contrib/remote/pom.xml.template     (working copy)
 @@ -28,7 +28,7 @@
    version@version@/version
  /parent
  groupIdorg.apache.lucene/groupId
 -  artifactIdlucene-regex/artifactId
 +  artifactIdlucene-remote/artifactId
  nameLucene Remote/name
  version@version@/version
  descriptionRemote Searchable based on RMI/description



 simon
 On Mon, Jun 15, 2009 at 4:19 AM, Apache Hudson
 Serverhud...@hudson.zones.apache.org wrote:

 See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/859/changes

 Changes:

 [mikemccand] LUCENE-1539: add DeleteByPercent, FlushReader tasks, and
 ability to open reader on a labelled commit point

 [mikemccand] LUCENE-1571: fix LatLongDistanceFilter to respect deleted
 docs

 [mikemccand] LUCENE-979: remove a few more old benchmark things

 [mikemccand] revert accidental commit

 [mikemccand] LUCENE-1677: deprecate gcj specializations, and the system
 properties that let you specify which SegmentReader impl class to use

 [mikemccand] LUCENE-1407: move RemoteSearchable out of core into
 contrib/remote

 --
 [...truncated 6620 lines...]
 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:

 jar-src:
     [jar] Building jar:
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc/lucene-misc-2.9-SNAPSHOT-src.jar

 dist-maven:
    [copy] Copying 1 file to
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/misc
 [artifact:install-provider] Installing provider:
 org.apache.maven.wagon:wagon-ssh:jar:1.0-beta-2:runtime
 [artifact:pom] Error downloading parent pom
 org.apache.lucene:lucene-contrib::2.9-SNAPSHOT: Missing:
 [artifact:pom] --
 [artifact:pom] 1) org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT
 [artifact:pom]   Path to dependency:
 [artifact:pom]          1) unspecified:unspecified:jar:0.0
 [artifact:pom]          2)
 org.apache.lucene:lucene-contrib:pom:2.9-SNAPSHOT
 [artifact:pom]
 [artifact:pom] --
 [artifact:pom] 1 required artifact is missing.
 [artifact:pom]
 [artifact:pom] for artifact:
 [artifact:pom]   unspecified:unspecified:jar:0.0
 [artifact:pom]
 [artifact:pom] from the specified remote repositories:
 [artifact:pom]   central (http://repo1.maven.org/maven2)
 [artifact:deploy] Deploying to
 file://http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/dist/maven
 [artifact:deploy] [INFO] Retrieving previous build number from remote
 [artifact:deploy] [INFO] repository metadata for: 'snapshot
 org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on
 repository: remote, so will be created
 [artifact:deploy] Uploading:
 org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1.jar
 to remote
 [artifact:deploy] Uploaded 52K
 [artifact:deploy] [INFO] Retrieving previous metadata from remote
 [artifact:deploy] [INFO] repository metadata for: 'artifact
 org.apache.lucene:lucene-misc' could not be found on repository: remote, so
 will be created
 [artifact:deploy] [INFO] Uploading repository metadata for: 'artifact
 org.apache.lucene:lucene-misc'
 [artifact:deploy] [INFO] Uploading project information for lucene-misc
 2.9-20090615.021803-1
 [artifact:deploy] [INFO] Retrieving previous metadata from remote
 [artifact:deploy] [INFO] repository metadata for: 'snapshot
 org.apache.lucene:lucene-misc:2.9-SNAPSHOT' could not be found on
 repository: remote, so will be created
 [artifact:deploy] [INFO] Uploading repository metadata for: 'snapshot
 org.apache.lucene:lucene-misc:2.9-SNAPSHOT'
 [artifact:deploy] [INFO] Retrieving previous build number from remote
 [artifact:deploy] Uploading:
 org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-sources.jar
 to remote
 [artifact:deploy] Uploaded 53K
 [artifact:deploy] [INFO] Retrieving previous build number from remote
 [artifact:deploy] Uploading:
 org/apache/lucene/lucene-misc/2.9-SNAPSHOT/lucene-misc-2.9-20090615.021803-1-javadoc.jar
 to remote
 [artifact:deploy] Uploaded 142K
    [echo] Building queries...

 javacc-uptodate-check:

 javacc-notice:

 jflex-uptodate-check:

 jflex-notice:

 common.init:

 build-lucene:

 build-lucene-tests:

 init:

 clover.setup:

 clover.info:

 clover:

 compile-core:

 jar-src:
     [jar] Building jar:
 

[jira] Commented: (LUCENE-1691) An index copied over another index can result in corruption

2009-06-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719513#action_12719513
 ] 

Michael McCandless commented on LUCENE-1691:


Copying over an existing index, without first removing all files in that index, 
is not a supported use case for Lucene.

Ie, to restore from backup you should make an empty dir and copy back your 
index files.

 An index copied over another index can result in corruption
 ---

 Key: LUCENE-1691
 URL: https://issues.apache.org/jira/browse/LUCENE-1691
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Adrian Hempel
Priority: Minor
 Fix For: 2.4.1


 After restoring an older backup of an index over the top of a newer version 
 of the index, attempts to open the index can result in 
 CorruptIndexExceptions, such as:
 {noformat}
 Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ 
 for segment _ed: fieldsReader shows 1137 but segmentInfo shows 1389
 at 
 org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:362)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:228)
 at 
 org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:55)
 at 
 org.apache.lucene.index.ReadOnlyMultiSegmentReader.init(ReadOnlyMultiSegmentReader.java:27)
 at 
 org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:102)
 at 
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
 at 
 org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:237)
 {noformat}
 The apparent cause is the strategy of taking the maximum of the ID in the 
 segments.gen file, and the IDs of the apparently valid segment files (See 
 lines 523-593 
 [here|http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/index/SegmentInfos.java?annotate=751393]),
  and using this as the current generation of the index.  This will include 
 stale segments that existed before the backup was restored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1691) An index copied over another index can result in corruption

2009-06-15 Thread Adrian Hempel (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719520#action_12719520
 ] 

Adrian Hempel commented on LUCENE-1691:
---

I realised that would probably be the case, but in the real world, this will be 
a common occurrence.

Hence my raising this issue as an Improvement rather than a Bug.

 An index copied over another index can result in corruption
 ---

 Key: LUCENE-1691
 URL: https://issues.apache.org/jira/browse/LUCENE-1691
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Adrian Hempel
Priority: Minor
 Fix For: 2.4.1


 After restoring an older backup of an index over the top of a newer version 
 of the index, attempts to open the index can result in 
 CorruptIndexExceptions, such as:
 {noformat}
 Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ 
 for segment _ed: fieldsReader shows 1137 but segmentInfo shows 1389
 at 
 org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:362)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:228)
 at 
 org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:55)
 at 
 org.apache.lucene.index.ReadOnlyMultiSegmentReader.init(ReadOnlyMultiSegmentReader.java:27)
 at 
 org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:102)
 at 
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
 at 
 org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:237)
 {noformat}
 The apparent cause is the strategy of taking the maximum of the ID in the 
 segments.gen file, and the IDs of the apparently valid segment files (See 
 lines 523-593 
 [here|http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/index/SegmentInfos.java?annotate=751393]),
  and using this as the current generation of the index.  This will include 
 stale segments that existed before the backup was restored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1691) An index copied over another index can result in corruption

2009-06-15 Thread Mark Miller

Adrian Hempel (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719520#action_12719520 ] 


Adrian Hempel commented on LUCENE-1691:
---

I realised that would probably be the case, but in the real world, this will be 
a common occurrence.
  

Delete the index you are copying over first?

Hence my raising this issue as an Improvement rather than a Bug.

  



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719535#action_12719535
 ] 

Shai Erera commented on LUCENE-1630:


Ok I was just about to post the patch, when the Spatial tests failed. After 
some investigation, I found out the following, and would appreciate your 
suggestions. IndexSearcher.search(QueryWeight weight, Filter filter, final int 
nDocs, Sort sort, boolean fillFields) I wrote the following code:

{code}
// try to create a Scorer in out-of-order mode, just to know which TFC
// version to instantiate.
boolean docsScoredInOrder = false;
if (subReaders.length  0) {
  docsScoredInOrder = !weight.scorer(subReaders[0], false, 
false).scoresOutOfOrder();
}
TopFieldCollector collector = TopFieldCollector.create(sort, nDocs,
fillFields, fieldSortDoTrackScores, fieldSortDoMaxScore, 
docsScoredInOrder);
search(weight, filter, collector);
{code}

For clarification - I need to know which TFC instance to create (in-order / 
out-of-order). For that, I need to first create a Scorer, asking for 
out-of-order one, and then check whether the Scorer is indeed an out-of-order 
or not. That's a dummy Scorer, as I never use it afterwards, but since we 
didn't want to add scoresOutOfOrder to Weight, but Scorer, I don't have any 
other choice.

For Spatial, this creates a problem. One of the tests uses ConstantScoreQuery 
and passes in a Filter. CSQ.scorer() creates a new Scorer and uses the given 
Filter as reference. In Spatial, every time Filter.getDocIdSet() is called, the 
internal filter populates a WeakHashMap of distances (with the doc id as key), 
and doesn't clear it between invocations. It also updates the base of the key 
to handle multiple readers. Therefore the docs of the first reader are added 
twice - once for the dummy invocation and the second time since the base is 
updated (LatLongDistanceFilter.java, line 222) to reader.maxDoc().

I tried to create a new distances map on every invocation, but then another 
test fails. I don't know this code very well, and I don't know which is the 
best solution:

* Complicate the code in IndexSearcher to create a Scorer, then collect it and 
then proceed w/ iterating on the readers, from the 2nd forward. This is a real 
ugly change, I tried it and quickly reverted. It also breaks the current beauty 
of having all the search methods call search(Weight, Filter, Collector).

* Fix LatLongDistanceFilter code to check if reader.maxDoc() == nextOffset, 
then do nextOffset -= reader.maxDoc(). This is not pretty either, since it 
assumes a certain implementation and use of it, which I don't like either.

* Add scoresOutOfOrder to Weight, but I don't know if we want to add this 
knowledge to Weight, and it fits nicely in Scorer.

Any suggestions? perhaps a different fix to Spatial?

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer 

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

I like the last option most. Creating dummy scorer looks ugly to me, and looks 
like it will cause more problems of the same kind in the future.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539
 ] 

Earwin Burrfoot edited comment on LUCENE-1630 at 6/15/09 5:36 AM:
--

I like the last option (move scoresOutOfOrder to Weight) most. Creating dummy 
scorer looks ugly to me, and looks like it will cause more problems of the same 
kind in the future.


  was (Author: earwin):
I like the last option most. Creating dummy scorer looks ugly to me, and 
looks like it will cause more problems of the same kind in the future.
  
 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r784758 - in /lucene/java/trunk: ./ docs/ docs/lucene-sandbox/ src/site/src/documentation/content/xdocs/

2009-06-15 Thread Simon Willnauer
Thanks, uwe!

simon

On Mon, Jun 15, 2009 at 2:45 PM, uschind...@apache.org wrote:
 Author: uschindler
 Date: Mon Jun 15 12:45:05 2009
 New Revision: 784758

 URL: http://svn.apache.org/viewvc?rev=784758view=rev
 Log:
 LUCENE-1407: move RemoteSearchable out of core into contrib/remote (add 
 javadocs to developer resources)

 Modified:
    lucene/java/trunk/build.xml
    lucene/java/trunk/docs/benchmarks.html
    lucene/java/trunk/docs/contributions.html
    lucene/java/trunk/docs/demo.html
    lucene/java/trunk/docs/demo2.html
    lucene/java/trunk/docs/demo3.html
    lucene/java/trunk/docs/demo4.html
    lucene/java/trunk/docs/fileformats.html
    lucene/java/trunk/docs/gettingstarted.html
    lucene/java/trunk/docs/index.html
    lucene/java/trunk/docs/linkmap.html
    lucene/java/trunk/docs/linkmap.pdf
    lucene/java/trunk/docs/lucene-sandbox/index.html
    lucene/java/trunk/docs/queryparsersyntax.html
    lucene/java/trunk/docs/scoring.html
    lucene/java/trunk/src/site/src/documentation/content/xdocs/site.xml

 Modified: lucene/java/trunk/build.xml
 URL: 
 http://svn.apache.org/viewvc/lucene/java/trunk/build.xml?rev=784758r1=784757r2=784758view=diff
 ==
 --- lucene/java/trunk/build.xml (original)
 +++ lucene/java/trunk/build.xml Mon Jun 15 12:45:05 2009
 @@ -309,6 +309,7 @@
           packageset dir=contrib/miscellaneous/src/java/
           packageset dir=contrib/queries/src/java/
           packageset dir=contrib/regex/src/java/
 +          packageset dir=contrib/remote/src/java/
           packageset dir=contrib/snowball/src/java/
           packageset dir=contrib/spatial/src/java/
           packageset dir=contrib/spellchecker/src/java/

 Modified: lucene/java/trunk/docs/benchmarks.html
 URL: 
 http://svn.apache.org/viewvc/lucene/java/trunk/docs/benchmarks.html?rev=784758r1=784757r2=784758view=diff
 ==
 --- lucene/java/trunk/docs/benchmarks.html (original)
 +++ lucene/java/trunk/docs/benchmarks.html Mon Jun 15 12:45:05 2009
 @@ -161,6 +161,9 @@
  a href=api/contrib-regex/index.htmlRegex/a
  /div
  div class=menuitem
 +a href=api/contrib-remote/index.htmlRemote/a
 +/div
 +div class=menuitem
  a href=api/contrib-snowball/index.htmlSnowball/a
  /div
  div class=menuitem

 Modified: lucene/java/trunk/docs/contributions.html
 URL: 
 http://svn.apache.org/viewvc/lucene/java/trunk/docs/contributions.html?rev=784758r1=784757r2=784758view=diff
 ==
 --- lucene/java/trunk/docs/contributions.html (original)
 +++ lucene/java/trunk/docs/contributions.html Mon Jun 15 12:45:05 2009
 @@ -163,6 +163,9 @@
  a href=api/contrib-regex/index.htmlRegex/a
  /div
  div class=menuitem
 +a href=api/contrib-remote/index.htmlRemote/a
 +/div
 +div class=menuitem
  a href=api/contrib-snowball/index.htmlSnowball/a
  /div
  div class=menuitem

 Modified: lucene/java/trunk/docs/demo.html
 URL: 
 http://svn.apache.org/viewvc/lucene/java/trunk/docs/demo.html?rev=784758r1=784757r2=784758view=diff
 ==
 --- lucene/java/trunk/docs/demo.html (original)
 +++ lucene/java/trunk/docs/demo.html Mon Jun 15 12:45:05 2009
 @@ -163,6 +163,9 @@
  a href=api/contrib-regex/index.htmlRegex/a
  /div
  div class=menuitem
 +a href=api/contrib-remote/index.htmlRemote/a
 +/div
 +div class=menuitem
  a href=api/contrib-snowball/index.htmlSnowball/a
  /div
  div class=menuitem

 Modified: lucene/java/trunk/docs/demo2.html
 URL: 
 http://svn.apache.org/viewvc/lucene/java/trunk/docs/demo2.html?rev=784758r1=784757r2=784758view=diff
 ==
 --- lucene/java/trunk/docs/demo2.html (original)
 +++ lucene/java/trunk/docs/demo2.html Mon Jun 15 12:45:05 2009
 @@ -163,6 +163,9 @@
  a href=api/contrib-regex/index.htmlRegex/a
  /div
  div class=menuitem
 +a href=api/contrib-remote/index.htmlRemote/a
 +/div
 +div class=menuitem
  a href=api/contrib-snowball/index.htmlSnowball/a
  /div
  div class=menuitem

 Modified: lucene/java/trunk/docs/demo3.html
 URL: 
 http://svn.apache.org/viewvc/lucene/java/trunk/docs/demo3.html?rev=784758r1=784757r2=784758view=diff
 ==
 --- lucene/java/trunk/docs/demo3.html (original)
 +++ lucene/java/trunk/docs/demo3.html Mon Jun 15 12:45:05 2009
 @@ -163,6 +163,9 @@
  a href=api/contrib-regex/index.htmlRegex/a
  /div
  div class=menuitem
 +a href=api/contrib-remote/index.htmlRemote/a
 +/div
 +div class=menuitem
  a href=api/contrib-snowball/index.htmlSnowball/a
  /div
  div class=menuitem

 Modified: lucene/java/trunk/docs/demo4.html
 URL: 
 http://svn.apache.org/viewvc/lucene/java/trunk/docs/demo4.html?rev=784758r1=784757r2=784758view=diff
 

[jira] Updated: (LUCENE-1691) An index copied over another index can result in corruption

2009-06-15 Thread Adrian Hempel (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Hempel updated LUCENE-1691:
--

Fix Version/s: (was: 2.4.1)
Affects Version/s: 2.4.1

 An index copied over another index can result in corruption
 ---

 Key: LUCENE-1691
 URL: https://issues.apache.org/jira/browse/LUCENE-1691
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Affects Versions: 2.4.1
Reporter: Adrian Hempel
Priority: Minor

 After restoring an older backup of an index over the top of a newer version 
 of the index, attempts to open the index can result in 
 CorruptIndexExceptions, such as:
 {noformat}
 Caused by: org.apache.lucene.index.CorruptIndexException: doc counts differ 
 for segment _ed: fieldsReader shows 1137 but segmentInfo shows 1389
 at 
 org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:362)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:228)
 at 
 org.apache.lucene.index.MultiSegmentReader.init(MultiSegmentReader.java:55)
 at 
 org.apache.lucene.index.ReadOnlyMultiSegmentReader.init(ReadOnlyMultiSegmentReader.java:27)
 at 
 org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:102)
 at 
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
 at 
 org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:237)
 {noformat}
 The apparent cause is the strategy of taking the maximum of the ID in the 
 segments.gen file, and the IDs of the apparently valid segment files (See 
 lines 523-593 
 [here|http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_4_1/src/java/org/apache/lucene/index/SegmentInfos.java?annotate=751393]),
  and using this as the current generation of the index.  This will include 
 stale segments that existed before the backup was restored.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719555#action_12719555
 ] 

Mark Miller commented on LUCENE-1595:
-

Okay, how about something like this:

we document up the changes and the conversion processes in the benchmark 
CHANGES and then, maybe check for removed alg properties in the algorithms and 
throw an exception pointing people to the CHANGES file if we find one? Or 
something along those lines?

I'd like to make the transition as smooth as possible.

 Split DocMaker into ContentSource and DocMaker
 --

 Key: LUCENE-1595
 URL: https://issues.apache.org/jira/browse/LUCENE-1595
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch


 This issue proposes some refactoring to the benchmark package. Today, 
 DocMaker has two roles: collecting documents from a collection and preparing 
 a Document object. These two should actually be split up to ContentSource and 
 DocMaker, which will use a ContentSource instance.
 ContentSource will implement all the methods of DocMaker, like 
 getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
 1591, by having a basic ContentSource that offers input stream services, and 
 wraps a file (for example) with a bzip or gzip streams etc.
 DocMaker will implement the makeDocument methods, reusing DocState etc.
 The idea is that collecting the Enwiki documents, for example, should be the 
 same whether I create documents using DocState, add payloads or index 
 additional metadata. Same goes for Trec and Reuters collections, as well as 
 LineDocMaker.
 In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
 99% the same and 99% different. Most of their differences lie in the way they 
 read the data, while most of the similarity lies in the way they create 
 documents (using DocState).
 That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
 (just the reuse of DocState). Also, other DocMakers do not use that DocState 
 today, something they could have gotten for free with this refactoring 
 proposed.
 So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
 Line, Simple), I can write several DocMakers, such as DocStateMaker, 
 ConfigurableDocMaker (one which accpets all kinds of config options) and 
 custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
 instance and reuse the same DocMaking algorithm with many content sources, as 
 well as the same ContentSource algorithm with many DocMaker implementations.
 This will also give us the opportunity to perf test content sources alone 
 (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
 creating a Document object.
 I've already done so in my code environment (I extend the benchmark package 
 for my application's purposes) and I like the flexibility I have. I think 
 this can be a nice contribution to the benchmark package, which can result in 
 some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719558#action_12719558
 ] 

Mark Miller commented on LUCENE-1518:
-

This issue is marked as part of LUCENE-1345, which has been pushed to 3.1. 
Also, it has not yet found an assignee. Speak out, or I will push this to 3.1.

 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719561#action_12719561
 ] 

Mark Miller commented on LUCENE-1313:
-

Whats the verdict on this one Mike? Got the impression this was a likely 3.1 ...

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Enable near realtime search in Lucene without external
 dependencies. When RAM NRT is enabled, the implementation adds a
 RAMDirectory to IndexWriter. Flushes go to the ramdir unless
 there is no available space. Merges are completed in the ram
 dir until there is no more available ram. 
 IW.optimize and IW.commit flush the ramdir to the primary
 directory, all other operations try to keep segments in ram
 until there is no more space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719565#action_12719565
 ] 

Mark Miller commented on LUCENE-1595:
-

bq. Does this make sense?

Okay, sounds good.

Silence is consent around here, so I think we are good to go with this patch as 
soon as I go over it a bit. I'll wait till you post this last one.

 Split DocMaker into ContentSource and DocMaker
 --

 Key: LUCENE-1595
 URL: https://issues.apache.org/jira/browse/LUCENE-1595
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch


 This issue proposes some refactoring to the benchmark package. Today, 
 DocMaker has two roles: collecting documents from a collection and preparing 
 a Document object. These two should actually be split up to ContentSource and 
 DocMaker, which will use a ContentSource instance.
 ContentSource will implement all the methods of DocMaker, like 
 getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
 1591, by having a basic ContentSource that offers input stream services, and 
 wraps a file (for example) with a bzip or gzip streams etc.
 DocMaker will implement the makeDocument methods, reusing DocState etc.
 The idea is that collecting the Enwiki documents, for example, should be the 
 same whether I create documents using DocState, add payloads or index 
 additional metadata. Same goes for Trec and Reuters collections, as well as 
 LineDocMaker.
 In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
 99% the same and 99% different. Most of their differences lie in the way they 
 read the data, while most of the similarity lies in the way they create 
 documents (using DocState).
 That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
 (just the reuse of DocState). Also, other DocMakers do not use that DocState 
 today, something they could have gotten for free with this refactoring 
 proposed.
 So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
 Line, Simple), I can write several DocMakers, such as DocStateMaker, 
 ConfigurableDocMaker (one which accpets all kinds of config options) and 
 custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
 instance and reuse the same DocMaking algorithm with many content sources, as 
 well as the same ContentSource algorithm with many DocMaker implementations.
 This will also give us the opportunity to perf test content sources alone 
 (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
 creating a Document object.
 I've already done so in my code environment (I extend the benchmark package 
 for my application's purposes) and I like the flexibility I have. I think 
 this can be a nice contribution to the benchmark package, which can result in 
 some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1518) Merge Query and Filter classes

2009-06-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719567#action_12719567
 ] 

Uwe Schindler commented on LUCENE-1518:
---

Push to 3.1! -- Uwe

 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-1606:
-

Assignee: Uwe Schindler

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719570#action_12719570
 ] 

Uwe Schindler commented on LUCENE-1606:
---

I take it, I think it is almost finished. The only problems at the moment are 
bundling the external library in contrib, which is BSD licensed, are there any 
problems?

If not, I can manage the inclusion into the regex contrib.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1518) Merge Query and Filter classes

2009-06-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1518:


Fix Version/s: (was: 2.9)
   3.1

 Merge Query and Filter classes
 --

 Key: LUCENE-1518
 URL: https://issues.apache.org/jira/browse/LUCENE-1518
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler
 Fix For: 3.1

 Attachments: LUCENE-1518.patch


 This issue presents a patch, that merges Queries and Filters in a way, that 
 the new Filter class extends Query. This would make it possible, to use every 
 filter as a query.
 The new abstract filter class would contain all methods of 
 ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the 
 Filter's getDocIdSet()/bits() methods he has nothing more to do, he could 
 just use the filter as a normal query.
 I do not want to completely convert Filters to ConstantScoreQueries. The idea 
 is to combine Queries and Filters in such a way, that every Filter can 
 automatically be used at all places where a Query can be used (e.g. also 
 alone a search query without any other constraint). For that, the abstract 
 Query methods must be implemented and return a default weight for Filters 
 which is the current ConstantScore Logic. If the filter is used as a real 
 filter (where the API wants a Filter), the getDocIdSet part could be directly 
 used, the weight is useless (as it is currently, too). The constant score 
 default implementation is only used when the Filter is used as a Query (e.g. 
 as direct parameter to Searcher.search()). For the special case of 
 BooleanQueries combining Filters and Queries the idea is, to optimize the 
 BooleanQuery logic in such a way, that it detects if a BooleanClause is a 
 Filter (using instanceof) and then directly uses the Filter API and not take 
 the burden of the ConstantScoreQuery (see LUCENE-1345).
 Here some ideas how to implement Searcher.search() with Query and Filter:
 - User runs Searcher.search() using a Filter as the only parameter. As every 
 Filter is also a ConstantScoreQuery, the query can be executed and returns 
 score 1.0 for all matching documents.
 - User runs Searcher.search() using a Query as the only parameter: No change, 
 all is the same as before
 - User runs Searcher.search() using a BooleanQuery as parameter: If the 
 BooleanQuery does not contain a Query that is subclass of Filter (the new 
 Filter) everything as usual. If the BooleanQuery only contains exactly one 
 Filter and nothing else the Filter is used as a constant score query. If 
 BooleanQuery contains clauses with Queries and Filters the new algorithm 
 could be used: The queries are executed and the results filtered with the 
 filters.
 For the user this has the main advantage: That he can construct his query 
 using a simplified API without thinking about Filters oder Queries, you can 
 just combine clauses together. The scorer/weight logic then identifies the 
 cases to use the filter or the query weight API. Just like the query 
 optimizer of a RDB.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-15 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719563#action_12719563
 ] 

Shai Erera commented on LUCENE-1595:


ok I agree. I've already documented CHANGES. I'll add to PerfTask a deprecated 
method checkObsoleteSettings which will throw an exception if it finds 
doc.add.log.step and doc.delete.log.step. doc.maker is still a valid one, 
but when you'll try to cast the argument to a DocMaker, you'll get an 
exception, b/c it's now a concrete class and not interface.

Does this make sense?

I'll post a patch soon.

 Split DocMaker into ContentSource and DocMaker
 --

 Key: LUCENE-1595
 URL: https://issues.apache.org/jira/browse/LUCENE-1595
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch


 This issue proposes some refactoring to the benchmark package. Today, 
 DocMaker has two roles: collecting documents from a collection and preparing 
 a Document object. These two should actually be split up to ContentSource and 
 DocMaker, which will use a ContentSource instance.
 ContentSource will implement all the methods of DocMaker, like 
 getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
 1591, by having a basic ContentSource that offers input stream services, and 
 wraps a file (for example) with a bzip or gzip streams etc.
 DocMaker will implement the makeDocument methods, reusing DocState etc.
 The idea is that collecting the Enwiki documents, for example, should be the 
 same whether I create documents using DocState, add payloads or index 
 additional metadata. Same goes for Trec and Reuters collections, as well as 
 LineDocMaker.
 In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
 99% the same and 99% different. Most of their differences lie in the way they 
 read the data, while most of the similarity lies in the way they create 
 documents (using DocState).
 That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
 (just the reuse of DocState). Also, other DocMakers do not use that DocState 
 today, something they could have gotten for free with this refactoring 
 proposed.
 So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
 Line, Simple), I can write several DocMakers, such as DocStateMaker, 
 ConfigurableDocMaker (one which accpets all kinds of config options) and 
 custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
 instance and reuse the same DocMaking algorithm with many content sources, as 
 well as the same ContentSource algorithm with many DocMaker implementations.
 This will also give us the opportunity to perf test content sources alone 
 (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
 creating a Document object.
 I've already done so in my code environment (I extend the benchmark package 
 for my application's purposes) and I like the flexibility I have. I think 
 this can be a nice contribution to the benchmark package, which can result in 
 some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719571#action_12719571
 ] 

Mark Miller commented on LUCENE-1606:
-

I don't think there is a problem with BSD. I know Grant has committed a BSD 
licensed stop word list in the past.

I've asked explicitly about it before, but got no response.

I'll try and dig a little, but Grant is the PMC head and he did it, so we 
wouldnt be following bad company...

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1595) Split DocMaker into ContentSource and DocMaker

2009-06-15 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1595:
---

Attachment: LUCENE-1595.patch

Patch adds a checkObsoleteSettings to PerfTask to alert on the use of 
doc.add.log.step and doc.delete.log.step, as well as documentation in CHANGES.

all benchmark tests pass.

 Split DocMaker into ContentSource and DocMaker
 --

 Key: LUCENE-1595
 URL: https://issues.apache.org/jira/browse/LUCENE-1595
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Shai Erera
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: LUCENE-1595.patch, LUCENE-1595.patch, LUCENE-1595.patch, 
 LUCENE-1595.patch


 This issue proposes some refactoring to the benchmark package. Today, 
 DocMaker has two roles: collecting documents from a collection and preparing 
 a Document object. These two should actually be split up to ContentSource and 
 DocMaker, which will use a ContentSource instance.
 ContentSource will implement all the methods of DocMaker, like 
 getNextDocData, raw size in bytes tracking etc. This can actually fit well w/ 
 1591, by having a basic ContentSource that offers input stream services, and 
 wraps a file (for example) with a bzip or gzip streams etc.
 DocMaker will implement the makeDocument methods, reusing DocState etc.
 The idea is that collecting the Enwiki documents, for example, should be the 
 same whether I create documents using DocState, add payloads or index 
 additional metadata. Same goes for Trec and Reuters collections, as well as 
 LineDocMaker.
 In fact, if one inspects EnwikiDocMaker and LineDocMaker closely, they are 
 99% the same and 99% different. Most of their differences lie in the way they 
 read the data, while most of the similarity lies in the way they create 
 documents (using DocState).
 That led to a somehwat bizzare extension of LineDocMaker by EnwikiDocMaker 
 (just the reuse of DocState). Also, other DocMakers do not use that DocState 
 today, something they could have gotten for free with this refactoring 
 proposed.
 So by having a EnwikiContentSource, ReutersContentSource and others (TREC, 
 Line, Simple), I can write several DocMakers, such as DocStateMaker, 
 ConfigurableDocMaker (one which accpets all kinds of config options) and 
 custom DocMakers (payload, facets, sorting), passing to them a ContentSource 
 instance and reuse the same DocMaking algorithm with many content sources, as 
 well as the same ContentSource algorithm with many DocMaker implementations.
 This will also give us the opportunity to perf test content sources alone 
 (i.e., compare bzip, gzip and regular input streams), w/o the overhead of 
 creating a Document object.
 I've already done so in my code environment (I extend the benchmark package 
 for my application's purposes) and I like the flexibility I have. I think 
 this can be a nice contribution to the benchmark package, which can result in 
 some code cleanup as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1313) Realtime Search

2009-06-15 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1313:
---

Fix Version/s: (was: 2.9)
   3.1

OK let's push it to 3.1.  It's very much in progress, but 1) the iterations are 
slow (it's a big patch), 2) it's a biggish change so I'd prefer to it shortly 
after a release, not shortly before, so it has plenty of time to bake on 
trunk.

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Enable near realtime search in Lucene without external
 dependencies. When RAM NRT is enabled, the implementation adds a
 RAMDirectory to IndexWriter. Flushes go to the ramdir unless
 there is no available space. Merges are completed in the ram
 dir until there is no more available ram. 
 IW.optimize and IW.commit flush the ramdir to the primary
 directory, all other operations try to keep segments in ram
 until there is no more space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: svn commit: r784540 - in /lucene/java/trunk: ./ contrib/remote/ contrib/remote/src/ contrib/remote/src/java/ contrib/remote/src/java/org/ contrib/remote/src/java/org/apache/ contrib/remote/src/j

2009-06-15 Thread Michael McCandless
Super, thanks Uwe!

Mike

On Mon, Jun 15, 2009 at 8:46 AM, Uwe Schindleru...@thetaphi.de wrote:
 Committed into general site docs (developer-resources) and into trunk's docs
 (large patch, because navigation changed).

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Monday, June 15, 2009 11:40 AM
 To: java-dev@lucene.apache.org
 Subject: Re: svn commit: r784540 - in /lucene/java/trunk: ./
 contrib/remote/ contrib/remote/src/ contrib/remote/src/java/
 contrib/remote/src/java/org/ contrib/remote/src/java/org/apache/
 contrib/remote/src/java/org/apache/lucene/ contrib/remote/src/java/org/a

 On Mon, Jun 15, 2009 at 3:41 AM, Uwe Schindleru...@thetaphi.de wrote:
  Hi Mike,
 
  after adding a new contrib, I think we should also add this to the site
 docs
  and also the javadocs generation in the main build.xml.

 Woops, you're right.

  Should I prepare this? I have done this for spatial and trie in the
 past,
  too.

 Yes please?  Thanks!

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719602#action_12719602
 ] 

Uwe Schindler commented on LUCENE-1606:
---

Robert: I applied the patch locally, one test was still using @Override, fixed 
that. I did only download automaton.jar not the source package.

Do you know, if automaton.jar is compiled using -source 1.4 -target 1.4  (it 
was compiled using ant 1.7 and Java 1.6). If not sure, I will try to build it 
again from source and use the correct compiler switches. The regex contrib 
module is Java 1.4 until now. If automaton only works with 1.5, we should wait 
until 3.0 to release it.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719605#action_12719605
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, you are correct, I just took a glance at the automaton source code and saw 
StringBuilder, so I think it is safe to say it only works with 1.5...

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719606#action_12719606
 ] 

Uwe Schindler commented on LUCENE-1606:
---

Doesn't seem to work, I will check the sources:

{code}
compile-core:
[javac] Compiling 12 source files to 
C:\Projects\lucene\trunk\build\contrib\regex\classes\java
[javac] 
C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\regex\AutomatonFuzzyQuery.java:11:
 cannot access dk.brics.automaton.Automaton
[javac] bad class file: C:\Projects\lucene\trunk\contrib\regex\lib\automaton
.jar(dk/brics/automaton/Automaton.class)
[javac] class file has wrong version 49.0, should be 48.0
[javac] Please remove or make sure it appears in the correct subdirectory of
 the classpath.
[javac] import dk.brics.automaton.Automaton;
[javac]   ^
[javac] 1 error
{code}

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719607#action_12719607
 ] 

Uwe Schindler commented on LUCENE-1606:
---

So I tend to move this to 3.0 or 3.1, because of missing support in regex 
contrib.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Core JDK 1.4 compatible.

2009-06-15 Thread Uwe Schindler
By the way:
I compiled core and corresponding tests with an old JDK 1.4 version, I found
locally on my machine. Works fine!

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Uwe Schindler (JIRA) [mailto:j...@apache.org]
 Sent: Monday, June 15, 2009 5:48 PM
 To: java-dev@lucene.apache.org
 Subject: [jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable
 regex)
 
 
 [ https://issues.apache.org/jira/browse/LUCENE-
 1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
 tabpanelfocusedCommentId=12719606#action_12719606 ]
 
 Uwe Schindler commented on LUCENE-1606:
 ---
 
 Doesn't seem to work, I will check the sources:
 
 {code}
 compile-core:
 [javac] Compiling 12 source files to
 C:\Projects\lucene\trunk\build\contrib\regex\classes\java
 [javac]
 C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\r
 egex\AutomatonFuzzyQuery.java:11: cannot access
 dk.brics.automaton.Automaton
 [javac] bad class file:
 C:\Projects\lucene\trunk\contrib\regex\lib\automaton
 .jar(dk/brics/automaton/Automaton.class)
 [javac] class file has wrong version 49.0, should be 48.0
 [javac] Please remove or make sure it appears in the correct
 subdirectory of
  the classpath.
 [javac] import dk.brics.automaton.Automaton;
 [javac]   ^
 [javac] 1 error
 {code}
 
  Automaton Query/Filter (scalable regex)
  ---
 
  Key: LUCENE-1606
  URL: https://issues.apache.org/jira/browse/LUCENE-1606
  Project: Lucene - Java
   Issue Type: New Feature
   Components: contrib/*
 Reporter: Robert Muir
 Assignee: Uwe Schindler
 Priority: Minor
  Fix For: 2.9
 
  Attachments: automaton.patch, automatonMultiQuery.patch,
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-
 1606.patch
 
 
  Attached is a patch for an AutomatonQuery/Filter (name can change if its
 not suitable).
  Whereas the out-of-box contrib RegexQuery is nice, I have some very
 large indexes (100M+ unique tokens) where queries are quite slow, 2
 minutes, etc. Additionally all of the existing RegexQuery implementations
 in Lucene are really slow if there is no constant prefix. This
 implementation does not depend upon constant prefix, and runs the same
 query in 640ms.
  Some use cases I envision:
   1. lexicography/etc on large text corpora
   2. looking for things such as urls where the prefix is not constant
 (http:// or ftp://)
  The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
 convert regular expressions into a DFA. Then, the filter enumerates
 terms in a special way, by using the underlying state machine. Here is my
 short description from the comments:
   The algorithm here is pretty basic. Enumerate terms but instead of
 a binary accept/reject do:
 
   1. Look at the portion that is OK (did not enter a reject state in
 the DFA)
   2. Generate the next possible String and seek to that.
  the Query simply wraps the filter with ConstantScoreQuery.
  I did not include the automaton.jar inside the patch but it can be
 downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
 
 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Core JDK 1.4 compatible.

2009-06-15 Thread Michael McCandless
:)

But those days are numbered!

Mike

On Mon, Jun 15, 2009 at 11:55 AM, Uwe Schindleru...@thetaphi.de wrote:
 By the way:
 I compiled core and corresponding tests with an old JDK 1.4 version, I found
 locally on my machine. Works fine!

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

 -Original Message-
 From: Uwe Schindler (JIRA) [mailto:j...@apache.org]
 Sent: Monday, June 15, 2009 5:48 PM
 To: java-dev@lucene.apache.org
 Subject: [jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable
 regex)


     [ https://issues.apache.org/jira/browse/LUCENE-
 1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
 tabpanelfocusedCommentId=12719606#action_12719606 ]

 Uwe Schindler commented on LUCENE-1606:
 ---

 Doesn't seem to work, I will check the sources:

 {code}
 compile-core:
     [javac] Compiling 12 source files to
 C:\Projects\lucene\trunk\build\contrib\regex\classes\java
     [javac]
 C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\r
 egex\AutomatonFuzzyQuery.java:11: cannot access
 dk.brics.automaton.Automaton
     [javac] bad class file:
 C:\Projects\lucene\trunk\contrib\regex\lib\automaton
 .jar(dk/brics/automaton/Automaton.class)
     [javac] class file has wrong version 49.0, should be 48.0
     [javac] Please remove or make sure it appears in the correct
 subdirectory of
  the classpath.
     [javac] import dk.brics.automaton.Automaton;
     [javac]                           ^
     [javac] 1 error
 {code}

  Automaton Query/Filter (scalable regex)
  ---
 
                  Key: LUCENE-1606
                  URL: https://issues.apache.org/jira/browse/LUCENE-1606
              Project: Lucene - Java
           Issue Type: New Feature
           Components: contrib/*
             Reporter: Robert Muir
             Assignee: Uwe Schindler
             Priority: Minor
              Fix For: 2.9
 
          Attachments: automaton.patch, automatonMultiQuery.patch,
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-
 1606.patch
 
 
  Attached is a patch for an AutomatonQuery/Filter (name can change if its
 not suitable).
  Whereas the out-of-box contrib RegexQuery is nice, I have some very
 large indexes (100M+ unique tokens) where queries are quite slow, 2
 minutes, etc. Additionally all of the existing RegexQuery implementations
 in Lucene are really slow if there is no constant prefix. This
 implementation does not depend upon constant prefix, and runs the same
 query in 640ms.
  Some use cases I envision:
   1. lexicography/etc on large text corpora
   2. looking for things such as urls where the prefix is not constant
 (http:// or ftp://)
  The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
 convert regular expressions into a DFA. Then, the filter enumerates
 terms in a special way, by using the underlying state machine. Here is my
 short description from the comments:
       The algorithm here is pretty basic. Enumerate terms but instead of
 a binary accept/reject do:
 
       1. Look at the portion that is OK (did not enter a reject state in
 the DFA)
       2. Generate the next possible String and seek to that.
  the Query simply wraps the filter with ConstantScoreQuery.
  I did not include the automaton.jar inside the patch but it can be
 downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Core JDK 1.4 compatible.

2009-06-15 Thread Shai Erera
It would help if we have a target date, then I'll know how many more X's I
need to mark on the Calendar :)

On Mon, Jun 15, 2009 at 6:56 PM, Michael McCandless 
luc...@mikemccandless.com wrote:

 :)

 But those days are numbered!

 Mike

 On Mon, Jun 15, 2009 at 11:55 AM, Uwe Schindleru...@thetaphi.de wrote:
  By the way:
  I compiled core and corresponding tests with an old JDK 1.4 version, I
 found
  locally on my machine. Works fine!
 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
  -Original Message-
  From: Uwe Schindler (JIRA) [mailto:j...@apache.org]
  Sent: Monday, June 15, 2009 5:48 PM
  To: java-dev@lucene.apache.org
  Subject: [jira] Commented: (LUCENE-1606) Automaton Query/Filter
 (scalable
  regex)
 
 
  [ https://issues.apache.org/jira/browse/LUCENE-
  1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
  tabpanelfocusedCommentId=12719606#action_12719606 ]
 
  Uwe Schindler commented on LUCENE-1606:
  ---
 
  Doesn't seem to work, I will check the sources:
 
  {code}
  compile-core:
  [javac] Compiling 12 source files to
  C:\Projects\lucene\trunk\build\contrib\regex\classes\java
  [javac]
 
 C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\r
  egex\AutomatonFuzzyQuery.java:11: cannot access
  dk.brics.automaton.Automaton
  [javac] bad class file:
  C:\Projects\lucene\trunk\contrib\regex\lib\automaton
  .jar(dk/brics/automaton/Automaton.class)
  [javac] class file has wrong version 49.0, should be 48.0
  [javac] Please remove or make sure it appears in the correct
  subdirectory of
   the classpath.
  [javac] import dk.brics.automaton.Automaton;
  [javac]   ^
  [javac] 1 error
  {code}
 
   Automaton Query/Filter (scalable regex)
   ---
  
   Key: LUCENE-1606
   URL:
 https://issues.apache.org/jira/browse/LUCENE-1606
   Project: Lucene - Java
Issue Type: New Feature
Components: contrib/*
  Reporter: Robert Muir
  Assignee: Uwe Schindler
  Priority: Minor
   Fix For: 2.9
  
   Attachments: automaton.patch, automatonMultiQuery.patch,
  automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch,
  automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-
  1606.patch
  
  
   Attached is a patch for an AutomatonQuery/Filter (name can change if
 its
  not suitable).
   Whereas the out-of-box contrib RegexQuery is nice, I have some very
  large indexes (100M+ unique tokens) where queries are quite slow, 2
  minutes, etc. Additionally all of the existing RegexQuery
 implementations
  in Lucene are really slow if there is no constant prefix. This
  implementation does not depend upon constant prefix, and runs the same
  query in 640ms.
   Some use cases I envision:
1. lexicography/etc on large text corpora
2. looking for things such as urls where the prefix is not constant
  (http:// or ftp://)
   The Filter uses the BRICS package (http://www.brics.dk/automaton/) to
  convert regular expressions into a DFA. Then, the filter enumerates
  terms in a special way, by using the underlying state machine. Here is
 my
  short description from the comments:
The algorithm here is pretty basic. Enumerate terms but instead
 of
  a binary accept/reject do:
  
1. Look at the portion that is OK (did not enter a reject state
 in
  the DFA)
2. Generate the next possible String and seek to that.
   the Query simply wraps the filter with ConstantScoreQuery.
   I did not include the automaton.jar inside the patch but it can be
  downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
 
  --
  This message is automatically generated by JIRA.
  -
  You can reply to this email to add a comment to the issue online.
 
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719612#action_12719612
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, sorry about this.

I did just verify automaton.jar can be compiled for Java 5 (at least it does 
not have java 1.6 dependencies), so perhaps this can be integrated for a later 
release.

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1606:
--

Fix Version/s: (was: 2.9)
   3.0

I move this to 3.0 (and not 3.1), because it can be released together with 3.0 
(contrib modules do not need to wait until 3.1).

Robert: you could supply a patch with StringBuilder toString() variants and all 
those @Override uncommented-in. And it works correct with 1.5 (I am working 
with 1.5 here locally - I hate 1.6...).

 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.0

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1599) SpanRegexQuery and SpanNearQuery is not working with MultiSearcher

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719619#action_12719619
 ] 

Mark Miller commented on LUCENE-1599:
-

Something is modifying the original query itself.

In MultiSearcher.rewrite:

  public Query rewrite(Query original) throws IOException {
Query[] queries = new Query[searchables.length];
for (int i = 0; i  searchables.length; i++) {
  queries[i] = searchables[i].rewrite(original);
}
return queries[0].combine(queries);
  }

On the first time through the loop, the SpanRegexQuery will contain the regex 
pattern, but the first time it hits rewrite, it will be changed to the expanded 
query. This shouldnt happen.
On the next time through the loop, original query will not contain a regex 
pattern, but will instead be the first time through the loop's rewritten query. 
Oddness.

I'll dig in and try and fix for 2.9.

 SpanRegexQuery and SpanNearQuery is not working with MultiSearcher
 --

 Key: LUCENE-1599
 URL: https://issues.apache.org/jira/browse/LUCENE-1599
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.4.1
 Environment: lucene-core 2.4.1, lucene-regex 2.4.1
Reporter: Billow Gao
 Fix For: 2.9

 Attachments: TestSpanRegexBug.java

   Original Estimate: 2h
  Remaining Estimate: 2h

 MultiSearcher is using:
 queries[i] = searchables[i].rewrite(original);
 to rewrite query and then use combine to combine them.
 But SpanRegexQuery's rewrite is different from others.
 After you call it on the same query, it always return the same rewritten 
 queries.
 As a result, only search on the first IndexSearcher work. All others are 
 using the first IndexSearcher's rewrite queries.
 So many terms are missing and return unexpected result.
 Billow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1599) SpanRegexQuery and SpanNearQuery is not working with MultiSearcher

2009-06-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned LUCENE-1599:
---

Assignee: Mark Miller

 SpanRegexQuery and SpanNearQuery is not working with MultiSearcher
 --

 Key: LUCENE-1599
 URL: https://issues.apache.org/jira/browse/LUCENE-1599
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.4.1
 Environment: lucene-core 2.4.1, lucene-regex 2.4.1
Reporter: Billow Gao
Assignee: Mark Miller
 Fix For: 2.9

 Attachments: TestSpanRegexBug.java

   Original Estimate: 2h
  Remaining Estimate: 2h

 MultiSearcher is using:
 queries[i] = searchables[i].rewrite(original);
 to rewrite query and then use combine to combine them.
 But SpanRegexQuery's rewrite is different from others.
 After you call it on the same query, it always return the same rewritten 
 queries.
 As a result, only search on the first IndexSearcher work. All others are 
 using the first IndexSearcher's rewrite queries.
 So many terms are missing and return unexpected result.
 Billow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719623#action_12719623
 ] 

Robert Muir commented on LUCENE-1606:
-

Uwe, ok.

Not to try to complicate things, but related to LUCENE-1689 and java 1.5, I 
could easily modify the Wildcard functionality here to work correctly with 
suppl. characters

This could be an alternative to fixing the WildcardQuery ? operator in core.


 Automaton Query/Filter (scalable regex)
 ---

 Key: LUCENE-1606
 URL: https://issues.apache.org/jira/browse/LUCENE-1606
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Robert Muir
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.0

 Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// 
 or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert 
 regular expressions into a DFA. Then, the filter enumerates terms in a 
 special way, by using the underlying state machine. Here is my short 
 description from the comments:
  The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:
   
  1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
  2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be downloaded 
 from http://www.brics.dk/automaton/ and is BSD-licensed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Updated: (LUCENE-1606) Automaton Query/Filter (scalable regex)

2009-06-15 Thread Michael McCandless
Why do you hate 1.6 Uwe?

Mike

On Mon, Jun 15, 2009 at 12:10 PM, Uwe Schindler (JIRA)j...@apache.org wrote:

     [ 
 https://issues.apache.org/jira/browse/LUCENE-1606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
  ]

 Uwe Schindler updated LUCENE-1606:
 --

    Fix Version/s:     (was: 2.9)
                   3.0

 I move this to 3.0 (and not 3.1), because it can be released together with 
 3.0 (contrib modules do not need to wait until 3.1).

 Robert: you could supply a patch with StringBuilder toString() variants and 
 all those @Override uncommented-in. And it works correct with 1.5 (I am 
 working with 1.5 here locally - I hate 1.6...).

 Automaton Query/Filter (scalable regex)
 ---

                 Key: LUCENE-1606
                 URL: https://issues.apache.org/jira/browse/LUCENE-1606
             Project: Lucene - Java
          Issue Type: New Feature
          Components: contrib/*
            Reporter: Robert Muir
            Assignee: Uwe Schindler
            Priority: Minor
             Fix For: 3.0

         Attachments: automaton.patch, automatonMultiQuery.patch, 
 automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, 
 automatonWithWildCard.patch, automatonWithWildCard2.patch, LUCENE-1606.patch


 Attached is a patch for an AutomatonQuery/Filter (name can change if its not 
 suitable).
 Whereas the out-of-box contrib RegexQuery is nice, I have some very large 
 indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. 
 Additionally all of the existing RegexQuery implementations in Lucene are 
 really slow if there is no constant prefix. This implementation does not 
 depend upon constant prefix, and runs the same query in 640ms.
 Some use cases I envision:
  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant 
 (http:// or ftp://)
 The Filter uses the BRICS package (http://www.brics.dk/automaton/) to 
 convert regular expressions into a DFA. Then, the filter enumerates terms 
 in a special way, by using the underlying state machine. Here is my short 
 description from the comments:
      The algorithm here is pretty basic. Enumerate terms but instead of a 
 binary accept/reject do:

      1. Look at the portion that is OK (did not enter a reject state in the 
 DFA)
      2. Generate the next possible String and seek to that.
 the Query simply wraps the filter with ConstantScoreQuery.
 I did not include the automaton.jar inside the patch but it can be 
 downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Michael McCandless
I thought the primary goal of switching to AttributeSource (yes, the
name is very generic...) was to allow extensibility to what's created
per-Token, so that an app could add their own attrs without costly
subclassing/casting per Token, independent of other other things
adding their tokens, etc.

EG, trie* takes advantage of this extensibility by adding a
ShiftAttribute.

Subclassing Token in your app wasn't a good solution for various
reasons.

I do think the API is somewhat more cumbersome than before, and I
don't like that about it (consumability!).

But net/net I think the change is good, and it's one of the
baby steps for flexible indexing (bullet #11):

  http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard

Ie it addresses the flexibility during analysis.

I don't think anything was held back in this effort. Grant, are you
referring to LUCENE-1458?  That's held back simply because the only
person working on it (me) got distracted by other things to work on.

Flexible indexing (all of bullet #11) is a complex project, and we
need to break it into baby steps like this one.  We've already made
good progress on it: you can already make custom attrs and a custom
(but, package private) indexing chain if you want.

Next step is pluggable codecs for writing index files (LUCENE-1458),
and APIs for reading them (that generalize Terms/TermDoc/TermPositions
we have today).

Mike

On Sun, Jun 14, 2009 at 11:41 PM, Shai Ereraser...@gmail.com wrote:
 The old API is deprecated, and therefore when we release 2.9 there might
 be some people who'd think they should move away from it, to better prepare
 for 3.0 (while in fact this many not be the case). Also, we should make sure
 that when we remove all the deprecations, this will still exist (and
 therefore, why deprecate it now?), if we think this should indeed be kept
 around for at least a while longer.

 I personally am all for keeping it around (it will save me a huge
 refactoring of an Analyzer package I wrote), but I have to admit it's only
 because I've got quite comfortable with the existing API, and did not have
 the time to try the new one yet.

 Shai

 On Mon, Jun 15, 2009 at 3:49 AM, Mark Miller markrmil...@gmail.com wrote:

 Mark Miller wrote:

 I don't know how I feel about rolling the new token api back.

 I will say that I originally had no issue with it because I am very
 excited about Lucene-1458.

 At the same time though, I'm thinking Lucene-1458 is a very advanced
 issue that will likely be for really expert usage (though I can see benefits
 falling to general users).

 I'm slightly iffy about making an intuitive api much less intuitive for
 an expert future feature that hasn't fully materialized in Lucene yet. It
 almost seems like that fight should weigh towards general usage and standard
 users.

 I don't have a better proposal though, nor the time to consider it at the
 moment. I was just more curious if anyone else had any thoughts. I hadn't
 realized Grant had asked a similar question not long ago
 with no response. Not sure how to take that, but I'd think that would
 indicate less problems with people than more. On the other hand, you don't
 have to switch yet (with trunk) and we have yet to release it. I wonder how
 many non dev, every day users have really had to tussle with the new API
 yet. Not many people complaining too loudly at the moment though.

 Asking for a roll back seems a bit extreme without a little more support
 behind it than we have seen.

 - Mark

 PS

 I know you didnt ask for a rollback Grant - just kind of talking in a
 general manner. I see your point on getting the search side in, I'm just not
 sure I agree that it really matters if one hits before the other. Like Mike
 says, you don't
 have to switch to the new API yet.

 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1650) Small fix in CustomScoreQuery JavaDoc

2009-06-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned LUCENE-1650:
---

Assignee: Mark Miller

 Small fix in CustomScoreQuery JavaDoc
 -

 Key: LUCENE-1650
 URL: https://issues.apache.org/jira/browse/LUCENE-1650
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Javadocs
Affects Versions: 2.9, 3.0
Reporter: Simon Willnauer
Assignee: Mark Miller
Priority: Minor
 Fix For: 2.9

 Attachments: customScoreQuery_CodeChange+JavaDoc.patch, 
 customScoreQuery_JavaDoc.patch


 I have fixed the javadoc for  Modified Score formular in CustomScoreQuery. 
 - Patch attached: customScoreQuery_JavaDoc.patch 
 I'm quite curious why the method:
  public float customScore(int doc, float subQueryScore, float valSrcScores[]) 
 calls public float customScore(int doc, float subQueryScore, float 
 valSrcScore])  only in 2 of the 3 cases which makes the choice to override 
 either one of the customScore methods dependent on the number of 
 ValueSourceQuery passed to the constructor. I figure it would be more 
 consistent if it would call the latter in all 3 cases.
 I also attached a patch which proposes a fix for that issue. The patch does 
 also include the JavaDoc issue mentioned above.
 - customScoreQuery_CodeChange+JavaDoc.patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1583) SpanOrQuery skipTo() doesn't always move forwards

2009-06-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller reassigned LUCENE-1583:
---

Assignee: Mark Miller

I guess I'll do this one.

You out there reading Paul Elschot? This look right to you? Any issues it might 
cause?

Else I guess I'll have to put on my thinking cap and figure it myself.

 SpanOrQuery skipTo() doesn't always move forwards
 -

 Key: LUCENE-1583
 URL: https://issues.apache.org/jira/browse/LUCENE-1583
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1
Reporter: Moti Nisenson
Assignee: Mark Miller
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1583.patch


 In SpanOrQuery the skipTo() method is improperly implemented if the target 
 doc is less than or equal to the current doc, since skipTo() may not be 
 called for any of the clauses' spans:
 public boolean skipTo(int target) throws IOException {
   if (queue == null) {
 return initSpanQueue(target);
   }
   while (queue.size() != 0  top().doc()  target) {
 if (top().skipTo(target)) {
   queue.adjustTop();
 } else {
   queue.pop();
 }
   }
   
   return queue.size() != 0;
 }
 This violates the correct behavior (as described in the Spans interface 
 documentation), that skipTo() should always move forwards, in other words the 
 correct implementation would be:
 public boolean skipTo(int target) throws IOException {
   if (queue == null) {
 return initSpanQueue(target);
   }
   boolean skipCalled = false;
   while (queue.size() != 0  top().doc()  target) {
 if (top().skipTo(target)) {
   queue.adjustTop();
 } else {
   queue.pop();
 }
 skipCalled = true;
   }
   
   if (skipCalled) {
   return queue.size() != 0;
   }
   return next();
 }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1688) Deprecating StopAnalyzer ENGLISH_STOP_WORDS - General replacement with an immutable Set

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719630#action_12719630
 ] 

Mark Miller commented on LUCENE-1688:
-

If no one else claims this for 2.9, I guess I'll do it.

 Deprecating StopAnalyzer ENGLISH_STOP_WORDS - General replacement with an 
 immutable Set
 ---

 Key: LUCENE-1688
 URL: https://issues.apache.org/jira/browse/LUCENE-1688
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 2.9, 3.0

 Attachments: StopWords.patch


 StopAnalyzer and StandartAnalyzer are using the static final array 
 ENGLISH_STOP_WORDS by default in various places. Internally this array is 
 converted into a mutable set which looks kind of weird to me. 
 I think the way to go is to deprecate all use of the static final array and 
 replace it with an immutable implementation of CharArraySet. Inside an 
 analyzer it does not make sense to have a mutable set anyway and we could 
 prevent set creation each time an analyzer is created. In the case of an 
 immutable set we won't have multithreading issues either. 
 in essence we get rid of a fair bit of converting string array to set code, 
 do not have a PUBLIC static reference to an array (which is mutable) and 
 reduce the overhead of analyzer creation.
 let me know what you think and I create a patch for it.
 simon

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-973) Token of returns in CJKTokenizer + new TestCJKTokenizer

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719635#action_12719635
 ] 

Mark Miller commented on LUCENE-973:


You guys looking for this for 2.9?

If so, any volunteers? If I assign myself any more, I won't likely get to them 
all.

 Token of   returns in CJKTokenizer + new TestCJKTokenizer
 ---

 Key: LUCENE-973
 URL: https://issues.apache.org/jira/browse/LUCENE-973
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3
Reporter: Toru Matsuzawa
Priority: Minor
 Fix For: 2.9

 Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, 
 LUCENE-973.patch, with-patch.jpg, without-patch.jpg


 The  string returns as Token in the boundary of two byte character and one 
 byte character. 
 There is no problem in CJKAnalyzer. 
 When CJKTokenizer is used with the unit, it becomes a problem. (Use it with 
 Solr etc.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Grant Ingersoll


On Jun 15, 2009, at 12:19 PM, Michael McCandless wrote:


I don't think anything was held back in this effort. Grant, are you
referring to LUCENE-1458?  That's held back simply because the only
person working on it (me) got distracted by other things to work on.


I'm sorry, I didn't mean to imply Michael B. was holding back on the  
work.  The patch has always felt half done to me because what's the  
point of having all of these attributes in the index if you don't have  
anyway of searching them, thus I was struck by the need to get it in  
prior to making it available in search.I realize it's complex, but  
here we are forcing people to upgrade for some future, long term goal.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-15 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1630:
---

Attachment: LUCENE-1630.patch

ok - let's start iterating on the patch. Anyone volunteer to accept it (and 
then I'll update CHANGES via ?)?

Patch include:
* QueryWeight with the new scorer(IndexReader, soreDocsInOrder, topScorer) and 
scoresOutOfOrder().
* Added methods to Searcher (this breaks back-compat, but it's already broken 
here because of 1575).
* BooleanWeight now creates BS or BS2 up front, and therefore BS2's code is 
simplified.

All tests pass.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9

 Attachments: LUCENE-1630.patch


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the 

[jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-06-15 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719639#action_12719639
 ] 

Mark Miller commented on LUCENE-1486:
-

Should this go in contrib rather than core? That seems to have been the 
approach so far, any reason to vary it up here?

Well, actually, looks like I see the multi field parser in core. Makes sense to 
put subclasses there I guess.

You think this is ready to commit Mark? If so, I should be able to review it 
(unless you want to commit it yourself).

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2009-06-15 Thread Mark Miller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-1486:


Attachment: LUCENE-1486.patch

Reformatted to lucene formatting, removed author tag, removed a couple unused 
fields, changed to patch format

Tests don't pass because it doesnt work quite correctly with the new 
constantscore multi term queries yet.

 Wildcards, ORs etc inside Phrase queries
 

 Key: LUCENE-1486
 URL: https://issues.apache.org/jira/browse/LUCENE-1486
 Project: Lucene - Java
  Issue Type: Improvement
  Components: QueryParser
Affects Versions: 2.4
Reporter: Mark Harwood
Priority: Minor
 Fix For: 2.9

 Attachments: ComplexPhraseQueryParser.java, LUCENE-1486.patch, 
 TestComplexPhraseQuery.java


 An extension to the default QueryParser that overrides the parsing of 
 PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
 The implementation feels a little hacky - this is arguably better handled in 
 QueryParser itself. This works as a proof of concept  for much of the query 
 parser syntax. Examples from the Junit test include:
   checkMatches(\j*   smyth~\, 1,2); //wildcards and fuzzies 
 are OK in phrases
   checkMatches(\(jo* -john)  smith\, 2); // boolean logic 
 works
   checkMatches(\jo*  smith\~2, 1,2,3); // position logic 
 works.
   
   checkBadQuery(\jo*  id:1 smith\); //mixing fields in a 
 phrase is bad
   checkBadQuery(\jo* \smith\ \); //phrases inside phrases 
 is bad
   checkBadQuery(\jo* [sma TO smZ]\ \); //range queries 
 inside phrases not supported
 Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-15 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719653#action_12719653
 ] 

Richard Marr commented on LUCENE-1690:
--

Sounds reasonable although that'll take a little longer for me to do. I'll have 
a think about it.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Yonik Seeley
The high-level description of the new API looks good (being able to
add arbitrary properties to tokens), unfortunately, I've never had the
time to try and use it and give any constructive feedback.

As far as difficulty of use, I assume this only applies to
implementing your own TokenFilter? It seems like most standard users
would be just stringing together existing TokenFilters to create
custom Analyzers?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Mark Miller

Yonik Seeley wrote:

The high-level description of the new API looks good (being able to
add arbitrary properties to tokens), unfortunately, I've never had the
time to try and use it and give any constructive feedback.

As far as difficulty of use, I assume this only applies to
implementing your own TokenFilter? It seems like most standard users
would be just stringing together existing TokenFilters to create
custom Analyzers?

-Yonik
http://www.lucidimagination.com

  
True - its the implementation. And just trying to understand whats going 
on the first time you see it.


Its not particularly difficult, but its also not obvious like the 
previous API was. As a user, I would ask why that is so, and frankly the 
answer wouldn't do much for me (as a user).


I don't know if most 'standard' users implement their own or not. I will 
say, and perhaps I was in a special situation, I was writing them and 
modifying them almost as soon
as I started playing with Lucene. And even when I wasnt, I needed to 
understand the code to understand some of the complexities that could 
occur, and thankfully, that was breezy to do.


Right now, if you told me to go convert all of Solr to the new API you 
would hear a mighty groan.


As Lucene's contrib hasn't been fully converted either (and its been 
quite some time now), someone has probably heard that groan before.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Robert Muir

 As Lucene's contrib hasn't been fully converted either (and its been quite
 some time now), someone has probably heard that groan before.

hope this doesn't sound like a complaint, but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.

-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1692) Contrib analyzers need tests

2009-06-15 Thread Robert Muir (JIRA)
Contrib analyzers need tests


 Key: LUCENE-1692
 URL: https://issues.apache.org/jira/browse/LUCENE-1692
 Project: Lucene - Java
  Issue Type: Test
  Components: contrib/analyzers
Reporter: Robert Muir


The analyzers in contrib need tests, preferably ones that test the behavior of 
all the Token 'attributes' involved (offsets, type, etc) and not just what they 
do with token text.

This way, they can be converted to the new api without breakage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Realtime Search

2009-06-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719665#action_12719665
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

Just wanted to give an update on this, I'm running the unit
tests with flushToRAM=true, the ones that fail are (mostly)
tests that look for files when they're now in RAM (temporarily)
and the like. I'm not sure what to do with these tests, 1)
ignore them (kind of hard to not run specific methods, I think)
2) or conditionalize them to run only if flushToRAM=false. 

 Realtime Search
 ---

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Enable near realtime search in Lucene without external
 dependencies. When RAM NRT is enabled, the implementation adds a
 RAMDirectory to IndexWriter. Flushes go to the ramdir unless
 there is no available space. Merges are completed in the ram
 dir until there is no more available ram. 
 IW.optimize and IW.commit flush the ramdir to the primary
 directory, all other operations try to keep segments in ram
 until there is no more space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Grant Ingersoll


On Jun 14, 2009, at 8:05 PM, Michael Busch wrote:


I'd be happy to discuss other API proposals that anybody brings up  
here, that have the same advantages and are more intuitive. We could  
also beef up the documentation and give a better example about how  
to convert a stream/filter from the old to the new API; a  
constructive suggestion that Uwe made at the ApacheCon.


More questions:

1. What about Highlighter and MoreLikeThis?  They have not been  
converted.  Also, what are they going to do if the attributes they  
need are not available?  Caveat emptor?
2. Same for TermVectors.  What if the user specifies with positions  
and offsets, but the analyzer doesn't produce them?  Caveat emptor?  
(BTW, this is also true for the new omit TF stuff)
3. Also, what about the case where one might have attributes that are  
meant for downstream TokenFilters, but not necessarily for indexing?   
Offsets and type come to mind.  Is it the case now that those  
attributes are not automatically added to the index?   If they are  
ignored now, what if I want to add them?  I admit, I'm having a hard  
time finding the code that specifically loops over the Attributes.  I  
recall seeing it, but can no longer find it.



Also, can we add something like an AttributeTermQuery?  Seems like it  
could work similar to the BoostingTermQuery.


I'm sure more will come to me.

-Grant

[jira] Updated: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1673:
--

Attachment: LUCENE-1673.patch

Updated patch:
- now with extended JavaDocs
- additional tests for float/doubles
- additional tests for equals/hashcode
- changes.txt
- lot of reformatting

The only open point is the name of TrieUtils, any idea for package and/or name?

Changes to FieldCache and SortField to always require a parser (see discussion 
with Yonik), which is a new case to be openend after this.

 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Mark Miller

Robert Muir wrote:

As Lucene's contrib hasn't been fully converted either (and its been quite
some time now), someone has probably heard that groan before.



hope this doesn't sound like a complaint,
Complaints are fine in any case. Every now and then, it might cause a 
little rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and 
opinions are coming out in a non negative way (which certainly includes 
complaints),

I think its all good.

 but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.
  
Thanks for pointing that out. We probably get lazy with tests, 
especially in contrib, and this brings up a good point - we should 
probably push
for tests or write them before committing more often. Sometimes I'm sure 
it just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then 
here we are ... a healthy dose of grunt work is bad enough when you have 
tests to check it.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Robert Muir
Mark, I created an issue for this.

I just think you know, converting an analyzer to the new api is really
not that bad.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...

On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote:
 Robert Muir wrote:

 As Lucene's contrib hasn't been fully converted either (and its been
 quite
 some time now), someone has probably heard that groan before.


 hope this doesn't sound like a complaint,

 Complaints are fine in any case. Every now and then, it might cause a little
 rant from me or something, but please don't let that dissuade you :)
 Who doesnt like to rant and rave now and then. As long as thoughts and
 opinions are coming out in a non negative way (which certainly includes
 complaints),
 I think its all good.

  but in my opinion this is
 because many do not have any tests.
 I converted a few of these and its just grunt work but if there are no
 tests, its impossible to verify the conversion is correct.


 Thanks for pointing that out. We probably get lazy with tests, especially in
 contrib, and this brings up a good point - we should probably push
 for tests or write them before committing more often. Sometimes I'm sure it
 just comes downto a tradeoff though - no resources at the time,
 the class looked clear cut, and it was just contrib anyway. But then here we
 are ... a healthy dose of grunt work is bad enough when you have tests to
 check it.

 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719689#action_12719689
 ] 

Michael McCandless commented on LUCENE-1673:


bq. So one using new code must always specify the parser when using 
SortField.INT (SortField.AUTO is already deprectaed so no problem). 

This will apply to int/long/float/double as well right?  How would you
do this (require a parser for only numeric sorts) back-compatibly?  EG,
the others (String, DOC, etc.) don't require a parser.

We could alternatively make NumericSortField (subclassing SortField),
that just uses the right parser?

Did you think about / decide against making a NumericField (that'd set
the right tokenStream itself)?

Other questions/comments:

  * Could we change ShiftAttribute - NumericShiftAttribute?

  * How about oal.util.NumericUtils instead of TrieUtils?

  * Can we rename RangeQuery - TextRangeQuery (TermRangeQuery), to
make it clear that its range checking is by Term sort order.

  * Should we support byte/short for trie indexed fields as well?
(Since SortField, FieldCache support these numeric types too...).


 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719692#action_12719692
 ] 

Michael McCandless commented on LUCENE-1673:


bq. The only open point is the name of TrieUtils, any idea for package and/or 
name?

I think NumericUtils?  (I'd like the naming to be consistent w/
NumericRangeQuery, NumericTokenStream, since it's very much a public
API, ie users must interact directly with it to get their SortField
(maybe) and FieldCache parser).

Leaving it util seems OK, since it's used by analysis  searching.


 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719699#action_12719699
 ] 

Yonik Seeley commented on LUCENE-1673:
--

bq. This will apply to int/long/float/double as well right? How would you do 
this (require a parser for only numeric sorts) back-compatibly? EG, the others 
(String, DOC, etc.) don't require a parser.

Allow passing parser==null to get the default?

bq. We could alternatively make NumericSortField (subclassing SortField), that 
just uses the right parser?

A factory method TrieUtils.getSortField() could also return the right SortField.



 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Mark Miller

Robert Muir wrote:

Mark, I created an issue for this.
  

Thanks Robert, great idea.

I just think you know, converting an analyzer to the new api is really
not that bad.
  
I don't either. I'm really just complaining about the initial 
readability. Once you know whats up, its not too much different. I just 
have found myself
having to refigure out whats up (a short task to be sure) over again 
after I leave it for a while. With the old one, everything was just kind 
of immediately self evident.


That makes me think new users might be a little more confused when they 
first meet again. I'm not a new user though, so its only a guess really.

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...
  
The chances of this issue being fully reverted are so remote that I 
really wouldnt let that stop you ...

On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote:
  

Robert Muir wrote:


As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.



hope this doesn't sound like a complaint,
  

Complaints are fine in any case. Every now and then, it might cause a little
rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and
opinions are coming out in a non negative way (which certainly includes
complaints),
I think its all good.


 but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.

  

Thanks for pointing that out. We probably get lazy with tests, especially in
contrib, and this brings up a good point - we should probably push
for tests or write them before committing more often. Sometimes I'm sure it
just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then here we
are ... a healthy dose of grunt work is bad enough when you have tests to
check it.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org







  



--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Uwe Schindler
 Also, what about the case where one might have attributes that are meant
 for downstream TokenFilters, but not necessarily for indexing?  Offsets 
 and type come to mind.  Is it the case now that those attributes are not 
 automatically added to the index?   If they are ignored now, what if I 
 want to add them?  I admit, I'm having a hard time finding the code that 
 specifically loops over the Attributes.  I recall seeing it, but can no 
 longer find it.

There is a new Attribute called ShiftAttribute (or NumericShiftAttribute),
when trie range is moved to core. This attribute contains the shifted-away
bits from the prefix encoded value during trie indexing. The idea is to e.g.
have TokenFilters that may additional payloads or others to trie values, but
only do this for specific precisions. In future, it may also be interesting
to automatically add this attribute to the index.

Maybe we should add a read/store method to attributes, that adds an
attribute to the Posting using a IndexOutput/IndexInput (like the
serialization methods).

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Yonik Seeley
On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindleru...@thetaphi.de wrote:
 There is a new Attribute called ShiftAttribute (or NumericShiftAttribute),
 when trie range is moved to core. This attribute contains the shifted-away
 bits from the prefix encoded value during trie indexing.

I was wondering about this
To make use of ShiftAttribute, you need to understand the trie
encoding scheme itself.  If you understood that, you'd be able to look
at the actual token value if you were interested in what shift was
used.  So it's redundant, has a runtime cost, it's not currently used
anywhere, and it's not useful to fields other than Trie.  Perhaps it
shouldn't exist (yet)?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719726#action_12719726
 ] 

Uwe Schindler commented on LUCENE-1673:
---

{quote}
bq. This will apply to int/long/float/double as well right? How would you do 
this (require a parser for only numeric sorts) back-compatibly? EG, the others 
(String, DOC, etc.) don't require a parser.

Mike: This will apply to int/long/float/double as well right? How would you
do this (require a parser for only numeric sorts) back-compatibly? EG,
the others (String, DOC, etc.) don't require a parser.

Yonik: Allow passing parser==null to get the default?

bq. We could alternatively make NumericSortField (subclassing SortField), that 
just uses the right parser?

A factory method TrieUtils.getSortField() could also return the right SortField.
{quote}

I want to move this into a new issue after, I will open one.

Nevertheless, I would like to remove emphasis from NumericUtils (which is in 
realyity a helper class). So I want to make the current human-readable numeric 
parsers public and also add the trie parsers to FieldCache.

The SortField factory is then the only parts really needed in NumericUtils, but 
not really. The parser is a singleton, works for all trie fields and could also 
live somewhere else or nowhere at all, if the Parsers all stay in FieldCache.

bq. Should we support byte/short for trie indexed fields as well? (Since 
SortField, FieldCache support these numeric types too...). 

For bytes, TrieRange is not very interesting, for shorts, maybe, but I would 
subsume them during indexing as simple integers. You could not speedup 
searching, but limit index size a little bit.

bq. Could we change ShiftAttribute - NumericShiftAttribute?

No problem, I do this. There is also missing the link from the TokenStream in 
the javadocs to this, see also my reply in java-dev to Grants mail.

bq. Can we rename RangeQuery - TextRangeQuery (TermRangeQuery), to make it 
clear that its range checking is by Term sort order.

We can do this and deprecate the old one, but I added a note to Javadocs (see 
patch). I would do this outside of this issue.

bq. How about oal.util.NumericUtils instead of TrieUtils?

That was my first idea, too. What to do with o.a.l.doc.NumberTools 
(deprecate?). And also update contrib/spatial to use NumericUtils instead of 
the copied and not really goo NumberUtils from Solr (Yonik said, it was written 
at a very early stage, and is not effective with UTF-8 encoding and the 
TermEnum posioning with the term prefixes). It would be a index-format change 
for spatial, but as the code was not yet released (in Lucene), the Lucene 
version should not use NumberUtils at all.

 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to 

[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719729#action_12719729
 ] 

Uwe Schindler commented on LUCENE-1673:
---

bq. Did you think about / decide against making a NumericField (that'd set the 
right tokenStream itself)?

The problem currently is:
- Field is final and so I must extend AbstractField. But some methods of 
Document return Field and not AbstractField.
- NumericField would only work for indexing, but when retrieving from index 
(stored fields), it would change to Field.

Maybe we should move this after the index-specific schemas and so on. Or 
document, that it can be only used for indexing.

By the way: How do you like the factories in NumericRangeQuery and the setValue 
methods, working like StringBuffer.append() in NumericTokenStream? This makes 
it really easy to index.

The only good thing of NumericField would be the possibility to automatically 
disable TF and Norms per default when indexing.

 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Uwe Schindler
 On Mon, Jun 15, 2009 at 3:00 PM, Uwe Schindleru...@thetaphi.de wrote:
  There is a new Attribute called ShiftAttribute (or
 NumericShiftAttribute),
  when trie range is moved to core. This attribute contains the shifted-
 away
  bits from the prefix encoded value during trie indexing.
 
 I was wondering about this
 To make use of ShiftAttribute, you need to understand the trie
 encoding scheme itself.  If you understood that, you'd be able to look
 at the actual token value if you were interested in what shift was
 used.  So it's redundant, has a runtime cost, it's not currently used
 anywhere, and it's not useful to fields other than Trie.  Perhaps it
 shouldn't exist (yet)?

The idea was to make the indexing process controllable. You were the one,
who asked e.g. for the possibility to add payloads to trie fields and so on.
Using the shift attribute, you have full control of the token types. OK,
it's a little bit redundant; you could also use the TypeAttribute (which is
already used to mark highest precision and lower precision values).

One question about the whole TokenStream: In the original case we discussed
about Payloads/Position and TrieRange. If this would be implemented in
future versions, the question is, how should I set the
PositionIncrement/Offsets in the token stream to create a Position of 0 in
the index. I do not understand the indexing process here, especially this
deprecated boolean flag about something negative (not sure what the name
was). Should I set PositionIncrement to 0 for all Trie fields per default.
How about PositionIncrementGap, when indexing more than one field? All not
really clear. The position would be simplier to implement, but doing this
with an attribute, that is indexes together with the other attributes like a
payload would be the most ideal solution for future versions of TrieRange.

(Maybe we could also use the Offset attribute for the highest precision
bits)

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Uwe Schindler
 If you understood that, you'd be able to look
 at the actual token value if you were interested in what shift was
 used.  So it's redundant, has a runtime cost, it's not currently used
 anywhere, and it's not useful to fields other than Trie.  Perhaps it
 shouldn't exist (yet)?

You are right, you could also decode the shift value from the first char of
the token... I think, I will remove the ShiftAttribute and only set the
TermType to highest, lower precisions. By this, one could easily add a
payload to the real numeric value using a TokenFilter.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Robert Muir
Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote:
 Robert Muir wrote:

 Mark, I created an issue for this.


 Thanks Robert, great idea.

 I just think you know, converting an analyzer to the new api is really
 not that bad.


 I don't either. I'm really just complaining about the initial readability.
 Once you know whats up, its not too much different. I just have found myself
 having to refigure out whats up (a short task to be sure) over again after I
 leave it for a while. With the old one, everything was just kind of
 immediately self evident.

 That makes me think new users might be a little more confused when they
 first meet again. I'm not a new user though, so its only a guess really.

 reverse engineering what one of them does is not necessarily obvious,
 and is completely unrelated but necessary if they are to be migrated.

 I'd be willing to assist with some of this but I don't want to really
 work the issue if its gonna be a waste of time at the end of the
 day...


 The chances of this issue being fully reverted are so remote that I really
 wouldnt let that stop you ...

 On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote:


 Robert Muir wrote:


 As Lucene's contrib hasn't been fully converted either (and its been
 quite
 some time now), someone has probably heard that groan before.



 hope this doesn't sound like a complaint,


 Complaints are fine in any case. Every now and then, it might cause a
 little
 rant from me or something, but please don't let that dissuade you :)
 Who doesnt like to rant and rave now and then. As long as thoughts and
 opinions are coming out in a non negative way (which certainly includes
 complaints),
 I think its all good.


  but in my opinion this is
 because many do not have any tests.
 I converted a few of these and its just grunt work but if there are no
 tests, its impossible to verify the conversion is correct.



 Thanks for pointing that out. We probably get lazy with tests, especially
 in
 contrib, and this brings up a good point - we should probably push
 for tests or write them before committing more often. Sometimes I'm sure
 it
 just comes downto a tradeoff though - no resources at the time,
 the class looked clear cut, and it was just contrib anyway. But then here
 we
 are ... a healthy dose of grunt work is bad enough when you have tests to
 check it.

 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org









 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719738#action_12719738
 ] 

Uwe Schindler commented on LUCENE-1673:
---

I think, I remove the ShiftAttribute in complete, its really useless. Maybe, I 
add a getShift() method to NumericUtils, that returns the shift value of a 
Token/String. See java-dev mailing with Yonik.

 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Robert Muir
let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
experimental, warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote:
 Mark, I'll see if I can get tests produced for some of those analyzers.

 as a new user of the new api myself, I think I can safely say the most
 confusing thing about it is having the old deprecated API mixed in the
 javadocs with it :)

 On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote:
 Robert Muir wrote:

 Mark, I created an issue for this.


 Thanks Robert, great idea.

 I just think you know, converting an analyzer to the new api is really
 not that bad.


 I don't either. I'm really just complaining about the initial readability.
 Once you know whats up, its not too much different. I just have found myself
 having to refigure out whats up (a short task to be sure) over again after I
 leave it for a while. With the old one, everything was just kind of
 immediately self evident.

 That makes me think new users might be a little more confused when they
 first meet again. I'm not a new user though, so its only a guess really.

 reverse engineering what one of them does is not necessarily obvious,
 and is completely unrelated but necessary if they are to be migrated.

 I'd be willing to assist with some of this but I don't want to really
 work the issue if its gonna be a waste of time at the end of the
 day...


 The chances of this issue being fully reverted are so remote that I really
 wouldnt let that stop you ...

 On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote:


 Robert Muir wrote:


 As Lucene's contrib hasn't been fully converted either (and its been
 quite
 some time now), someone has probably heard that groan before.



 hope this doesn't sound like a complaint,


 Complaints are fine in any case. Every now and then, it might cause a
 little
 rant from me or something, but please don't let that dissuade you :)
 Who doesnt like to rant and rave now and then. As long as thoughts and
 opinions are coming out in a non negative way (which certainly includes
 complaints),
 I think its all good.


  but in my opinion this is
 because many do not have any tests.
 I converted a few of these and its just grunt work but if there are no
 tests, its impossible to verify the conversion is correct.



 Thanks for pointing that out. We probably get lazy with tests, especially
 in
 contrib, and this brings up a good point - we should probably push
 for tests or write them before committing more often. Sometimes I'm sure
 it
 just comes downto a tradeoff though - no resources at the time,
 the class looked clear cut, and it was just contrib anyway. But then here
 we
 are ... a healthy dose of grunt work is bad enough when you have tests to
 check it.

 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org









 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





 --
 Robert Muir
 rcm...@gmail.com




-- 
Robert Muir
rcm...@gmail.com

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Uwe Schindler
 there's also a stray bold tag gone haywire somewhere, possibly
 .incrementToken()

I fixed this. This was going me on my nerves the whole day when I wrote
javadocs for NumericTokenStream...

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Mark Miller
Some great points - especially the decision between a deprecated API, 
and a new experimental one subject to change. Bit of a rock and a hard 
place for a new user.


Perhaps we should add a little note with some guidance.


- Mark

Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
experimental, warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote:
  

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote:


Robert Muir wrote:
  

Mark, I created an issue for this.



Thanks Robert, great idea.
  

I just think you know, converting an analyzer to the new api is really
not that bad.



I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.
  

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...



The chances of this issue being fully reverted are so remote that I really
wouldnt let that stop you ...
  

On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote:



Robert Muir wrote:

  

As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.


  

hope this doesn't sound like a complaint,



Complaints are fine in any case. Every now and then, it might cause a
little
rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and
opinions are coming out in a non negative way (which certainly includes
complaints),
I think its all good.

  

 but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.




Thanks for pointing that out. We probably get lazy with tests, especially
in
contrib, and this brings up a good point - we should probably push
for tests or write them before committing more often. Sometimes I'm sure
it
just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then here
we
are ... a healthy dose of grunt work is bad enough when you have tests to
check it.

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  





--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


  


--
Robert Muir
rcm...@gmail.com






  



--
- Mark

http://www.lucidimagination.com




-
To 

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Michael Busch

This is excellent feedback, Robert!

I agree this is confusing; especially having a deprecated API and only a 
experimental one that replaces the old one. We need to change that.
And I don't like the *useNewAPI*() methods either. I spent a lot of time 
thinking about backwards compatibility for this API. It's tricky to do 
without sacrificing performance. In API patches I find myself spending 
more time for backwards-compatibility than for the actual new feature! :(


I'll try to think about how to simplify this confusing old/new API mix.

However, we need to make the decisions:
a) if we want to release this new API with 2.9,
b) if yes to a), if we want to remove the old API in 3.0?

If yes to a) and no to b), then we'll have to support both APIs for a 
presumably very long time, so we then need to have a better solution for 
the backwards-compatibility here.


-Michael

On 6/15/09 1:09 PM, Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
experimental, warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly .incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com  wrote:
   

Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com  wrote:
 

Robert Muir wrote:
   

Mark, I created an issue for this.

 

Thanks Robert, great idea.
   

I just think you know, converting an analyzer to the new api is really
not that bad.

 

I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.
   

reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...

 

The chances of this issue being fully reverted are so remote that I really
wouldnt let that stop you ...
   

On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com  wrote:

 

Robert Muir wrote:

   

As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has probably heard that groan before.


   

hope this doesn't sound like a complaint,

 

Complaints are fine in any case. Every now and then, it might cause a
little
rant from me or something, but please don't let that dissuade you :)
Who doesnt like to rant and rave now and then. As long as thoughts and
opinions are coming out in a non negative way (which certainly includes
complaints),
I think its all good.

   

  but in my opinion this is
because many do not have any tests.
I converted a few of these and its just grunt work but if there are no
tests, its impossible to verify the conversion is correct.


 

Thanks for pointing that out. We probably get lazy with tests, especially
in
contrib, and this brings up a good point - we should probably push
for tests or write them before committing more often. Sometimes I'm sure
it
just comes downto a tradeoff though - no resources at the time,
the class looked clear cut, and it was just contrib anyway. But then here
we
are ... a healthy dose of grunt work is bad enough when you have tests to
check it.

--
- Mark


RE: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Uwe Schindler
By the way, there is an empty de subdir in SVN inside analysis. Can this
be removed?

And, in tests: test/o/a/l/index/store is somehow wrong placed. The class
inside should be in test/o/a/l/store. Should I move?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Monday, June 15, 2009 10:18 PM
 To: java-dev@lucene.apache.org
 Subject: RE: New Token API was Re: Payloads and TrieRangeQuery
 
  there's also a stray bold tag gone haywire somewhere, possibly
  .incrementToken()
 
 I fixed this. This was going me on my nerves the whole day when I wrote
 javadocs for NumericTokenStream...
 
 Uwe
 
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Uwe Schindler
 And I don't like the *useNewAPI*() methods either. I spent a lot of time 
 thinking about backwards compatibility for this API. It's tricky to do 
 without sacrificing performance. In API patches I find myself spending 
 more time for backwards-compatibility than for the actual new feature! :(

 I'll try to think about how to simplify this confusing old/new API mix.

One solution to fix this useNewAPI problem would be to change the
AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.

In this case, next() could simply return this Token (which is the all 4
attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.

Is this an idea?

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Robert Muir
Michael, again I am terrible with such things myself...

Personally I am impressed that you have the back compat, even if you
don't change any code at all I think some reformatting of javadocs
might make the situation a lot friendlier. I just listed everything
that came to my mind immediately.

I guess I will also mention that one of the reasons I might not use
the new API is that since all filters, etc on the same chain must use
the same API, its discouraging if all the contrib stuff doesn't
support the new API, it makes me want to just stick with the old so
everything will work. So I think contribs being on the new API is
really important otherwise no one will want to use it.

On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com wrote:
 This is excellent feedback, Robert!

 I agree this is confusing; especially having a deprecated API and only a
 experimental one that replaces the old one. We need to change that.
 And I don't like the *useNewAPI*() methods either. I spent a lot of time
 thinking about backwards compatibility for this API. It's tricky to do
 without sacrificing performance. In API patches I find myself spending more
 time for backwards-compatibility than for the actual new feature! :(

 I'll try to think about how to simplify this confusing old/new API mix.

 However, we need to make the decisions:
 a) if we want to release this new API with 2.9,
 b) if yes to a), if we want to remove the old API in 3.0?

 If yes to a) and no to b), then we'll have to support both APIs for a
 presumably very long time, so we then need to have a better solution for the
 backwards-compatibility here.

 -Michael

 On 6/15/09 1:09 PM, Robert Muir wrote:

 let me try some slightly more constructive feedback:

 new user looks at TokenStream javadocs:
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
 immediately they see deprecated, text in red with the words
 experimental, warnings in bold, the whole thing is scary!
 due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
 in a bad way, and its probably the most important method to a new
 user!
 there's also a stray bold tag gone haywire somewhere, possibly
 .incrementToken()

 from a technical perspective, the documentation is excellent! but for
 a new user unfamiliar with lucene, its unclear exactly what steps to
 take: use the scary red experimental api or the old deprecated one?

 theres also some fairly advanced stuff such as .captureState and
 .restoreState that might be better in a different place.

 finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
 [one is static, one is not], especially because it states all streams
 and filters in one chain must use the same API, is there a way to
 simplify this?

 i'm really terrible with javadocs myself, but perhaps we can come up
 with a way to improve the presentation... maybe that will make the
 difference.

 On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote:


 Mark, I'll see if I can get tests produced for some of those analyzers.

 as a new user of the new api myself, I think I can safely say the most
 confusing thing about it is having the old deprecated API mixed in the
 javadocs with it :)

 On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote:


 Robert Muir wrote:


 Mark, I created an issue for this.



 Thanks Robert, great idea.


 I just think you know, converting an analyzer to the new api is really
 not that bad.



 I don't either. I'm really just complaining about the initial readability.
 Once you know whats up, its not too much different. I just have found myself
 having to refigure out whats up (a short task to be sure) over again after I
 leave it for a while. With the old one, everything was just kind of
 immediately self evident.

 That makes me think new users might be a little more confused when they
 first meet again. I'm not a new user though, so its only a guess really.


 reverse engineering what one of them does is not necessarily obvious,
 and is completely unrelated but necessary if they are to be migrated.

 I'd be willing to assist with some of this but I don't want to really
 work the issue if its gonna be a waste of time at the end of the
 day...



 The chances of this issue being fully reverted are so remote that I really
 wouldnt let that stop you ...


 On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com wrote:



 Robert Muir wrote:



 As Lucene's contrib hasn't been fully converted either (and its been
 quite
 some time now), someone has probably heard that groan before.




 hope this doesn't sound like a complaint,



 Complaints are fine in any case. Every now and then, it might cause a
 little
 rant from me or something, but please don't let that dissuade you :)
 Who doesnt like to rant and rave now and then. As long as thoughts and
 opinions are coming out in a non negative way (which certainly includes
 

[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719761#action_12719761
 ] 

Michael McCandless commented on LUCENE-1673:


OK let's open a new issue for how to best integrate/default SortField
and FieldCache.

bq. Nevertheless, I would like to remove emphasis from NumericUtils (which is 
in realyity a helper class).

+1

bq. For bytes, TrieRange is not very interesting, for shorts, maybe, but I 
would subsume them during indexing as simple integers. You could not speedup 
searching, but limit index size a little bit.

Well, a RangeQuery on a plain text byte or short field requires
sneakiness (knowing that you must zero-pad; keeping
document.NumberUtils around); I think it's best if NumericXXX in
Lucene handles all of java's native numeric types.  And you want a
byte[] or short[] out of FieldCache (to not waste RAM having to
upgrade to an int[]).

We can do this under the (a?) new issue too...

bq. The SortField factory is then the only parts really needed in NumericUtils, 
but not really. The parser is a singleton, works for all trie fields and could 
also live somewhere else or nowhere at all, if the Parsers all stay in 
FieldCache.

(Under a new issue, but...) I'm not really a fan of leaving the parser
in FieldCache and expecting a user to know to create the SortField
with that parser.  NumericSortField would make it much more consumable
to direct Lucene users.

{quote}
bq. Can we rename RangeQuery - TextRangeQuery (TermRangeQuery), to make it 
clear that its range checking is by Term sort order.

We can do this and deprecate the old one, but I added a note to Javadocs (see 
patch). I would do this outside of this issue.
{quote}

OK.

One benefit of a rename is it's a reminder to users on upgrading to
consider whether they should in fact switch to NumericRangeQuery.

{quote}
bq. How about oal.util.NumericUtils instead of TrieUtils?

That was my first idea, too. What to do with o.a.l.doc.NumberTools 
(deprecate?). And also update contrib/spatial to use NumericUtils instead of 
the copied and not really goo NumberUtils from Solr (Yonik said, it was written 
at a very early stage, and is not effective with UTF-8 encoding and the 
TermEnum posioning with the term prefixes). It would be a index-format change 
for spatial, but as the code was not yet released (in Lucene), the Lucene 
version should not use NumberUtils at all.
{quote}

+1 on both (if we can add byte/short to trie*); we should do this
before 2.9 since we can still change locallucene's format.  Maybe open
a new issue for that, too?  We're forking off new 2.9 issues left and
right here!!

bq. I think, I remove the ShiftAttribute in complete, its really useless. 
Maybe, I add a getShift() method to NumericUtils, that returns the shift value 
of a Token/String. See java-dev mailing with Yonik.

OK

{quote}
bq. Did you think about / decide against making a NumericField (that'd set the 
right tokenStream itself)?

Field is final and so I must extend AbstractField. But some methods of Document 
return Field and not AbstractField.
{quote}

Can we just un-final Field?

{quote}
NumericField would only work for indexing, but when retrieving from index 
(stored fields), it would change to Field.

Maybe we should move this after the index-specific schemas and so on. Or 
document, that it can be only used for indexing.
{quote}

True, but we already have such challenges between index vs search
time Document; documenting it it seems fine.

bq. By the way: How do you like the factories in NumericRangeQuery and the 
setValue methods, working like StringBuffer.append() in NumericTokenStream? 
This makes it really easy to index.

I think this is great!  I like that you return NumericTokenStream :)

bq. The only good thing of NumericField would be the possibility to 
automatically disable TF and Norms per default when indexing.

Consumability (good defaults)!  (And also not having to know that you
must go and get a tokenStream from NumericUtils).


 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as 

Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Michael Busch
I have implemented most of that actually (the interface part and Token 
implementing all of them).


The problem is a paradigm change with the new API: the assumption is 
that there is always only one single instance of an Attribute. With the 
old API, it is recommended to reuse the passed-in token, but you don't 
have to, you can also return a new one with every call of next().


Now with this change the indexer classes should only know about the 
interfaces, if shouldn't know Token anymore, which seems fine when Token 
implements all those interfaces. BUT, since there can be more than once 
instance of Token, the indexer would have to call getAttribute() for all 
Attributes it needs after each call of next(). I haven't measured how 
expensive that is, but it seems like a severe performance hit.


That's basically the main reason why the backwards compatibility is 
ensured in such a goofy way right now.


 Michael

On 6/15/09 1:28 PM, Uwe Schindler wrote:

And I don't like the *useNewAPI*() methods either. I spent a lot of time
thinking about backwards compatibility for this API. It's tricky to do
without sacrificing performance. In API patches I find myself spending
more time for backwards-compatibility than for the actual new feature! :(

I'll try to think about how to simplify this confusing old/new API mix.
 


One solution to fix this useNewAPI problem would be to change the
AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.

In this case, next() could simply return this Token (which is the all 4
attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.

Is this an idea?

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   




Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Michael McCandless
On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindleru...@thetaphi.de wrote:

 And, in tests: test/o/a/l/index/store is somehow wrong placed. The class
 inside should be in test/o/a/l/store. Should I move?

Please do!

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Michael Busch
I agree. It's my fault, the task of changing the contribs (LUCENE-1460) 
is assigned to me for a while now - I just haven't found the time to do 
it yet.


It's great that you started the work on that! I'll try to review the 
patch in the next couple of days and help with fixing the remaining 
ones. I'd like to get the AttributeSource improvements patch out first. 
I'll try that tonight.


 Michael

On 6/15/09 1:35 PM, Robert Muir wrote:

Michael, again I am terrible with such things myself...

Personally I am impressed that you have the back compat, even if you
don't change any code at all I think some reformatting of javadocs
might make the situation a lot friendlier. I just listed everything
that came to my mind immediately.

I guess I will also mention that one of the reasons I might not use
the new API is that since all filters, etc on the same chain must use
the same API, its discouraging if all the contrib stuff doesn't
support the new API, it makes me want to just stick with the old so
everything will work. So I think contribs being on the new API is
really important otherwise no one will want to use it.

On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com  wrote:
   

This is excellent feedback, Robert!

I agree this is confusing; especially having a deprecated API and only a
experimental one that replaces the old one. We need to change that.
And I don't like the *useNewAPI*() methods either. I spent a lot of time
thinking about backwards compatibility for this API. It's tricky to do
without sacrificing performance. In API patches I find myself spending more
time for backwards-compatibility than for the actual new feature! :(

I'll try to think about how to simplify this confusing old/new API mix.

However, we need to make the decisions:
a) if we want to release this new API with 2.9,
b) if yes to a), if we want to remove the old API in 3.0?

If yes to a) and no to b), then we'll have to support both APIs for a
presumably very long time, so we then need to have a better solution for the
backwards-compatibility here.

-Michael

On 6/15/09 1:09 PM, Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
experimental, warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly
.incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com  wrote:


Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com  wrote:


Robert Muir wrote:


Mark, I created an issue for this.



Thanks Robert, great idea.


I just think you know, converting an analyzer to the new api is really
not that bad.



I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.


reverse engineering what one of them does is not necessarily obvious,
and is completely unrelated but necessary if they are to be migrated.

I'd be willing to assist with some of this but I don't want to really
work the issue if its gonna be a waste of time at the end of the
day...



The chances of this issue being fully reverted are so remote that I really
wouldnt let that stop you ...


On Mon, Jun 15, 2009 at 1:55 PM, Mark Millermarkrmil...@gmail.com  wrote:



Robert Muir wrote:



As Lucene's contrib hasn't been fully converted either (and its been
quite
some time now), someone has 

Re: [jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Mark Miller

Michael McCandless (JIRA) wrote:

 We're forking off new 2.9 issues left and
right here!!
  

Evil :) You guys are like small team working against me.

We still have 29+- issue to wrap up though, so probably plenty of time.

I hope we can set a rough target date soon though - it really feels like 
we could drag for quite a bit longer

if we wanted to.

Remember the last time we started to push for 2.9 in Dec/Jan :)

--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Robert Muir
Michael OK, I plan on adding some tests for the analyzers that don't have any.

I didn't try to migrate things such as highlighter, which are
definitely just as important, only because I'm not familiar with that
territory.

But I think I can figure out what the various language analyzers are
trying to do and add tests / convert the remaining ones.

On Mon, Jun 15, 2009 at 4:42 PM, Michael Buschbusch...@gmail.com wrote:
 I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is
 assigned to me for a while now - I just haven't found the time to do it yet.

 It's great that you started the work on that! I'll try to review the patch
 in the next couple of days and help with fixing the remaining ones. I'd like
 to get the AttributeSource improvements patch out first. I'll try that
 tonight.

  Michael

 On 6/15/09 1:35 PM, Robert Muir wrote:

 Michael, again I am terrible with such things myself...

 Personally I am impressed that you have the back compat, even if you
 don't change any code at all I think some reformatting of javadocs
 might make the situation a lot friendlier. I just listed everything
 that came to my mind immediately.

 I guess I will also mention that one of the reasons I might not use
 the new API is that since all filters, etc on the same chain must use
 the same API, its discouraging if all the contrib stuff doesn't
 support the new API, it makes me want to just stick with the old so
 everything will work. So I think contribs being on the new API is
 really important otherwise no one will want to use it.

 On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com wrote:


 This is excellent feedback, Robert!

 I agree this is confusing; especially having a deprecated API and only a
 experimental one that replaces the old one. We need to change that.
 And I don't like the *useNewAPI*() methods either. I spent a lot of time
 thinking about backwards compatibility for this API. It's tricky to do
 without sacrificing performance. In API patches I find myself spending more
 time for backwards-compatibility than for the actual new feature! :(

 I'll try to think about how to simplify this confusing old/new API mix.

 However, we need to make the decisions:
 a) if we want to release this new API with 2.9,
 b) if yes to a), if we want to remove the old API in 3.0?

 If yes to a) and no to b), then we'll have to support both APIs for a
 presumably very long time, so we then need to have a better solution for the
 backwards-compatibility here.

 -Michael

 On 6/15/09 1:09 PM, Robert Muir wrote:

 let me try some slightly more constructive feedback:

 new user looks at TokenStream javadocs:
 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
 immediately they see deprecated, text in red with the words
 experimental, warnings in bold, the whole thing is scary!
 due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
 in a bad way, and its probably the most important method to a new
 user!
 there's also a stray bold tag gone haywire somewhere, possibly
 .incrementToken()

 from a technical perspective, the documentation is excellent! but for
 a new user unfamiliar with lucene, its unclear exactly what steps to
 take: use the scary red experimental api or the old deprecated one?

 theres also some fairly advanced stuff such as .captureState and
 .restoreState that might be better in a different place.

 finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
 [one is static, one is not], especially because it states all streams
 and filters in one chain must use the same API, is there a way to
 simplify this?

 i'm really terrible with javadocs myself, but perhaps we can come up
 with a way to improve the presentation... maybe that will make the
 difference.

 On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote:


 Mark, I'll see if I can get tests produced for some of those analyzers.

 as a new user of the new api myself, I think I can safely say the most
 confusing thing about it is having the old deprecated API mixed in the
 javadocs with it :)

 On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote:


 Robert Muir wrote:


 Mark, I created an issue for this.



 Thanks Robert, great idea.


 I just think you know, converting an analyzer to the new api is really
 not that bad.



 I don't either. I'm really just complaining about the initial readability.
 Once you know whats up, its not too much different. I just have found myself
 having to refigure out whats up (a short task to be sure) over again after I
 leave it for a while. With the old one, everything was just kind of
 immediately self evident.

 That makes me think new users might be a little more confused when they
 first meet again. I'm not a new user though, so its only a guess really.


 reverse engineering what one of them does is not necessarily obvious,
 and is completely unrelated but 

RE: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Uwe Schindler
Maybe change the deprecation wrapper around next() and next(Token) [the
default impl of incrementToken()] to check, if the retrieved token is not
identical to the attribute and then just copy the contents to the
instance-Token? This would be a slowdown, but only be the case for very rare
TokenStreams that did not reuse token before (and were slow before, too).

 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  _  

From: Michael Busch [mailto:busch...@gmail.com] 
Sent: Monday, June 15, 2009 10:39 PM
To: java-dev@lucene.apache.org
Subject: Re: New Token API was Re: Payloads and TrieRangeQuery

 

I have implemented most of that actually (the interface part and Token
implementing all of them).

The problem is a paradigm change with the new API: the assumption is that
there is always only one single instance of an Attribute. With the old API,
it is recommended to reuse the passed-in token, but you don't have to, you
can also return a new one with every call of next().

Now with this change the indexer classes should only know about the
interfaces, if shouldn't know Token anymore, which seems fine when Token
implements all those interfaces. BUT, since there can be more than once
instance of Token, the indexer would have to call getAttribute() for all
Attributes it needs after each call of next(). I haven't measured how
expensive that is, but it seems like a severe performance hit.

That's basically the main reason why the backwards compatibility is ensured
in such a goofy way right now.

 Michael

On 6/15/09 1:28 PM, Uwe Schindler wrote: 

And I don't like the *useNewAPI*() methods either. I spent a lot of time 
thinking about backwards compatibility for this API. It's tricky to do 
without sacrificing performance. In API patches I find myself spending 
more time for backwards-compatibility than for the actual new feature! :(
 
I'll try to think about how to simplify this confusing old/new API mix.


 
One solution to fix this useNewAPI problem would be to change the
AttributeSource in a way, to return classes that implement interfaces (as
you proposed some weeks ago). The good old Token class then do not need to
be deprecated, it could simply implement all 4 interfaces. AttributeSource
then must implement a registry, which classes implement which interfaces. So
if somebody wants a TermAttribute, he always gets the Token. In all other
cases, the default could be a *Impl default class.
 
In this case, next() could simply return this Token (which is the all 4
attributes). The reuseableToken is simply thrown away in the deprecated API,
the reuseable Token comes from the AttributeSource and is per-instance.
 
Is this an idea?
 
Uwe
 
 
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org
 
 
  

 



[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible

2009-06-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719766#action_12719766
 ] 

Michael McCandless commented on LUCENE-1541:


Uwe, what's the plan on this issue...?  Should it wait until 3.1?

 Trie range - make trie range indexing more flexible
 ---

 Key: LUCENE-1541
 URL: https://issues.apache.org/jira/browse/LUCENE-1541
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Ning Li
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1541.patch, LUCENE-1541.patch


 In the current trie range implementation, a single precision step is 
 specified. With a large precision step (say 8), a value is indexed in fewer 
 terms (8) but the number of terms for a range can be large. With a small 
 precision step (say 2), the number of terms for a range is smaller but a 
 value is indexed in more terms (32).
 We want to add an option that different precision steps can be set for 
 different precisions. An expert can use this option to keep the number of 
 terms for a range small and at the same time index a value in a small number 
 of terms. See the discussion in LUCENE-1470 that results in this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Michael McCandless
On Mon, Jun 15, 2009 at 4:42 PM, Mark Millermarkrmil...@gmail.com wrote:

 Remember the last time we started to push for 2.9 in Dec/Jan :)

Yes this is very much on my mind too!!

So maybe, it's a race between the trie* group of issues, and the other 28 ;)

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1313) Near Realtime Search

2009-06-15 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719767#action_12719767
 ] 

Jason Rutherglen commented on LUCENE-1313:
--

TestThreadedOptimize is throwing a ensureContiguousMerge
exception. I think this is highlighting the change to merging
all ram segments to a single primaryDir segment can sometimes
lead to choosing segments that are non-contiguous? I'm not sure
of the best way to handle this.

 Near Realtime Search
 

 Key: LUCENE-1313
 URL: https://issues.apache.org/jira/browse/LUCENE-1313
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, 
 LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, 
 lucene-1313.patch, lucene-1313.patch


 Enable near realtime search in Lucene without external
 dependencies. When RAM NRT is enabled, the implementation adds a
 RAMDirectory to IndexWriter. Flushes go to the ramdir unless
 there is no available space. Merges are completed in the ram
 dir until there is no more available ram. 
 IW.optimize and IW.commit flush the ramdir to the primary
 directory, all other operations try to keep segments in ram
 until there is no more space.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1541) Trie range - make trie range indexing more flexible

2009-06-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719769#action_12719769
 ] 

Uwe Schindler commented on LUCENE-1541:
---

I see no real use in it, it does not affect query performance, only index size. 
Maybe we should move it to 3.1 until I have some time, but the Payload thing is 
more interesting and maybe this can be combined.

 Trie range - make trie range indexing more flexible
 ---

 Key: LUCENE-1541
 URL: https://issues.apache.org/jira/browse/LUCENE-1541
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Ning Li
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1541.patch, LUCENE-1541.patch


 In the current trie range implementation, a single precision step is 
 specified. With a large precision step (say 8), a value is indexed in fewer 
 terms (8) but the number of terms for a range can be large. With a small 
 precision step (say 2), the number of terms for a range is smaller but a 
 value is indexed in more terms (32).
 We want to add an option that different precision steps can be set for 
 different precisions. An expert can use this option to keep the number of 
 terms for a range small and at the same time index a value in a small number 
 of terms. See the discussion in LUCENE-1470 that results in this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: [jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-15 Thread Uwe Schindler
Sorry,

I think these new issues may also be in 3.1 (not all), but I want to have
this trie stuff with a clean API before 2.9 and not deprecate parts of it
again in 3.1, shortly after release :-(

This issues are no hard changes, its just a little bit API cleanup you can
do in your freetime :-] -- I know I am a little bit late, but I am working
hard on this :)

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Monday, June 15, 2009 10:51 PM
 To: java-dev@lucene.apache.org
 Subject: Re: [jira] Commented: (LUCENE-1673) Move TrieRange to core
 
 On Mon, Jun 15, 2009 at 4:42 PM, Mark Millermarkrmil...@gmail.com wrote:
 
  Remember the last time we started to push for 2.9 in Dec/Jan :)
 
 Yes this is very much on my mind too!!
 
 So maybe, it's a race between the trie* group of issues, and the other
 28 ;)
 
 Mike
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1541) Trie range - make trie range indexing more flexible

2009-06-15 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1541:
---

Fix Version/s: (was: 2.9)
   3.1

OK, moving out to 3.1.

 Trie range - make trie range indexing more flexible
 ---

 Key: LUCENE-1541
 URL: https://issues.apache.org/jira/browse/LUCENE-1541
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Ning Li
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.1

 Attachments: LUCENE-1541.patch, LUCENE-1541.patch


 In the current trie range implementation, a single precision step is 
 specified. With a large precision step (say 8), a value is indexed in fewer 
 terms (8) but the number of terms for a range can be large. With a small 
 precision step (say 2), the number of terms for a range is smaller but a 
 value is indexed in more terms (32).
 We want to add an option that different precision steps can be set for 
 different precisions. An expert can use this option to keep the number of 
 terms for a range small and at the same time index a value in a small number 
 of terms. See the discussion in LUCENE-1470 that results in this issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Mark Miller
I may do the Highlighter. Its annoying though - I'll have to break back 
compat because Token is part of the public API (Fragmenter, etc).


Robert Muir wrote:

Michael OK, I plan on adding some tests for the analyzers that don't have any.

I didn't try to migrate things such as highlighter, which are
definitely just as important, only because I'm not familiar with that
territory.

But I think I can figure out what the various language analyzers are
trying to do and add tests / convert the remaining ones.

On Mon, Jun 15, 2009 at 4:42 PM, Michael Buschbusch...@gmail.com wrote:
  

I agree. It's my fault, the task of changing the contribs (LUCENE-1460) is
assigned to me for a while now - I just haven't found the time to do it yet.

It's great that you started the work on that! I'll try to review the patch
in the next couple of days and help with fixing the remaining ones. I'd like
to get the AttributeSource improvements patch out first. I'll try that
tonight.

 Michael

On 6/15/09 1:35 PM, Robert Muir wrote:

Michael, again I am terrible with such things myself...

Personally I am impressed that you have the back compat, even if you
don't change any code at all I think some reformatting of javadocs
might make the situation a lot friendlier. I just listed everything
that came to my mind immediately.

I guess I will also mention that one of the reasons I might not use
the new API is that since all filters, etc on the same chain must use
the same API, its discouraging if all the contrib stuff doesn't
support the new API, it makes me want to just stick with the old so
everything will work. So I think contribs being on the new API is
really important otherwise no one will want to use it.

On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com wrote:


This is excellent feedback, Robert!

I agree this is confusing; especially having a deprecated API and only a
experimental one that replaces the old one. We need to change that.
And I don't like the *useNewAPI*() methods either. I spent a lot of time
thinking about backwards compatibility for this API. It's tricky to do
without sacrificing performance. In API patches I find myself spending more
time for backwards-compatibility than for the actual new feature! :(

I'll try to think about how to simplify this confusing old/new API mix.

However, we need to make the decisions:
a) if we want to release this new API with 2.9,
b) if yes to a), if we want to remove the old API in 3.0?

If yes to a) and no to b), then we'll have to support both APIs for a
presumably very long time, so we then need to have a better solution for the
backwards-compatibility here.

-Michael

On 6/15/09 1:09 PM, Robert Muir wrote:

let me try some slightly more constructive feedback:

new user looks at TokenStream javadocs:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
immediately they see deprecated, text in red with the words
experimental, warnings in bold, the whole thing is scary!
due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
in a bad way, and its probably the most important method to a new
user!
there's also a stray bold tag gone haywire somewhere, possibly
.incrementToken()

from a technical perspective, the documentation is excellent! but for
a new user unfamiliar with lucene, its unclear exactly what steps to
take: use the scary red experimental api or the old deprecated one?

theres also some fairly advanced stuff such as .captureState and
.restoreState that might be better in a different place.

finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
[one is static, one is not], especially because it states all streams
and filters in one chain must use the same API, is there a way to
simplify this?

i'm really terrible with javadocs myself, but perhaps we can come up
with a way to improve the presentation... maybe that will make the
difference.

On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote:


Mark, I'll see if I can get tests produced for some of those analyzers.

as a new user of the new api myself, I think I can safely say the most
confusing thing about it is having the old deprecated API mixed in the
javadocs with it :)

On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com wrote:


Robert Muir wrote:


Mark, I created an issue for this.



Thanks Robert, great idea.


I just think you know, converting an analyzer to the new api is really
not that bad.



I don't either. I'm really just complaining about the initial readability.
Once you know whats up, its not too much different. I just have found myself
having to refigure out whats up (a short task to be sure) over again after I
leave it for a while. With the old one, everything was just kind of
immediately self evident.

That makes me think new users might be a little more confused when they
first meet again. I'm not a new user though, so its only a guess really.


reverse engineering 

[jira] Commented: (LUCENE-973) Token of returns in CJKTokenizer + new TestCJKTokenizer

2009-06-15 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719776#action_12719776
 ] 

Steven Rowe commented on LUCENE-973:


+1 from me for inclusion in 2.9.

Mark, as you wrote a couple of hours ago on java-dev, in response to Robert 
Muir's complaint about the lack of tests in contrib:

bq. we should probably push for tests or write them before committing more 
often.

Here's a chance to improve the situation: this issue adds a test to a contrib 
module where there currently are none!

 Token of   returns in CJKTokenizer + new TestCJKTokenizer
 ---

 Key: LUCENE-973
 URL: https://issues.apache.org/jira/browse/LUCENE-973
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3
Reporter: Toru Matsuzawa
Priority: Minor
 Fix For: 2.9

 Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, 
 LUCENE-973.patch, with-patch.jpg, without-patch.jpg


 The  string returns as Token in the boundary of two byte character and one 
 byte character. 
 There is no problem in CJKAnalyzer. 
 When CJKTokenizer is used with the unit, it becomes a problem. (Use it with 
 Solr etc.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Some SVN cleanup, was: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Uwe Schindler
Done, tests pass. 

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Monday, June 15, 2009 10:40 PM
 To: java-dev@lucene.apache.org
 Subject: Re: New Token API was Re: Payloads and TrieRangeQuery
 
 On Mon, Jun 15, 2009 at 4:21 PM, Uwe Schindleru...@thetaphi.de wrote:
 
  And, in tests: test/o/a/l/index/store is somehow wrong placed. The class
  inside should be in test/o/a/l/store. Should I move?
 
 Please do!
 
 Mike
 
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: New Token API was Re: Payloads and TrieRangeQuery

2009-06-15 Thread Robert Muir
yeah about 5 seconds in I saw that and decided to stick with what I know :)

On Mon, Jun 15, 2009 at 5:10 PM, Mark Millermarkrmil...@gmail.com wrote:
 I may do the Highlighter. Its annoying though - I'll have to break back
 compat because Token is part of the public API (Fragmenter, etc).

 Robert Muir wrote:

 Michael OK, I plan on adding some tests for the analyzers that don't have
 any.

 I didn't try to migrate things such as highlighter, which are
 definitely just as important, only because I'm not familiar with that
 territory.

 But I think I can figure out what the various language analyzers are
 trying to do and add tests / convert the remaining ones.

 On Mon, Jun 15, 2009 at 4:42 PM, Michael Buschbusch...@gmail.com wrote:


 I agree. It's my fault, the task of changing the contribs (LUCENE-1460)
 is
 assigned to me for a while now - I just haven't found the time to do it
 yet.

 It's great that you started the work on that! I'll try to review the
 patch
 in the next couple of days and help with fixing the remaining ones. I'd
 like
 to get the AttributeSource improvements patch out first. I'll try that
 tonight.

  Michael

 On 6/15/09 1:35 PM, Robert Muir wrote:

 Michael, again I am terrible with such things myself...

 Personally I am impressed that you have the back compat, even if you
 don't change any code at all I think some reformatting of javadocs
 might make the situation a lot friendlier. I just listed everything
 that came to my mind immediately.

 I guess I will also mention that one of the reasons I might not use
 the new API is that since all filters, etc on the same chain must use
 the same API, its discouraging if all the contrib stuff doesn't
 support the new API, it makes me want to just stick with the old so
 everything will work. So I think contribs being on the new API is
 really important otherwise no one will want to use it.

 On Mon, Jun 15, 2009 at 4:21 PM, Michael Buschbusch...@gmail.com wrote:


 This is excellent feedback, Robert!

 I agree this is confusing; especially having a deprecated API and only a
 experimental one that replaces the old one. We need to change that.
 And I don't like the *useNewAPI*() methods either. I spent a lot of time
 thinking about backwards compatibility for this API. It's tricky to do
 without sacrificing performance. In API patches I find myself spending
 more
 time for backwards-compatibility than for the actual new feature! :(

 I'll try to think about how to simplify this confusing old/new API mix.

 However, we need to make the decisions:
 a) if we want to release this new API with 2.9,
 b) if yes to a), if we want to remove the old API in 3.0?

 If yes to a) and no to b), then we'll have to support both APIs for a
 presumably very long time, so we then need to have a better solution for
 the
 backwards-compatibility here.

 -Michael

 On 6/15/09 1:09 PM, Robert Muir wrote:

 let me try some slightly more constructive feedback:

 new user looks at TokenStream javadocs:

 http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apache/lucene/analysis/TokenStream.html
 immediately they see deprecated, text in red with the words
 experimental, warnings in bold, the whole thing is scary!
 due to the use of 'e.g.' the javadoc for .incrementToken() is cut off
 in a bad way, and its probably the most important method to a new
 user!
 there's also a stray bold tag gone haywire somewhere, possibly
 .incrementToken()

 from a technical perspective, the documentation is excellent! but for
 a new user unfamiliar with lucene, its unclear exactly what steps to
 take: use the scary red experimental api or the old deprecated one?

 theres also some fairly advanced stuff such as .captureState and
 .restoreState that might be better in a different place.

 finally, the .setUseNewAPI() and .setUseNewAPIDefault() are confusing
 [one is static, one is not], especially because it states all streams
 and filters in one chain must use the same API, is there a way to
 simplify this?

 i'm really terrible with javadocs myself, but perhaps we can come up
 with a way to improve the presentation... maybe that will make the
 difference.

 On Mon, Jun 15, 2009 at 3:45 PM, Robert Muirrcm...@gmail.com wrote:


 Mark, I'll see if I can get tests produced for some of those analyzers.

 as a new user of the new api myself, I think I can safely say the most
 confusing thing about it is having the old deprecated API mixed in the
 javadocs with it :)

 On Mon, Jun 15, 2009 at 2:53 PM, Mark Millermarkrmil...@gmail.com
 wrote:


 Robert Muir wrote:


 Mark, I created an issue for this.



 Thanks Robert, great idea.


 I just think you know, converting an analyzer to the new api is really
 not that bad.



 I don't either. I'm really just complaining about the initial
 readability.
 Once you know whats up, its not too much different. I just have found
 myself
 having to refigure out whats up (a short task to be sure) over again
 after I
 leave it for a while. 

[jira] Commented: (LUCENE-973) Token of returns in CJKTokenizer + new TestCJKTokenizer

2009-06-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719781#action_12719781
 ] 

Robert Muir commented on LUCENE-973:


very nice. although it might be a tad trickier to convert to the new API, 
anything with tests is easier!

in other words, i have the existing cjktokenizer converted, but whose to say I 
did it right :)


 Token of   returns in CJKTokenizer + new TestCJKTokenizer
 ---

 Key: LUCENE-973
 URL: https://issues.apache.org/jira/browse/LUCENE-973
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3
Reporter: Toru Matsuzawa
Priority: Minor
 Fix For: 2.9

 Attachments: CJKTokenizer20070807.patch, LUCENE-973.patch, 
 LUCENE-973.patch, with-patch.jpg, without-patch.jpg


 The  string returns as Token in the boundary of two byte character and one 
 byte character. 
 There is no problem in CJKAnalyzer. 
 When CJKTokenizer is used with the unit, it becomes a problem. (Use it with 
 Solr etc.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



  1   2   >