Re: [jira] Created: (LUCENE-1257) Port to Java5
On Tue, 2008-04-08 at 18:48 -0500, robert engels wrote: That is opposite of my testing:... The 'foreach' is consistently faster. The time difference is independent of the size of the array. What I know about JVM implementations, the foreach version SHOULD always be faster - because the no bounds checking needs to be done on the element access... That's interesting. Even if it doesn't show in a performance-test right now, it might do so in later Java versions. As for your test-code, then it does not measure performance in a fair way, as the foreach runs after the old-style loop. I'm sure you'll see different results if you switch the order of the two tests. I'm a big fan of foreach, but I'll have to admit that Steven's observations seems to be correct. I hope I'll find the time to take the advice of Yonik and make my own test sometime soon. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimise Indexing time using lucene..
lucene4varma a écrit : Hi all, I am new to lucene and am using it for text search in my web application, and for that i need to index records in database. We are using jdbc directory to store the indexes. Now the problem is when is start the process of indexing the records for the first time it is taking huge amount of time. Following is the code for indexing. rs = st.executequery(); // returns 2 million records while(rs.next()) { create java object .; index java record into JDBC directory...; } The above process takes me huge amount of time for 2 million records. Approximately it is taking 3-4 business days to run the process. Can any one please suggest me and approach by which i could cut down this time. jdbc directory is not a good idea. It's only useful when you need central repository. Use large maxBufferedDocs in your IndexWriter. With large amount of data, you'll get bottleneck : database reading, index writing, RAM for buffered docs, maybe CPU. If your database reading is huge, and you are hurry, you can shard the index between multiple computer, and when it's finished, merge all the index, with champain. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1262) NullPointerException from FieldsReader after problem reading the index
[ https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587117#action_12587117 ] Michael McCandless commented on LUCENE-1262: Those stack traces look like 2.1 not 2.3.1. Is that right? Can you post the index that you are using and the code that results in the 2nd exception? I can't get the 2nd exception to happen in a test case... NullPointerException from FieldsReader after problem reading the index -- Key: LUCENE-1262 URL: https://issues.apache.org/jira/browse/LUCENE-1262 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.1 Reporter: Trejkaz There is a situation where there is an IOException reading from Hits, and then the next time you get a NullPointerException instead of an IOException. Example stack traces: java.io.IOException: The specified network name is no longer available at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:322) at org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) That error is fine. The problem is the next call to doc generates: java.lang.NullPointerException at org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) Presumably FieldsReader is caching partially-initialised data somewhere. I would normally expect the exact same IOException to be thrown for subsequent calls to the method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1150) The token types of the standard tokenizer is not accessible
[ https://issues.apache.org/jira/browse/LUCENE-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1150: --- Fix Version/s: 2.3.2 Backported fix to 2.3.2. The token types of the standard tokenizer is not accessible --- Key: LUCENE-1150 URL: https://issues.apache.org/jira/browse/LUCENE-1150 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.3 Reporter: Nicolas Lalevée Assignee: Michael McCandless Fix For: 2.3.2, 2.4 Attachments: LUCENE-1150.patch, LUCENE-1150.take2.patch The StandardTokenizerImpl not being public, these token types are not accessible : {code:java} public static final int ALPHANUM = 0; public static final int APOSTROPHE= 1; public static final int ACRONYM = 2; public static final int COMPANY = 3; public static final int EMAIL = 4; public static final int HOST = 5; public static final int NUM = 6; public static final int CJ= 7; /** * @deprecated this solves a bug where HOSTs that end with '.' are identified * as ACRONYMs. It is deprecated and will be removed in the next * release. */ public static final int ACRONYM_DEP = 8; public static final String [] TOKEN_TYPES = new String [] { ALPHANUM, APOSTROPHE, ACRONYM, COMPANY, EMAIL, HOST, NUM, CJ, ACRONYM_DEP }; {code} So no custom TokenFilter can be based of the token type. Actually even the StandardFilter cannot be writen outside the org.apache.lucene.analysis.standard package. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizerConstants in 2.3
Thanks Mike/Hoss for the clarification. Antony Michael McCandless wrote: Chris Hostetter wrote: : But, StandardTokenizer is public? It exports those constants for you? : : Really? Sorry, but I can't find them - in 2.3.1 sources, there are no : references to those statics. Javadocs have no reference to them in : StandardTokenizer I think Michael is forgetting that he re-added those constants to the trunk after 2.3.1 was released... https://issues.apache.org/jira/browse/LUCENE-1150 Woops! I'm sorry Antony -- Hoss is correct. I didn't realize this missed 2.3. I'll backport this fix to 2.3 branch so it'll be included when we release 2.3.2 (which I think we should do soon -- alot of little fixes have been backported). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizerConstants in 2.3
Chris Hostetter wrote: : But, StandardTokenizer is public? It exports those constants for you? : : Really? Sorry, but I can't find them - in 2.3.1 sources, there are no : references to those statics. Javadocs have no reference to them in : StandardTokenizer I think Michael is forgetting that he re-added those constants to the trunk after 2.3.1 was released... https://issues.apache.org/jira/browse/LUCENE-1150 Woops! I'm sorry Antony -- Hoss is correct. I didn't realize this missed 2.3. I'll backport this fix to 2.3 branch so it'll be included when we release 2.3.2 (which I think we should do soon -- alot of little fixes have been backported). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [jira] Created: (LUCENE-1257) Port to Java5
Hi Toke, On 04/09/2008 at 2:43 AM, Toke Eskildsen wrote: On Tue, 2008-04-08 at 18:48 -0500, robert engels wrote: That is opposite of my testing:... The 'foreach' is consistently faster. The time difference is independent of the size of the array. What I know about JVM implementations, the foreach version SHOULD always be faster - because the no bounds checking needs to be done on the element access... As for your test-code, then it does not measure performance in a fair way, as the foreach runs after the old-style loop. I'm sure you'll see different results if you switch the order of the two tests. My first try at a test looked like Robert's, and exactly as you say, Toke, when operating on the same array, the first loop is slower and the second one is faster. Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Storing phrases in index
Hello all. I have a question to advanced in lucene. I have a set of phrases which I need to store in index. Is there is a way of storing phrases as terms in index? How is the best way of writing such index? Should this field be tokenized? What is the best way of searching phrases by mask in such index? Should I use BooleanQuery, WildCartQuery or SpanQuery? How is the best way to escape from maxClauses exception when searching like a*? -- View this message in context: http://www.nabble.com/Storing-phrases-in-index-tp16585658p16585658.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Storing phrases in index
palexv a écrit : Hello all. I have a question to advanced in lucene. I have a set of phrases which I need to store in index. Is there is a way of storing phrases as terms in index? How is the best way of writing such index? Should this field be tokenized? not tokenized What is the best way of searching phrases by mask in such index? Should I use BooleanQuery, WildCartQuery or SpanQuery? il you search complete phrase, just use Term, if you search part of phrase, use ShingleFilter. How is the best way to escape from maxClauses exception when searching like a*? indexing indexed term. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Flexible indexing design (was Re: Pooling of posting objects in DocumentsWriter)
Thanks for your quick answers. Michael McCandless wrote: Hi Michael, I've actually been working on factoring DocumentsWriter, as a first step towards flexible indexing. Cool, yeah separating the DocumentsWriter into multiple classes certainly helped understanding the complex code better. I agree we would have an abstract base Posting class that just tracks the term text. Then, DocumentsWriter manages inverting each field, maintaining the per-field hash of term Text - abstract Posting instances, exposing the methods to write bytes into multiple streams for a Posting in the RAM byte slices, and then read them back when flushing, etc. And then the code that writes the current index format would plug into this and should be fairly small and easy to understand. For example, frq/prx postings and term vectors writing would be two plugins to the inverted terms API; it's just that term vectors flush after every document and frq/prx flush when RAM is full. I think this makes sense. We also need to come up with a good solution for the dictionary, because a term with frq/prx postings needs to store two (or three for skiplist) file pointers in the dictionary, whereas e. g. a binary posting list only needs one pointer. Then there would also be plugins that just tap into the entire document (don't need inversion), like FieldsWriter. There are still alot of details to work out... Definitely. For example, we should think about the Field APIs. Since we don't have global field semantics in Lucene I wonder how to handle conflict cases, e. g. when a document specifies a different posting list format than a previous one for the same field. The easiest way would be to not allow it and throw an exception. But this is kind of against Lucene's way of dealing with fields currently. But I'm scared of the complicated code to handle conflicts of all the possible combinations of posting list formats. KinoSearch doesn't have to worry about this, because it has a static schema (I think?), but isn't as flexible as Lucene. The DocumentsWriter does pooling of the Posting instances and I'm wondering how much this improves performance. We should retest this. I think it was a decent difference in performance but I don't remember how much. I think the pooling can also be made generic (handled by DocumentsWriter). EG the plugin could expose a newPosting() method. Yeah, but for code simplicity let's really figure out first how much pooling helps at all. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: [jira] Created: (LUCENE-1257) Port to Java5
Hi, I confirm your results. I didn't think there could be a difference using foreach constructs... Cedric Steven A Rowe wrote: On 04/04/2008 at 4:40 AM, Toke Eskildsen wrote: On Wed, 2008-04-02 at 09:30 -0400, Mark Miller wrote: - replacement of indexed for loops with for each constructs Is this always the best idea? Doesn't the for loop construct make an iterator, which can be much slower than an indexed for loop? Only in the case of iterations over collections. For arrays, the foreach is syntactic sugar for indexed for-loop. http://java.sun.com/docs/books/jls/third_edition/html/statements.html#14.14.2 I don't think this is actually true. The text at the above-linked page simply says that for-each over an array means the same as an indexed loop over the same array. Syntactic sugar, OTOH, implies that the resulting opcode is exactly the same. When I look at the byte code (using javap) for the comparison test I include below, I can see that the indexed and for-each loops do not generate the same byte code. I constructed a simple program to compare the runtime length of the two loop control mechanisms, while varying the size of the array. The test program takes command line parameters to control which loop control mechanism to use, the size of the array (#elems), and the number of times to execute the loop (#iters). I used a Bash shell script to invoke the test program. Summary of the results: over int[] arrays, indexed loops are faster on arrays with fewer than about a million elements. The fewer the elements, the faster indexed loops are relative to for-each loops. This could be explained by a higher one-time setup cost for the for-each loop - above a certain array size, the for-each setup cost is lost in the noise. It should be noted, however, that this one-time setup cost is quite small, and might be worth the increased code clarity. Here are the results for three different platforms: - Best of five iterations for each combination - All using the -server JVM option - Holding constant #iters * #elems = 10^10 - Rounding the reported real time to the nearest tenth of a second - % Slower = 100 * ((For-each - Indexed) / Indexed) Platform #1: Windows XP SP2; Intel Core 2 duo [EMAIL PROTECTED]; Java 1.5.0_13 #iters #elems For-each Indexed % Slower -- -- --- 10^910^1 22.3s13.8s 62% 10^810^2 16.0s13.6s 18% 10^610^4 14.8s13.0s 14% 10^410^6 12.9s12.9s0% 10^310^7 13.4s13.3s1% Platform #2: Debian Linux, 2.6.21.7 kernel; Intel Xeon [EMAIL PROTECTED]; Java 1.5.0_14 #iters #elems For-each Indexed % Slower -- -- --- 10^910^1 33.6s14.2s 137% 10^810^2 20.4s13.9s 47% 10^610^4 19.0s12.7s 50% 10^410^6 12.7s12.8s -1% 10^310^7 13.2s13.2s0% Platform #3: Debian Linux, 2.6.21.7 kernel; Intel Xeon [EMAIL PROTECTED]; Java 1.5.0_10 #iters #elems For-each Indexed % Slower -- -- --- 10^910^1102.7s73.6s 40% 10^810^2107.8s60.0s 80% 10^610^4105.2s58.6s 80% 10^410^6 58.8s53.0s 11% 10^310^7 60.0s54.1s 11% - ForEachTest.java follows - import java.util.Date; import java.util.Random; /** * This is meant to be called from a shell script that varies the loop style, * the number of iterations over the loop, and the number of elements in the * array over which the loop iterates, e.g.: * * cmd=java -server -cp . ForEachTest * for elems in 10 100 1 100 1000 ; do * iters=$((100/${elems})) * for run in 1 2 3 4 5 ; do * time $cmd --indexed --arraysize $elems --iterations $iters * time $cmd --foreach --arraysize $elems --iterations $iters * done * done * */ public class ForEachTest { static String NL = System.getProperty(line.separator); static String usage = Usage: java -server -cp . ForEachTest [ --indexed | --foreach ] + NL + \t--iterations num-iterations --arraysize array-size; public static void main(String[] args) { boolean useIndexedLoop = false; int size = 0; int iterations = 0; try { for (int argnum = 0 ; argnum args.length ; ++argnum) { if (args[argnum].equals(--indexed)) { useIndexedLoop = true; } else if (args[argnum].equals(--foreach)) { useIndexedLoop = false; } else if (args[argnum].equals(--iterations)) { iterations = Integer.parseInt(args[++argnum]); } else if (args[argnum].equals(--arraysize)) { size = Integer.parseInt(args[++argnum]);
Re: [jira] Created: (LUCENE-1257) Port to Java5
I think it is going to be highly JVM dependent. I reworked it to call each twice (and reordered the tests)... the foreach is still faster. Ialso ran it on Windows (under Parallels) and got similar results, but in some cases the indexed was faster. server times are tough to judge because normally the server is not going to compile until it hits it 10k times, but this can be configured... I think this is a case where you need to make a judgement based on expected behavior as there are probably too many variables. The 'foreach' should be faster in the general case for arrays as the bounds checking can be avoided. But, I doubt the speed difference is going to matter much either way, and eventually the JVM impl will converge to near equal performance. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
index reopen question
Hi: Have been reading the 2.3.1 release code and have a few questions regarding indexReader reopen: 1) looking at the code: if (this.hasChanges || this.isCurrent()) { // the index hasn't changed - nothing to do here return this; } Shouldn't it be !this.hasChanges? 2) FilterIndexReader calls the ensureOpen() method from the super class instead of overriding the method and call the inner reader's ensureOpen, is that expected? 3) When you reopen an index, the inner reference count is not updated, is that ok? Thanks -John
Re: [jira] Created: (LUCENE-1257) Port to Java5
Just for kicks, I tried it on a 64 bit Athlon, linux_x86_64, jvm=64 bit Sun 1.6 -server. The explicit loop counter was 50% faster (for N=10... the inner loop) -Yonik On Tue, Apr 8, 2008 at 8:21 PM, Yonik Seeley [EMAIL PROTECTED] wrote: On Tue, Apr 8, 2008 at 7:48 PM, robert engels [EMAIL PROTECTED] wrote: That is opposite of my testing:... The 'foreach' is consistently faster. It's consistently slower for me (I tested java5 and java6 both with -server on a P4). I'm a big fan of testing different methods in different test runs (because of hotspot, gc, etc). Example results: $ c:/opt/jdk16/bin/java -server t 1 10 foreach N = 10 method=foreachlen=10 indexed time = 8734 [EMAIL PROTECTED] /cygdrive/h/tmp $ c:/opt/jdk16/bin/java -server t 1 10 iter N = 10 method=iterlen=10 indexed time = 7062 Here's my test code (a modified version of yours): public class t { public static void main(String[] args) { int I = Integer.parseInt(args[0]); // 100 int N = Integer.parseInt(args[1]); // 10 String method = args[2].intern(); // foreach or iter String[] strings = new String[N]; for (int i = 0; i N; i++) { strings[i] = Integer.toString(i); } System.out.println(N = + N); long len = 0; long start = System.currentTimeMillis(); if (method==foreach) for (int i = 0; i I; i++) { for (String s : strings) { len += s.length(); } } else for (int i = 0; i I; i++) { for (int j = 0; j N; j++) { len += strings[j].length(); } } System.out.println(method=+method + len=+len+ indexed time = + (System.currentTimeMillis() - start)); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587290#action_12587290 ] Karl Wettin commented on LUCENE-1260: - {quote} As long as the norm remains a fixed size (1 byte) then it doesn't really matter whether it's tied to Similarity's or the store itself - it would be nice if the Index could tell you which normDecoder to use, but it's not any more unreasonable to expect the application to keep track of this (if it's not the default encoding) since applications already have to keep track of things like which Analyzer is compatible with querying this index. If we want norms to be more flexible, so tat apps can pick not only the encoding but also the size... then things get more interesting, but it's still feasible to say if you customize this, you have to make your reading apps and your writing apps smart enough to know about your customization. {quote} I like the idea of an index that is completely self aware of norm encoding, what payloads mean, et c. {quote} I also want to move it to the instance scope so I can have multiple indices with unique norm span/resolutions created from the same classloader. {quote} My use case is really about document boost and not normalization. So another solution to this is to introduce a (variable bit sized?) document boost file and completely separate it from the norms instead of as now where normalization and document boost is baked up as the same thing. I think there would be no need to touch the norms encoding then, that the default resolution is good enough for /normalization/. It would fix several caveats with norms as I see it. Norm codec strategy in Similarity - Key: LUCENE-1260 URL: https://issues.apache.org/jira/browse/LUCENE-1260 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.3.1 Reporter: Karl Wettin Attachments: LUCENE-1260.txt The static span and resolution of the 8 bit norms codec might not fit with all applications. My use case requires that 100f-250f is discretized in 60 bags instead of the default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Flexible indexing design
On Apr 9, 2008, at 6:35 AM, Michael Busch wrote: We also need to come up with a good solution for the dictionary, because a term with frq/prx postings needs to store two (or three for skiplist) file pointers in the dictionary, whereas e. g. a binary posting list only needs one pointer. This is something I'm working on as well, and I hope we can solve a couple of design problems I've been turning over in my mind for some time. In KS, the information Lucene stores in the frq/prx files is carried in one postings file per field, as discussed previously. However, I made the additional change of breaking out skip data into a separate file (shared across all fields). Isolating skip data sacrifices some locality of reference, but buys substantial gains in simplicity and compartmentalization. Individual Posting subclasses, each of which defines a file format, don't have to know about skip algorithms at all. :) Further, improvements in the skip algorithm only require changes to the .skip file, and falling back to PostingList_Next still works if the .skip file becomes corrupted since .skip carries only optimization info and no real data. For reasons I won't go into here, KS doesn't need to put a field number in it's TermInfo, but it does need doc freq, plus file positions for the postings file, the skip file, and the primary Lexicon file. (Lexicon is the KS term dictionary class, akin to Lucene's TermEnum.) struct kino_TermInfo { kino_VirtualTable* _; kino_ref_t ref; chy_i32_t doc_freq; chy_u64_t post_filepos; chy_u64_t skip_filepos; chy_u64_t lex_filepos; }; There are two problems. First is that I'd like to extend indexing with arbitrary subclasses of SegDataWriter, and I'd like these classes to be able to put their own file position bookmarks (or possibly other data) into TermInfo. Making TermInfo hash-based would probably do it, but there would be nasty performance and memory penalties since TermInfo objects are numerous. So, what's the best way to allow multiple, unrelated classes to extend TermInfo and the term dictionary file format? Is it to break up TermInfo information horizontally rather than vertically, so that instead of a single array of TermInfo objects, we have a flexible stack of arrays of 64-bit integers representing file positions? The second problem is how to share a term dictionary over a cluster. It would be nice to be able to plug modules into IndexReader that represent clusters of machines but that are dedicated to specific tasks: one cluster could be dedicated to fetching full documents and applying highlighting; another cluster could be dedicated to scanning through postings and finding/scoring hits; a third cluster could store the entire term dictionary in RAM. A centralized term dictionary held in RAM would be particularly handy for sorting purposes. The problem is that the file pointers of a term dictionary are specific to indexes on individual machines. A shared dictionary in RAM would have to contain pointers for *all* clients, which isn't really workable. So, just how do you go about assembling task specific clusters? The stored documents cluster is easy, but the term dictionary and the postings are hard. For example, we should think about the Field APIs. Since we don't have global field semantics in Lucene I wonder how to handle conflict cases, e. g. when a document specifies a different posting list format than a previous one for the same field. The easiest way would be to not allow it and throw an exception. But this is kind of against Lucene's way of dealing with fields currently. But I'm scared of the complicated code to handle conflicts of all the possible combinations of posting list formats. Yeah. Lucene's field definition conflict-resolution code is gnarly already. :( KinoSearch doesn't have to worry about this, because it has a static schema (I think?), but isn't as flexible as Lucene. Earlier versions of KS did not allow the addition of new fields on the fly, but this has been changed. You can now add fields to an existing Schema object like so: for my $doc (@docs) { # Dynamically define any new fields as 'text'. for my $field ( keys %$doc ) { $schema-add_field( $field = 'text' ); } $invindexer-add_doc($doc); } See the attached sample app for that snippet in context. Here are some current differences between KS and Lucene: * KS doesn't yet purge *old* dynamic field definitions which have become obsolete. However, that should be possible to add later, as a sweep triggered during full optimization. * You can't change the definition of an existing field. * Documents are hash-based, so you can't have multiple fields with the same name within one document object. However, I consider that capability a misfeature of
[jira] Updated: (LUCENE-1262) NullPointerException from FieldsReader after problem reading the index
[ https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trejkaz updated LUCENE-1262: Affects Version/s: (was: 2.3.1) 2.2 Whoops. I don't think it's 2.1 but it must be 2.2. I'll try and reproduce this standalone but first I need a way to have readInternal throw an exception. I presume you were using some kind of custom store implementation to do that, I'll see if I can make it happen.under 2.2 and then try the same thing under 2.3.1 to confirm whether it still breaks. NullPointerException from FieldsReader after problem reading the index -- Key: LUCENE-1262 URL: https://issues.apache.org/jira/browse/LUCENE-1262 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.2 Reporter: Trejkaz There is a situation where there is an IOException reading from Hits, and then the next time you get a NullPointerException instead of an IOException. Example stack traces: java.io.IOException: The specified network name is no longer available at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:322) at org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) That error is fine. The problem is the next call to doc generates: java.lang.NullPointerException at org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) Presumably FieldsReader is caching partially-initialised data somewhere. I would normally expect the exact same IOException to be thrown for subsequent calls to the method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1262) NullPointerException from FieldsReader after problem reading the index
[ https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trejkaz updated LUCENE-1262: Affects Version/s: (was: 2.2) 2.1 Okay I'll eat my words now, it is indeed 2.1 as the version doesn't have openInput(String,int) in it. Anyway an update: I've managed to reproduce it on any text index by simulating random network outage. I'm keeping a flag which I set to true. The trick is that the wrapping IndexInput implementation *randomly* throws IOException if the flag is true -- if it always throws IOException the problem doesn't occur. If it randomly throws it then it occurs occasionally, and it always seems to be for larger queries (I'm using MatchAllDocsQuery now.) I'll see if I can tweak the code to make it more likely to happen and then start working up to each version of Lucene to see if it stops happening somewhere. NullPointerException from FieldsReader after problem reading the index -- Key: LUCENE-1262 URL: https://issues.apache.org/jira/browse/LUCENE-1262 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.1 Reporter: Trejkaz There is a situation where there is an IOException reading from Hits, and then the next time you get a NullPointerException instead of an IOException. Example stack traces: java.io.IOException: The specified network name is no longer available at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:322) at org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) That error is fine. The problem is the next call to doc generates: java.lang.NullPointerException at org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) Presumably FieldsReader is caching partially-initialised data somewhere. I would normally expect the exact same IOException to be thrown for subsequent calls to the method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587435#action_12587435 ] Hoss Man commented on LUCENE-1260: -- bq. My use case is really about document boost and not normalization. bq. So another solution to this is to introduce a (variable bit sized?) document boost file and completely separate it from the norms instead... 1) norms is a vague term. currently lengthNorm is folded in with field boosts and doc boosts to form a generic fieldNorm ... I assumed you were interested in a more general way to improve the resolution of fieldNorm 2) your description of general purpose variable sized document boosting sounds exactly like LUCENE-1231 ... in the long run utilities using LUCENE-1231 (or something like it) to replace field boosts and length norms might make the most sense as a way to eliminate the current static Norm encoding and put more flexibility in the hands of users Norm codec strategy in Similarity - Key: LUCENE-1260 URL: https://issues.apache.org/jira/browse/LUCENE-1260 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.3.1 Reporter: Karl Wettin Attachments: LUCENE-1260.txt The static span and resolution of the 8 bit norms codec might not fit with all applications. My use case requires that 100f-250f is discretized in 60 bags instead of the default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1262) IndexOutOfBoundsException from FieldsReader after problem reading the index
[ https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trejkaz updated LUCENE-1262: Affects Version/s: (was: 2.1) 2.3.1 Summary: IndexOutOfBoundsException from FieldsReader after problem reading the index (was: NullPointerException from FieldsReader after problem reading the index) I managed to reproduce the problem as-is under version 2.2. For 2.3 the problem has changed -- instead of a NullPointerException it is now an IndexOutOfBoundsException: Exception in thread main java.lang.IndexOutOfBoundsException: Index: 52, Size: 34 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:154) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:659) at org.apache.lucene.index.IndexReader.document(IndexReader.java:525) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:92) at org.apache.lucene.search.Hits.doc(Hits.java:167) at Test.main(Test.java:24) Will attach my test program in a moment. IndexOutOfBoundsException from FieldsReader after problem reading the index --- Key: LUCENE-1262 URL: https://issues.apache.org/jira/browse/LUCENE-1262 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.1 Reporter: Trejkaz There is a situation where there is an IOException reading from Hits, and then the next time you get a NullPointerException instead of an IOException. Example stack traces: java.io.IOException: The specified network name is no longer available at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:322) at org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) That error is fine. The problem is the next call to doc generates: java.lang.NullPointerException at org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) Presumably FieldsReader is caching partially-initialised data somewhere. I would normally expect the exact same IOException to be thrown for subsequent calls to the method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1262) IndexOutOfBoundsException from FieldsReader after problem reading the index
[ https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Trejkaz updated LUCENE-1262: Attachment: Test.java Attaching a test program to reproduce the problem under 2.3.1. It occurs approximately 1 in every 4 executions for any reasonably large text index (really small ones don't seem to do it so I couldn't attach a text index with it.) The number of fields may be related, looking at the IndexOutOfBoundsException numbers it seems that the indexes we have happen to have a large number of fields. IndexOutOfBoundsException from FieldsReader after problem reading the index --- Key: LUCENE-1262 URL: https://issues.apache.org/jira/browse/LUCENE-1262 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.1 Reporter: Trejkaz Attachments: Test.java There is a situation where there is an IOException reading from Hits, and then the next time you get a NullPointerException instead of an IOException. Example stack traces: java.io.IOException: The specified network name is no longer available at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:322) at org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74) at org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220) at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) That error is fine. The problem is the next call to doc generates: java.lang.NullPointerException at org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344) at org.apache.lucene.index.IndexReader.document(IndexReader.java:368) at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84) at org.apache.lucene.search.Hits.doc(Hits.java:104) Presumably FieldsReader is caching partially-initialised data somewhere. I would normally expect the exact same IOException to be thrown for subsequent calls to the method. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587445#action_12587445 ] Karl Wettin commented on LUCENE-1260: - {quote} 1) norms is a vague term. currently lengthNorm is folded in with field boosts and doc boosts to form a generic fieldNorm ... I assumed you were interested in a more general way to improve the resolution of fieldNorm {quote} I still am but mainly because it is the simplest and only way to get better document boost resolution at the moment. Norm codec strategy in Similarity - Key: LUCENE-1260 URL: https://issues.apache.org/jira/browse/LUCENE-1260 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.3.1 Reporter: Karl Wettin Attachments: LUCENE-1260.txt The static span and resolution of the 8 bit norms codec might not fit with all applications. My use case requires that 100f-250f is discretized in 60 bags instead of the default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587446#action_12587446 ] Karl Wettin commented on LUCENE-1260: - I notice there is a tyop in the patch. And there is no test case for SimpleNormCodec. I'll come up with that too. Norm codec strategy in Similarity - Key: LUCENE-1260 URL: https://issues.apache.org/jira/browse/LUCENE-1260 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.3.1 Reporter: Karl Wettin Attachments: LUCENE-1260.txt The static span and resolution of the 8 bit norms codec might not fit with all applications. My use case requires that 100f-250f is discretized in 60 bags instead of the default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]