Re: Determining NumericType for a field
On Wed, 2014-12-10 at 15:27 +0100, Michael McCandless wrote: No, Lucene does not store numeric type nor multi-valued-ness today; it's frustrating. At least I now know not to dig too deep for non-existing answers, thanks. Out current code requires the user to be explicit about how the content of the fields should be treated. Until a more fundamental change, such as LUCENE-6005, we will leave it at that. In the meantime, maybe you could model your tool after UninvertingReader? It faces the same issue (lack of schema) and lets the user specify the type. Yes, that is what we're doing. Unfortunately we cannot use the UninvertingReader directly due to its restrictions on facet structure size: We have too many references in our shards so it hits an internal 16M(?) limit. Unfortunately our current mapping code from stored multi value String to DocValues seems to be much very slow: It took nearly 2 days to convert a single-segment 900GB index, where a standard optimize is only 8 hours. Also, see (the confusingly named) TestDemoParallelLeafReader? It lets you partially reindex, e.g. derive new indexed fields or DV fields, etc., from existing stored/DV fields, in an NRT manner. Thanks for the pointer. As far as I can see, the demo is very explicit about the type of DocValues being long, so no auto-guessing there. It's a very interesting idea though, with seamless DV-enabling. Thank you, Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Determining NumericType for a field
On Mon, Dec 15, 2014 at 4:53 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: In the meantime, maybe you could model your tool after UninvertingReader? It faces the same issue (lack of schema) and lets the user specify the type. Yes, that is what we're doing. Unfortunately we cannot use the UninvertingReader directly due to its restrictions on facet structure size: We have too many references in our shards so it hits an internal 16M(?) limit. Hmm that's probably the DocTermOrds 16 MB internal addressing limit? Unfortunately our current mapping code from stored multi value String to DocValues seems to be much very slow: It took nearly 2 days to convert a single-segment 900GB index, where a standard optimize is only 8 hours. That's awful. Profile it? But, how long did it take to index in the first place? Also, see (the confusingly named) TestDemoParallelLeafReader? It lets you partially reindex, e.g. derive new indexed fields or DV fields, etc., from existing stored/DV fields, in an NRT manner. Thanks for the pointer. As far as I can see, the demo is very explicit about the type of DocValues being long, so no auto-guessing there. It's a very interesting idea though, with seamless DV-enabling. The DVs can be arbitrary (not just long); it's only that the test cases focuses on long. Have a look @ the LUCENE-6005 branch: I broke this test out as a separate ReindexingReader + test. I think we could do a better integration between that and the schema... I also added a simpler testSwitchToDocValues test case. It still uses only long DVs but you can easily see how you could do other types to ... I'll add an example of SortedSet. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Determining NumericType for a field
On Mon, 2014-12-15 at 11:33 +0100, Michael McCandless wrote: On Mon, Dec 15, 2014 at 4:53 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: [Toke: Limit on faceting with many references] Hmm that's probably the DocTermOrds 16 MB internal addressing limit? Yes, we've hit that one before. If we did not have DocValues, I would consider it a serious deficiency of Solr. For one of the fields in the shard I tested, we had 675M references from 256M documents to 3M unique values, with the most popular value having 18M references. (all of which works perfectly fine fast with DocValues, yay!) [2 days for conversion of 900GB index] That's awful. Profile it? But, how long did it take to index in the first place? Full index takes 8 days with 24 CPUs going full tilt ~=192 CPU days. Conversion is (sadly) single threaded, so measured in total CPU time, it is just the 2 days. Still, we can't scale parallel conversions of multiple shards very high due to limited local storage space. I'll put a lot more timing debug logging into the code to investigate where the time is spend. [TestDemoParallelLeafReader] The DVs can be arbitrary (not just long); it's only that the test cases focuses on long. My point was that there does not seem to be any auto-guessing of field type (especially NumericsType for numeric values) in the code. Anyway, since that would not guarantee correct results, it seems that it is better anyway to require the user to be specific about what should happen. Have a look @ the LUCENE-6005 branch: I broke this test out as a separate ReindexingReader + test. I think we could do a better integration between that and the schema... Down to practicalities, we need Lucene 4.8 as our DocValues are Disk based and that support was removed in 4.9. I hope to find the time to look at your better solution in January. Regards, Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Determining NumericType for a field
Down to practicalities, we need Lucene 4.8 as our DocValues are Disk based and that support was removed in 4.9. I assume you’re referring to the “Disk” DV format/Codec? The standard format has the data on disk too, it’s just that there’s some “small” (relative to the disk data) lookup references in heap/memory whereas the codec you’re using doesn’t. Are you sure the standard codec isn’t sufficient? If your use-case shows that there’s a need for the disk codec, I think it could be brought back, perhaps into the codecs module. You could copy the code too to use newer Lucene versions… although I recall some push vs pull API changes so I don’t know what it would take to bring it up to date. I’m curious what Rob Muir says about this. ~ David
Re: Determining NumericType for a field
On Mon, 2014-12-15 at 14:23 +0100, david.w.smi...@gmail.com wrote: Toke: Down to practicalities, we need Lucene 4.8 as our DocValues are Disk based and that support was removed in 4.9. I assume you’re referring to the “Disk” DV format/Codec? The standard format has the data on disk too, it’s just that there’s some “small” (relative to the disk data) lookup references in heap/memory whereas the codec you’re using doesn’t. Are you sure the standard codec isn’t sufficient? As we have not tried anything else than Disk for our Net Archive index, we have no comparison with standard (or whatever it is called). We have no real preference and our next shards will be build with standard. Only reason for Disk is that it seemed like a good idea at the time and now we have 20TB of index with it. We would like to convert away from Disk too, but we would like to avoid having to do a two-pass upgrade (Disk - standard followed by non-DV - DV), so the DVEnabling code should preferably support Disk for reading and do it all as single-pass. If your use-case shows that there’s a need for the disk codec, I think it could be brought back, perhaps into the codecs module. I think the removal of Disk during a minor version increase was not in line with the backwards compatibility spirit of Solr. But I am sure it was marked Experimental somewhere in the code and that the removal obeyed the stated rules. Anyway, done is done and as we have no future need for Disk. But thanks for the suggested fix. You could copy the code too to use newer Lucene versions… We looked at that sometime back and the code tentacles reached too far for us to dare grapple with. Regards, Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Determining NumericType for a field
I am attempting to write some code for removing or adding DocValues for an existing Lucene index: https://github.com/netarchivesuite/dvenabler I have a proof of concept running, but it is not very user friendly. Ideally the user should be presented with a list of fields and simply select which ones should have DocValues. However, in order to do so, I need to determine is a NumericField was indexed as INT, LONG, FLOAT or DOUBLE. That information is present in FieldType at index time, but I cannot figure out if it is possible to extract it from an existing index? If it not possible to determine with certainty, I could use a way of performing a best-guess. On a similar note, does Lucene have a concept of single and multi-value stored fields or do I have to infer that by iterating all the documents and check each one? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Determining NumericType for a field
No, Lucene does not store numeric type nor multi-valued-ness today; it's frustrating. In LUCENE-6005 branch I'm exploring fixing that, and it's going well, but there are many challenges/nocommits. In the meantime, maybe you could model your tool after UninvertingReader? It faces the same issue (lack of schema) and lets the user specify the type. Also, see (the confusingly named) TestDemoParallelLeafReader? It lets you partially reindex, e.g. derive new indexed fields or DV fields, etc., from existing stored/DV fields, in an NRT manner. Mike McCandless http://blog.mikemccandless.com On Wed, Dec 10, 2014 at 9:12 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: I am attempting to write some code for removing or adding DocValues for an existing Lucene index: https://github.com/netarchivesuite/dvenabler I have a proof of concept running, but it is not very user friendly. Ideally the user should be presented with a list of fields and simply select which ones should have DocValues. However, in order to do so, I need to determine is a NumericField was indexed as INT, LONG, FLOAT or DOUBLE. That information is present in FieldType at index time, but I cannot figure out if it is possible to extract it from an existing index? If it not possible to determine with certainty, I could use a way of performing a best-guess. On a similar note, does Lucene have a concept of single and multi-value stored fields or do I have to infer that by iterating all the documents and check each one? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org