subject:"Determining NumericType for a field"

Re: Determining NumericType for a field

2014-12-15 Thread Toke Eskildsen

On Wed, 2014-12-10 at 15:27 +0100, Michael McCandless wrote:
 No, Lucene does not store numeric type nor multi-valued-ness today;
 it's frustrating.

At least I now know not to dig too deep for non-existing answers,
thanks. Out current code requires the user to be explicit about how the
content of the fields should be treated. Until a more fundamental
change, such as LUCENE-6005, we will leave it at that.

 In the meantime, maybe you could model your tool after
 UninvertingReader?  It faces the same issue (lack of schema) and lets
 the user specify the type.

Yes, that is what we're doing. Unfortunately we cannot use the
UninvertingReader directly due to its restrictions on facet structure
size: We have too many references in our shards so it hits an internal
16M(?) limit. 

Unfortunately our current mapping code from stored multi value String to
DocValues seems to be much very slow: It took nearly 2 days to convert a
single-segment 900GB index, where a standard optimize is only 8 hours.

 Also, see (the confusingly named) TestDemoParallelLeafReader?  It lets
 you partially reindex, e.g. derive new indexed fields or DV fields,
 etc., from existing stored/DV fields, in an NRT manner.

Thanks for the pointer. As far as I can see, the demo is very explicit
about the type of DocValues being long, so no auto-guessing there. It's
a very interesting idea though, with seamless DV-enabling.

Thank you,
Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Determining NumericType for a field

2014-12-15 Thread Michael McCandless

On Mon, Dec 15, 2014 at 4:53 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:

 In the meantime, maybe you could model your tool after
 UninvertingReader?  It faces the same issue (lack of schema) and lets
 the user specify the type.

 Yes, that is what we're doing. Unfortunately we cannot use the
 UninvertingReader directly due to its restrictions on facet structure
 size: We have too many references in our shards so it hits an internal
 16M(?) limit.

Hmm that's probably the DocTermOrds 16 MB internal addressing limit?

 Unfortunately our current mapping code from stored multi value String to
 DocValues seems to be much very slow: It took nearly 2 days to convert a
 single-segment 900GB index, where a standard optimize is only 8 hours.

That's awful.  Profile it?  But, how long did it take to index in the
first place?

 Also, see (the confusingly named) TestDemoParallelLeafReader?  It lets
 you partially reindex, e.g. derive new indexed fields or DV fields,
 etc., from existing stored/DV fields, in an NRT manner.

 Thanks for the pointer. As far as I can see, the demo is very explicit
 about the type of DocValues being long, so no auto-guessing there. It's
 a very interesting idea though, with seamless DV-enabling.

The DVs can be arbitrary (not just long); it's only that the test
cases focuses on long.

Have a look @ the LUCENE-6005 branch: I broke this test out as a
separate ReindexingReader + test.  I think we could do a better
integration between that and the schema...

I also added a simpler testSwitchToDocValues test case.  It still
uses only long DVs but you can easily see how you could do other types
to ... I'll add an example of SortedSet.

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Determining NumericType for a field

2014-12-15 Thread Toke Eskildsen

On Mon, 2014-12-15 at 11:33 +0100, Michael McCandless wrote:
 On Mon, Dec 15, 2014 at 4:53 AM, Toke Eskildsen t...@statsbiblioteket.dk 
 wrote:

[Toke: Limit on faceting with many references]

 Hmm that's probably the DocTermOrds 16 MB internal addressing limit?

Yes, we've hit that one before. If we did not have DocValues, I would
consider it a serious deficiency of Solr.

For one of the fields in the shard I tested, we had 675M references from
256M documents to 3M unique values, with the most popular value having
18M references.

(all of which works perfectly fine  fast with DocValues, yay!)

[2 days for conversion of 900GB index]

 That's awful.  Profile it?  But, how long did it take to index in the
 first place?

Full index takes 8 days with 24 CPUs going full tilt ~=192 CPU days.
Conversion is (sadly) single threaded, so measured in total CPU time, it
is just the 2 days. Still, we can't scale parallel conversions of
multiple shards very high due to limited local storage space.

I'll put a lot more timing debug logging into the code to investigate
where the time is spend.

[TestDemoParallelLeafReader]

 The DVs can be arbitrary (not just long); it's only that the test
 cases focuses on long.

My point was that there does not seem to be any auto-guessing of field
type (especially NumericsType for numeric values) in the code. Anyway,
since that would not guarantee correct results, it seems that it is
better anyway to require the user to be specific about what should
happen.

 Have a look @ the LUCENE-6005 branch: I broke this test out as a
 separate ReindexingReader + test.  I think we could do a better
 integration between that and the schema...

Down to practicalities, we need Lucene 4.8 as our DocValues are Disk
based and that support was removed in 4.9. I hope to find the time to
look at your better solution in January.

Regards,
Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Determining NumericType for a field

2014-12-15 Thread david.w.smi...@gmail.com

 Down to practicalities, we need Lucene 4.8 as our DocValues are Disk
 based and that support was removed in 4.9.


I assume you’re referring to the “Disk” DV format/Codec?  The standard
format has the data on disk too, it’s just that there’s some “small”
(relative to the disk data) lookup references in heap/memory whereas the
codec you’re using doesn’t.  Are you sure the standard codec isn’t
sufficient?  If your use-case shows that there’s a need for the disk codec,
I think it could be brought back, perhaps into the codecs module.  You
could copy the code too to use newer Lucene versions… although I recall
some push vs pull API changes so I don’t know what it would take to bring
it up to date.  I’m curious what Rob Muir says about this.

~ David

Re: Determining NumericType for a field

2014-12-15 Thread Toke Eskildsen

On Mon, 2014-12-15 at 14:23 +0100, david.w.smi...@gmail.com wrote:

Toke:
 Down to practicalities, we need Lucene 4.8 as our DocValues
 are Disk
 based and that support was removed in 4.9.

 I assume you’re referring to the “Disk” DV format/Codec?  The standard
 format has the data on disk too, it’s just that there’s some
 “small” (relative to the disk data) lookup references in heap/memory
 whereas the codec you’re using doesn’t.  Are you sure the standard
 codec isn’t sufficient?

As we have not tried anything else than Disk for our Net Archive
index, we have no comparison with standard (or whatever it is called).
We have no real preference and our next shards will be build with
standard. Only reason for Disk is that it seemed like a good idea at
the time and now we have 20TB of index with it.

We would like to convert away from Disk too, but we would like to
avoid having to do a two-pass upgrade (Disk - standard followed by
non-DV - DV), so the DVEnabling code should preferably support
Disk for reading and do it all as single-pass.

   If your use-case shows that there’s a need for the disk codec, I
 think it could be brought back, perhaps into the codecs module.

I think the removal of Disk during a minor version increase was not in
line with the backwards compatibility spirit of Solr. But I am sure it
was marked Experimental somewhere in the code and that the removal
obeyed the stated rules.

Anyway, done is done and as we have no future need for Disk. But
thanks for the suggested fix.

   You could copy the code too to use newer Lucene versions…

We looked at that sometime back and the code tentacles reached too far
for us to dare grapple with.

Regards,
Toke Eskildsen, State and University Library, Denmark




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Determining NumericType for a field

2014-12-10 Thread Toke Eskildsen

I am attempting to write some code for removing or adding DocValues for
an existing Lucene index: https://github.com/netarchivesuite/dvenabler
I have a proof of concept running, but it is not very user friendly.

Ideally the user should be presented with a list of fields and simply
select which ones should have DocValues. However, in order to do so, I
need to determine is a NumericField was indexed as INT, LONG, FLOAT or
DOUBLE.

That information is present in FieldType at index time, but I cannot
figure out if it is possible to extract it from an existing index?
If it not possible to determine with certainty, I could use a way of
performing a best-guess.

On a similar note, does Lucene have a concept of single and multi-value
stored fields or do I have to infer that by iterating all the documents
and check each one?

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Determining NumericType for a field

2014-12-10 Thread Michael McCandless

No, Lucene does not store numeric type nor multi-valued-ness today;
it's frustrating.

In LUCENE-6005 branch I'm exploring fixing that, and it's going well,
but there are many challenges/nocommits.

In the meantime, maybe you could model your tool after
UninvertingReader?  It faces the same issue (lack of schema) and lets
the user specify the type.

Also, see (the confusingly named) TestDemoParallelLeafReader?  It lets
you partially reindex, e.g. derive new indexed fields or DV fields,
etc., from existing stored/DV fields, in an NRT manner.



Mike McCandless

http://blog.mikemccandless.com


On Wed, Dec 10, 2014 at 9:12 AM, Toke Eskildsen t...@statsbiblioteket.dk 
wrote:
 I am attempting to write some code for removing or adding DocValues for
 an existing Lucene index: https://github.com/netarchivesuite/dvenabler
 I have a proof of concept running, but it is not very user friendly.

 Ideally the user should be presented with a list of fields and simply
 select which ones should have DocValues. However, in order to do so, I
 need to determine is a NumericField was indexed as INT, LONG, FLOAT or
 DOUBLE.

 That information is present in FieldType at index time, but I cannot
 figure out if it is possible to extract it from an existing index?
 If it not possible to determine with certainty, I could use a way of
 performing a best-guess.

 On a similar note, does Lucene have a concept of single and multi-value
 stored fields or do I have to infer that by iterating all the documents
 and check each one?

 - Toke Eskildsen, State and University Library, Denmark



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Determining NumericType for a field

Re: Determining NumericType for a field

Re: Determining NumericType for a field

Re: Determining NumericType for a field

Re: Determining NumericType for a field

Determining NumericType for a field

Re: Determining NumericType for a field

7 matches

Site Navigation

Mail list logo

Footer information