Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-17 Thread alessandro.benedetti
Shawn Heisey wrote
> If the data for a field in the results comes from docValues instead of 
> stored fields, I don't think it is compressed, which hopefully means 
> that if a field is NOT requested, the corresponding docValues data is 
> never read. 

I think we need to make a consideration here.
DocValues is a data structure per field ( column style).
And it is compressed on disk (
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene70/Lucene70DocValuesFormat.java
).
I took a brief look to the code,I was wondering if it is possible that the
index reader will read the entire segment content for all the docValues per
type( like all the numeric doc values, all the binary ect) and then put it
in a map in the Solr process heap only the docValues related the fields
requested ? 
So the OS should memory map the whole content for the segment and then when
Solr requests a specific field, it accesses the entry in the Map in the Heap
( really a brief look into Lucene70DocValuesProducer so I may be completely
wrong).
If this is correct, it means that we read the entire content from the
segment for the docValues, even the portion related a field that is not
requested.
The only difference would be that a not requested doc Values field content,
will not be stored in the Solr process heap.
Is there any place to read more about this ? ( apart the source code)

Cheers






--
View this message in context: 
http://lucene.472066.n3.nabble.com/A-feature-idea-for-discussion-fields-that-can-only-be-explicitly-retrieved-tp4313890p4314288.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Erick Erickson
bq: Is my understanding about stored fields correct, that even if excluded
from fl, the data on the disk for a given field would still be read as
part of decompression..

Assuming any stored field (NOT docvalues) was read then this is, indeed,
correct. To be pedantic about it, enough 16K blocks will be read/decompressed
to get all the fields of the doc, then the necessary fields will be extracted.

I gather it's kind hard to index into a compressed blob and extract just a
specific field.

Alexandre:

I think Shawn's looking for the opposite. If useDocValuesAsStored="true"
do _not_ fetch any fields where docValues=false and stored=true for
fl=*.

Best,
Erick



On Fri, Jan 13, 2017 at 6:03 PM, Shawn Heisey  wrote:
> On 1/13/2017 1:02 PM, Erick Erickson wrote:
>> What about using the defaults in requestHandlers along with SOLR-3191
>> to accomplish this? Let's say that there was an fl-exclusion
>> parameter. Now you'd be able to define an exclusion default that would
>> exclude your field(s) unless overridden in your request handler. This
>> could be either a default or invariant depending on how strictly you
>> wanted to enforce not being able to retrieve the field.
>
> If it's done with a parameter, I would want the parameter to work
> correctly if included multiple times, then add an exclusion default to
> the appends section rather than defaults or invariants.
>
>> And one thing about your notion. docValues are only primitive types,
>> i.e. string in this case. There's a limit I believe on how big these
>> can be, 32K? Which seems rather restrictive in this case so we're back
>> to stored.
>
> Oh, fun.  32K might be enough for my index, but it is not enough for
> general usage.
>
> Is my understanding about stored fields correct, that even if excluded
> from fl, the data on the disk for a given field would still be read as
> part of decompression?  That's what I was hoping to avoid by using
> docValues.  Just how much pain would be involved in implementing an
> option to disable stored field compression, and if that became possible,
> would it avoid the need to read field data that isn't used?
>
> Thanks,
> Shawn
>


Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Alexandre Rafalovitch
On 13 January 2017 at 14:40, Shawn Heisey  wrote:
> What if there were a schema option that would skip docValue retrieval
> for a field unless the fl parameter were to *explicitly* ask for that
> field?  With a typical wildcard value in fl, fields with this option
> enabled would not be retrieved.

Isn't that what useDocValuesAsStored="false" do? As per Ref Guide:
When useDocValuesAsStored="false", non-stored DocValues fields can
still be explicitly requested by name in the fl param, but will not
match glob patterns ("*").
https://cwiki.apache.org/confluence/display/solr/DocValues

Regards,
   Alex.


http://www.solr-start.com/ - Resources for Solr users, new and experienced


Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Shawn Heisey
On 1/13/2017 1:02 PM, Erick Erickson wrote:
> What about using the defaults in requestHandlers along with SOLR-3191
> to accomplish this? Let's say that there was an fl-exclusion
> parameter. Now you'd be able to define an exclusion default that would
> exclude your field(s) unless overridden in your request handler. This
> could be either a default or invariant depending on how strictly you
> wanted to enforce not being able to retrieve the field. 

If it's done with a parameter, I would want the parameter to work
correctly if included multiple times, then add an exclusion default to
the appends section rather than defaults or invariants.

> And one thing about your notion. docValues are only primitive types,
> i.e. string in this case. There's a limit I believe on how big these
> can be, 32K? Which seems rather restrictive in this case so we're back
> to stored.

Oh, fun.  32K might be enough for my index, but it is not enough for
general usage.

Is my understanding about stored fields correct, that even if excluded
from fl, the data on the disk for a given field would still be read as
part of decompression?  That's what I was hoping to avoid by using 
docValues.  Just how much pain would be involved in implementing an
option to disable stored field compression, and if that became possible,
would it avoid the need to read field data that isn't used?

Thanks,
Shawn



Re: A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Erick Erickson
What about using the defaults in requestHandlers
along with SOLR-3191 to accomplish this? Let's
say that there was an fl-exclusion parameter. Now
you'd be able to define an exclusion default that
would exclude your field(s) unless overridden in your
request handler. This could be either a default or
invariant depending on how strictly you wanted to
enforce not being able to retrieve the field.

I'm not entirely sure how I feel about this option, but
wanted to throw it out for discussion. It does seem
easier to keep track of than another schema field
option.

I see no reason to make a distinction between
docValues only and stored-only though.

And one thing about your notion. docValues are only
primitive types, i.e. string in this case. There's a limit
I believe on how big these can be, 32K? Which seems
rather restrictive in this case so we're back to stored.

Not sure if that limit is configurable or not.

Erick



On Fri, Jan 13, 2017 at 11:40 AM, Shawn Heisey  wrote:
> I've got an idea for a feature that I think could be very useful.  I'd
> like to get some community feedback about it, see whether it's worth
> opening an issue for discussion.
>
> First, some background info:
>
> As I understand it, the fact that stored fields are compressed means
> that even if a particular stored field is not requested in the fl
> parameter, the data on disk for that field must still be read, in order
> to decompress the data and find the fields that ARE desired.  If one of
> the stored fields that's NOT requested is really large, that would
> pollute the OS disk cache with useless data.
>
> If the data for a field in the results comes from docValues instead of
> stored fields, I don't think it is compressed, which hopefully means
> that if a field is NOT requested, the corresponding docValues data is
> never read.
>
> And now for the idea:
>
> What if there were a schema option that would skip docValue retrieval
> for a field unless the fl parameter were to *explicitly* ask for that
> field?  With a typical wildcard value in fl, fields with this option
> enabled would not be retrieved.  If the field is not stored, not
> indexed, but has docValues, I *think* its presence on the disk would not
> affect performance (OS disk cache efficiency) unless its data is
> returned in results.
>
> One practical application, should my theory about docValues prove to be
> accurate:  Implementing a field that contains all the data sent for
> indexing, which could then be used for completely internal reindexing.
> A field like this would probably be detrimental to performance unless it
> could be automatically excluded without the client asking for the exclusion.
>
> SOLR-3191 is a sort-of related issue.  This links to SOLR-9467, which
> made me think of another potential use -- making it so certain fields
> are semi-secure because they aren't returned unless they are explicitly
> requested.  It wouldn't be TRULY secure, of course.
>
> Thanks,
> Shawn
>


A feature idea for discussion -- fields that can only be explicitly retrieved

2017-01-13 Thread Shawn Heisey
I've got an idea for a feature that I think could be very useful.  I'd
like to get some community feedback about it, see whether it's worth
opening an issue for discussion.

First, some background info:

As I understand it, the fact that stored fields are compressed means
that even if a particular stored field is not requested in the fl
parameter, the data on disk for that field must still be read, in order
to decompress the data and find the fields that ARE desired.  If one of
the stored fields that's NOT requested is really large, that would
pollute the OS disk cache with useless data.

If the data for a field in the results comes from docValues instead of
stored fields, I don't think it is compressed, which hopefully means
that if a field is NOT requested, the corresponding docValues data is
never read.

And now for the idea:

What if there were a schema option that would skip docValue retrieval
for a field unless the fl parameter were to *explicitly* ask for that
field?  With a typical wildcard value in fl, fields with this option
enabled would not be retrieved.  If the field is not stored, not
indexed, but has docValues, I *think* its presence on the disk would not
affect performance (OS disk cache efficiency) unless its data is
returned in results.

One practical application, should my theory about docValues prove to be
accurate:  Implementing a field that contains all the data sent for
indexing, which could then be used for completely internal reindexing. 
A field like this would probably be detrimental to performance unless it
could be automatically excluded without the client asking for the exclusion.

SOLR-3191 is a sort-of related issue.  This links to SOLR-9467, which
made me think of another potential use -- making it so certain fields
are semi-secure because they aren't returned unless they are explicitly
requested.  It wouldn't be TRULY secure, of course.

Thanks,
Shawn