Re: No or limited use of FieldCache

2013-09-12 Thread Per Steffensen

On 9/12/13 3:28 PM, Toke Eskildsen wrote:

On Thu, 2013-09-12 at 14:48 +0200, Per Steffensen wrote:

Actually some months back I made PoC of a FieldCache that could expand
beyond the heap. Basically imagine a FieldCache with room for
"unlimited" data-arrays, that just behind the scenes goes to
memory-mapped files when there is no more room on heap.

That sounds a lot like disk-based DocValues.


He he

But that solution will also have the "running out of swap space"-problems.

Not really. Memory mapping works like the disk cache: There is no
requirement that a certain amount of physical memory needs to be
available, it just takes what it can get. If there are not a lot of
physical memory, it will require a lot of storage access, but it will
not over-allocate swap space.
That was also my impression, but during the work, I experienced some 
problems around swap space, but I do not remember exactly what I saw, 
and therefore how I concluded that everything in mm-files actually have 
to fit in physical mem + swap. I might very well have been wrong in that 
conclusion

It seems that different setups vary quite a lot in this area and some
systems are prone to aggressive use of the swap file, which can severely
harm responsiveness of applications with out-swapped data.

However, this should still not result in any OOM's, as the system can
always discard some of the memory mapped data if it needs more physical
memory.

I saw no OOMs

- Toke Eskildsen, State and University Library, Denmark





Re: No or limited use of FieldCache

2013-09-12 Thread Toke Eskildsen
On Thu, 2013-09-12 at 14:48 +0200, Per Steffensen wrote:
> Actually some months back I made PoC of a FieldCache that could expand 
> beyond the heap. Basically imagine a FieldCache with room for 
> "unlimited" data-arrays, that just behind the scenes goes to 
> memory-mapped files when there is no more room on heap.

That sounds a lot like disk-based DocValues.

[...]

> But that solution will also have the "running out of swap space"-problems.

Not really. Memory mapping works like the disk cache: There is no
requirement that a certain amount of physical memory needs to be
available, it just takes what it can get. If there are not a lot of
physical memory, it will require a lot of storage access, but it will
not over-allocate swap space.


It seems that different setups vary quite a lot in this area and some
systems are prone to aggressive use of the swap file, which can severely
harm responsiveness of applications with out-swapped data.

However, this should still not result in any OOM's, as the system can
always discard some of the memory mapped data if it needs more physical
memory.

- Toke Eskildsen, State and University Library, Denmark




Re: No or limited use of FieldCache

2013-09-12 Thread Per Steffensen

Yes, thanks.

Actually some months back I made PoC of a FieldCache that could expand 
beyond the heap. Basically imagine a FieldCache with room for 
"unlimited" data-arrays, that just behind the scenes goes to 
memory-mapped files when there is no more room on heap. Never finished 
it, and it might be kinda stupid because you actually just go read the 
data from lucene indices and write them to memory-mapped files in order 
to use them. It is better to just use the data in the Lucene indices 
instead. But it had some nice features. But that solution will also have 
the "running out of swap space"-problems.


Regards, Per Steffensen

On 9/12/13 12:48 PM, Erick Erickson wrote:

Per:

One thing I'll be curious about. From my reading of DocValues, it uses
little or no heap. But it _will_ use memory from the OS if I followed
Simon's slides correctly. So I wonder if you'll hit swapping issues...
Which are better than OOMs, certainly...

Thanks,
Erick




Re: No or limited use of FieldCache

2013-09-12 Thread Erick Erickson
Per:

One thing I'll be curious about. From my reading of DocValues, it uses
little or no heap. But it _will_ use memory from the OS if I followed
Simon's slides correctly. So I wonder if you'll hit swapping issues...
Which are better than OOMs, certainly...

Thanks,
Erick


On Thu, Sep 12, 2013 at 2:07 AM, Per Steffensen  wrote:

> Thanks, guys. Now I know a little more about DocValues and realize that
> they will do the job wrt FieldCache.
>
> Regards, Per Steffensen
>
>
> On 9/12/13 3:11 AM, Otis Gospodnetic wrote:
>
>> Per,  check zee Wiki, there is a page describing docvalues. We used them
>> successfully in a solr for analytics scenario.
>>
>> Otis
>> Solr & ElasticSearch Support
>> http://sematext.com/
>> On Sep 11, 2013 9:15 AM, "Michael Sokolov" > com >
>> wrote:
>>
>>  On 09/11/2013 08:40 AM, Per Steffensen wrote:
>>>
>>>  The reason I mention sort is that we in my project, half a year ago,
 have
 dealt with the FieldCache->OOM-problem when doing sort-requests. We
 basically just reject sort-requests unless they hit below X documents -
 in
 case they do we just find them without sorting and sort them ourselves
 afterwards.

 Currently our problem is, that we have to do a group/distinct (in
 SQL-language) query and we have found that we can do what we want to do
 using group 
 (http://wiki.apache.org/solr/FieldCollapsing
 
 >)
 or facet - either will work for us. Problem is that they both use
 FieldCache and we "know" that using FieldCache will lead to
 OOM-execptions
 with the amount of data each of our Solr-nodes administrate. This time
 we
 have really no option of just "limit" usage as we did with sort.
 Therefore
 we need a group/distinct-functionality that works even on huge
 data-amounts
 (and a algorithm using FieldCache will not)

 I believe setting facet.method=enum will actually make facet not use the
 FieldCache. Is that true? Is it a bad idea?

 I do not know much about DocValues, but I do not believe that you will
 avoid FieldCache by using DocValues? Please elaborate, or point to
 documentation where I will be able to read that I am wrong. Thanks!

  There is Simon Willnauer's presentation http://www.slideshare.net/**
>>> lucenerevolution/willnauer-simon-doc-values-column-**
>>> stride-fields-in-lucene>> lucenerevolution/willnauer-**simon-doc-values-column-**
>>> stride-fields-in-lucene
>>> >
>>>
>>> and this blog post 
>>> http://blog.trifork.com/2011/
>>> 10/27/introducing-lucene-index-doc-values/>> trifork.com/2011/10/27/**introducing-lucene-index-doc-**values/
>>> >
>>>
>>> and this one that shows some performance comparisons:
>>> http://searchhub.org/2013/04/02/fun-with-docvalues-in-**solr-**4-2/
>>> 
>>> >
>>>
>>>
>>>
>>>
>>>
>


Re: No or limited use of FieldCache

2013-09-11 Thread Per Steffensen
Thanks, guys. Now I know a little more about DocValues and realize that 
they will do the job wrt FieldCache.


Regards, Per Steffensen

On 9/12/13 3:11 AM, Otis Gospodnetic wrote:

Per,  check zee Wiki, there is a page describing docvalues. We used them
successfully in a solr for analytics scenario.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 9:15 AM, "Michael Sokolov" 
wrote:


On 09/11/2013 08:40 AM, Per Steffensen wrote:


The reason I mention sort is that we in my project, half a year ago, have
dealt with the FieldCache->OOM-problem when doing sort-requests. We
basically just reject sort-requests unless they hit below X documents - in
case they do we just find them without sorting and sort them ourselves
afterwards.

Currently our problem is, that we have to do a group/distinct (in
SQL-language) query and we have found that we can do what we want to do
using group 
(http://wiki.apache.org/solr/**FieldCollapsing)
or facet - either will work for us. Problem is that they both use
FieldCache and we "know" that using FieldCache will lead to OOM-execptions
with the amount of data each of our Solr-nodes administrate. This time we
have really no option of just "limit" usage as we did with sort. Therefore
we need a group/distinct-functionality that works even on huge data-amounts
(and a algorithm using FieldCache will not)

I believe setting facet.method=enum will actually make facet not use the
FieldCache. Is that true? Is it a bad idea?

I do not know much about DocValues, but I do not believe that you will
avoid FieldCache by using DocValues? Please elaborate, or point to
documentation where I will be able to read that I am wrong. Thanks!


There is Simon Willnauer's presentation http://www.slideshare.net/**
lucenerevolution/willnauer-**simon-doc-values-column-**
stride-fields-in-lucene

and this blog post http://blog.trifork.com/2011/**
10/27/introducing-lucene-**index-doc-values/

and this one that shows some performance comparisons:
http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/








Re: No or limited use of FieldCache

2013-09-11 Thread Otis Gospodnetic
Per,  check zee Wiki, there is a page describing docvalues. We used them
successfully in a solr for analytics scenario.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 9:15 AM, "Michael Sokolov" 
wrote:

> On 09/11/2013 08:40 AM, Per Steffensen wrote:
>
>> The reason I mention sort is that we in my project, half a year ago, have
>> dealt with the FieldCache->OOM-problem when doing sort-requests. We
>> basically just reject sort-requests unless they hit below X documents - in
>> case they do we just find them without sorting and sort them ourselves
>> afterwards.
>>
>> Currently our problem is, that we have to do a group/distinct (in
>> SQL-language) query and we have found that we can do what we want to do
>> using group 
>> (http://wiki.apache.org/solr/**FieldCollapsing)
>> or facet - either will work for us. Problem is that they both use
>> FieldCache and we "know" that using FieldCache will lead to OOM-execptions
>> with the amount of data each of our Solr-nodes administrate. This time we
>> have really no option of just "limit" usage as we did with sort. Therefore
>> we need a group/distinct-functionality that works even on huge data-amounts
>> (and a algorithm using FieldCache will not)
>>
>> I believe setting facet.method=enum will actually make facet not use the
>> FieldCache. Is that true? Is it a bad idea?
>>
>> I do not know much about DocValues, but I do not believe that you will
>> avoid FieldCache by using DocValues? Please elaborate, or point to
>> documentation where I will be able to read that I am wrong. Thanks!
>>
> There is Simon Willnauer's presentation http://www.slideshare.net/**
> lucenerevolution/willnauer-**simon-doc-values-column-**
> stride-fields-in-lucene
>
> and this blog post http://blog.trifork.com/2011/**
> 10/27/introducing-lucene-**index-doc-values/
>
> and this one that shows some performance comparisons:
> http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/
>
>
>
>


Re: No or limited use of FieldCache

2013-09-11 Thread Michael Sokolov

On 09/11/2013 08:40 AM, Per Steffensen wrote:
The reason I mention sort is that we in my project, half a year ago, 
have dealt with the FieldCache->OOM-problem when doing sort-requests. 
We basically just reject sort-requests unless they hit below X 
documents - in case they do we just find them without sorting and sort 
them ourselves afterwards.


Currently our problem is, that we have to do a group/distinct (in 
SQL-language) query and we have found that we can do what we want to 
do using group (http://wiki.apache.org/solr/FieldCollapsing) or facet 
- either will work for us. Problem is that they both use FieldCache 
and we "know" that using FieldCache will lead to OOM-execptions with 
the amount of data each of our Solr-nodes administrate. This time we 
have really no option of just "limit" usage as we did with sort. 
Therefore we need a group/distinct-functionality that works even on 
huge data-amounts (and a algorithm using FieldCache will not)


I believe setting facet.method=enum will actually make facet not use 
the FieldCache. Is that true? Is it a bad idea?


I do not know much about DocValues, but I do not believe that you will 
avoid FieldCache by using DocValues? Please elaborate, or point to 
documentation where I will be able to read that I am wrong. Thanks!
There is Simon Willnauer's presentation 
http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene


and this blog post 
http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/


and this one that shows some performance comparisons: 
http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/






Re: No or limited use of FieldCache

2013-09-11 Thread Per Steffensen
The reason I mention sort is that we in my project, half a year ago, 
have dealt with the FieldCache->OOM-problem when doing sort-requests. We 
basically just reject sort-requests unless they hit below X documents - 
in case they do we just find them without sorting and sort them 
ourselves afterwards.


Currently our problem is, that we have to do a group/distinct (in 
SQL-language) query and we have found that we can do what we want to do 
using group (http://wiki.apache.org/solr/FieldCollapsing) or facet - 
either will work for us. Problem is that they both use FieldCache and we 
"know" that using FieldCache will lead to OOM-execptions with the amount 
of data each of our Solr-nodes administrate. This time we have really no 
option of just "limit" usage as we did with sort. Therefore we need a 
group/distinct-functionality that works even on huge data-amounts (and a 
algorithm using FieldCache will not)


I believe setting facet.method=enum will actually make facet not use the 
FieldCache. Is that true? Is it a bad idea?


I do not know much about DocValues, but I do not believe that you will 
avoid FieldCache by using DocValues? Please elaborate, or point to 
documentation where I will be able to read that I am wrong. Thanks!


Regards, Per Steffensen

On 9/11/13 1:38 PM, Erick Erickson wrote:

I don't know any more than Michael, but I'd _love_ some reports from the
field.

There are some restriction on DocValues though, I believe one of them
is that they don't really work on analyzed data

FWIW,
Erick




Re: No or limited use of FieldCache

2013-09-11 Thread Erick Erickson
I don't know any more than Michael, but I'd _love_ some reports from the
field.

There are some restriction on DocValues though, I believe one of them
is that they don't really work on analyzed data

FWIW,
Erick


On Wed, Sep 11, 2013 at 7:00 AM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> On 9/11/13 3:11 AM, Per Steffensen wrote:
>
>> Hi
>>
>> We have a SolrCloud setup handling huge amounts of data. When we do
>> group, facet or sort searches Solr will use its FieldCache, and add data in
>> it for every single document we have. For us it is not realistic that this
>> will ever fit in memory and we get OOM exceptions. Are there some way of
>> disabling the FieldCache (taking the performance penalty of course) or make
>> it behave in a nicer way where it only uses up to e.g. 80% of the memory
>> available to the JVM? Or other suggestions?
>>
>> Regards, Per Steffensen
>>
> I think you might want to look into using DocValues fields, which are
> column-stride fields stored as compressed arrays - one value per document
> -- for the fields on which you are sorting and faceting. My understanding
> (which is limited) is that these avoid the use of the field cache, and I
> believe you have the option to control whether they are held in memory or
> on disk.  I hope someone who knows more will elaborate...
>
> -Mike
>


Re: No or limited use of FieldCache

2013-09-11 Thread Michael Sokolov

On 9/11/13 3:11 AM, Per Steffensen wrote:

Hi

We have a SolrCloud setup handling huge amounts of data. When we do 
group, facet or sort searches Solr will use its FieldCache, and add 
data in it for every single document we have. For us it is not 
realistic that this will ever fit in memory and we get OOM exceptions. 
Are there some way of disabling the FieldCache (taking the performance 
penalty of course) or make it behave in a nicer way where it only uses 
up to e.g. 80% of the memory available to the JVM? Or other suggestions?


Regards, Per Steffensen
I think you might want to look into using DocValues fields, which are 
column-stride fields stored as compressed arrays - one value per 
document -- for the fields on which you are sorting and faceting. My 
understanding (which is limited) is that these avoid the use of the 
field cache, and I believe you have the option to control whether they 
are held in memory or on disk.  I hope someone who knows more will 
elaborate...


-Mike


No or limited use of FieldCache

2013-09-11 Thread Per Steffensen

Hi

We have a SolrCloud setup handling huge amounts of data. When we do 
group, facet or sort searches Solr will use its FieldCache, and add data 
in it for every single document we have. For us it is not realistic that 
this will ever fit in memory and we get OOM exceptions. Are there some 
way of disabling the FieldCache (taking the performance penalty of 
course) or make it behave in a nicer way where it only uses up to e.g. 
80% of the memory available to the JVM? Or other suggestions?


Regards, Per Steffensen