Re: scan query that returns document values only is heavily accessing the *.FDT file .

2014-11-24 Thread joergpra...@gmail.com
Doc values are stored in the .fdt files.

Jörg

On Sun, Nov 23, 2014 at 11:52 PM, Tzahi jakubovitz tza...@hotmail.com
wrote:

 Hi all,

 I have a tests index with 43 million documenst. there is a string document
 value for each document. (about 5-10 character value for each document)

 Mapping is:

 {

   myindex : {

 mappings : {

   num_type : {

 _type : {

   store : true

 },

 properties : {

   doc_value : {

 type : string,

 doc_values_format : default

   },

   int1 : {

 type : integer,

 index : analyzed,

 store : true

   },

   int2 : {

 .

 .

 .

 I need to retrieve the document values only for queries that may return
 about 100,000 documents result set. I do not need ranking or anything else
 that will slow this down.



 My understanding is that if the query is only a filter – ranking is not
 computed, and it is faster.

 Here is a small python program to test it:


 *import *elasticsearch

 es = elasticsearch.Elasticsearch()

 results = es.search(*myindex*, *num_type*,
 {
 *fields*:[*doc_value*],
*size*:1000,
*query*: {*filtered*: {
*query*: {*match_all*:{}}
   ,*filter*: {
 *term*: {*r_int3*: 929}}
}}
 },scroll=*10s*,search_type=*scan*)


 *while True*:
 results = es.scroll(results[*_scroll_id*], scroll=*10s*)
 *if *len(results[*hits*][*hits*]) = 0:
 *break*



 The query runs pretty slow, and I see there is huge number of access to
 the *.fdt (field data) file.

 But I ask for a document value field – so why does ES access the *.fdt.

 Thanks a lot in advance.




  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/89480f13-b00e-4e3f-a538-15fdbd18f073%40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/89480f13-b00e-4e3f-a538-15fdbd18f073%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEsDnXCbmV0tGmNwuYvAwdW-t%2BYJhf6mYmbN4ZVf3fMrQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: scan query that returns document values only is heavily accessing the *.FDT file .

2014-11-24 Thread Tzahi jakubovitz
Thanks 
Sorry - I did not stress this is *document* values and not *field* values.
Document values are stores in DVD file. which is small, compressed format. 
I defined it to avoide having to access and parse the lucene document from 
the huge FDT file (in my test- FDT file is 1000 times bigger than DVD file).
see 
https://lucene.apache.org/core/4_3_1/core/org/apache/lucene/codecs/lucene42/Lucene42DocValuesFormat.html
.

I still try to avoide accessing the FDT file - it makes my query t slow.

Thanks again.




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/cd6ed6a9-f1c7-47c4-be3d-833553cb2bf6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: scan query that returns document values only is heavily accessing the *.FDT file .

2014-11-24 Thread joergpra...@gmail.com
Oh, sorry. Yess, doc values are in .dvd files.

I assume that ES still puts hidden type and uid field in .fdt. But I'm
also surprised, there should be not much disk access for that.

Jörg

On Mon, Nov 24, 2014 at 10:04 AM, Tzahi jakubovitz tza...@hotmail.com
wrote:

 Thanks
 Sorry - I did not stress this is *document* values and not *field* values.
 Document values are stores in DVD file. which is small, compressed format.
 I defined it to avoide having to access and parse the lucene document from
 the huge FDT file (in my test- FDT file is 1000 times bigger than DVD file).
 see
 https://lucene.apache.org/core/4_3_1/core/org/apache/lucene/codecs/lucene42/Lucene42DocValuesFormat.html
 .

 I still try to avoide accessing the FDT file - it makes my query t
 slow.

 Thanks again.




  --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/cd6ed6a9-f1c7-47c4-be3d-833553cb2bf6%40googlegroups.com
 https://groups.google.com/d/msgid/elasticsearch/cd6ed6a9-f1c7-47c4-be3d-833553cb2bf6%40googlegroups.com?utm_medium=emailutm_source=footer
 .

 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEnzt3BFr-6jmQ6voNxn9pkG5bsdYnK-iV8HauRTRkKyA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


scan query that returns document values only is heavily accessing the *.FDT file .

2014-11-23 Thread Tzahi jakubovitz


Hi all,

I have a tests index with 43 million documenst. there is a string document 
value for each document. (about 5-10 character value for each document)

Mapping is:

{

  myindex : {

mappings : {

  num_type : {

_type : {

  store : true

},

properties : {

  doc_value : {

type : string,

doc_values_format : default

  },

  int1 : {

type : integer,

index : analyzed,

store : true

  },

  int2 : {

.

.

.

I need to retrieve the document values only for queries that may return 
about 100,000 documents result set. I do not need ranking or anything else 
that will slow this down.

 

My understanding is that if the query is only a filter – ranking is not 
computed, and it is faster.

Here is a small python program to test it:


*import *elasticsearch

es = elasticsearch.Elasticsearch()

results = es.search(*myindex*, *num_type*,
{
*fields*:[*doc_value*],
   *size*:1000,
   *query*: {*filtered*: {
   *query*: {*match_all*:{}}
  ,*filter*: {
*term*: {*r_int3*: 929}}
   }}
},scroll=*10s*,search_type=*scan*)


*while True*:
results = es.scroll(results[*_scroll_id*], scroll=*10s*)
*if *len(results[*hits*][*hits*]) = 0:
*break*

 

The query runs pretty slow, and I see there is huge number of access to the 
*.fdt (field data) file.

But I ask for a document value field – so why does ES access the *.fdt.

Thanks a lot in advance.

 


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/89480f13-b00e-4e3f-a538-15fdbd18f073%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.