fragment_size not used for simple queries

Neamar Tucote Tue, 25 Feb 2014 01:39:27 -0800

Hello,

Using the highlight API for a simple query like this:


curl localhost:9200/company_52fb7b90c8318c4dc800006b/_search -d'{
  "fields": [],
  "query": {
    "filtered": {
      "query": {
        "match": {
          "_all": "i do not"
        }
      }
    }
  },
  "highlight": {
    "fields": {
      "metadatas.*": {
        "number_of_fragments" : 1,
        "fragment_size" : 20
      }
    }
  }
}'

This should return snippet whose size does not exceeds 20 characters. Most 
of the time, this works, however i do have one document analyzed with the 
same mappings which yields really long snippets - in fact, it is not 
truncated, and contains all text.

Here is a sample working as expected:

{"took":21,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":19,"max_score":0.24860834,"hits":[{"_index":"company_52fb7b90c8318c4dc800006b","_type":"document","_id":"5309c5949ba7daaa265ffdd8","_score":0.24860834,"highlight":{"metadatas.text":[",
 
and <em>do</em> not 
hesitate"]}},{"_index":"company_52fb7b90c8318c4dc800006b","_type":"document","_id":"5309c5949ba7daaa265ffdd6","_score":0.14883985,"highlight":{"metadatas.text":["
 
take his child.\n<em>I</em> 
<em>do</em>"]}},{"_index":"company_52fb7b90c8318c4dc800006b","_type":"document","_id":"5309c57a9ba7daaa265ffdc8","_score":0.1365959,"highlight":{"metadatas.text":["
 
resident of DC, <em>I</em> am"]}}]}}

And here is the unruly one:

{"took":122,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":19,"max_score":0.24860834,"hits":[{"_index":"company_52fb7b90c8318c4dc800006b","_type":"document","_id":"5309c5949ba7daaa265ffdd8","_score":0.24860834,"highlight":{"metadatas.text":[",
 
and <em>do</em> not 
hesitate"]}},{"_index":"company_52fb7b90c8318c4dc800006b","_type":"document","_id":"5309c5949ba7daaa265ffdd6","_score":0.14883985,"highlight":{"metadatas.text":["
 
take his child.\n<em>I</em> 
<em>do</em>"]}},{"_index":"company_52fb7b90c8318c4dc800006b","_type":"document","_id":"5309c57a9ba7daaa265ffdc8","_score":0.1365959,"highlight":{"metadatas.text":["
 
resident of DC, <em>I</em> 
am"]}},{"_index":"company_52fb7b90c8318c4dc800006b","_type":"document","_id":"5309c57a9ba7daaa265ffdc7","_score":0.13437755,"highlight":{"metadatas.text":[".\n<em>I</em>
 
<em>do</em> not enlighten those who are not eager to learn, nor 
arouse\nthose who are not anxious to give an explanation themselves. If 
<em>I</em>\nhave presented one corner of the square and they cannot 
come\nback to me with the other three, <em>I</em> should not go over the 
points\nagain.\n― Confucius\nBesides explaining JavaScript, this book tries 
to be an introduction to the basic\nprinciples of programming. Programming, 
it turns out, is hard. The\nfundamental rules are, most of the time, simple 
and clear. But programs,\nwhile built on top of these basic rules, tend to 
become complex enough to\nintroduce their own rules, their own complexity. 
Because of this, programming\nis rarely simple or predictable. As Donald 
Knuth, who is something of a\nfounding father of the field, says, it is an 
art.\nTo get something out of this book, more than just passive reading is 
required.\nTry to stay sharp, make an effort to solve the exercises, and 
only continue on\nwhen you are reasonably sure you understand the material 
that came before.\nThe computer programmer is a creator of universes for 
which he\nalone is responsible. Universes of virtually unlimited complexity 
can\nbe created in the form of computer programs.\n― Joseph Weizenbaum, 
Computer Power and Human Reason\nA program is many things. It is a piece of 
text typed by a programmer, it is\nthe directing force that makes the 
computer <em>do</em> what it does, it is data in the\ncomputer's memory, 
yet it controls the actions performed on this same\nmemory. Analogies that 
try to compare programs to objects we are familiar\nwith tend to fall 
short, but a superficially fitting one is that of a machine. The\ngears of 
a mechanical watch fit together ingeniously, and if the watchmaker\nwas any 
good, it will accurately show the time for many years. The elements\nof a 
program fit together in a similar way, and if the programmer knows what\nhe 
is doing, the program will run without crashing.\nA computer is a machine 
built to act as a host for these immaterial machines.\nComputers themselves 
can only <em>do</em> stupidly straightforward things. The reason\nthey are 
so useful is that they <em>do</em> these things at an incredibly high 
speed. A\nprogram can, by ingeniously combining many of these simple 
actions, <em>do</em> very\ncomplicated things.\nTo some of us, writing 
computer programs is a fascinating game. A program\nis a building of 
thought. It is costless to build, weightless, growing easily under\nour 
typing hands. If we get carried away, its size and complexity will grow 
out\nof control, confusing even the one who created it. This is the main 
problem of\nprogramming. It is why so much of today's software tends to 
crash, fail,\nscrew up.\nWhen a program works, it is beautiful. The art of 
programming is the skill of\ncontrolling complexity. The great program is 
subdued, made simple in its\ncomplexity.\nToday, many programmers believe 
that this complexity is best managed by\nusing only a small set of 
well-understood techniques in their programs. They\nhave composed strict 
rules about the form programs should have, and the\nmore zealous among them 
will denounce those who break these rules as bad\nprogrammers.\nWhat 
hostility to the richness of programming! To try to reduce it to\nsomething 
straightforward and predictable, to place a taboo on all the weird\nand 
beautiful programs. The landscape of programming techniques is\nenormous, 
fascinating in its diversity, still largely unexplored. It is 
certainly\nlittered with traps and snares, luring the inexperienced 
programmer into all\nkinds of horrible mistakes, but that only means you 
should proceed with\ncaution, keep your wits about you. As you learn, there 
will always be new\nchallenges, new territory to explore. The programmer 
who refuses to keep\nexploring will surely stagnate, forget his joy, lose 
the will to program (and\nbecome a manager).\nAs far as <em>I</em> am 
concerned, the definite criterion for a program is whether it is\ncorrect. 
Efficiency, clarity, and size are also important, but how to balance\nthese 
against each other is always a matter of judgement, a judgement that\neach 
programmer must make for himself. Rules of thumb are useful, but 
one\nshould never be afraid to break them.\nIn the beginning, at the birth 
of computing, there were no programming\nlanguages. Programs looked 
something like this:\n00110001 00000000 00000000\n00110001 00000001 
00000001\n00110011 00000001 00000010\n01010001 00001011 00000010\n00100010 
00000010 00001000\n01000011 00000001 00000000\n01000001 00000001 
00000001\n00010000 00000010 00000000\n01100010 00000000 00000000\nThat is a 
program to add the numbers from one to ten together, and print out\nthe 
result (1 + 2 + ... + 10 = 55). It could run on a very simple kind 
of\ncomputer. To program early computers, it was necessary to set large 
arrays\nof switches in the right position, or punch holes in strips of 
cardboard and\nfeed them to the computer. You can imagine how this was a 
tedious,\nerror-prone procedure. Even the writing of simple programs 
required much\ncleverness and discipline, complex ones were nearly 
inconceivable.\nOf course, manually entering these arcane patterns of bits 
(which is what the\n1s and 0s above are generally called) did give the 
programmer a profound\nsense of being a mighty wizard. And that has to be 
worth something, in terms\nof job satisfaction.\nEach line of the program 
contains a single instruction. It could be written in\nEnglish like 
this:\nStore the number 0 in memory location 01.\nStore the number 1 in 
memory location 12.\nStore the value of memory location 1 in memory 
location 23.\nSubtract the number 11 from the value in memory location 
24.\nIf the value in memory location 2 is the number 0, continue 
with\ninstruction 9\n5.\nAdd the value of memory location 1 to memory 
location 06.\nAdd the number 1 to the value of memory location 
17.\nContinue with instruction 38.\nOutput the value of memory location 
09.\nWhile that is more readable than the binary soup, it is still rather 
unpleasant.\nIt might help to use names instead of numbers for the 
instructions and\nmemory locations:\nSet 'total' to 0\nSet 'count' to 
1\n[loop]\nSet 'compare' to 'count'\nSubtract 11 from 'compare'\nIf 
'compare' is zero, continue at [end]\nAdd 'count' to 'total'\nAdd 1 to 
'count'\nContinue at [loop]\n[end]\nOutput 'total'\nAt this point it is not 
too hard to see how the program works. Can you? The\nfirst two lines give 
two memory locations their starting values: total will be\nused to build up 
the result of the program, and count keeps track of the\nnumber that we are 
currently looking at. The lines using compare are probably\nthe weirdest 
ones. What the program wants to <em>do</em> is see if count is equal 
to\n11, in order to decide whether it can stop yet. Because the machine is 
so\nprimitive, it can only test whether a number is zero, and make a 
decision\n(jump) based on that. So it uses the memory location labelled 
compare to\ncompute the value of count - 11, and makes a decision based on 
that value.\nThe next two lines add the value of count to the result, and 
increment count\nby one every time the program has decided that it is not 
11 yet.\nHere is the same program in JavaScript:\nvar total = 0, count = 
1;\nwhile (count &lt;= 10) {\ntotal += count;\ncount += 
1;\n}\nprint(total);\nThis gives us a few more improvements. Most 
importantly, there is no need\nto specify the way we want the program to 
jump back and forth anymore.\nThe magic word while takes care of that. It 
continues executing the lines\nbelow it as long as the condition it was 
given holds: count &lt;= 10, which means\n'count is less than or equal to 
10'. Apparently, there is no need anymore to\ncreate a temporary value and 
compare that to zero. This was a stupid little\ndetail, and the power of 
programming languages is that they take care of\nstupid little details for 
us.\nFinally, here is what the program could look like if we happened to 
have the\nconvenient operations range and sum available, which respectively 
create a\ncollection of numbers within a range and compute the sum of a 
collection of\nnumbers:\nprint(sum(range(1, 10)));\nThe moral of this 
story, then, is that the same program can be expressed in\nlong and short, 
unreadable and readable ways. The first version of the\nprogram was 
extremely obscure, while this last one is almost English: print\nthe sum of 
the range of numbers from 1 to 10. (We will see in later chapters\nhow to 
build things like sum and range.)\nA good programming language helps the 
programmer by providing a more\nabstract way to express himself. It hides 
uninteresting details, provides\nconvenient building blocks (such as the 
while construct), and, most of the\ntime, allows the programmer to add 
building blocks himself (such as the sum\nand range 
operations).\nJavaScript is the language that is, at the moment, mostly 
being used to <em>do</em> all\nki......[truncated]

Am I doing anything wrong? Over the course of 3 months, the problem was 
only reported twice (on two distinct documents), all other documents 
behaved correctly.
Interestingly, updating the query to something more complex returns valid 
snippet, correctly truncated.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b342a193-8f98-4202-a9c1-84ec100e94ae%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

fragment_size not used for simple queries

Reply via email to