[
https://issues.apache.org/jira/browse/LUCENE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523067#comment-15523067
]
Tim Allison edited comment on LUCENE-5317 at 9/26/16 1:38 PM:
--------------------------------------------------------------
I received a personal email asking for some more background on this capability.
Here goes (apologies for some repetition with the issue description)...
For an example of concordance output, see these
[slides|https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf].
Slides 23 and 24 for LUCENE-5317 and slides 25-28 for LUCENE-5318.
The notion is that you present every time the term appears in the central
column with {{x}} number of words to the left and right. The user can sort on
words before the target term to see what modifies it, or the user can sort on
words after the target term to see what it modifies, or the user can sort on
order of appearance within the documents to effectively read everything in
their docs that matters to them.
By {{target term}}, of course, I mean any term/phrase that can be represented
by a SpanQuery.
This kind of view of the data is extremely helpful to linguists and
philologists to understand how words are being used. It also has practical
applications for anyone doing "analytic" search, that is, they want to see
every time a term/phrase appears -- lawyers, patent examiners, etc.
This view of the data is fundamentally different from snippets, which typically
show the three or so best chunks where the search terms appear. Snippets allow
the user to determine if a document is relevant, then the user has to open the
document. Snippets are great if the user is seeking the best document to
answer the information need. For "analytic searchers", however, with
concordance results, the user can be saved the step of having to open the
document; they can see _every time_ their term/phrase appears. Also, for
"analytic searchers", if their documents are lengthy, the concordance allows
them to see the potentially hundreds of times that their term/phrase appears in
each document instead of the three or so snippets they might see with
traditional search engines.
"But you can increase the number of snippets to whatever you want..." Yes, you
can, but the layout of the concordance allows you to see patterns across
documents very easily. Again, the results are sorted by words to the left or
right, not by which document the target appeared in.
This [link|https://wmtang.org/corpus-linguistics/corpus-linguistics] shows some
output from a concordancer (AntConc). Wikipedia's best description is under
key word in context ([KWIC|https://en.wikipedia.org/wiki/Key_Word_in_Context]).
If you're into tree-ware,
[Oakes|https://global.oup.com/academic/product/statistics-for-corpus-linguistics-9780748608171?cc=us&lang=en&]
has a great introduction to concordances among many other useful topics!
was (Author: [email protected]):
I received a personal email asking for some more background on this capability.
Here goes (apologies for some repetition with the issue description)...
For an example of concordance output, see these
[slides|https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf].
Slides 23 and 24 for LUCENE-5317 and slides 25-28 for LUCENE-5318.
The notion is that you present every time the term appears in the central
column with {{x}} number of words to the left and right. The user can sort on
words before the target term to see what modifies it, or the user can sort on
words after the target term to see what it modifies, or the user can sort on
order of appearance.
By {{target term}}, of course, I mean any term/phrase that can be represented
by a SpanQuery.
This kind of view of the data is extremely helpful to linguists and
philologists to understand how words are being used. It also has practical
applications for anyone doing "analytic" search, that is, they want to see
every time a term/phrase appears -- lawyers, patent examiners, etc.
This view of the data is fundamentally different from snippets, which typically
show the three or so best chunks where the search terms appear. Snippets allow
the user to determine if a document is relevant, then the user has to open the
document. Snippets are great if the user is seeking the best document to
answer the information need. For "analytic searchers", however, with
concordance results, the user can be saved the step of having to open the
document; they can see _every time_ their term/phrase appears.
This [link|https://wmtang.org/corpus-linguistics/corpus-linguistics] shows some
output from a concordancer (AntConc). Wikipedia's best description is under
key word in context ([KWIC|https://en.wikipedia.org/wiki/Key_Word_in_Context]).
If you're into tree-ware,
[Oakes|https://global.oup.com/academic/product/statistics-for-corpus-linguistics-9780748608171?cc=us&lang=en&]
has a great introduction to concordances among many other useful topics!
> Concordance capability
> ----------------------
>
> Key: LUCENE-5317
> URL: https://issues.apache.org/jira/browse/LUCENE-5317
> Project: Lucene - Core
> Issue Type: New Feature
> Components: core/search
> Affects Versions: 4.5
> Reporter: Tim Allison
> Labels: patch
> Attachments: LUCENE-5317.patch, LUCENE-5317.patch,
> concordance_v1.patch.gz, lucene5317v1.patch, lucene5317v2.patch
>
>
> This patch enables a Lucene-powered concordance search capability.
> Concordances are extremely useful for linguists, lawyers and other analysts
> performing analytic search vs. traditional snippeting/document retrieval
> tasks. By "analytic search," I mean that the user wants to browse every time
> a term appears (or at least the topn) in a subset of documents and see the
> words before and after.
> Concordance technology is far simpler and less interesting than IR relevance
> models/methods, but it can be extremely useful for some use cases.
> Traditional concordance sort orders are available (sort on words before the
> target, words after, target then words before and target then words after).
> Under the hood, this is running SpanQuery's getSpans() and reanalyzing to
> obtain character offsets. There is plenty of room for optimizations and
> refactoring.
> Many thanks to my colleague, Jason Robinson, for input on the design of this
> patch.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]