[jira] [Comment Edited] (LUCENE-5317) Concordance capability

Tim Allison (JIRA) Mon, 26 Sep 2016 06:38:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523067#comment-15523067
 ]


Tim Allison edited comment on LUCENE-5317 at 9/26/16 1:38 PM:
--------------------------------------------------------------

I received a personal email asking for some more background on this capability. 
 Here goes (apologies for some repetition with the issue description)...

For an example of concordance output, see these 
[slides|https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf].
  Slides 23 and 24 for LUCENE-5317 and slides 25-28 for LUCENE-5318.

The notion is that you present every time the term appears in the central 
column with {{x}} number of words to the left and right.  The user can sort on 
words before the target term to see what modifies it, or the user can sort on 
words after the target term to see what it modifies, or the user can sort on 
order of appearance within the documents to effectively read everything in 
their docs that matters to them. 

 By {{target term}}, of course, I mean any term/phrase that can be represented 
by a SpanQuery.

This kind of view of the data is extremely helpful to linguists and 
philologists to understand how words are being used.  It also has practical 
applications for anyone doing "analytic" search, that is, they want to see 
every time a term/phrase appears -- lawyers, patent examiners, etc.

This view of the data is fundamentally different from snippets, which typically 
show the three or so best chunks where the search terms appear.  Snippets allow 
the user to determine if a document is relevant, then the user has to open the 
document.  Snippets are great if the user is seeking the best document to 
answer the information need.  For "analytic searchers", however, with 
concordance results, the user can be saved the step of having to open the 
document; they can see _every time_ their term/phrase appears.  Also, for 
"analytic searchers", if their documents are lengthy, the concordance allows 
them to see the potentially hundreds of times that their term/phrase appears in 
each document instead of the three or so snippets they might see with 
traditional search engines.

"But you can increase the number of snippets to whatever you want..."  Yes, you 
can, but the layout of the concordance allows you to see patterns across 
documents very easily.  Again, the results are sorted by words to the left or 
right, not by which document the target appeared in.

This [link|https://wmtang.org/corpus-linguistics/corpus-linguistics] shows some 
output from a concordancer (AntConc).  Wikipedia's best description is under 
key word in context ([KWIC|https://en.wikipedia.org/wiki/Key_Word_in_Context]). 
If you're into tree-ware, 
[Oakes|https://global.oup.com/academic/product/statistics-for-corpus-linguistics-9780748608171?cc=us&lang=en&;]
 has a great introduction to concordances among many other useful topics!


was (Author: [email protected]):
I received a personal email asking for some more background on this capability. 
 Here goes (apologies for some repetition with the issue description)...

For an example of concordance output, see these 
[slides|https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf].
  Slides 23 and 24 for LUCENE-5317 and slides 25-28 for LUCENE-5318.

The notion is that you present every time the term appears in the central 
column with {{x}} number of words to the left and right.  The user can sort on 
words before the target term to see what modifies it, or the user can sort on 
words after the target term to see what it modifies, or the user can sort on 
order of appearance.

 By {{target term}}, of course, I mean any term/phrase that can be represented 
by a SpanQuery.

This kind of view of the data is extremely helpful to linguists and 
philologists to understand how words are being used.  It also has practical 
applications for anyone doing "analytic" search, that is, they want to see 
every time a term/phrase appears -- lawyers, patent examiners, etc.

This view of the data is fundamentally different from snippets, which typically 
show the three or so best chunks where the search terms appear.  Snippets allow 
the user to determine if a document is relevant, then the user has to open the 
document.  Snippets are great if the user is seeking the best document to 
answer the information need.  For "analytic searchers", however, with 
concordance results, the user can be saved the step of having to open the 
document; they can see _every time_ their term/phrase appears.

This [link|https://wmtang.org/corpus-linguistics/corpus-linguistics] shows some 
output from a concordancer (AntConc).  Wikipedia's best description is under 
key word in context ([KWIC|https://en.wikipedia.org/wiki/Key_Word_in_Context]). 
If you're into tree-ware, 
[Oakes|https://global.oup.com/academic/product/statistics-for-corpus-linguistics-9780748608171?cc=us&lang=en&;]
 has a great introduction to concordances among many other useful topics!

> Concordance capability
> ----------------------
>
>                 Key: LUCENE-5317
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5317
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/search
>    Affects Versions: 4.5
>            Reporter: Tim Allison
>              Labels: patch
>         Attachments: LUCENE-5317.patch, LUCENE-5317.patch, 
> concordance_v1.patch.gz, lucene5317v1.patch, lucene5317v2.patch
>
>
> This patch enables a Lucene-powered concordance search capability.
> Concordances are extremely useful for linguists, lawyers and other analysts 
> performing analytic search vs. traditional snippeting/document retrieval 
> tasks.  By "analytic search," I mean that the user wants to browse every time 
> a term appears (or at least the topn)  in a subset of documents and see the 
> words before and after.  
> Concordance technology is far simpler and less interesting than IR relevance 
> models/methods, but it can be extremely useful for some use cases.
> Traditional concordance sort orders are available (sort on words before the 
> target, words after, target then words before and target then words after).
> Under the hood, this is running SpanQuery's getSpans() and reanalyzing to 
> obtain character offsets.  There is plenty of room for optimizations and 
> refactoring.
> Many thanks to my colleague, Jason Robinson, for input on the design of this 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-5317) Concordance capability

Reply via email to