[ https://issues.apache.org/jira/browse/LUCENE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523067#comment-15523067 ]
Tim Allison edited comment on LUCENE-5317 at 9/26/16 1:40 PM: -------------------------------------------------------------- I received a personal email asking for some more background on this capability. Here goes (apologies for some repetition with the issue description)... For an example of concordance output, see these [slides|https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf]. Slides 23 and 24 for LUCENE-5317 and slides 25-28 for LUCENE-5318. The notion is that you present every time the term appears in the central column with {{x}} number of words to the left and right. The user can sort on words before the target term to see what modifies it, or the user can sort on words after the target term to see what it modifies, or the user can sort on order of appearance within the documents to effectively read everything in their docs that matters to them. By {{target term}}, of course, I mean any term/phrase that can be represented by a SpanQuery. This kind of view of the data is extremely helpful to linguists and philologists to understand how words are being used. It also has practical applications for anyone doing "analytic" search, that is, they want to see every time a term/phrase appears -- lawyers, patent examiners, etc. This view of the data is fundamentally different from snippets, which typically show the three or so best chunks where the search terms appear, and they're typically ordered _per document_. Snippets allow the user to determine if a document is relevant, then the user has to open the document. Snippets are great if users are seeking the best document to answer their information need. For "analytic searchers", however, with concordance results, the user can be saved the step of having to open the document; they can see _every time_ their term/phrase appears. Also, for "analytic searchers", if their documents are lengthy, the concordance allows them to see the potentially hundreds of times that their term/phrase appears in each document instead of the three or so snippets they might see with traditional search engines. "But you can increase the number of snippets to whatever you want..." Yes, you can, but the layout of the concordance allows you to see patterns across documents very easily. Again, the results are sorted by words to the left or right, not by which document the target appeared in. This [link|https://wmtang.org/corpus-linguistics/corpus-linguistics] shows some output from a concordancer (AntConc). Wikipedia's best description is under key word in context ([KWIC|https://en.wikipedia.org/wiki/Key_Word_in_Context]). If you're into tree-ware, [Oakes|https://global.oup.com/academic/product/statistics-for-corpus-linguistics-9780748608171?cc=us&lang=en&] has a great introduction to concordances among many other useful topics! was (Author: talli...@mitre.org): I received a personal email asking for some more background on this capability. Here goes (apologies for some repetition with the issue description)... For an example of concordance output, see these [slides|https://github.com/tballison/share/blob/master/slides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf]. Slides 23 and 24 for LUCENE-5317 and slides 25-28 for LUCENE-5318. The notion is that you present every time the term appears in the central column with {{x}} number of words to the left and right. The user can sort on words before the target term to see what modifies it, or the user can sort on words after the target term to see what it modifies, or the user can sort on order of appearance within the documents to effectively read everything in their docs that matters to them. By {{target term}}, of course, I mean any term/phrase that can be represented by a SpanQuery. This kind of view of the data is extremely helpful to linguists and philologists to understand how words are being used. It also has practical applications for anyone doing "analytic" search, that is, they want to see every time a term/phrase appears -- lawyers, patent examiners, etc. This view of the data is fundamentally different from snippets, which typically show the three or so best chunks where the search terms appear. Snippets allow the user to determine if a document is relevant, then the user has to open the document. Snippets are great if the user is seeking the best document to answer the information need. For "analytic searchers", however, with concordance results, the user can be saved the step of having to open the document; they can see _every time_ their term/phrase appears. Also, for "analytic searchers", if their documents are lengthy, the concordance allows them to see the potentially hundreds of times that their term/phrase appears in each document instead of the three or so snippets they might see with traditional search engines. "But you can increase the number of snippets to whatever you want..." Yes, you can, but the layout of the concordance allows you to see patterns across documents very easily. Again, the results are sorted by words to the left or right, not by which document the target appeared in. This [link|https://wmtang.org/corpus-linguistics/corpus-linguistics] shows some output from a concordancer (AntConc). Wikipedia's best description is under key word in context ([KWIC|https://en.wikipedia.org/wiki/Key_Word_in_Context]). If you're into tree-ware, [Oakes|https://global.oup.com/academic/product/statistics-for-corpus-linguistics-9780748608171?cc=us&lang=en&] has a great introduction to concordances among many other useful topics! > Concordance capability > ---------------------- > > Key: LUCENE-5317 > URL: https://issues.apache.org/jira/browse/LUCENE-5317 > Project: Lucene - Core > Issue Type: New Feature > Components: core/search > Affects Versions: 4.5 > Reporter: Tim Allison > Labels: patch > Attachments: LUCENE-5317.patch, LUCENE-5317.patch, > concordance_v1.patch.gz, lucene5317v1.patch, lucene5317v2.patch > > > This patch enables a Lucene-powered concordance search capability. > Concordances are extremely useful for linguists, lawyers and other analysts > performing analytic search vs. traditional snippeting/document retrieval > tasks. By "analytic search," I mean that the user wants to browse every time > a term appears (or at least the topn) in a subset of documents and see the > words before and after. > Concordance technology is far simpler and less interesting than IR relevance > models/methods, but it can be extremely useful for some use cases. > Traditional concordance sort orders are available (sort on words before the > target, words after, target then words before and target then words after). > Under the hood, this is running SpanQuery's getSpans() and reanalyzing to > obtain character offsets. There is plenty of room for optimizations and > refactoring. > Many thanks to my colleague, Jason Robinson, for input on the design of this > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org