(groovy-website) branch asf-site updated: additional descriptive material

paulk Sat, 23 Nov 2024 23:35:08 -0800

This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 0229d5a  additional descriptive material
0229d5a is described below

commit 0229d5a4b317d533b7557475079aa68236aabb73
Author: Paul King <[email protected]>
AuthorDate: Sun Nov 24 17:24:13 2024 +1000

    additional descriptive material
---
 site/src/site/blog/groovy-lucene.adoc | 188 +++++++++++++++++++++++-----------
 1 file changed, 127 insertions(+), 61 deletions(-)

diff --git a/site/src/site/blog/groovy-lucene.adoc 
b/site/src/site/blog/groovy-lucene.adoc
index 8299c00..06b67ad 100644
--- a/site/src/site/blog/groovy-lucene.adoc
+++ b/site/src/site/blog/groovy-lucene.adoc
@@ -11,15 +11,15 @@ https://solr.apache.org[Apache Solr] to crawl/index those 
web pages and search u
 For this post, we are going to search for the
 information we require from the original source 
(https://asciidoc.org/[AsciiDoc]) files.
 We'll first look at how we can find project references using regular 
expressions
-and then using Apache Lucene.
+and then using https://lucene.apache.org/[Apache Lucene].
 
 == Finding project names with a regex
 
 For the sake of this post, let's assume that project references will
-include the work "Apache" followed by the project name. To make it more
+include the word "Apache" followed by the project name. To make it more
 interesting, we'll also include references to Eclipse projects.
-We'll also make provision for project with subprojects, at least for
-Apache Commons, so this will pick up names like Apache Commons Math
+We'll also make provision for projects with subprojects, at least for
+Apache Commons, so this will pick up names like "Apache Commons Math"
 for instance. We'll exclude Apache Groovy since that would hit possibly
 every Groovy blog post. We'll also exclude a bunch of words that appear in
 commonly used phrases like "Apache License" and "Apache Projects".
@@ -227,7 +227,7 @@ new IndexWriter(indexDir, config).withCloseable { writer ->
 <4> Also store the name of the file
 
 With an index defined, we'd typically now perform some kind of search.
-We'll do just that shortly, but first for the kind of information we are 
interested in,
+We'll do just that shortly, but first, for the kind of information we are 
interested in,
 part of the Lucene API lets us explore the index. Here is how we might do that:
 
 [source,groovy]
@@ -276,8 +276,8 @@ docFreq.sort(byReverseValue).take(10).each { k, v ->
 ----
 <1> Get all index terms
 <2> Look for terms which match project names, so we can save them to a set
-<3> Grab hit frequency metadata for our term
-<4> Grab document frequency metadata for our term
+<3> Grab hit frequency metadata for each term in our set of terms
+<4> Grab document frequency metadata for each term in our set of terms
 
 When we run this we see:
 
@@ -385,8 +385,8 @@ module to highlight hits as part of potentially displaying 
them on a web page
 as part of some web search functionality. For us, we are going to just
 pick out the terms of interest, project names that matching our query.
 
-We the highlight functionality to work, we ask the indexer to store some 
additional information
-when indexing about term positions. The index code changes to look like this:
+For the highlight functionality to work, we ask the indexer to store some 
additional information
+when indexing, in particular term positions and offsets. The index code 
changes to look like this:
 
 [source,groovy]
 ----
@@ -420,9 +420,12 @@ List<String> handleHit(ScoreDoc hit, Query query, 
DirectoryReader dirReader) {
     var fieldQuery = new FieldQuery(query, dirReader, phraseHighlight, 
fieldMatch)
     var stack = new FieldTermStack(dirReader, hit.doc, 'content', fieldQuery)
     var phrases = new FieldPhraseList(stack, fieldQuery)
-    phrases.phraseList*.termsInfos*.text.flatten()
+    phrases.phraseList*.termsInfos*.text.flatten() // <1>
 }
 ----
+<1> Converts a `FieldPhraseList` into a list of `TermInfo` instances into a 
list of strings
+
+Now we can run our query code:
 
 [source,groovy]
 ----
@@ -518,14 +521,14 @@ var analyzer = new ProjectNameAnalyzer()
 var indexDir = new ByteBuffersDirectory()
 var taxonDir = new ByteBuffersDirectory()
 var config = new IndexWriterConfig(analyzer)
-var indexWriter = new IndexWriter(indexDir, config)
-var taxonWriter = new DirectoryTaxonomyWriter(taxonDir)
-
-var fConfig = new FacetsConfig().tap {
-    setHierarchical("projectNameCounts", true)
-    setMultiValued("projectNameCounts", true)
-    setMultiValued("projectFileCounts", true)
-    setMultiValued("projectHitCounts", true)
+var indexWriter = new IndexWriter(indexDir, config) // <1>
+var taxonWriter = new DirectoryTaxonomyWriter(taxonDir) // <2>
+
+var fConfig = new FacetsConfig().tap { // <3>
+    setHierarchical('projectNameCounts', true)
+    setMultiValued('projectNameCounts', true)
+    setMultiValued('projectFileCounts', true)
+    setMultiValued('projectHitCounts', true)
     setIndexFieldName('projectHitCounts', '$projectHitCounts')
 }
 
@@ -541,10 +544,10 @@ new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) { 
file ->
         document.add(new StringField('name', file.name, Field.Store.YES))
         if (projects) {
             println "$file.name: $projects"
-            projects.each { k, v ->
-                document.add(new IntAssociationFacetField(v, 
"projectHitCounts", k))
-                document.add(new FacetField("projectFileCounts", k))
-                document.add(new FacetField("projectNameCounts", k.split()))
+            projects.each { k, v -> // <4>
+                document.add(new IntAssociationFacetField(v, 
'projectHitCounts', k))
+                document.add(new FacetField('projectFileCounts', k))
+                document.add(new FacetField('projectNameCounts', k.split()))
             }
         }
         indexWriter.addDocument(fConfig.build(taxonWriter, document))
@@ -553,6 +556,10 @@ new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) { file 
->
 indexWriter.close()
 taxonWriter.close()
 ----
+<1> Our normal index writer
+<2> A writer for our taxonomy
+<3> Define some properties for the facets we are interested in
+<4> We add our facets of interest to our document
 
 Since we are collecting this data during indexing, we can print it out:
 
@@ -592,6 +599,8 @@ zipping-collections-with-groovy.adoc: 
[eclipse&nbsp;collections:4]
 
 Now when doing searches, we can extract the taxonomy information along with 
other info.
 With `projectHitCounts` we can gather the taxonomy metadata for the top hits 
from our search.
+We'll use `MatchAllDocsQuery` to match all documents, i.e. the metadata will 
be for
+all documents.
 
 [source,groovy]
 ----
@@ -603,22 +612,19 @@ var fc = FacetsCollectorManager.search(searcher, new 
MatchAllDocsQuery(), 0, fcm
 
 var topN = 5
 var projects = new TaxonomyFacetIntAssociations('$projectHitCounts', 
taxonReader, fConfig, fc, AssociationAggregationFunction.SUM)
-var hitCounts = projects.getTopChildren(topN, 
"projectHitCounts").labelValues.collect{
-    [label: it.label, hits: it.value, files: it.count]
-}
+var hitData = projects.getTopChildren(topN, 'projectHitCounts').labelValues
 
 println "\nFrequency of total hits mentioning a project (top $topN):"
-hitCounts.sort{ m -> -m.hits }.each { m ->
-    var label = "$m.label ($m.hits)"
-    println "${label.padRight(32)} ${bar(m.hits, 0, 50, 50)}"
+hitData.each { m ->
+    var label = "$m.label ($m.value)"
+    println "${label.padRight(32)} ${bar(m.value, 0, 50, 50)}"
 }
 
 println "\nFrequency of documents mentioning a project (top $topN):"
-hitCounts.sort{ m -> -m.files }.each { m ->
-    var label = "$m.label ($m.files)"
-    println "${label.padRight(32)} ${bar(m.files * 2, 0, 20, 20)}"
+hitData.each { m ->
+    var label = "$m.label ($m.count)"
+    println "${label.padRight(32)} ${bar(m.count * 2, 0, 20, 20)}"
 }
-
 ----
 
 When running this we can see the frequencies for the total hits and number of 
files:
@@ -646,16 +652,16 @@ apache&nbsp;mxnet (1)                 ██▏
 NOTE: At the time of writing, there is a bug in sorting for the second of 
these graphs.
 A https://github.com/apache/lucene/issues/14008[fix] is coming.
 
-Now, the taxonomy information about file counts is for the selected top hits 
based on number of hits.
-One of our other facets (`projectFileCounts`) lets us look at the top 
frequencies of references in files. Let's look at how we can query that 
information:
+Now, the taxonomy information about document frequency is for the top hits 
scored using the number of hits.
+One of our other facets (`projectFileCounts`) tracks document frequency 
independently.
+Let's look at how we can query that information:
 
 [source,groovy]
 ----
 var facets = new FastTaxonomyFacetCounts(taxonReader, fConfig, fc)
 
 println "\nFrequency of documents mentioning a project (top $topN):"
-var fileCounts = facets.getTopChildren(topN, "projectFileCounts")
-println fileCounts
+println facets.getTopChildren(topN, 'projectFileCounts')
 ----
 
 The output looks like this:
@@ -675,7 +681,9 @@ dim=projectFileCounts path=[] value=-1 childCount=27
 
 When comparing this result, with the result from our previous facet,
 we can see that commons csv is mentioned in more files than mxnet,
-even though mxnet is mentioned more times.
+even though mxnet is mentioned more times. In general, you'd decide
+which document frequency is of more interest to you, and you'd skip
+the `projectFileCounts` facet if you didn't need that extra information.
 
 Our final facet (`projectNameCounts`) is a hierarchical facet. These are 
typically used interactively
 when "browsing" search results. We can look at project names by first word, 
e.g. the foundation.
@@ -685,12 +693,12 @@ Here is the code which does that.
 
 [source,groovy]
 ----
-['apache', 'commons'].inits().reverseEach { path ->
+['apache', 'commons'].inits().reverseEach { path -> // <1>
     println "Frequency of documents mentioning a project with path $path (top 
$topN):"
-    var nameCounts = facets.getTopChildren(topN, "projectNameCounts", *path)
-    println "$nameCounts"
+    println "${facets.getTopChildren(topN, 'projectNameCounts', *path)}"
 }
 ----
+<1> The `inits()` method returns all prefixes of a list including the empty 
list.
 
 The output looks like this:
 
@@ -725,7 +733,7 @@ which might just use the latter.
 
 [source,groovy]
 ----
-var parser = new QueryParser("content", analyzer)
+var parser = new QueryParser('content', analyzer)
 var query = parser.parse(/apache\ * AND eclipse\ * AND emoji*/)
 var results = searcher.search(query, topN)
 var storedFields = searcher.storedFields()
@@ -734,12 +742,20 @@ assert results.totalHits.value() == 1 &&
 ----
 
 This query shows that there is exactly one blog post that mentions
-Apache projects, Eclipse projects and also emojis.
+Apache projects, Eclipse projects, and also emojis.
+
+Facets are a really powerful feature. Given that we are indexing asciidoc 
source
+files, we could even use libraries like 
https://github.com/asciidoctor/asciidoctorj[AsciidoctorJ]
+to extract more metadata from our source files and store them as facets.
+We could for instance extra titles, author(s), keywords, publication dates and 
so forth.
+This would allow us to make some pretty powerful searches.
+We leave this as an exercise for the reader.
+But if you try, please let us know how you go!
 
 == More complex queries
 
 As a final example, we chose earlier to extract project names at index time.
-We could have instead used the normal analyzer at the cost of needing more
+We could have instead used a more typical analyzer at the cost of needing more
 complex span queries to pull out our project names at search time.
 Let's have a look at what the code for that scenario could look like.
 
@@ -769,11 +785,22 @@ new IndexWriter(indexDir, config).withCloseable { writer 
->
 ----
 
 Now our queries will need to be more complex. We have a few options up our 
sleeve,
-but we'll choose to put together our queries using some low level query 
classes.
-We'll look for "apache commons <namepart>"
-or "(apache|eclipse) <namepart>",
-where _namepart_ is the project name
-without the foundation prefix.
+but we'll choose to put together our queries using some of Lucene's low-level 
query classes.
+
+NOTE: Before considering Lucene's low-level query classes, you might
+want to look at some of Lucene's higher-level query classes like the 
`QueryParser` class.
+It supports representing a query as a string and includes support for phrases,
+ranges, regex terms and so forth. As far as I am aware, it doesn't support
+a regex within a phrase, hence the low-level classes we'll explore below.
+
+We'll look for expressions like "apache commons <suffix>"
+or "(apache|eclipse) <suffix>",
+where _suffix_ is the project name
+without the foundation prefix, or in the case of Apache Commons, the 
subproject name.
+
+Instead of having a list of stop words (excluded words) like in our regex,
+we'll just have a list of allowable project suffix names.
+It wouldn't be hard to swap to the stop word approach if we wanted.
 
 [source,groovy]
 ----
@@ -786,19 +813,19 @@ var projects = [
     'nlpcraft', 'pekko', 'hugegraph', 'tinkerpop', 'commons',
     'cli', 'opennlp', 'ofbiz', 'codec', 'kie', 'flink'
 ]
-var namepart = new SpanMultiTermQueryWrapper(new RegexpQuery(
+var suffix = new SpanMultiTermQueryWrapper(new RegexpQuery( // <1>
     new Term('content', "(${projects.join('|')})")))
 
-// look for apache commons <namepart>
+// look for apache commons <suffix>
 SpanQuery[] spanTerms = ['apache', 'commons'].collect{
     new SpanTermQuery(new Term('content', it))
-} + namepart
+} + suffix
 var apacheCommons = new SpanNearQuery(spanTerms, 0, true)
 
-// look for (apache|eclipse) <namepart>
+// look for (apache|eclipse) <suffix>
 var foundation = new SpanMultiTermQueryWrapper(new RegexpQuery(
     new Term('content', '(apache|eclipse)')))
-var otherProject = new SpanNearQuery([foundation, namepart] as SpanQuery[], 0, 
true)
+var otherProject = new SpanNearQuery([foundation, suffix] as SpanQuery[], 0, 
true)
 
 var builder = new BooleanQuery.Builder(minimumNumberShouldMatch: 1)
 builder.add(otherProject, BooleanClause.Occur.SHOULD)
@@ -807,6 +834,7 @@ var query = builder.build()
 var results = searcher.search(query, 30)
 println "Total documents with hits for $query --> $results.totalHits"
 ----
+<1> Regex queries are wrapped to appear in a span query
 
 When we run this we see the same number of hits as before:
 
@@ -814,17 +842,49 @@ When we run this we see the same number of hits as before:
 Total documents with hits for 
(spanNear([SpanMultiTermQueryWrapper(content:/(apache|eclipse)/), 
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pekko|hugegraph|tinkerpop|commons|cli|opennlp|ofbiz|codec|kie|flink)/)],
 0, true) spanNear([content:apache, content:commons, 
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pek
 [...]
 ----
 
-Using the `StandardAnalyzer` has numerous advantages.
+Another thing we might want to consider for this example is to make use of
+Groovy's excellent Domain Specific Language (DSL) capabilities.
+By defining one helper method, `span`, and providing one metaprogramming
+extension for `or` on Lucene's `Query` class, we can rewrite the last 20 lines
+of the previous example in a more compact and understandable form:
+
+[source,groovy]
+----
+var suffix = "(${projects.join('|')})"
+var query = span('apache', 'commons', ~suffix) | span(~'(apache|eclipse)', 
~suffix)
+var results = searcher.search(query, 30)
+println "Total documents with hits for $query --> $results.totalHits"
+----
+
+Running the code gives the same output as previously.
+
+We can try out our DSL on other terms:
+
+[source,groovy]
+----
+query = span('jackson', 'databind') | span(~'virt.*', 'threads')
+results = searcher.search(query, 30)
+println "Total documents with hits for $query --> $results.totalHits"
+----
+
+When run, we'll now see this output:
+
+----
+Total documents with hits for (spanNear([content:jackson, content:databind], 
0, true) spanNear([SpanMultiTermQueryWrapper(content:/virt.*/), 
content:threads], 0, true))~1 --> 8 hits
+----
+
+Using the `StandardAnalyzer` with span queries certainly opens up the 
possibility
+of a much wider range of queries. But `StandardAnalyzer` also has other 
advantages.
 It has baked into it the ability for stop words, smart word breaking, 
lowercasing
 and other features. Other built-in analyzers might also be useful. We could of 
course,
-also make our regex-based analyzer smarter. Many of Lucene's features are in 
reusable
-pieces.
+also make our regex-based analyzer smarter. The fact that many of Lucene's 
features
+are in reusable pieces certainly helps.
 
-Another advantage of the `StandardAnalyzer` is that it properly handles emojis 
in our index.
-Our regex analyzer in its current form only looks for "word" characters which 
doesn't
-work with emoji characters, although it could be expanded to support them.
+A fun advantage of the `StandardAnalyzer` is that it properly handles emojis 
in our index.
+Our regex analyzer in its current form only looks for "regex word" characters 
which doesn't
+include emoji characters, although it could be expanded to support them.
 
-Given that we've used `StandardAnalyzer` here, let's look again look at terms
+Given that we've used `StandardAnalyzer` here, let's look again at terms
 in our index but this time pull out emojis instead of project names:
 
 [source,groovy]
@@ -853,7 +913,13 @@ When run, you should see something like this (flag emojis 
may not show up on som
 
 image:img/LuceneWithStandardAnalyzer.png[]
 
+== References
+
+* Lucene project https://lucene.apache.org/[website]
+* Source code https://github.com/paulk-asert/groovy-lucene[examples] for this 
blog post
+
 == Conclusion
 
 We have analyzed the Groovy blog posts looking for referenced projects
-using regular expressions and Apache Lucene.
+using regular expressions and Apache Lucene. Hopefully this gives you a taste
+of the Lucene APIs and some of Groovy's features.

(groovy-website) branch asf-site updated: additional descriptive material

Reply via email to