This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 0229d5a additional descriptive material
0229d5a is described below
commit 0229d5a4b317d533b7557475079aa68236aabb73
Author: Paul King <[email protected]>
AuthorDate: Sun Nov 24 17:24:13 2024 +1000
additional descriptive material
---
site/src/site/blog/groovy-lucene.adoc | 188 +++++++++++++++++++++++-----------
1 file changed, 127 insertions(+), 61 deletions(-)
diff --git a/site/src/site/blog/groovy-lucene.adoc
b/site/src/site/blog/groovy-lucene.adoc
index 8299c00..06b67ad 100644
--- a/site/src/site/blog/groovy-lucene.adoc
+++ b/site/src/site/blog/groovy-lucene.adoc
@@ -11,15 +11,15 @@ https://solr.apache.org[Apache Solr] to crawl/index those
web pages and search u
For this post, we are going to search for the
information we require from the original source
(https://asciidoc.org/[AsciiDoc]) files.
We'll first look at how we can find project references using regular
expressions
-and then using Apache Lucene.
+and then using https://lucene.apache.org/[Apache Lucene].
== Finding project names with a regex
For the sake of this post, let's assume that project references will
-include the work "Apache" followed by the project name. To make it more
+include the word "Apache" followed by the project name. To make it more
interesting, we'll also include references to Eclipse projects.
-We'll also make provision for project with subprojects, at least for
-Apache Commons, so this will pick up names like Apache Commons Math
+We'll also make provision for projects with subprojects, at least for
+Apache Commons, so this will pick up names like "Apache Commons Math"
for instance. We'll exclude Apache Groovy since that would hit possibly
every Groovy blog post. We'll also exclude a bunch of words that appear in
commonly used phrases like "Apache License" and "Apache Projects".
@@ -227,7 +227,7 @@ new IndexWriter(indexDir, config).withCloseable { writer ->
<4> Also store the name of the file
With an index defined, we'd typically now perform some kind of search.
-We'll do just that shortly, but first for the kind of information we are
interested in,
+We'll do just that shortly, but first, for the kind of information we are
interested in,
part of the Lucene API lets us explore the index. Here is how we might do that:
[source,groovy]
@@ -276,8 +276,8 @@ docFreq.sort(byReverseValue).take(10).each { k, v ->
----
<1> Get all index terms
<2> Look for terms which match project names, so we can save them to a set
-<3> Grab hit frequency metadata for our term
-<4> Grab document frequency metadata for our term
+<3> Grab hit frequency metadata for each term in our set of terms
+<4> Grab document frequency metadata for each term in our set of terms
When we run this we see:
@@ -385,8 +385,8 @@ module to highlight hits as part of potentially displaying
them on a web page
as part of some web search functionality. For us, we are going to just
pick out the terms of interest, project names that matching our query.
-We the highlight functionality to work, we ask the indexer to store some
additional information
-when indexing about term positions. The index code changes to look like this:
+For the highlight functionality to work, we ask the indexer to store some
additional information
+when indexing, in particular term positions and offsets. The index code
changes to look like this:
[source,groovy]
----
@@ -420,9 +420,12 @@ List<String> handleHit(ScoreDoc hit, Query query,
DirectoryReader dirReader) {
var fieldQuery = new FieldQuery(query, dirReader, phraseHighlight,
fieldMatch)
var stack = new FieldTermStack(dirReader, hit.doc, 'content', fieldQuery)
var phrases = new FieldPhraseList(stack, fieldQuery)
- phrases.phraseList*.termsInfos*.text.flatten()
+ phrases.phraseList*.termsInfos*.text.flatten() // <1>
}
----
+<1> Converts a `FieldPhraseList` into a list of `TermInfo` instances into a
list of strings
+
+Now we can run our query code:
[source,groovy]
----
@@ -518,14 +521,14 @@ var analyzer = new ProjectNameAnalyzer()
var indexDir = new ByteBuffersDirectory()
var taxonDir = new ByteBuffersDirectory()
var config = new IndexWriterConfig(analyzer)
-var indexWriter = new IndexWriter(indexDir, config)
-var taxonWriter = new DirectoryTaxonomyWriter(taxonDir)
-
-var fConfig = new FacetsConfig().tap {
- setHierarchical("projectNameCounts", true)
- setMultiValued("projectNameCounts", true)
- setMultiValued("projectFileCounts", true)
- setMultiValued("projectHitCounts", true)
+var indexWriter = new IndexWriter(indexDir, config) // <1>
+var taxonWriter = new DirectoryTaxonomyWriter(taxonDir) // <2>
+
+var fConfig = new FacetsConfig().tap { // <3>
+ setHierarchical('projectNameCounts', true)
+ setMultiValued('projectNameCounts', true)
+ setMultiValued('projectFileCounts', true)
+ setMultiValued('projectHitCounts', true)
setIndexFieldName('projectHitCounts', '$projectHitCounts')
}
@@ -541,10 +544,10 @@ new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) {
file ->
document.add(new StringField('name', file.name, Field.Store.YES))
if (projects) {
println "$file.name: $projects"
- projects.each { k, v ->
- document.add(new IntAssociationFacetField(v,
"projectHitCounts", k))
- document.add(new FacetField("projectFileCounts", k))
- document.add(new FacetField("projectNameCounts", k.split()))
+ projects.each { k, v -> // <4>
+ document.add(new IntAssociationFacetField(v,
'projectHitCounts', k))
+ document.add(new FacetField('projectFileCounts', k))
+ document.add(new FacetField('projectNameCounts', k.split()))
}
}
indexWriter.addDocument(fConfig.build(taxonWriter, document))
@@ -553,6 +556,10 @@ new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) { file
->
indexWriter.close()
taxonWriter.close()
----
+<1> Our normal index writer
+<2> A writer for our taxonomy
+<3> Define some properties for the facets we are interested in
+<4> We add our facets of interest to our document
Since we are collecting this data during indexing, we can print it out:
@@ -592,6 +599,8 @@ zipping-collections-with-groovy.adoc:
[eclipse collections:4]
Now when doing searches, we can extract the taxonomy information along with
other info.
With `projectHitCounts` we can gather the taxonomy metadata for the top hits
from our search.
+We'll use `MatchAllDocsQuery` to match all documents, i.e. the metadata will
be for
+all documents.
[source,groovy]
----
@@ -603,22 +612,19 @@ var fc = FacetsCollectorManager.search(searcher, new
MatchAllDocsQuery(), 0, fcm
var topN = 5
var projects = new TaxonomyFacetIntAssociations('$projectHitCounts',
taxonReader, fConfig, fc, AssociationAggregationFunction.SUM)
-var hitCounts = projects.getTopChildren(topN,
"projectHitCounts").labelValues.collect{
- [label: it.label, hits: it.value, files: it.count]
-}
+var hitData = projects.getTopChildren(topN, 'projectHitCounts').labelValues
println "\nFrequency of total hits mentioning a project (top $topN):"
-hitCounts.sort{ m -> -m.hits }.each { m ->
- var label = "$m.label ($m.hits)"
- println "${label.padRight(32)} ${bar(m.hits, 0, 50, 50)}"
+hitData.each { m ->
+ var label = "$m.label ($m.value)"
+ println "${label.padRight(32)} ${bar(m.value, 0, 50, 50)}"
}
println "\nFrequency of documents mentioning a project (top $topN):"
-hitCounts.sort{ m -> -m.files }.each { m ->
- var label = "$m.label ($m.files)"
- println "${label.padRight(32)} ${bar(m.files * 2, 0, 20, 20)}"
+hitData.each { m ->
+ var label = "$m.label ($m.count)"
+ println "${label.padRight(32)} ${bar(m.count * 2, 0, 20, 20)}"
}
-
----
When running this we can see the frequencies for the total hits and number of
files:
@@ -646,16 +652,16 @@ apache mxnet (1) ██▏
NOTE: At the time of writing, there is a bug in sorting for the second of
these graphs.
A https://github.com/apache/lucene/issues/14008[fix] is coming.
-Now, the taxonomy information about file counts is for the selected top hits
based on number of hits.
-One of our other facets (`projectFileCounts`) lets us look at the top
frequencies of references in files. Let's look at how we can query that
information:
+Now, the taxonomy information about document frequency is for the top hits
scored using the number of hits.
+One of our other facets (`projectFileCounts`) tracks document frequency
independently.
+Let's look at how we can query that information:
[source,groovy]
----
var facets = new FastTaxonomyFacetCounts(taxonReader, fConfig, fc)
println "\nFrequency of documents mentioning a project (top $topN):"
-var fileCounts = facets.getTopChildren(topN, "projectFileCounts")
-println fileCounts
+println facets.getTopChildren(topN, 'projectFileCounts')
----
The output looks like this:
@@ -675,7 +681,9 @@ dim=projectFileCounts path=[] value=-1 childCount=27
When comparing this result, with the result from our previous facet,
we can see that commons csv is mentioned in more files than mxnet,
-even though mxnet is mentioned more times.
+even though mxnet is mentioned more times. In general, you'd decide
+which document frequency is of more interest to you, and you'd skip
+the `projectFileCounts` facet if you didn't need that extra information.
Our final facet (`projectNameCounts`) is a hierarchical facet. These are
typically used interactively
when "browsing" search results. We can look at project names by first word,
e.g. the foundation.
@@ -685,12 +693,12 @@ Here is the code which does that.
[source,groovy]
----
-['apache', 'commons'].inits().reverseEach { path ->
+['apache', 'commons'].inits().reverseEach { path -> // <1>
println "Frequency of documents mentioning a project with path $path (top
$topN):"
- var nameCounts = facets.getTopChildren(topN, "projectNameCounts", *path)
- println "$nameCounts"
+ println "${facets.getTopChildren(topN, 'projectNameCounts', *path)}"
}
----
+<1> The `inits()` method returns all prefixes of a list including the empty
list.
The output looks like this:
@@ -725,7 +733,7 @@ which might just use the latter.
[source,groovy]
----
-var parser = new QueryParser("content", analyzer)
+var parser = new QueryParser('content', analyzer)
var query = parser.parse(/apache\ * AND eclipse\ * AND emoji*/)
var results = searcher.search(query, topN)
var storedFields = searcher.storedFields()
@@ -734,12 +742,20 @@ assert results.totalHits.value() == 1 &&
----
This query shows that there is exactly one blog post that mentions
-Apache projects, Eclipse projects and also emojis.
+Apache projects, Eclipse projects, and also emojis.
+
+Facets are a really powerful feature. Given that we are indexing asciidoc
source
+files, we could even use libraries like
https://github.com/asciidoctor/asciidoctorj[AsciidoctorJ]
+to extract more metadata from our source files and store them as facets.
+We could for instance extra titles, author(s), keywords, publication dates and
so forth.
+This would allow us to make some pretty powerful searches.
+We leave this as an exercise for the reader.
+But if you try, please let us know how you go!
== More complex queries
As a final example, we chose earlier to extract project names at index time.
-We could have instead used the normal analyzer at the cost of needing more
+We could have instead used a more typical analyzer at the cost of needing more
complex span queries to pull out our project names at search time.
Let's have a look at what the code for that scenario could look like.
@@ -769,11 +785,22 @@ new IndexWriter(indexDir, config).withCloseable { writer
->
----
Now our queries will need to be more complex. We have a few options up our
sleeve,
-but we'll choose to put together our queries using some low level query
classes.
-We'll look for "apache commons <namepart>"
-or "(apache|eclipse) <namepart>",
-where _namepart_ is the project name
-without the foundation prefix.
+but we'll choose to put together our queries using some of Lucene's low-level
query classes.
+
+NOTE: Before considering Lucene's low-level query classes, you might
+want to look at some of Lucene's higher-level query classes like the
`QueryParser` class.
+It supports representing a query as a string and includes support for phrases,
+ranges, regex terms and so forth. As far as I am aware, it doesn't support
+a regex within a phrase, hence the low-level classes we'll explore below.
+
+We'll look for expressions like "apache commons <suffix>"
+or "(apache|eclipse) <suffix>",
+where _suffix_ is the project name
+without the foundation prefix, or in the case of Apache Commons, the
subproject name.
+
+Instead of having a list of stop words (excluded words) like in our regex,
+we'll just have a list of allowable project suffix names.
+It wouldn't be hard to swap to the stop word approach if we wanted.
[source,groovy]
----
@@ -786,19 +813,19 @@ var projects = [
'nlpcraft', 'pekko', 'hugegraph', 'tinkerpop', 'commons',
'cli', 'opennlp', 'ofbiz', 'codec', 'kie', 'flink'
]
-var namepart = new SpanMultiTermQueryWrapper(new RegexpQuery(
+var suffix = new SpanMultiTermQueryWrapper(new RegexpQuery( // <1>
new Term('content', "(${projects.join('|')})")))
-// look for apache commons <namepart>
+// look for apache commons <suffix>
SpanQuery[] spanTerms = ['apache', 'commons'].collect{
new SpanTermQuery(new Term('content', it))
-} + namepart
+} + suffix
var apacheCommons = new SpanNearQuery(spanTerms, 0, true)
-// look for (apache|eclipse) <namepart>
+// look for (apache|eclipse) <suffix>
var foundation = new SpanMultiTermQueryWrapper(new RegexpQuery(
new Term('content', '(apache|eclipse)')))
-var otherProject = new SpanNearQuery([foundation, namepart] as SpanQuery[], 0,
true)
+var otherProject = new SpanNearQuery([foundation, suffix] as SpanQuery[], 0,
true)
var builder = new BooleanQuery.Builder(minimumNumberShouldMatch: 1)
builder.add(otherProject, BooleanClause.Occur.SHOULD)
@@ -807,6 +834,7 @@ var query = builder.build()
var results = searcher.search(query, 30)
println "Total documents with hits for $query --> $results.totalHits"
----
+<1> Regex queries are wrapped to appear in a span query
When we run this we see the same number of hits as before:
@@ -814,17 +842,49 @@ When we run this we see the same number of hits as before:
Total documents with hits for
(spanNear([SpanMultiTermQueryWrapper(content:/(apache|eclipse)/),
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pekko|hugegraph|tinkerpop|commons|cli|opennlp|ofbiz|codec|kie|flink)/)],
0, true) spanNear([content:apache, content:commons,
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pek
[...]
----
-Using the `StandardAnalyzer` has numerous advantages.
+Another thing we might want to consider for this example is to make use of
+Groovy's excellent Domain Specific Language (DSL) capabilities.
+By defining one helper method, `span`, and providing one metaprogramming
+extension for `or` on Lucene's `Query` class, we can rewrite the last 20 lines
+of the previous example in a more compact and understandable form:
+
+[source,groovy]
+----
+var suffix = "(${projects.join('|')})"
+var query = span('apache', 'commons', ~suffix) | span(~'(apache|eclipse)',
~suffix)
+var results = searcher.search(query, 30)
+println "Total documents with hits for $query --> $results.totalHits"
+----
+
+Running the code gives the same output as previously.
+
+We can try out our DSL on other terms:
+
+[source,groovy]
+----
+query = span('jackson', 'databind') | span(~'virt.*', 'threads')
+results = searcher.search(query, 30)
+println "Total documents with hits for $query --> $results.totalHits"
+----
+
+When run, we'll now see this output:
+
+----
+Total documents with hits for (spanNear([content:jackson, content:databind],
0, true) spanNear([SpanMultiTermQueryWrapper(content:/virt.*/),
content:threads], 0, true))~1 --> 8 hits
+----
+
+Using the `StandardAnalyzer` with span queries certainly opens up the
possibility
+of a much wider range of queries. But `StandardAnalyzer` also has other
advantages.
It has baked into it the ability for stop words, smart word breaking,
lowercasing
and other features. Other built-in analyzers might also be useful. We could of
course,
-also make our regex-based analyzer smarter. Many of Lucene's features are in
reusable
-pieces.
+also make our regex-based analyzer smarter. The fact that many of Lucene's
features
+are in reusable pieces certainly helps.
-Another advantage of the `StandardAnalyzer` is that it properly handles emojis
in our index.
-Our regex analyzer in its current form only looks for "word" characters which
doesn't
-work with emoji characters, although it could be expanded to support them.
+A fun advantage of the `StandardAnalyzer` is that it properly handles emojis
in our index.
+Our regex analyzer in its current form only looks for "regex word" characters
which doesn't
+include emoji characters, although it could be expanded to support them.
-Given that we've used `StandardAnalyzer` here, let's look again look at terms
+Given that we've used `StandardAnalyzer` here, let's look again at terms
in our index but this time pull out emojis instead of project names:
[source,groovy]
@@ -853,7 +913,13 @@ When run, you should see something like this (flag emojis
may not show up on som
image:img/LuceneWithStandardAnalyzer.png[]
+== References
+
+* Lucene project https://lucene.apache.org/[website]
+* Source code https://github.com/paulk-asert/groovy-lucene[examples] for this
blog post
+
== Conclusion
We have analyzed the Groovy blog posts looking for referenced projects
-using regular expressions and Apache Lucene.
+using regular expressions and Apache Lucene. Hopefully this gives you a taste
+of the Lucene APIs and some of Groovy's features.