This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 2c43bb3 additional descriptive material
2c43bb3 is described below
commit 2c43bb39a39164026678ac12eaadf06a3d8f6aa5
Author: Paul King <[email protected]>
AuthorDate: Sat Nov 23 16:34:20 2024 +1000
additional descriptive material
---
site/src/site/blog/groovy-lucene.adoc | 82 ++++++++++++++++++++++++++++++++++-
1 file changed, 80 insertions(+), 2 deletions(-)
diff --git a/site/src/site/blog/groovy-lucene.adoc
b/site/src/site/blog/groovy-lucene.adoc
index 91722aa..8299c00 100644
--- a/site/src/site/blog/groovy-lucene.adoc
+++ b/site/src/site/blog/groovy-lucene.adoc
@@ -504,6 +504,14 @@ apache commons csv (6) ██████▏
== Using Lucene Facets
+As well as the metadata Lucene stores for its own purposes in the index,
+it provides a mechanism for storing custom metadata called facets. If we
wanted to we could
+store referenced project names using this mechanism.
+
+Let's use our regex to find project names and store the information in various
facets.
+Lucene has a special taxonomy index which stores metadata about our metadata.
+We'll also enable that.
+
[source,groovy]
----
var analyzer = new ProjectNameAnalyzer()
@@ -546,6 +554,8 @@ indexWriter.close()
taxonWriter.close()
----
+Since we are collecting this data during indexing, we can print it out:
+
++++
<pre>
apache-nlpcraft-with-groovy.adoc: [apache nlpcraft:5]
@@ -580,6 +590,8 @@ zipping-collections-with-groovy.adoc:
[eclipse collections:4]
</pre>
++++
+Now when doing searches, we can extract the taxonomy information along with
other info.
+With `projectHitCounts` we can gather the taxonomy metadata for the top hits
from our search.
[source,groovy]
----
@@ -609,6 +621,8 @@ hitCounts.sort{ m -> -m.files }.each { m ->
----
+When running this we can see the frequencies for the total hits and number of
files:
+
// entered below so that we don't hit this whole table as a bunch of
references
++++
<pre>
@@ -629,6 +643,11 @@ apache mxnet (1) ██▏
</pre>
++++
+NOTE: At the time of writing, there is a bug in sorting for the second of
these graphs.
+A https://github.com/apache/lucene/issues/14008[fix] is coming.
+
+Now, the taxonomy information about file counts is for the selected top hits
based on number of hits.
+One of our other facets (`projectFileCounts`) lets us look at the top
frequencies of references in files. Let's look at how we can query that
information:
[source,groovy]
----
@@ -639,6 +658,8 @@ var fileCounts = facets.getTopChildren(topN,
"projectFileCounts")
println fileCounts
----
+The output looks like this:
+
++++
<pre>
Frequency of documents mentioning a project (top 5):
@@ -647,11 +668,21 @@ dim=projectFileCounts path=[] value=-1 childCount=27
apache commons math (7)
apache spark (5)
apache ignite (4)
- apache commons csv (4)
+ apache commons csv (4)
</pre>
++++
+When comparing this result, with the result from our previous facet,
+we can see that commons csv is mentioned in more files than mxnet,
+even though mxnet is mentioned more times.
+
+Our final facet (`projectNameCounts`) is a hierarchical facet. These are
typically used interactively
+when "browsing" search results. We can look at project names by first word,
e.g. the foundation.
+We could then drill down into "Apache" and find referenced projects, and then
in the
+case of commons, we could drill down into its subprojects.
+Here is the code which does that.
+
[source,groovy]
----
['apache', 'commons'].inits().reverseEach { path ->
@@ -661,6 +692,8 @@ dim=projectFileCounts path=[] value=-1 childCount=27
}
----
+The output looks like this:
+
++++
<pre>
Frequency of documents mentioning a project with path [] (top 5):
@@ -687,6 +720,9 @@ dim=projectNameCounts path=[apache, commons] value=-1
childCount=7
</pre>
++++
+We now have a taxonomy index as well as the normal one, so we can still do
adhoc queries
+which might just use the latter.
+
[source,groovy]
----
var parser = new QueryParser("content", analyzer)
@@ -697,12 +733,15 @@ assert results.totalHits.value() == 1 &&
storedFields.document(results.scoreDocs[0].doc).get('name') ==
'fruity-eclipse-collections.adoc'
----
+This query shows that there is exactly one blog post that mentions
+Apache projects, Eclipse projects and also emojis.
+
== More complex queries
As a final example, we chose earlier to extract project names at index time.
We could have instead used the normal analyzer at the cost of needing more
complex span queries to pull out our project names at search time.
-Let's have a look at what the could for that scenario could look like.
+Let's have a look at what the code for that scenario could look like.
First, we'll do indexing with the `StandardAnalyzer`.
@@ -775,6 +814,45 @@ When we run this we see the same number of hits as before:
Total documents with hits for
(spanNear([SpanMultiTermQueryWrapper(content:/(apache|eclipse)/),
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pekko|hugegraph|tinkerpop|commons|cli|opennlp|ofbiz|codec|kie|flink)/)],
0, true) spanNear([content:apache, content:commons,
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pek
[...]
----
+Using the `StandardAnalyzer` has numerous advantages.
+It has baked into it the ability for stop words, smart word breaking,
lowercasing
+and other features. Other built-in analyzers might also be useful. We could of
course,
+also make our regex-based analyzer smarter. Many of Lucene's features are in
reusable
+pieces.
+
+Another advantage of the `StandardAnalyzer` is that it properly handles emojis
in our index.
+Our regex analyzer in its current form only looks for "word" characters which
doesn't
+work with emoji characters, although it could be expanded to support them.
+
+Given that we've used `StandardAnalyzer` here, let's look again look at terms
+in our index but this time pull out emojis instead of project names:
+
+[source,groovy]
+----
+var vectors = reader.termVectors()
+var storedFields = reader.storedFields()
+
+var emojis = [:].withDefault { [] as Set }
+for (docId in 0..<reader.maxDoc()) {
+ String name = storedFields.document(docId).get('name')
+ TermsEnum terms = vectors.get(docId, 'content').iterator()
+ while (terms.next() != null) {
+ PostingsEnum postingsEnum = terms.postings(null, PostingsEnum.ALL)
+ while (postingsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
+ var string = terms.term().utf8ToString()
+ if (string.codePoints().allMatch(Character::isEmojiPresentation)) {
+ emojis[name] += string
+ }
+ }
+ }
+}
+emojis.collect { k, v -> "$k: ${v.join(', ')}" }.each { println it }
+----
+
+When run, you should see something like this (flag emojis may not show up on
some platforms):
+
+image:img/LuceneWithStandardAnalyzer.png[]
+
== Conclusion
We have analyzed the Groovy blog posts looking for referenced projects