(groovy-website) branch asf-site updated: additional descriptive material

paulk Fri, 22 Nov 2024 22:35:55 -0800

This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 2c43bb3  additional descriptive material
2c43bb3 is described below

commit 2c43bb39a39164026678ac12eaadf06a3d8f6aa5
Author: Paul King <[email protected]>
AuthorDate: Sat Nov 23 16:34:20 2024 +1000

    additional descriptive material
---
 site/src/site/blog/groovy-lucene.adoc | 82 ++++++++++++++++++++++++++++++++++-
 1 file changed, 80 insertions(+), 2 deletions(-)

diff --git a/site/src/site/blog/groovy-lucene.adoc 
b/site/src/site/blog/groovy-lucene.adoc
index 91722aa..8299c00 100644
--- a/site/src/site/blog/groovy-lucene.adoc
+++ b/site/src/site/blog/groovy-lucene.adoc
@@ -504,6 +504,14 @@ apache&nbsp;commons csv (6)           ██████▏
 
 == Using Lucene Facets
 
+As well as the metadata Lucene stores for its own purposes in the index,
+it provides a mechanism for storing custom metadata called facets. If we 
wanted to we could
+store referenced project names using this mechanism.
+
+Let's use our regex to find project names and store the information in various 
facets.
+Lucene has a special taxonomy index which stores metadata about our metadata.
+We'll also enable that.
+
 [source,groovy]
 ----
 var analyzer = new ProjectNameAnalyzer()
@@ -546,6 +554,8 @@ indexWriter.close()
 taxonWriter.close()
 ----
 
+Since we are collecting this data during indexing, we can print it out:
+
 ++++
 <pre>
 apache-nlpcraft-with-groovy.adoc: [apache&nbsp;nlpcraft:5]
@@ -580,6 +590,8 @@ zipping-collections-with-groovy.adoc: 
[eclipse&nbsp;collections:4]
 </pre>
 ++++
 
+Now when doing searches, we can extract the taxonomy information along with 
other info.
+With `projectHitCounts` we can gather the taxonomy metadata for the top hits 
from our search.
 
 [source,groovy]
 ----
@@ -609,6 +621,8 @@ hitCounts.sort{ m -> -m.files }.each { m ->
 
 ----
 
+When running this we can see the frequencies for the total hits and number of 
files:
+
 // &nbsp; entered below so that we don't hit this whole table as a bunch of 
references
 ++++
 <pre>
@@ -629,6 +643,11 @@ apache&nbsp;mxnet (1)                 ██▏
 </pre>
 ++++
 
+NOTE: At the time of writing, there is a bug in sorting for the second of 
these graphs.
+A https://github.com/apache/lucene/issues/14008[fix] is coming.
+
+Now, the taxonomy information about file counts is for the selected top hits 
based on number of hits.
+One of our other facets (`projectFileCounts`) lets us look at the top 
frequencies of references in files. Let's look at how we can query that 
information:
 
 [source,groovy]
 ----
@@ -639,6 +658,8 @@ var fileCounts = facets.getTopChildren(topN, 
"projectFileCounts")
 println fileCounts
 ----
 
+The output looks like this:
+
 ++++
 <pre>
 Frequency of documents mentioning a project (top 5):
@@ -647,11 +668,21 @@ dim=projectFileCounts path=[] value=-1 childCount=27
   apache&nbsp;commons math (7)
   apache&nbsp;spark (5)
   apache&nbsp;ignite (4)
-  apache commons csv (4)
+  apache&nbsp;commons csv (4)
 
 </pre>
 ++++
 
+When comparing this result, with the result from our previous facet,
+we can see that commons csv is mentioned in more files than mxnet,
+even though mxnet is mentioned more times.
+
+Our final facet (`projectNameCounts`) is a hierarchical facet. These are 
typically used interactively
+when "browsing" search results. We can look at project names by first word, 
e.g. the foundation.
+We could then drill down into "Apache" and find referenced projects, and then 
in the
+case of commons, we could drill down into its subprojects.
+Here is the code which does that.
+
 [source,groovy]
 ----
 ['apache', 'commons'].inits().reverseEach { path ->
@@ -661,6 +692,8 @@ dim=projectFileCounts path=[] value=-1 childCount=27
 }
 ----
 
+The output looks like this:
+
 ++++
 <pre>
 Frequency of documents mentioning a project with path [] (top 5):
@@ -687,6 +720,9 @@ dim=projectNameCounts path=[apache, commons] value=-1 
childCount=7
 </pre>
 ++++
 
+We now have a taxonomy index as well as the normal one, so we can still do 
adhoc queries
+which might just use the latter.
+
 [source,groovy]
 ----
 var parser = new QueryParser("content", analyzer)
@@ -697,12 +733,15 @@ assert results.totalHits.value() == 1 &&
     storedFields.document(results.scoreDocs[0].doc).get('name') == 
'fruity-eclipse-collections.adoc'
 ----
 
+This query shows that there is exactly one blog post that mentions
+Apache projects, Eclipse projects and also emojis.
+
 == More complex queries
 
 As a final example, we chose earlier to extract project names at index time.
 We could have instead used the normal analyzer at the cost of needing more
 complex span queries to pull out our project names at search time.
-Let's have a look at what the could for that scenario could look like.
+Let's have a look at what the code for that scenario could look like.
 
 First, we'll do indexing with the `StandardAnalyzer`.
 
@@ -775,6 +814,45 @@ When we run this we see the same number of hits as before:
 Total documents with hits for 
(spanNear([SpanMultiTermQueryWrapper(content:/(apache|eclipse)/), 
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pekko|hugegraph|tinkerpop|commons|cli|opennlp|ofbiz|codec|kie|flink)/)],
 0, true) spanNear([content:apache, content:commons, 
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pek
 [...]
 ----
 
+Using the `StandardAnalyzer` has numerous advantages.
+It has baked into it the ability for stop words, smart word breaking, 
lowercasing
+and other features. Other built-in analyzers might also be useful. We could of 
course,
+also make our regex-based analyzer smarter. Many of Lucene's features are in 
reusable
+pieces.
+
+Another advantage of the `StandardAnalyzer` is that it properly handles emojis 
in our index.
+Our regex analyzer in its current form only looks for "word" characters which 
doesn't
+work with emoji characters, although it could be expanded to support them.
+
+Given that we've used `StandardAnalyzer` here, let's look again look at terms
+in our index but this time pull out emojis instead of project names:
+
+[source,groovy]
+----
+var vectors = reader.termVectors()
+var storedFields = reader.storedFields()
+
+var emojis = [:].withDefault { [] as Set }
+for (docId in 0..<reader.maxDoc()) {
+    String name = storedFields.document(docId).get('name')
+    TermsEnum terms = vectors.get(docId, 'content').iterator()
+    while (terms.next() != null) {
+        PostingsEnum postingsEnum = terms.postings(null, PostingsEnum.ALL)
+        while (postingsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
+            var string = terms.term().utf8ToString()
+            if (string.codePoints().allMatch(Character::isEmojiPresentation)) {
+                emojis[name] += string
+            }
+        }
+    }
+}
+emojis.collect { k, v -> "$k: ${v.join(', ')}" }.each { println it }
+----
+
+When run, you should see something like this (flag emojis may not show up on 
some platforms):
+
+image:img/LuceneWithStandardAnalyzer.png[]
+
 == Conclusion
 
 We have analyzed the Groovy blog posts looking for referenced projects

(groovy-website) branch asf-site updated: additional descriptive material

Reply via email to