(groovy-website) 03/03: additional examples

paulk Fri, 22 Nov 2024 03:15:00 -0800

This is an automated email from the ASF dual-hosted git repository.

paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git


commit e37adf7b4c8c072aae523725a035dc8dba24f232
Author: Paul King <[email protected]>
AuthorDate: Fri Nov 22 21:08:04 2024 +1000

    additional examples
---
 site/src/site/blog/groovy-lucene.adoc | 649 ++++++++++++++++++++++++----------
 1 file changed, 457 insertions(+), 192 deletions(-)

diff --git a/site/src/site/blog/groovy-lucene.adoc 
b/site/src/site/blog/groovy-lucene.adoc
index ee877b7..e31fc0e 100644
--- a/site/src/site/blog/groovy-lucene.adoc
+++ b/site/src/site/blog/groovy-lucene.adoc
@@ -6,7 +6,11 @@ Paul King
 :description: This post looks at using Lucene to find references to other 
projects in Groovy's blog posts.
 
 The Groovy https://groovy.apache.org/blog/[blog posts] often reference other 
Apache projects.
-Let's have a look at how we can find such references, first using regular 
expressions
+Given that these pages are published, we could use something like 
https://nutch.apache.org[Apache Nutch] or
+https://solr.apache.org[Apache Solr] to crawl/index those web pages and search 
using those tools.
+For this post, we are going to search for the
+information we require from the original source 
(https://asciidoc.org/[AsciiDoc]) files.
+We'll first look at how we can find project references using regular 
expressions
 and then using Apache Lucene.
 
 == Finding project names with a regex
@@ -29,33 +33,24 @@ so we won't.
 
 [source,groovy]
 ----
-String tokenRegex = /(?ix)           # ignore case, enable whitespace & 
comments
-    \b                               # word boundary
-    (                                # start capture of all terms
-        (                            # capture project name
-            (apache|eclipse)\s       # foundation name
-            (commons\s)?             # optional subproject name
-                (                    # capture next word unless excluded word
-                    ?!(
-                        groovy       # excluded words
-                      | and
-                      | license
-                      | users
-                      | software
-                      | projects
-                      | https
-                      | or
-                      | prefixes
-                      | technologies
-                      )
-                )\w+                 # end capture #2
-        )
-        |                            # alternatively
-        (                            # capture non-project word
-            (?!(apache|eclipse))
-            \w+
-        )                            # end capture #3
-    )                                # end capture #1
+String tokenRegex = /(?ix)               # ignore case, enable whitespace & 
comments
+    \b                                   # word boundary
+    (                                    # start capture of all terms
+        (                                # capture project name term
+            (apache|eclipse)\s           # foundation name
+            (commons\s)?                 # optional subproject name
+            (
+                ?!(groovy                # negative lookahead for excluded 
words
+                | and   | license  | users
+                | https | projects | software
+                | or    | prefixes | technologies)
+            )\w+
+        )                                # end capture project name term
+        |                                # alternatively
+        (                                # capture non-project term
+            \w+?\b                       # non-greedily match any other words
+        )                                # end capture non-project term
+    )                                    # end capture term
 /
 ----
 
@@ -66,14 +61,24 @@ Feel free to make a compact (long) one-liner without 
comments if you prefer.
 == Finding project names using regex matching
 
 With our regex sorted, let's look at how you could use a Groovy matcher
-to find all the project names.
+to find all the project names. First we'll define one other common constant,
+the base directory for our blogs, which you might need to change if you
+are wanting to follow along and run these examples:
 
 [source,groovy]
 ----
-var blogBaseDir = 
'/projects/apache-websites/groovy-website/site/src/site/blog' // <1>
-var histogram = [:].withDefault { 0 }
+String baseDir = '/projects/apache-websites/groovy-website/site/src/site/blog' 
// <1>
+----
+<1> You'd need to check out the Groovy website and point to it here
+
+Now our script will traverse all the files in that directory, processing them 
with our regex
+and track the hits we find.
 
-new File(blogBaseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->  // <2>
+[source,groovy]
+----
+var histogram = [:].withDefault { 0 } // <1>
+
+new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->  // <2>
     var m = file.text =~ tokenRegex // <3>
     var projects = m*.get(2).grep()*.toLowerCase()*.replaceAll('\n', ' ') // 
<4>
     var counts = projects.countBy() // <5>
@@ -83,14 +88,14 @@ new File(blogBaseDir).traverse(nameFilter: ~/.*\.adoc/) { 
file ->  // <2>
     }
 }
 
-println()
+println "\nFrequency of total hits mentioning a project:"
 histogram.sort { e -> -e.value }.each { k, v -> // <8>
     var label = "$k ($v)"
     println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
 }
 ----
-<1> You'd need to check out the Groovy website and point to it here
-<2> This traverse the directory processing each asciidoc file
+<1> This is a map which provides a default value for non-existent keys
+<2> This traverse the directory processing each AsciiDoc file
 <3> We define our matcher
 <4> This pulls out project names (capture group 2), ignores other words (using 
grep), converts to lowercase, and removes newlines for the case where a term 
might span over the end of a line
 <5> This aggregates the count hits for that file
@@ -105,7 +110,7 @@ The output looks like:
 <pre>
 apache-nlpcraft-with-groovy.adoc: [apache&nbsp;nlpcraft:5]
 classifying-iris-flowers-with-deep.adoc: [eclipse&nbsp;deeplearning4j:5, 
apache&nbsp;commons math:1, apache&nbsp;spark:2]
-community-over-code-eu-2024.adoc: [apache&nbsp;ofbiz:1, apache&nbsp;commons 
math:2, apache&nbsp;ignite:1]
+community-over-code-eu-2024.adoc: [apache&nbsp;ofbiz:1, apache&nbsp;commons 
math:2, apache&nbsp;ignite:1, apache&nbsp;spark:1, apache&nbsp;wayang:1, 
apache&nbsp;beam:1, apache&nbsp;flink:1]
 community-over-code-na-2023.adoc: [apache&nbsp;ignite:8, apache&nbsp;commons 
numbers:1, apache&nbsp;commons csv:1]
 deck-of-cards-with-groovy.adoc: [eclipse&nbsp;collections:5]
 deep-learning-and-eclipse-collections.adoc: [eclipse&nbsp;collections:7, 
eclipse&nbsp;deeplearning4j:2]
@@ -116,7 +121,7 @@ groovy-2-5-clibuilder-renewal.adoc: [apache&nbsp;commons 
cli:2]
 groovy-graph-databases.adoc: [apache&nbsp;age:11, apache&nbsp;hugegraph:3, 
apache&nbsp;tinkerpop:3]
 groovy-haiku-processing.adoc: [eclipse&nbsp;collections:3]
 groovy-list-processing-cheat-sheet.adoc: [eclipse&nbsp;collections:4, 
apache&nbsp;commons collections:3]
-groovy-lucene.adoc: [apache&nbsp;lucene:2, apache&nbsp;commons:1, 
apache&nbsp;commons math:2]
+groovy-lucene.adoc: [apache&nbsp;nutch:1, apache&nbsp;solr:1, 
apache&nbsp;lucene:2, apache&nbsp;commons:1, apache&nbsp;commons math:2]
 groovy-null-processing.adoc: [eclipse&nbsp;collections:6, apache&nbsp;commons 
collections:4]
 groovy-pekko-gpars.adoc: [apache&nbsp;pekko:4]
 groovy-record-performance.adoc: [apache&nbsp;commons codec:1]
@@ -124,7 +129,7 @@ handling-byte-order-mark-characters.adoc: 
[apache&nbsp;commons io:1]
 lego-bricks-with-groovy.adoc: [eclipse&nbsp;collections:6]
 matrix-calculations-with-groovy-apache.adoc: [apache&nbsp;commons math:6, 
eclipse&nbsp;deeplearning4j:1, apache&nbsp;commons:1]
 natural-language-processing-with-groovy.adoc: [apache&nbsp;opennlp:2, 
apache&nbsp;spark:1]
-reading-and-writing-csv-files.adoc: [apache&nbsp;commons csv:1]
+reading-and-writing-csv-files.adoc: [apache&nbsp;commons csv:2]
 set-operations-with-groovy.adoc: [eclipse&nbsp;collections:3]
 solving-simple-optimization-problems-with-groovy.adoc: [apache&nbsp;commons 
math:5, apache&nbsp;kie:1]
 using-groovy-with-apache-wayang.adoc: [apache&nbsp;wayang:9, 
apache&nbsp;spark:7, apache&nbsp;flink:1, apache&nbsp;commons csv:1, 
apache&nbsp;ignite:1]
@@ -132,36 +137,40 @@ whiskey-clustering-with-groovy-and.adoc: 
[apache&nbsp;ignite:7, apache&nbsp;waya
 wordle-checker.adoc: [eclipse&nbsp;collections:3]
 zipping-collections-with-groovy.adoc: [eclipse&nbsp;collections:4]
 
+Frequency of total hits mentioning a project:
 eclipse&nbsp;collections (50)         
██████████████████████████████████████████████████▏
 apache&nbsp;commons math (18)         ██████████████████▏
 apache&nbsp;ignite (17)               █████████████████▏
-apache&nbsp;spark (12)                ████████████▏
+apache&nbsp;spark (13)                █████████████▏
 apache&nbsp;mxnet (12)                ████████████▏
+apache&nbsp;wayang (11)               ███████████▏
 apache&nbsp;age (11)                  ███████████▏
-apache&nbsp;wayang (10)               ██████████▏
 eclipse&nbsp;deeplearning4j (8)       ████████▏
 apache&nbsp;commons collections (7)   ███████▏
+apache&nbsp;commons csv (6)           ██████▏
 apache&nbsp;nlpcraft (5)              █████▏
-apache&nbsp;commons csv (5)           █████▏
 apache&nbsp;pekko (4)                 ████▏
 apache&nbsp;hugegraph (3)             ███▏
 apache&nbsp;tinkerpop (3)             ███▏
+apache&nbsp;flink (2)                 ██▏
 apache&nbsp;commons cli (2)           ██▏
-apache&nbsp;commons (2)               ██▏
 apache&nbsp;lucene (2)                ██▏
+apache&nbsp;commons (2)               ██▏
 apache&nbsp;opennlp (2)               ██▏
 apache&nbsp;ofbiz (1)                 █▏
+apache&nbsp;beam (1)                  █▏
 apache&nbsp;commons numbers (1)       █▏
+apache&nbsp;nutch (1)                 █▏
+apache&nbsp;solr (1)                  █▏
 apache&nbsp;commons codec (1)         █▏
 apache&nbsp;commons io (1)            █▏
 apache&nbsp;kie (1)                   █▏
-apache&nbsp;flink (1)                 █▏
 </pre>
 ++++
 
-== Using Lucene
+== Indexing with Lucene
 
-image:https://www.apache.org/logos/res/lucene/default.png[lucene 
logo,100,float="right"]
+image:https://www.apache.org/logos/res/lucene/default.png[lucene 
logo,200,float="right"]
 Okay, regular expressions weren't that hard but in general we might want to 
search many things.
 Search frameworks like Lucene help with that. Let's see what it looks like to 
apply
 Lucene to our problem.
@@ -169,15 +178,18 @@ Lucene to our problem.
 First, we'll define a custom analyzer. Lucene is very flexible and comes with 
builtin
 analyzers. In a typical scenario, we might just search on all words.
 There's a builtin analyzer for that.
-If we used that, to query for our project names,
+If we used one of the builtin analyzers, to query for our project names,
 we'd construct a query that spanned multiple (word) terms.
-For the purposes of our little example, we are going to assume project names
-are indivisible terms and slice them up that way. There is a pattern tokenizer
+We'll look at what that might look like later, but
+for the purposes of our little example, we are going to assume project names
+are indivisible terms and slice up our documents that way.
+
+Luckily, Lucene has a pattern tokenizer
 which lets us reuse our existing regex.
 
 [source,groovy]
 ----
-class ApacheProjectAnalyzer extends Analyzer {
+class ProjectNameAnalyzer extends Analyzer {
     @Override
     protected TokenStreamComponents createComponents(String fieldName) {
         var src = new PatternTokenizer(~tokenRegex, 0)
@@ -191,13 +203,195 @@ Let's now tokenize our documents and let Lucene index 
them.
 
 [source,groovy]
 ----
-var analyzer = new ApacheProjectAnalyzer() // <1>
+var analyzer = new ProjectNameAnalyzer() // <1>
 var indexDir = new ByteBuffersDirectory() // <2>
 var config = new IndexWriterConfig(analyzer)
 
-var blogBaseDir = '/projects/apache-websites/groovy-website/site/src/site/blog'
 new IndexWriter(indexDir, config).withCloseable { writer ->
-    new File(blogBaseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->
+    var indexedWithFreq = new FieldType(stored: true,
+        indexOptions: IndexOptions.DOCS_AND_FREQS,
+        storeTermVectors: true)
+    new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->
+        file.withReader { br ->
+            var document = new Document()
+            document.add(new Field('content', br.text, indexedWithFreq)) // <3>
+            document.add(new StringField('name', file.name, Field.Store.YES)) 
// <4>
+            writer.addDocument(document)
+        }
+    }
+}
+----
+<1> This is our regex-based analyzer
+<2> We'll use a memory-based index for our little example
+<3> Store content of document along with term position info
+<4> Also store the name of the file
+
+With an index defined, we'd typically now perform some kind of search.
+We'll do just that shortly, but first for the kind of information we are 
interested in,
+part of the Lucene API lets us explore the index. Here is how we might do that:
+
+[source,groovy]
+----
+var reader = DirectoryReader.open(indexDir)
+var vectors = reader.termVectors()
+var storedFields = reader.storedFields()
+
+Set projects = []
+for (docId in 0..<reader.maxDoc()) {
+    String name = storedFields.document(docId).get('name')
+    TermsEnum terms = vectors.get(docId, 'content').iterator() // <1>
+    var found = [:]
+    while (terms.next() != null) {
+        PostingsEnum postingsEnum = terms.postings(null, PostingsEnum.ALL)
+        while (postingsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
+            int freq = postingsEnum.freq()
+            var string = terms.term().utf8ToString().replaceAll('\n', ' ')
+            if (string.startsWith('apache ') || string.startsWith('eclipse ')) 
{ // <2>
+                found[string] = freq
+            }
+        }
+    }
+    if (found) {
+        println "$name: $found"
+        projects += found.keySet()
+    }
+}
+
+var terms = projects.collect { name -> new Term('content', name) }
+var byReverseValue = { e -> -e.value }
+
+println "\nFrequency of total hits mentioning a project (top 10):"
+var termFreq = terms.collectEntries { term -> [term.text(), 
reader.totalTermFreq(term)] } // <3>
+termFreq.sort(byReverseValue).take(10).each { k, v ->
+    var label = "$k ($v)"
+    println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
+}
+
+println "\nFrequency of documents mentioning a project (top 10):"
+var docFreq = terms.collectEntries { term -> [term.text(), 
reader.docFreq(term)] } // <4>
+docFreq.sort(byReverseValue).take(10).each { k, v ->
+    var label = "$k ($v)"
+    println "${label.padRight(32)} ${bar(v * 2, 0, 20, 20)}"
+}
+----
+<1> Get all index terms
+<2> Look for terms which match project names, so we can save them to a set
+<3> Grab hit frequency metadata for our term
+<4> Grab document frequency metadata for our term
+
+When we run this we see:
+
+// &nbsp; entered below so that we don't hit this whole table as a bunch of 
references
+++++
+<pre>
+apache-nlpcraft-with-groovy.adoc: [apache&nbsp;nlpcraft:5]
+classifying-iris-flowers-with-deep.adoc: [apache&nbsp;commons math:1, 
apache&nbsp;spark:2, eclipse&nbsp;deeplearning4j:5]
+community-over-code-eu-2024.adoc: [apache&nbsp;beam:1, apache&nbsp;commons 
math:2, apache&nbsp;flink:1, apache&nbsp;ignite:1, apache&nbsp;ofbiz:1, 
apache&nbsp;spark:1, apache&nbsp;wayang:1]
+community-over-code-na-2023.adoc: [apache&nbsp;commons csv:1, 
apache&nbsp;commons numbers:1, apache&nbsp;ignite:8]
+deck-of-cards-with-groovy.adoc: [eclipse&nbsp;collections:5]
+deep-learning-and-eclipse-collections.adoc: [eclipse&nbsp;collections:7, 
eclipse&nbsp;deeplearning4j:2]
+detecting-objects-with-groovy-the.adoc: [apache&nbsp;mxnet:12]
+fruity-eclipse-collections.adoc: [apache&nbsp;commons math:1, 
eclipse&nbsp;collections:9]
+fun-with-obfuscated-groovy.adoc: [apache&nbsp;commons math:1]
+groovy-2-5-clibuilder-renewal.adoc: [apache&nbsp;commons cli:2]
+groovy-graph-databases.adoc: [apache&nbsp;age:11, apache&nbsp;hugegraph:3, 
apache&nbsp;tinkerpop:3]
+groovy-haiku-processing.adoc: [eclipse&nbsp;collections:3]
+groovy-list-processing-cheat-sheet.adoc: [apache&nbsp;commons collections:3, 
eclipse&nbsp;collections:4]
+groovy-lucene.adoc: [apache&nbsp;commons:1, apache&nbsp;commons math:2, 
apache&nbsp;lucene:2, apache&nbsp;nutch:1, apache&nbsp;solr:1]
+groovy-null-processing.adoc: [apache&nbsp;commons collections:4, 
eclipse&nbsp;collections:6]
+groovy-pekko-gpars.adoc: [apache&nbsp;pekko:4]
+groovy-record-performance.adoc: [apache&nbsp;commons codec:1]
+handling-byte-order-mark-characters.adoc: [apache&nbsp;commons io:1]
+lego-bricks-with-groovy.adoc: [eclipse&nbsp;collections:6]
+matrix-calculations-with-groovy-apache.adoc: [apache&nbsp;commons:1, 
apache&nbsp;commons math:6, eclipse&nbsp;deeplearning4j:1]
+natural-language-processing-with-groovy.adoc: [apache&nbsp;opennlp:2, 
apache&nbsp;spark:1]
+reading-and-writing-csv-files.adoc: [apache&nbsp;commons csv:2]
+set-operations-with-groovy.adoc: [eclipse&nbsp;collections:3]
+solving-simple-optimization-problems-with-groovy.adoc: [apache&nbsp;commons 
math:4, apache&nbsp;kie:1]
+using-groovy-with-apache-wayang.adoc: [apache&nbsp;commons csv:1, 
apache&nbsp;flink:1, apache&nbsp;ignite:1, apache&nbsp;spark:7, 
apache&nbsp;wayang:9]
+whiskey-clustering-with-groovy-and.adoc: [apache&nbsp;commons csv:2, 
apache&nbsp;ignite:7, apache&nbsp;spark:2, apache&nbsp;wayang:1]
+wordle-checker.adoc: [eclipse&nbsp;collections:3]
+zipping-collections-with-groovy.adoc: [eclipse&nbsp;collections:4]
+
+Frequency of total hits mentioning a project (top 10):
+eclipse&nbsp;collections (50)         
██████████████████████████████████████████████████▏
+apache&nbsp;commons math (17)         █████████████████▏
+apache&nbsp;ignite (17)               █████████████████▏
+apache&nbsp;spark (13)                █████████████▏
+apache&nbsp;mxnet (12)                ████████████▏
+apache&nbsp;wayang (11)               ███████████▏
+apache&nbsp;age (11)                  ███████████▏
+eclipse&nbsp;deeplearning4j (8)       ████████▏
+apache&nbsp;commons collections (7)   ███████▏
+apache&nbsp;commons csv (6)           ██████▏
+
+Frequency of documents mentioning a project (top 10):
+eclipse&nbsp;collections (10)         ████████████████████▏
+apache&nbsp;commons math (7)          ██████████████▏
+apache&nbsp;spark (5)                 ██████████▏
+apache&nbsp;ignite (4)                ████████▏
+apache&nbsp;commons csv (4)           ████████▏
+eclipse&nbsp;deeplearning4j (3)       ██████▏
+apache&nbsp;wayang (3)                ██████▏
+apache&nbsp;flink (2)                 ████▏
+apache&nbsp;commons collections (2)   ████▏
+apache&nbsp;commons (2)               ████▏
+
+</pre>
+++++
+
+So far, we have just displayed curated metadata about our index.
+But just to show that we have an index that supports searching,
+let's look for all documents which mention emojis.
+They often make programming examples a lot of fun!
+
+[source,groovy]
+----
+var parser = new QueryParser("content", analyzer)
+var searcher = new IndexSearcher(reader)
+var query = parser.parse('emoji*')
+var results = searcher.search(query, 10)
+println "\nTotal documents with hits for $query --> $results.totalHits"
+results.scoreDocs.each {
+    var doc = storedFields.document(it.doc)
+    println "${doc.get('name')}"
+}
+----
+
+When we run this we see:
+
+----
+Total documents with hits for content:emoji* --> 11 hits
+adventures-with-groovyfx.adoc
+create-groovy-blog.adoc
+deep-learning-and-eclipse-collections.adoc
+fruity-eclipse-collections.adoc
+groovy-haiku-processing.adoc
+groovy-lucene.adoc
+helloworldemoji.adoc
+seasons-greetings-emoji.adoc
+set-operations-with-groovy.adoc
+solving-simple-optimization-problems-with-groovy.adoc
+----
+
+Lucene has a very rich API. Let's now look at some alternative
+ways we could use Lucene.
+
+Rather than exploring index metadata, we'd more typically run queries
+and explore those results. We'll look at how to do that now.
+When exploring query results, we are going to use some classes in the 
`vectorhighlight`
+package in the `lucene-highlight` module. You'd typically use functionality in 
that
+module to highlight hits as part of potentially displaying them on a web page
+as part of some web search functionality. For us, we are going to just
+pick out the terms of interest, project names that matching our query.
+
+We the highlight functionality to work, we ask the indexer to store some 
additional information
+when indexing about term positions. The index code changes to look like this:
+
+[source,groovy]
+----
+new IndexWriter(indexDir, config).withCloseable { writer ->
+    new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->
         file.withReader { br ->
             var document = new Document()
             var fieldType = new FieldType(stored: true,
@@ -205,70 +399,73 @@ new IndexWriter(indexDir, config).withCloseable { writer 
->
                 storeTermVectors: true,
                 storeTermVectorPositions: true,
                 storeTermVectorOffsets: true)
-            document.add(new Field('content', br.text, fieldType)) // <3>
-            document.add(new StringField('name', file.name, Field.Store.YES)) 
// <4>
+            document.add(new Field('content', br.text, fieldType))
+            document.add(new StringField('name', file.name, Field.Store.YES))
             writer.addDocument(document)
         }
     }
 }
+----
 
-var reader = DirectoryReader.open(indexDir)
-var searcher = new IndexSearcher(reader)
-var parser = new QueryParser("content", analyzer)
+We could have stored this additional information even for our previous example,
+but it wasn't needed previously.
 
-var query = parser.parse('apache* OR eclipse*') // <5>
-var results = searcher.search(query, 30) // <6>
-println "Total documents with hits for $query --> $results.totalHits"
+Next, we define a helper method to extract the actual project names from 
matches:
+
+[source,groovy]
+----
+List<String> handleHit(ScoreDoc hit, Query query, DirectoryReader dirReader) {
+    boolean phraseHighlight = true
+    boolean fieldMatch = true
+    var fieldQuery = new FieldQuery(query, dirReader, phraseHighlight, 
fieldMatch)
+    var stack = new FieldTermStack(dirReader, hit.doc, 'content', fieldQuery)
+    var phrases = new FieldPhraseList(stack, fieldQuery)
+    phrases.phraseList*.termsInfos*.text.flatten()
+}
+----
+
+[source,groovy]
+----
+var query = parser.parse(/apache\ * OR eclipse\ */) // <1>
+var results = searcher.search(query, 30) // <2>
+println "Total documents with hits for $query --> $results.totalHits\n"
 
 var storedFields = searcher.storedFields()
 var histogram = [:].withDefault { 0 }
-results.scoreDocs.each { ScoreDoc doc -> // <7>
-    var document = storedFields.document(doc.doc)
-    var found = handleHit(doc, query, reader) // <8>
-    println "${document.get('name')}: ${found*.replaceAll('\n', ' 
').countBy()}"
-    found.each { histogram[it.replaceAll('\n', ' ')] += 1 } // <9>
+results.scoreDocs.each { ScoreDoc scoreDoc -> // <3>
+    var doc = storedFields.document(scoreDoc.doc)
+    var found = handleHit(scoreDoc, query, reader) // <4>
+    println "${doc.get('name')}: ${found*.replaceAll('\n', ' ').countBy()}"
+    found.each { histogram[it.replaceAll('\n', ' ')] += 1 } // <5>
 }
-println()
 
-histogram.sort { e -> -e.value }.each { k, v -> // <10>
+println "\nFrequency of total hits mentioning a project:"
+histogram.sort { e -> -e.value }.each { k, v -> // <6>
     var label = "$k ($v)"
     println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
 }
-
-List<String> handleHit(ScoreDoc hit, Query query, DirectoryReader dirReader) { 
// <11>
-    boolean phraseHighlight = true
-    boolean fieldMatch = true
-    FieldQuery fieldQuery = new FieldQuery(query, dirReader, phraseHighlight, 
fieldMatch)
-    FieldTermStack stack = new FieldTermStack(dirReader, hit.doc, 'content', 
fieldQuery)
-    FieldPhraseList phrases = new FieldPhraseList(stack, fieldQuery)
-    phrases.phraseList*.termsInfos*.text.flatten()
-}
 ----
-<1> This is our regex-based analyzer
-<2> We'll use a memory-based index for our little example
-<3> Store content of document along with term position info
-<4> Also store the name of the file
-<5> Search for terms with the apache or eclipse prefixes
-<6> Perform our query with a limit of 30 results
-<7> Process each result
-<8> Pull out the actual matched terms
-<9> Also aggregate the counts
-<10> Display the aggregates as a pretty barchart
-<11> Helper method
+<1> Search for terms with the apache or eclipse prefixes
+<2> Perform our query with a limit of 30 results
+<3> Process each result
+<4> Pull out the actual matched terms
+<5> Also aggregate the counts
+<6> Display the aggregates as a pretty barchart
 
 The output is essentially the same as before:
 
 // &nbsp; used instead of space below so that we don't hit this whole table as 
a bunch of project references
 ++++
 <pre>
-Total documents with hits for content:apache* content:eclipse* --> 28 hits
+Total documents with hits for content:apache&nbsp;* content:eclipse&nbsp;* --> 
28 hits
+
 classifying-iris-flowers-with-deep.adoc: [eclipse&nbsp;deeplearning4j:5, 
apache&nbsp;commons math:1, apache&nbsp;spark:2]
 fruity-eclipse-collections.adoc: [eclipse&nbsp;collections:9, 
apache&nbsp;commons math:1]
 groovy-list-processing-cheat-sheet.adoc: [eclipse&nbsp;collections:4, 
apache&nbsp;commons collections:3]
 groovy-null-processing.adoc: [eclipse&nbsp;collections:6, apache&nbsp;commons 
collections:4]
 matrix-calculations-with-groovy-apache.adoc: [apache&nbsp;commons math:6, 
eclipse&nbsp;deeplearning4j:1, apache&nbsp;commons:1]
 apache-nlpcraft-with-groovy.adoc: [apache&nbsp;nlpcraft:5]
-community-over-code-eu-2024.adoc: [apache&nbsp;ofbiz:1, apache&nbsp;commons 
math:2, apache&nbsp;ignite:1]
+community-over-code-eu-2024.adoc: [apache&nbsp;ofbiz:1, apache&nbsp;commons 
math:2, apache&nbsp;ignite:1, apache&nbsp;spark:1, apache&nbsp;wayang:1, 
apache&nbsp;beam:1, apache&nbsp;flink:1]
 community-over-code-na-2023.adoc: [apache&nbsp;ignite:8, apache&nbsp;commons 
numbers:1, apache&nbsp;commons csv:1]
 deck-of-cards-with-groovy.adoc: [eclipse&nbsp;collections:5]
 deep-learning-and-eclipse-collections.adoc: [eclipse&nbsp;collections:7, 
eclipse&nbsp;deeplearning4j:2]
@@ -277,13 +474,13 @@ fun-with-obfuscated-groovy.adoc: [apache&nbsp;commons 
math:1]
 groovy-2-5-clibuilder-renewal.adoc: [apache&nbsp;commons cli:2]
 groovy-graph-databases.adoc: [apache&nbsp;age:11, apache&nbsp;hugegraph:3, 
apache&nbsp;tinkerpop:3]
 groovy-haiku-processing.adoc: [eclipse&nbsp;collections:3]
-groovy-lucene.adoc: [apache&nbsp;lucene:2, apache&nbsp;commons:1, 
apache&nbsp;commons math:2]
+groovy-lucene.adoc: [apache&nbsp;nutch:1, apache&nbsp;solr:1, 
apache&nbsp;lucene:2, apache&nbsp;commons:1, apache&nbsp;commons math:2]
 groovy-pekko-gpars.adoc: [apache&nbsp;pekko:4]
 groovy-record-performance.adoc: [apache&nbsp;commons codec:1]
 handling-byte-order-mark-characters.adoc: [apache&nbsp;commons io:1]
 lego-bricks-with-groovy.adoc: [eclipse&nbsp;collections:6]
 natural-language-processing-with-groovy.adoc: [apache&nbsp;opennlp:2, 
apache&nbsp;spark:1]
-reading-and-writing-csv-files.adoc: [apache&nbsp;commons csv:1]
+reading-and-writing-csv-files.adoc: [apache&nbsp;commons csv:2]
 set-operations-with-groovy.adoc: [eclipse&nbsp;collections:3]
 solving-simple-optimization-problems-with-groovy.adoc: [apache&nbsp;commons 
math:5, apache&nbsp;kie:1]
 using-groovy-with-apache-wayang.adoc: [apache&nbsp;wayang:9, 
apache&nbsp;spark:7, apache&nbsp;flink:1, apache&nbsp;commons csv:1, 
apache&nbsp;ignite:1]
@@ -291,30 +488,17 @@ whiskey-clustering-with-groovy-and.adoc: 
[apache&nbsp;ignite:7, apache&nbsp;waya
 wordle-checker.adoc: [eclipse&nbsp;collections:3]
 zipping-collections-with-groovy.adoc: [eclipse&nbsp;collections:4]
 
+Frequency of total hits mentioning a project (top 10):
 eclipse&nbsp;collections (50)         
██████████████████████████████████████████████████▏
 apache&nbsp;commons math (18)         ██████████████████▏
 apache&nbsp;ignite (17)               █████████████████▏
-apache&nbsp;spark (12)                ████████████▏
+apache&nbsp;spark (13)                █████████████▏
 apache&nbsp;mxnet (12)                ████████████▏
+apache&nbsp;wayang (11)               ███████████▏
 apache&nbsp;age (11)                  ███████████▏
-apache&nbsp;wayang (10)               ██████████▏
 eclipse&nbsp;deeplearning4j (8)       ████████▏
 apache&nbsp;commons collections (7)   ███████▏
-apache&nbsp;nlpcraft (5)              █████▏
-apache&nbsp;commons csv (5)           █████▏
-apache&nbsp;pekko (4)                 ████▏
-apache&nbsp;hugegraph (3)             ███▏
-apache&nbsp;tinkerpop (3)             ███▏
-apache&nbsp;commons (2)               ██▏
-apache&nbsp;commons cli (2)           ██▏
-apache&nbsp;lucene (2)                ██▏
-apache&nbsp;opennlp (2)               ██▏
-apache&nbsp;ofbiz (1)                 █▏
-apache&nbsp;commons numbers (1)       █▏
-apache&nbsp;commons codec (1)         █▏
-apache&nbsp;commons io (1)            █▏
-apache&nbsp;kie (1)                   █▏
-apache&nbsp;flink (1)                 █▏
+apache&nbsp;commons csv (6)           ██████▏
 </pre>
 ++++
 
@@ -322,7 +506,7 @@ apache&nbsp;flink (1)                 █▏
 
 [source,groovy]
 ----
-var analyzer = new ApacheProjectAnalyzer()
+var analyzer = new ProjectNameAnalyzer()
 var indexDir = new ByteBuffersDirectory()
 var taxonDir = new ByteBuffersDirectory()
 var config = new IndexWriterConfig(analyzer)
@@ -337,18 +521,15 @@ var fConfig = new FacetsConfig().tap {
     setIndexFieldName('projectHitCounts', '$projectHitCounts')
 }
 
-var blogBaseDir = '/projects/apache-websites/groovy-website/site/src/site/blog'
-new File(blogBaseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->
+new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->
     var m = file.text =~ tokenRegex
     var projects = m*.get(2).grep()*.toLowerCase()*.replaceAll('\n', ' 
').countBy()
     file.withReader { br ->
         var document = new Document()
-        var fieldType = new FieldType(stored: true,
-            indexOptions: 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS,
-            storeTermVectors: true,
-            storeTermVectorPositions: true,
-            storeTermVectorOffsets: true)
-        document.add(new Field('content', br.text, fieldType))
+        var indexedWithFreq = new FieldType(stored: true,
+            indexOptions: IndexOptions.DOCS_AND_FREQS,
+            storeTermVectors: true)
+        document.add(new Field('content', br.text, indexedWithFreq))
         document.add(new StringField('name', file.name, Field.Store.YES))
         if (projects) {
             println "$file.name: $projects"
@@ -363,46 +544,13 @@ new File(blogBaseDir).traverse(nameFilter: ~/.*\.adoc/) { 
file ->
 }
 indexWriter.close()
 taxonWriter.close()
-println()
-
-var reader = DirectoryReader.open(indexDir)
-var searcher = new IndexSearcher(reader)
-var taxonReader = new DirectoryTaxonomyReader(taxonDir)
-var fcm = new FacetsCollectorManager()
-var fc = FacetsCollectorManager.search(searcher, new MatchAllDocsQuery(), 10, 
fcm).facetsCollector()
-
-var projects = new TaxonomyFacetIntAssociations('$projectHitCounts', 
taxonReader, fConfig, fc, AssociationAggregationFunction.SUM)
-var hitCounts = projects.getTopChildren(10, "projectHitCounts")
-println hitCounts
-
-var facets = new FastTaxonomyFacetCounts(taxonReader, fConfig, fc)
-var fileCounts = facets.getTopChildren(10, "projectFileCounts")
-println fileCounts
-
-var nameCounts = facets.getTopChildren(10, "projectNameCounts")
-println nameCounts
-nameCounts = facets.getTopChildren(10, "projectNameCounts", 'apache')
-println nameCounts
-nameCounts = facets.getTopChildren(10, "projectNameCounts", 'apache', 
'commons')
-println nameCounts
-
-var parser = new QueryParser("content", analyzer)
-var query = parser.parse('apache* AND eclipse*')
-var results = searcher.search(query, 10)
-println "Total documents with hits for $query --> $results.totalHits"
-var storedFields = searcher.storedFields()
-results.scoreDocs.each { ScoreDoc doc ->
-    var document = storedFields.document(doc.doc)
-    println "${document.get('name')}"
-}
 ----
 
-// &nbsp; entered below so that we don't hit this whole table as a bunch of 
references
 ++++
 <pre>
 apache-nlpcraft-with-groovy.adoc: [apache&nbsp;nlpcraft:5]
 classifying-iris-flowers-with-deep.adoc: [eclipse&nbsp;deeplearning4j:5, 
apache&nbsp;commons math:1, apache&nbsp;spark:2]
-community-over-code-eu-2024.adoc: [apache&nbsp;ofbiz:1, apache&nbsp;commons 
math:2, apache&nbsp;ignite:1]
+community-over-code-eu-2024.adoc: [apache&nbsp;ofbiz:1, apache&nbsp;commons 
math:2, apache&nbsp;ignite:1, apache&nbsp;spark:1, apache&nbsp;wayang:1, 
apache&nbsp;beam:1, apache&nbsp;flink:1]
 community-over-code-na-2023.adoc: [apache&nbsp;ignite:8, apache&nbsp;commons 
numbers:1, apache&nbsp;commons csv:1]
 deck-of-cards-with-groovy.adoc: [eclipse&nbsp;collections:5]
 deep-learning-and-eclipse-collections.adoc: [eclipse&nbsp;collections:7, 
eclipse&nbsp;deeplearning4j:2]
@@ -413,7 +561,7 @@ groovy-2-5-clibuilder-renewal.adoc: [apache&nbsp;commons 
cli:2]
 groovy-graph-databases.adoc: [apache&nbsp;age:11, apache&nbsp;hugegraph:3, 
apache&nbsp;tinkerpop:3]
 groovy-haiku-processing.adoc: [eclipse&nbsp;collections:3]
 groovy-list-processing-cheat-sheet.adoc: [eclipse&nbsp;collections:4, 
apache&nbsp;commons collections:3]
-groovy-lucene.adoc: [apache&nbsp;lucene:2, apache&nbsp;commons:1, 
apache&nbsp;commons math:2]
+groovy-lucene.adoc: [apache&nbsp;nutch:1, apache&nbsp;solr:1, 
apache&nbsp;lucene:2, apache&nbsp;commons:1, apache&nbsp;commons math:2]
 groovy-null-processing.adoc: [eclipse&nbsp;collections:6, apache&nbsp;commons 
collections:4]
 groovy-pekko-gpars.adoc: [apache&nbsp;pekko:4]
 groovy-record-performance.adoc: [apache&nbsp;commons codec:1]
@@ -421,7 +569,7 @@ handling-byte-order-mark-characters.adoc: 
[apache&nbsp;commons io:1]
 lego-bricks-with-groovy.adoc: [eclipse&nbsp;collections:6]
 matrix-calculations-with-groovy-apache.adoc: [apache&nbsp;commons math:6, 
eclipse&nbsp;deeplearning4j:1, apache&nbsp;commons:1]
 natural-language-processing-with-groovy.adoc: [apache&nbsp;opennlp:2, 
apache&nbsp;spark:1]
-reading-and-writing-csv-files.adoc: [apache&nbsp;commons csv:1]
+reading-and-writing-csv-files.adoc: [apache&nbsp;commons csv:2]
 set-operations-with-groovy.adoc: [eclipse&nbsp;collections:3]
 solving-simple-optimization-problems-with-groovy.adoc: [apache&nbsp;commons 
math:5, apache&nbsp;kie:1]
 using-groovy-with-apache-wayang.adoc: [apache&nbsp;wayang:9, 
apache&nbsp;spark:7, apache&nbsp;flink:1, apache&nbsp;commons csv:1, 
apache&nbsp;ignite:1]
@@ -429,64 +577,181 @@ whiskey-clustering-with-groovy-and.adoc: 
[apache&nbsp;ignite:7, apache&nbsp;waya
 wordle-checker.adoc: [eclipse&nbsp;collections:3]
 zipping-collections-with-groovy.adoc: [eclipse&nbsp;collections:4]
 
-dim=projectHitCounts path=[] value=-1 childCount=24
-  eclipse&nbsp;collections (50)
-  apache&nbsp;commons math (18)
-  apache&nbsp;ignite (17)
-  apache&nbsp;spark (12)
-  apache&nbsp;mxnet (12)
-  apache&nbsp;age (11)
-  apache&nbsp;wayang (10)
-  eclipse&nbsp;deeplearning4j (8)
-  apache&nbsp;commons collections (7)
-  apache&nbsp;nlpcraft (5)
-
-dim=projectFileCounts path=[] value=-1 childCount=24
+</pre>
+++++
+
+
+[source,groovy]
+----
+var reader = DirectoryReader.open(indexDir)
+var searcher = new IndexSearcher(reader)
+var taxonReader = new DirectoryTaxonomyReader(taxonDir)
+var fcm = new FacetsCollectorManager()
+var fc = FacetsCollectorManager.search(searcher, new MatchAllDocsQuery(), 0, 
fcm).facetsCollector()
+
+var topN = 5
+var projects = new TaxonomyFacetIntAssociations('$projectHitCounts', 
taxonReader, fConfig, fc, AssociationAggregationFunction.SUM)
+var hitCounts = projects.getTopChildren(topN, 
"projectHitCounts").labelValues.collect{
+    [label: it.label, hits: it.value, files: it.count]
+}
+
+println "\nFrequency of total hits mentioning a project (top $topN):"
+hitCounts.sort{ m -> -m.hits }.each { m ->
+    var label = "$m.label ($m.hits)"
+    println "${label.padRight(32)} ${bar(m.hits, 0, 50, 50)}"
+}
+
+println "\nFrequency of documents mentioning a project (top $topN):"
+hitCounts.sort{ m -> -m.files }.each { m ->
+    var label = "$m.label ($m.files)"
+    println "${label.padRight(32)} ${bar(m.files * 2, 0, 20, 20)}"
+}
+
+----
+
+// &nbsp; entered below so that we don't hit this whole table as a bunch of 
references
+++++
+<pre>
+Frequency of total hits mentioning a project (top 5):
+eclipse&nbsp;collections (50)         
██████████████████████████████████████████████████▏
+apache&nbsp;commons math (18)         ██████████████████▏
+apache&nbsp;ignite (17)               █████████████████▏
+apache&nbsp;spark (13)                █████████████▏
+apache&nbsp;mxnet (12)                ████████████▏
+
+Frequency of documents mentioning a project (top 5):
+eclipse&nbsp;collections (10)         ████████████████████▏
+apache&nbsp;commons math (7)          ██████████████▏
+apache&nbsp;spark (5)                 ██████████▏
+apache&nbsp;ignite (4)                ████████▏
+apache&nbsp;mxnet (1)                 ██▏
+
+</pre>
+++++
+
+
+[source,groovy]
+----
+var facets = new FastTaxonomyFacetCounts(taxonReader, fConfig, fc)
+
+println "\nFrequency of documents mentioning a project (top $topN):"
+var fileCounts = facets.getTopChildren(topN, "projectFileCounts")
+println fileCounts
+----
+
+++++
+<pre>
+Frequency of documents mentioning a project (top 5):
+dim=projectFileCounts path=[] value=-1 childCount=27
   eclipse&nbsp;collections (10)
   apache&nbsp;commons math (7)
-  apache&nbsp;spark (4)
+  apache&nbsp;spark (5)
   apache&nbsp;ignite (4)
-  apache&nbsp;commons csv (4)
-  eclipse&nbsp;deeplearning4j (3)
-  apache&nbsp;commons collections (2)
-  apache&nbsp;commons (2)
-  apache&nbsp;wayang (2)
-  apache&nbsp;nlpcraft (1)
+  apache commons csv (4)
+
+</pre>
+++++
 
+[source,groovy]
+----
+['apache', 'commons'].inits().reverseEach { path ->
+    println "Frequency of documents mentioning a project with path $path (top 
$topN):"
+    var nameCounts = facets.getTopChildren(topN, "projectNameCounts", *path)
+    println "$nameCounts"
+}
+----
+
+++++
+<pre>
+Frequency of documents mentioning a project with path [] (top 5):
 dim=projectNameCounts path=[] value=-1 childCount=2
   apache (21)
   eclipse (12)
 
-dim=projectNameCounts path=[apache] value=-1 childCount=15
+Frequency of documents mentioning a project with path [apache] (top 5):
+dim=projectNameCounts path=[apache] value=-1 childCount=18
   commons (16)
-  spark (4)
+  spark (5)
   ignite (4)
-  wayang (2)
-  nlpcraft (1)
-  ofbiz (1)
-  mxnet (1)
-  age (1)
-  hugegraph (1)
-  tinkerpop (1)
-
-dim=projectNameCounts path=[apache,&nbsp;commons] value=-1 childCount=7
+  wayang (3)
+  flink (2)
+
+Frequency of documents mentioning a project with path [apache, commons] (top 
5):
+dim=projectNameCounts path=[apache, commons] value=-1 childCount=7
   math (7)
   csv (4)
   collections (2)
   numbers (1)
   cli (1)
-  codec (1)
-  io (1)
 
-Total documents with hits for +content:apache* +content:eclipse* --> 5 hits
-classifying-iris-flowers-with-deep.adoc
-fruity-eclipse-collections.adoc
-groovy-list-processing-cheat-sheet.adoc
-groovy-null-processing.adoc
-matrix-calculations-with-groovy-apache.adoc
 </pre>
 ++++
 
+[source,groovy]
+----
+var parser = new QueryParser("content", analyzer)
+var query = parser.parse(/apache\ * AND eclipse\ * AND emoji*/)
+var results = searcher.search(query, topN)
+var storedFields = searcher.storedFields()
+assert results.totalHits.value() == 1 &&
+    storedFields.document(results.scoreDocs[0].doc).get('name') == 
'fruity-eclipse-collections.adoc'
+----
+
+== More complex queries
+
+[source,groovy]
+----
+var analyzer = new StandardAnalyzer()
+var indexDir = new ByteBuffersDirectory()
+var config = new IndexWriterConfig(analyzer)
+
+new IndexWriter(indexDir, config).withCloseable { writer ->
+    new File(baseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->
+        file.withReader { br ->
+            var document = new Document()
+            var fieldType = new FieldType(stored: true,
+                indexOptions: 
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS,
+                storeTermVectors: true,
+                storeTermVectorPositions: true,
+                storeTermVectorOffsets: true)
+            document.add(new Field('content', br.text, fieldType))
+            document.add(new StringField('name', file.name, Field.Store.YES))
+            writer.addDocument(document)
+        }
+    }
+}
+----
+
+[source,groovy]
+----
+IndexReader reader = DirectoryReader.open(indexDir)
+var searcher = new IndexSearcher(reader)
+
+var namepart = new SpanMultiTermQueryWrapper(new RegexpQuery(new 
Term("content", '''(
+math|spark|lucene|collections|deeplearning4j
+|beam|wayang|csv|io|numbers|ignite|mxnet|age
+|nlpcraft|pekko|hugegraph|tinkerpop|commons
+|cli|opennlp|ofbiz|codec|kie|flink
+)'''.replaceAll('\n', ''))))
+
+var (apache, commons) = ['apache', 'commons'].collect{ new Term('content', it) 
}
+var apacheCommons = new SpanNearQuery([new SpanTermQuery(apache), new 
SpanTermQuery(commons), namepart] as SpanQuery[], 0, true)
+
+var foundation = new SpanMultiTermQueryWrapper(new RegexpQuery(new 
Term("content", "(apache|eclipse)")))
+var otherProject = new SpanNearQuery([foundation, namepart] as SpanQuery[], 0, 
true)
+
+var builder = new BooleanQuery.Builder(minimumNumberShouldMatch: 1)
+builder.add(otherProject, BooleanClause.Occur.SHOULD)
+builder.add(apacheCommons, BooleanClause.Occur.SHOULD)
+var query = builder.build()
+var results = searcher.search(query, 30)
+println "Total documents with hits for $query --> $results.totalHits"
+----
+
+----
+Total documents with hits for 
(spanNear([SpanMultiTermQueryWrapper(content:/(apache|eclipse)/), 
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pekko|hugegraph|tinkerpop|commons|cli|opennlp|ofbiz|codec|kie|flink)/)],
 0, true) spanNear([content:apache, content:commons, 
SpanMultiTermQueryWrapper(content:/(math|spark|lucene|collections|deeplearning4j|beam|wayang|csv|io|numbers|ignite|mxnet|age|nlpcraft|pek
 [...]
+----
+
 == Conclusion
 
 We have analyzed the Groovy blog posts looking for referenced projects

(groovy-website) 03/03: additional examples

Reply via email to