This is an automated email from the ASF dual-hosted git repository.
paulk pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/groovy-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 53329d1 draft blog about using lucene with groovy
53329d1 is described below
commit 53329d16574e6dc4fdd1cc3e312fb1d1fc3b6cbf
Author: Paul King <[email protected]>
AuthorDate: Tue Nov 19 07:08:00 2024 +1000
draft blog about using lucene with groovy
---
site/src/site/blog/groovy-lucene.adoc | 490 ++++++++++++++++++++++++++++++++++
1 file changed, 490 insertions(+)
diff --git a/site/src/site/blog/groovy-lucene.adoc
b/site/src/site/blog/groovy-lucene.adoc
new file mode 100644
index 0000000..25ae523
--- /dev/null
+++ b/site/src/site/blog/groovy-lucene.adoc
@@ -0,0 +1,490 @@
+= Searching with Lucene
+Paul King
+:revdate: 2024-11-18T20:30:00+00:00
+:draft: true
+:keywords: aggregation, search, lucene, groovy
+:description: This post looks at using Lucene to find references to other
projects in Groovy's blog posts.
+
+The Groovy https://groovy.apache.org/blog/[blog posts] often reference other
Apache projects.
+Let's have a look at how we can find such references, first using regular
expressions
+and then using Apache Lucene.
+
+== Finding project names with a regex
+
+For the sake of this post, let's assume that project references will
+include the work "Apache" followed by the project name. To make it more
+interesting, we'll also include references to Eclipse projects.
+We'll also make provision for project with subprojects, at least for
+Apache Commons, so this will pick up names like Apache Commons Math
+for instance. We'll exclude Apache Groovy since that would hit possibly
+every Groovy blog post. We'll also exclude a bunch of words that appear in
+commonly used phrases like "Apache License" and "Apache Projects".
+
+This is by no means a perfect name reference finder, for example,
+we often refer to Apache Commons Math by its full name when first introduced
+but later in posts we fall back to the more friendly "Commons Math" reference
+where the "Apache" is understood from the context. We could make the regex
+more elaborate to cater for such cases but there isn't really any benefit,
+so we won't.
+
+[source,groovy]
+----
+String tokenRegex = /(?ix) # ignore case, enable whitespace &
comments
+ \b # word boundary
+ ( # start capture of all terms
+ ( # capture project name
+ (apache|eclipse)\s # foundation name
+ (commons\s)? # optional subproject name
+ ( # capture next word unless excluded word
+ ?!(
+ groovy # excluded words
+ | and
+ | license
+ | users
+ | software
+ | projects
+ | https
+ | technologies
+ )
+ )\w+ # end capture #2
+ )
+ | # alternatively
+ ( # capture non-project word
+ (?!(apache|eclipse))
+ \w+
+ ) # end capture #3
+ ) # end capture #1
+/
+----
+
+We've used Groovy's multiline slashy string to save having to escape
backslashes.
+We've also enabled regex whitespace and comments to explain the regex.
+Feel free to make a compact (long) one-liner without comments if you prefer.
+
+== Finding project names using regex matching
+
+With our regex sorted, let's look at how you could use a Groovy matcher
+to find all the project names.
+
+[source,groovy]
+----
+var blogBaseDir =
'/projects/apache-websites/groovy-website/site/src/site/blog' // <1>
+var histogram = [:].withDefault { 0 }
+
+new File(blogBaseDir).traverse(nameFilter: ~/.*\.adoc/) { file -> // <2>
+ var m = file.text =~ tokenRegex // <3>
+ var projects = m*.get(2).grep()*.toLowerCase()*.replaceAll('\n', '
').countBy() // <4>
+ if (projects) {
+ println "$file.name: $projects" // <5>
+ projects.each { k, v -> histogram[k] += v } // <6>
+ }
+}
+
+println()
+
+histogram.sort { e -> -e.value }.each { k, v -> // <7>
+ var label = "$k ($v)"
+ println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
+}
+----
+<1> You'd need to check out the Groovy website and point to it here
+<2> This traverse the directory processing each asciidoc file
+<3> We define our matcher
+<4> This pulls out project names (capture group 2) and ignores other words
(using grep) then aggregates the hits for that file
+<5> We print out each blog post file name and its project references
+<6> We add the file aggregates to the overall aggregates
+<7> We print out the pretty ascii barchart summarising the overall aggregates
+
+The output looks like:
+
+// entered below so that we don't hit this whole table as a bunch of
references
+++++
+<pre>
+apache-nlpcraft-with-groovy.adoc: [apache nlpcraft:5]
+classifying-iris-flowers-with-deep.adoc: [eclipse deeplearning4j:5,
apache commons math:1, apache spark:2]
+community-over-code-eu-2024.adoc: [apache ofbiz:1, apache commons
math:2, apache ignite:1]
+community-over-code-na-2023.adoc: [apache ignite:8, apache commons
numbers:1, apache commons csv:1]
+deck-of-cards-with-groovy.adoc: [eclipse collections:5]
+deep-learning-and-eclipse-collections.adoc: [eclipse collections:7,
eclipse deeplearning4j:2]
+detecting-objects-with-groovy-the.adoc: [apache mxnet:12]
+fruity-eclipse-collections.adoc: [eclipse collections:9,
apache commons math:1]
+fun-with-obfuscated-groovy.adoc: [apache commons math:1]
+groovy-2-5-clibuilder-renewal.adoc: [apache commons cli:2]
+groovy-graph-databases.adoc: [apache age:11, apache hugegraph:3,
apache tinkerpop:3]
+groovy-haiku-processing.adoc: [eclipse collections:3]
+groovy-list-processing-cheat-sheet.adoc: [eclipse collections:4,
apache commons collections:3]
+groovy-lucene.adoc: [apache lucene:2, apache commons:1,
apache commons math:2]
+groovy-null-processing.adoc: [eclipse collections:6, apache commons
collections:4]
+groovy-pekko-gpars.adoc: [apache pekko:4]
+groovy-record-performance.adoc: [apache commons codec:1]
+handling-byte-order-mark-characters.adoc: [apache commons io:1]
+lego-bricks-with-groovy.adoc: [eclipse collections:6]
+matrix-calculations-with-groovy-apache.adoc: [apache commons math:6,
eclipse deeplearning4j:1, apache commons:1]
+natural-language-processing-with-groovy.adoc: [apache opennlp:2,
apache spark:1]
+reading-and-writing-csv-files.adoc: [apache commons csv:1]
+set-operations-with-groovy.adoc: [eclipse collections:3]
+solving-simple-optimization-problems-with-groovy.adoc: [apache commons
math:5, apache kie:1]
+using-groovy-with-apache-wayang.adoc: [apache wayang:9,
apache spark:7, apache flink:1, apache commons csv:1,
apache ignite:1]
+whiskey-clustering-with-groovy-and.adoc: [apache ignite:7,
apache wayang:1, apache spark:2, apache commons csv:2]
+wordle-checker.adoc: [eclipse collections:3]
+zipping-collections-with-groovy.adoc: [eclipse collections:4]
+
+eclipse collections (50)
██████████████████████████████████████████████████▏
+apache commons math (18) ██████████████████▏
+apache ignite (17) █████████████████▏
+apache spark (12) ████████████▏
+apache mxnet (12) ████████████▏
+apache age (11) ███████████▏
+apache wayang (10) ██████████▏
+eclipse deeplearning4j (8) ████████▏
+apache commons collections (7) ███████▏
+apache nlpcraft (5) █████▏
+apache commons csv (5) █████▏
+apache pekko (4) ████▏
+apache hugegraph (3) ███▏
+apache tinkerpop (3) ███▏
+apache commons cli (2) ██▏
+apache commons (2) ██▏
+apache lucene (2) ██▏
+apache opennlp (2) ██▏
+apache ofbiz (1) █▏
+apache commons numbers (1) █▏
+apache commons codec (1) █▏
+apache commons io (1) █▏
+apache kie (1) █▏
+apache flink (1) █▏
+</pre>
+++++
+
+== Using Lucene
+
+Okay, regular expressions weren't that hard but in general we might want to
search many things.
+Search frameworks like Lucene help with that. Let's see what it looks like to
apply
+Lucene to our problem.
+
+First, we'll define a custom analyzer. Lucene is very flexible and comes with
builtin
+analyzers. In a typical scenario, we might just search on all words.
+There's a builtin analyzer for that.
+If we used that, to query for our project names,
+we'd construct a query that spanned multiple (word) terms.
+For the purposes of our little example, we are going to assume project names
+are indivisible terms and slice them up that way. There is a pattern tokenizer
+which lets us reuse our existing regex.
+
+[source,groovy]
+----
+class ApacheProjectAnalyzer extends Analyzer {
+ @Override
+ protected TokenStreamComponents createComponents(String fieldName) {
+ var src = new PatternTokenizer(~tokenRegex, 0)
+ var result = new LowerCaseFilter(src)
+ new TokenStreamComponents(src, result)
+ }
+}
+----
+
+Let's now tokenize our documents and let Lucene index them.
+
+[source,groovy]
+----
+var analyzer = new ApacheProjectAnalyzer() // <1>
+var indexDir = new ByteBuffersDirectory() // <2>
+var config = new IndexWriterConfig(analyzer)
+var writer = new IndexWriter(indexDir, config)
+
+var blogBaseDir = '/projects/apache-websites/groovy-website/site/src/site/blog'
+new File(blogBaseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->
+ file.withReader { br ->
+ var document = new Document()
+ var fieldType = new FieldType(stored: true,
+ indexOptions:
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS,
+ storeTermVectors: true,
+ storeTermVectorPositions: true,
+ storeTermVectorOffsets: true)
+ document.add(new Field('content', br.text, fieldType)) // <3>
+ document.add(new StringField('name', file.name, Field.Store.YES)) //
<4>
+ writer.addDocument(document)
+ }
+}
+writer.close()
+
+var reader = DirectoryReader.open(indexDir)
+var searcher = new IndexSearcher(reader)
+var parser = new QueryParser("content", analyzer)
+
+var query = parser.parse('apache* OR eclipse*') // <5>
+var results = searcher.search(query, 30) // <6>
+println "Total documents with hits for $query --> $results.totalHits"
+
+var storedFields = searcher.storedFields()
+var histogram = [:].withDefault { 0 }
+results.scoreDocs.each { ScoreDoc doc -> // <7>
+ var document = storedFields.document(doc.doc)
+ var found = handleHit(doc, query, reader) // <8>
+ println "${document.get('name')}: ${found*.replaceAll('\n', '
').countBy()}"
+ found.each { histogram[it.replaceAll('\n', ' ')] += 1 } // <9>
+}
+
+println()
+
+histogram.sort { e -> -e.value }.each { k, v -> // <10>
+ var label = "$k ($v)"
+ println "${label.padRight(32)} ${bar(v, 0, 50, 50)}"
+}
+
+List<String> handleHit(ScoreDoc hit, Query query, DirectoryReader dirReader) {
// <11>
+ boolean phraseHighlight = Boolean.TRUE
+ boolean fieldMatch = Boolean.TRUE
+ FieldQuery fieldQuery = new FieldQuery(query, dirReader, phraseHighlight,
fieldMatch)
+ FieldTermStack stack = new FieldTermStack(dirReader, hit.doc, 'content',
fieldQuery)
+ FieldPhraseList phrases = new FieldPhraseList(stack, fieldQuery)
+ phrases.phraseList*.termsInfos*.text.flatten()
+}
+----
+<1> This is our regex-based analyzer
+<2> We'll use a memory-based index for our little example
+<3> Store content of document along with term position info
+<4> Also store the name of the file
+<5> Search for terms with the apache or eclipse prefixes
+<6> Perform our query with a limit of 30 results
+<7> Process each result
+<8> Pull out the actual matched terms
+<9> Also aggregate the counts
+<10> Display the aggregates as a pretty barchart
+<11> Helper method
+
+The output is essentially the same as before:
+
+// used instead of space below so that we don't hit this whole table as
a bunch of project references
+++++
+<pre>
+Total documents with hits for content:apache* content:eclipse* --> 28 hits
+classifying-iris-flowers-with-deep.adoc: [eclipse deeplearning4j:5,
apache commons math:1, apache spark:2]
+fruity-eclipse-collections.adoc: [eclipse collections:9,
apache commons math:1]
+groovy-list-processing-cheat-sheet.adoc: [eclipse collections:4,
apache commons collections:3]
+groovy-null-processing.adoc: [eclipse collections:6, apache commons
collections:4]
+matrix-calculations-with-groovy-apache.adoc: [apache commons math:6,
eclipse deeplearning4j:1, apache commons:1]
+apache-nlpcraft-with-groovy.adoc: [apache nlpcraft:5]
+community-over-code-eu-2024.adoc: [apache ofbiz:1, apache commons
math:2, apache ignite:1]
+community-over-code-na-2023.adoc: [apache ignite:8, apache commons
numbers:1, apache commons csv:1]
+deck-of-cards-with-groovy.adoc: [eclipse collections:5]
+deep-learning-and-eclipse-collections.adoc: [eclipse collections:7,
eclipse deeplearning4j:2]
+detecting-objects-with-groovy-the.adoc: [apache mxnet:12]
+fun-with-obfuscated-groovy.adoc: [apache commons math:1]
+groovy-2-5-clibuilder-renewal.adoc: [apache commons cli:2]
+groovy-graph-databases.adoc: [apache age:11, apache hugegraph:3,
apache tinkerpop:3]
+groovy-haiku-processing.adoc: [eclipse collections:3]
+groovy-lucene.adoc: [apache lucene:2, apache commons:1,
apache commons math:2]
+groovy-pekko-gpars.adoc: [apache pekko:4]
+groovy-record-performance.adoc: [apache commons codec:1]
+handling-byte-order-mark-characters.adoc: [apache commons io:1]
+lego-bricks-with-groovy.adoc: [eclipse collections:6]
+natural-language-processing-with-groovy.adoc: [apache opennlp:2,
apache spark:1]
+reading-and-writing-csv-files.adoc: [apache commons csv:1]
+set-operations-with-groovy.adoc: [eclipse collections:3]
+solving-simple-optimization-problems-with-groovy.adoc: [apache commons
math:5, apache kie:1]
+using-groovy-with-apache-wayang.adoc: [apache wayang:9,
apache spark:7, apache flink:1, apache commons csv:1,
apache ignite:1]
+whiskey-clustering-with-groovy-and.adoc: [apache ignite:7,
apache wayang:1, apache spark:2, apache commons csv:2]
+wordle-checker.adoc: [eclipse collections:3]
+zipping-collections-with-groovy.adoc: [eclipse collections:4]
+
+eclipse collections (50)
██████████████████████████████████████████████████▏
+apache commons math (18) ██████████████████▏
+apache ignite (17) █████████████████▏
+apache spark (12) ████████████▏
+apache mxnet (12) ████████████▏
+apache age (11) ███████████▏
+apache wayang (10) ██████████▏
+eclipse deeplearning4j (8) ████████▏
+apache commons collections (7) ███████▏
+apache nlpcraft (5) █████▏
+apache commons csv (5) █████▏
+apache pekko (4) ████▏
+apache hugegraph (3) ███▏
+apache tinkerpop (3) ███▏
+apache commons (2) ██▏
+apache commons cli (2) ██▏
+apache lucene (2) ██▏
+apache opennlp (2) ██▏
+apache ofbiz (1) █▏
+apache commons numbers (1) █▏
+apache commons codec (1) █▏
+apache commons io (1) █▏
+apache kie (1) █▏
+apache flink (1) █▏
+</pre>
+++++
+
+== Using Lucene Facets
+
+[source,groovy]
+----
+var analyzer = new ApacheProjectAnalyzer()
+var indexDir = new ByteBuffersDirectory()
+var taxonDir = new ByteBuffersDirectory()
+var config = new IndexWriterConfig(analyzer)
+var indexWriter = new IndexWriter(indexDir, config)
+var taxonWriter = new DirectoryTaxonomyWriter(taxonDir)
+
+var fConfig = new FacetsConfig().tap {
+ setHierarchical("projectNameCounts", true)
+ setMultiValued("projectNameCounts", true)
+ setMultiValued("projectFileCounts", true)
+ setMultiValued("projectHitCounts", true)
+ setIndexFieldName('projectHitCounts', '$projectHitCounts')
+}
+
+var blogBaseDir = '/projects/apache-websites/groovy-website/site/src/site/blog'
+new File(blogBaseDir).traverse(nameFilter: ~/.*\.adoc/) { file ->
+ var m = file.text =~ tokenRegex
+ var projects = m*.get(2).grep()*.toLowerCase()*.replaceAll('\n', '
').countBy()
+ file.withReader { br ->
+ var document = new Document()
+ var fieldType = new FieldType(stored: true,
+ indexOptions:
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS,
+ storeTermVectors: true,
+ storeTermVectorPositions: true,
+ storeTermVectorOffsets: true)
+ document.add(new Field('content', br.text, fieldType))
+ document.add(new StringField('name', file.name, Field.Store.YES))
+ if (projects) {
+ println "$file.name: $projects"
+ projects.each { k, v ->
+ document.add(new IntAssociationFacetField(v,
"projectHitCounts", k))
+ document.add(new FacetField("projectFileCounts", k))
+ document.add(new FacetField("projectNameCounts", k.split()))
+ }
+ }
+ indexWriter.addDocument(fConfig.build(taxonWriter, document))
+ }
+}
+indexWriter.close()
+taxonWriter.close()
+println()
+
+var reader = DirectoryReader.open(indexDir)
+var searcher = new IndexSearcher(reader)
+var taxonReader = new DirectoryTaxonomyReader(taxonDir)
+var fcm = new FacetsCollectorManager()
+var fc = FacetsCollectorManager.search(searcher, new MatchAllDocsQuery(), 10,
fcm).facetsCollector()
+
+var projects = new TaxonomyFacetIntAssociations('$projectHitCounts',
taxonReader, fConfig, fc, AssociationAggregationFunction.SUM)
+var hitCounts = projects.getTopChildren(10, "projectHitCounts")
+println hitCounts
+
+var facets = new FastTaxonomyFacetCounts(taxonReader, fConfig, fc)
+var fileCounts = facets.getTopChildren(10, "projectFileCounts")
+println fileCounts
+
+var nameCounts = facets.getTopChildren(10, "projectNameCounts")
+println nameCounts
+nameCounts = facets.getTopChildren(10, "projectNameCounts", 'apache')
+println nameCounts
+nameCounts = facets.getTopChildren(10, "projectNameCounts", 'apache',
'commons')
+println nameCounts
+
+var parser = new QueryParser("content", analyzer)
+var query = parser.parse('apache* AND eclipse*')
+var results = searcher.search(query, 10)
+println "Total documents with hits for $query --> $results.totalHits"
+var storedFields = searcher.storedFields()
+results.scoreDocs.each { ScoreDoc doc ->
+ var document = storedFields.document(doc.doc)
+ println "${document.get('name')}"
+}
+----
+
+// entered below so that we don't hit this whole table as a bunch of
references
+++++
+<pre>
+apache-nlpcraft-with-groovy.adoc: [apache nlpcraft:5]
+classifying-iris-flowers-with-deep.adoc: [eclipse deeplearning4j:5,
apache commons math:1, apache spark:2]
+community-over-code-eu-2024.adoc: [apache ofbiz:1, apache commons
math:2, apache ignite:1]
+community-over-code-na-2023.adoc: [apache ignite:8, apache commons
numbers:1, apache commons csv:1]
+deck-of-cards-with-groovy.adoc: [eclipse collections:5]
+deep-learning-and-eclipse-collections.adoc: [eclipse collections:7,
eclipse deeplearning4j:2]
+detecting-objects-with-groovy-the.adoc: [apache mxnet:12]
+fruity-eclipse-collections.adoc: [eclipse collections:9,
apache commons math:1]
+fun-with-obfuscated-groovy.adoc: [apache commons math:1]
+groovy-2-5-clibuilder-renewal.adoc: [apache commons cli:2]
+groovy-graph-databases.adoc: [apache age:11, apache hugegraph:3,
apache tinkerpop:3]
+groovy-haiku-processing.adoc: [eclipse collections:3]
+groovy-list-processing-cheat-sheet.adoc: [eclipse collections:4,
apache commons collections:3]
+groovy-lucene.adoc: [apache lucene:2, apache commons:1,
apache commons math:2]
+groovy-null-processing.adoc: [eclipse collections:6, apache commons
collections:4]
+groovy-pekko-gpars.adoc: [apache pekko:4]
+groovy-record-performance.adoc: [apache commons codec:1]
+handling-byte-order-mark-characters.adoc: [apache commons io:1]
+lego-bricks-with-groovy.adoc: [eclipse collections:6]
+matrix-calculations-with-groovy-apache.adoc: [apache commons math:6,
eclipse deeplearning4j:1, apache commons:1]
+natural-language-processing-with-groovy.adoc: [apache opennlp:2,
apache spark:1]
+reading-and-writing-csv-files.adoc: [apache commons csv:1]
+set-operations-with-groovy.adoc: [eclipse collections:3]
+solving-simple-optimization-problems-with-groovy.adoc: [apache commons
math:5, apache kie:1]
+using-groovy-with-apache-wayang.adoc: [apache wayang:9,
apache spark:7, apache flink:1, apache commons csv:1,
apache ignite:1]
+whiskey-clustering-with-groovy-and.adoc: [apache ignite:7,
apache wayang:1, apache spark:2, apache commons csv:2]
+wordle-checker.adoc: [eclipse collections:3]
+zipping-collections-with-groovy.adoc: [eclipse collections:4]
+
+dim=projectHitCounts path=[] value=-1 childCount=24
+ eclipse collections (50)
+ apache commons math (18)
+ apache ignite (17)
+ apache spark (12)
+ apache mxnet (12)
+ apache age (11)
+ apache wayang (10)
+ eclipse deeplearning4j (8)
+ apache commons collections (7)
+ apache nlpcraft (5)
+
+dim=projectFileCounts path=[] value=-1 childCount=24
+ eclipse collections (10)
+ apache commons math (7)
+ apache spark (4)
+ apache ignite (4)
+ apache commons csv (4)
+ eclipse deeplearning4j (3)
+ apache commons collections (2)
+ apache commons (2)
+ apache wayang (2)
+ apache nlpcraft (1)
+
+dim=projectNameCounts path=[] value=-1 childCount=2
+ apache (21)
+ eclipse (12)
+
+dim=projectNameCounts path=[apache] value=-1 childCount=15
+ commons (16)
+ spark (4)
+ ignite (4)
+ wayang (2)
+ nlpcraft (1)
+ ofbiz (1)
+ mxnet (1)
+ age (1)
+ hugegraph (1)
+ tinkerpop (1)
+
+dim=projectNameCounts path=[apache, commons] value=-1 childCount=7
+ math (7)
+ csv (4)
+ collections (2)
+ numbers (1)
+ cli (1)
+ codec (1)
+ io (1)
+
+Total documents with hits for +content:apache* +content:eclipse* --> 5 hits
+classifying-iris-flowers-with-deep.adoc
+fruity-eclipse-collections.adoc
+groovy-list-processing-cheat-sheet.adoc
+groovy-null-processing.adoc
+matrix-calculations-with-groovy-apache.adoc
+</pre>
+++++
+
+== Conclusion
+
+We have analyzed the Groovy blog posts looking for referenced projects
+using regular expressions and Apache Lucene.