Hello All,
I have been given the envious job of upgrading existing faceted taxonomy
indexes from 3.6 to 5.3.
To make sure that I have everything in working order, I have written a little
program to “smoke test” . Facets retrieved in version 3 should be retrievable
in version 5, or our upgrade has failed.
Unfortunately, I can’t seem to put together a quick program to validate my date
once it is upgraded to version 5. Can someone tell me where I have gone off
the rails?
In this email, I include:
1. The 3.6.2 validation code … (establishes what should be seen after the
upgrade runs)
1.1. mvn dependencies
1.2. source code
1.3. output
2. The lucene upgrade shell script
3. The 5.3.1 validation code (that doesn’t generates nulls and isn’t quiet
right)
3.1. mvn dependencies
3.2. source code
4. The url for the compressed tar file of the index data stored in drop box.
Here are the key maven dependencies that I used for the 3.6 source:
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>3.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-facet</artifactId>
<version>3.6.2</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>3.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queries</artifactId>
<version>3.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>3.6.0</version>
</dependency>
Here is the code to retrieve facet data from the version 3.6 index (which does
work against version 3.6 lucene):
public class FacetRunner {
public static void main(final String[] args) throws Exception {
File indexDirFile = new
File("/Users/scott/projects/prototypes/lucene-3-and-5/lucene3/data/doc-index/lucene");
Directory indexDir = new SimpleFSDirectory(indexDirFile);
IndexReader indexReader = IndexReader.open(indexDir);
Searcher searcher = new IndexSearcher(indexReader);
File taxonomyIndexDirFile = new
File("/Users/scott/projects/prototypes/lucene-3-and-5/lucene3/data/facets");
Directory taxonomyIndexDir = new
SimpleFSDirectory(taxonomyIndexDirFile);
TaxonomyReader taxo = new DirectoryTaxonomyReader(taxonomyIndexDir);
Term aTerm = new Term("$facets", "$fulltree$");// new Term("text",
"clarissa");
Query q = new TermQuery(aTerm);
TopScoreDocCollector tdc = TopScoreDocCollector.create(10,true);
FacetSearchParams facetSearchParams = new FacetSearchParams();
facetSearchParams.addFacetRequest(new CountFacetRequest(
new CategoryPath("brs_recipient_domain"), 10));
FacetsCollector facetsCollector = new
FacetsCollector(facetSearchParams, indexReader, taxo);
searcher.search(q, MultiCollector.wrap(tdc, facetsCollector));
List<FacetResult> res = facetsCollector.getFacetResults();
for (FacetResult facetResult:res) {
System.out.println(facetResult.toString());
}
}
Output looks like:
Request: brs_recipient_domain nRes=10 nLbl=10
Num valid Descendants (up to specified depth): 486
Facet Result Node with 10 sub result nodes.
Name: brs_recipient_domain
Value: 2896.0
Residue: 1497.0
Subresult #0
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/enron.com
Value: 1979.0
Residue: 0.0
Subresult #1
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/aol.com
Value: 124.0
Residue: 0.0
Subresult #2
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/bracepatt.com
Value: 84.0
Residue: 0.0
Subresult #3
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/txu.com
Value: 63.0
Residue: 0.0
Subresult #4
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/hotmail.com
Value: 46.0
Residue: 0.0
Subresult #5
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/teneo-test.com
Value: 42.0
Residue: 0.0
Subresult #6
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/yahoo.com
Value: 41.0
Residue: 0.0
Subresult #7
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/dttus.com
Value: 34.0
Residue: 0.0
Subresult #8
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/velaw.com
Value: 30.0
Residue: 0.0
Subresult #9
Facet Result Node with 0 sub result nodes.
Name: brs_recipient_domain/netzero.net
Value: 28.0
Residue: 0.0
Process finished with exit code 0
To upgrade the indexes, I have written a shell script that runs the
IndexUpgrader using the 4.10.4 core jar to bring the facet index to 4 and the
document index to 4.
#!/bin/sh
export JARS_HOME=/users/scott/projects/prototypes/lucene-3-and-5/jars
echo "===>>>>>migrating lucene data from 3 to 4<<<<<========="
echo
export LUCENE_4_PATH=$JARS_HOME/lucene-core-4.10.4.jar
date "+DATE: %Y-%m-%d%nTIME: %H:%M:%S"
echo "upgrading facets taxonomy indices from 3 to 4 with command time java -cp
$LUCENE_4_PATH org.apache.lucene.index.IndexUpgrader facets"
time java -cp $LUCENE_4_PATH org.apache.lucene.index.IndexUpgrader facets
echo
echo "upgrading document indices from 3 to 4 with command time java -cp
$LUCENE_4_PATH org.apache.lucene.index.IndexUpgrader doc-index/lucene"
time java -cp $LUCENE_4_PATH org.apache.lucene.index.IndexUpgrader
doc-index/lucene
echo
echo "===>>>>>migrating lucene data from 4 to 5<<<<<========="
echo
export
LUCENE_5_PATH=$JARS_HOME/lucene-backward-codecs-5.3.1.jar:$JARS_HOME/lucene-core-5.3.1.jar
echo "upgrading facets taxonomy indices from 4 to 5 with command time java -cp
$LUCENE_5_PATH org.apache.lucene.index.IndexUpgrader facets"
time java -cp $LUCENE_5_PATH org.apache.lucene.index.IndexUpgrader facets
echo
echo "upgrading document indices from 4 to 5 with command time java -cp
$LUCENE_5_PATH org.apache.lucene.index.IndexUpgrader doc-index/lucene"
time java -cp $LUCENE_5_PATH org.apache.lucene.index.IndexUpgrader
doc-index/lucene
echo
echo "done upgrading from lucene 3 to lucene 5"
date "+DATE: %Y-%m-%d%nTIME: %H:%M:%S"
no errors occur.
At this point, my index documents look like version 5 lucene.
Now I want to validate my indexes and pull similar (if not the same data) from
the upgraded indexes.
Here are the maven dependencies for the 5.3.1. source
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-facet</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queries</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>5.3.1</version>
</dependency>
Here is my 5.3.1 program - it return’s nulls - what am I doing wrong?.
public static void main(final String[] args) throws Exception {
File indexDirFile = new
File("/Users/scott/projects/prototypes/lucene-3-and-5/lucene5/data/doc-index/lucene");
Path indexDirFilePath = indexDirFile.toPath();
Directory indexDir = new SimpleFSDirectory(indexDirFilePath);
IndexReader indexReader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(indexReader);
File taxonomyIndexDirFile = new
File("/Users/scott/projects/prototypes/lucene-3-and-5/lucene5/data/facets");
Path taxonomyIndexDirFilePath = taxonomyIndexDirFile.toPath();
Directory taxonomyIndexDir = new
SimpleFSDirectory(taxonomyIndexDirFilePath);
TaxonomyReader taxo = new DirectoryTaxonomyReader(taxonomyIndexDir);
Term aTerm = new Term("$facets", "$fulltree$");
Query q = new TermQuery(aTerm);
FacetsCollector facetsCollector = new FacetsCollector();
//searcher.search(q, MultiCollector.wrap(tdc, facetsCollector));
//FacetsCollector.search(searcher, new
MatchAllDocsQuery(),10,facetsCollector);
FacetsCollector.search(searcher, q, 10, facetsCollector);
FacetsConfig config = new FacetsConfig();
//config.set
Facets facets = new FastTaxonomyFacetCounts(taxo, config,
facetsCollector);
FacetResult result = facets.getTopChildren(10, "brs_recipient_domain");
for (LabelAndValue labelValue : result.labelValues) {
System.out.println(String.format("%s (%s)", labelValue.label,
labelValue.value));
}
}
Here is the url to a gzipped tar that contains the index (not yet upgraded):
https://www.dropbox.com/s/qbr7ogwgekatrdf/faceted_lucene_data.tar.gz?dl=0
<https://www.dropbox.com/s/qbr7ogwgekatrdf/faceted_lucene_data.tar.gz?dl=0>
Thanks for your help.
SCott