Hi,
I'm new to blur and have been spending a little time today learning the
0.2.3 API. I'm having trouble dumping the terms of a blur index.
Here's some code that uses Iface.terms that sort of works (see below), but
has an issue depending on the size parameter passed to Iface.terms
It wasn't obvious to me how to detect the end-of-terms condition, so if
there's a cleaner way, please let me know.
public static void DumpTerms(Iface blurClient, String tableName)
throws BlurException, TException
{
Schema schema = blurClient.schema(tableName);
for (Map<String,ColumnDefinition> familyDef :
schema.getFamilies().values()) {
for (ColumnDefinition columnDef : familyDef.values()) {
DumpTermsForColumn(blurClient, tableName, columnDef);
}
}
}
public static void DumpTermsForColumn(Iface blurClient,
String tableName,
ColumnDefinition columnDef)
throws BlurException, TException
{
String family = columnDef.getFamily();
String column = columnDef.getColumnName();
String type = columnDef.getFieldType();
System.out.println(columnDef);
if ( !type.equals(TextFieldTypeDefinition.NAME)
&& !type.equals(StringFieldTypeDefinition.NAME)) {
System.out.println(" WARNING: terms unavailable for type " +
type);
return;
}
String startTerm = "";
int termCount = 0;
final short termFetchSize = 20;// loop logic assumes this is at
least 2
while (true) {
List<String> terms = blurClient.terms(tableName,
family,
column,
startTerm,
termFetchSize);
if ( terms.isEmpty()
|| (terms.size() == 1 && terms.get(0).equals(startTerm))) {
return;
}
for (String term : terms) {
if (term.equals(startTerm)) {
// 1st term is startTerm on calls 2-N of
blurClient.terms
continue;
}
if (term.isEmpty()) {
// empty string returned when termFetchSize > terms left
return;
}
startTerm = term;
long termFreq = blurClient.recordFrequency(tableName,
family,
column,
term);
System.out.println(" term " + ++termCount
+ ": [" + term + "] freq=" + termFreq);
}
}
}
ColumnDefinition(family:technology, columnName:author, subColumnName:null,
fieldLessIndexed:false, fieldType:text, properties:null, sortable:false)
term 1: [andy] freq=1
term 2: [beck] freq=1
term 3: [dave] freq=1
term 4: [douglas] freq=1
term 5: [erik] freq=2
term 6: [gospodnetic] freq=1
term 7: [hatcher] freq=2
term 8: [hofstadter] freq=1
term 9: [howard] freq=1
term 10: [hunt] freq=1
term 11: [husted] freq=1
term 12: [kent] freq=1
term 13: [lewis] freq=1
term 14: [loughran] freq=1
term 15: [massol] freq=1
term 16: [otis] freq=1
term 17: [papert] freq=1
term 18: [seymour] freq=1
term 19: [ship] freq=1
term 20: [steve] freq=1
term 21: [ted] freq=1
term 22: [thomas] freq=1
term 23: [vincent] freq=1
ColumnDefinition(family:technology, columnName:title, subColumnName:null,
fieldLessIndexed:false, fieldType:text, properties:null, sortable:false)
term 1: [action] freq=3
term 2: [an] freq=1
term 3: [ant] freq=1
term 4: [bach] freq=1
term 5: [braid] freq=1
term 6: [development] freq=1
term 7: [escher] freq=1
term 8: [eternal] freq=1
term 9: [explained] freq=1
term 10: [extreme] freq=1
term 11: [g] freq=1
term 12: [golden] freq=1
term 13: [in] freq=3
term 14: [java] freq=1
term 15: [junit] freq=1
term 16: [lucene] freq=1
term 17: [mindstorms] freq=1
term 18: [pragmatic] freq=1
term 19: [programmer] freq=1
term 20: [programming] freq=1
term 21: [tapestry] freq=1
term 22: [the] freq=1
term 23: [u00f6del] freq=1
term 24: [with] freq=1
ColumnDefinition(family:technology, columnName:pubmonth,
subColumnName:null, fieldLessIndexed:false, fieldType:text,
properties:null, sortable:false)
term 1: [197903] freq=1
term 2: [198001] freq=1
term 3: [199910] freq=2
term 4: [200208] freq=1
term 5: [200310] freq=1
term 6: [200403] freq=1
term 7: [200406] freq=1
ColumnDefinition(family:technology, columnName:subject, subColumnName:null,
fieldLessIndexed:false, fieldType:text, properties:null, sortable:false)
term 1: [agile] freq=2
term 2: [ant] freq=1
term 3: [apache] freq=1
term 4: [artificial] freq=1
term 5: [build] freq=1
term 6: [children] freq=1
term 7: [components] freq=1
term 8: [computers] freq=1
term 9: [developer] freq=1
term 10: [development] freq=2
term 11: [driven] freq=1
term 12: [education] freq=1
term 13: [extreme] freq=1
term 14: [ideas] freq=1
term 15: [intelligence] freq=1
term 16: [interface] freq=1
term 17: [jakarta] freq=1
term 18: [java] freq=1
term 19: [junit] freq=2
term 20: [logo] freq=1
term 21: [lucene] freq=1
term 22: [mathematics] freq=1
term 23: [methodology] freq=2
term 24: [mock] freq=1
term 25: [music] freq=1
term 26: [number] freq=1
term 27: [objects] freq=1
term 28: [powerful] freq=1
term 29: [pragmatic] freq=1
term 30: [programming] freq=1
term 31: [search] freq=1
term 32: [tapestry] freq=1
term 33: [test] freq=1
term 34: [testing] freq=1
term 35: [theory] freq=1
term 36: [tool] freq=1
term 37: [tools] freq=1
term 38: [unit] freq=1
term 39: [user] freq=1
ColumnDefinition(family:technology, columnName:isbn, subColumnName:null,
fieldLessIndexed:false, fieldType:string, properties:null, sortable:false)
term 1: [020161622X] freq=1
term 2: [0201616416] freq=1
term 3: [0465026567] freq=1
term 4: [0465046290] freq=1
term 5: [1930110588] freq=1
term 6: [1930110995] freq=1
term 7: [1932394117] freq=1
term 8: [tbd] freq=1
ColumnDefinition(family:technology, columnName:url, subColumnName:null,
fieldLessIndexed:false, fieldType:text, properties:null, sortable:false)
term 1: [0201616416] freq=1
term 2: [0465026567] freq=1
term 3: [antbook] freq=1
term 4: [detail] freq=2
term 5: [exec] freq=2
term 6: [http] freq=8
term 7: [index.shtml] freq=1
term 8: [lewisship] freq=1
term 9: [lucene] freq=1
term 10: [massol] freq=1
term 11: [obidos] freq=2
term 12: [ppbook] freq=1
term 13: [tg] freq=2
term 14: [www.amazon.com] freq=2
term 15: [www.manning.com] freq=4
term 16: [www.papert.org] freq=1
term 17: [www.pragmaticprogrammer.com] freq=1
Exception in thread "main" BlurException(message:Call execution exception
[[lia, technology, url, www.pragmaticprogrammer.com, 20]],
stackTraceStr:java.lang.ArrayIndexOutOfBoundsException: 128
at
org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:104)
at
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextLeaf(BlockTreeTermsReader.java:2467)
at
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next(BlockTreeTermsReader.java:2459)
at
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next(BlockTreeTermsReader.java:2139)
at
org.apache.blur.index.ExitableReader$ExitableTermsEnum.next(ExitableReader.java:233)
at org.apache.blur.manager.IndexManager.terms(IndexManager.java:1031)
at org.apache.blur.manager.IndexManager$9.call(IndexManager.java:982)
at org.apache.blur.manager.IndexManager$9.call(IndexManager.java:976)
at org.apache.blur.utils.ForkJoin$2.call(ForkJoin.java:63)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
org.apache.blur.concurrent.ThreadWatcher$ThreadWatcherExecutorService$1.run(ThreadWatcher.java:127)
at
org.apache.blur.concurrent.BlurThreadPoolExecutor$1.run(BlurThreadPoolExecutor.java:83)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
, errorType:UNKNOWN)
at
org.apache.blur.thrift.generated.Blur$terms_result$terms_resultStandardScheme.read(Blur.java:26728)
at
org.apache.blur.thrift.generated.Blur$terms_result$terms_resultStandardScheme.read(Blur.java:26696)
at org.apache.blur.thrift.generated.Blur$terms_result.read(Blur.java:26638)
at
org.apache.blur.thirdparty.thrift_0_9_0.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.blur.thrift.generated.Blur$Client.recv_terms(Blur.java:1212)
at
org.apache.blur.thrift.generated.SafeClientGen.recv_terms(SafeClientGen.java:508)
at org.apache.blur.thrift.generated.Blur$Client.terms(Blur.java:1195)
at
org.apache.blur.thrift.generated.SafeClientGen.terms(SafeClientGen.java:942)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.blur.thrift.BlurClient$BlurClientInvocationHandler$1.call(BlurClient.java:60)
at
org.apache.blur.thrift.BlurClient$BlurClientInvocationHandler$1.call(BlurClient.java:56)
at org.apache.blur.thrift.AbstractCommand.call(AbstractCommand.java:62)
at
org.apache.blur.thrift.BlurClientManager.execute(BlurClientManager.java:197)
at
org.apache.blur.thrift.BlurClient$BlurClientInvocationHandler.invoke(BlurClient.java:56)
at com.sun.proxy.$Proxy0.terms(Unknown Source)
at
hoodware.sandbox.blur.BlurIndexMain.DumpTermsForColumn(BlurIndexMain.java:88)
at hoodware.sandbox.blur.BlurIndexMain.DumpTerms(BlurIndexMain.java:64)
at hoodware.sandbox.blur.BlurIndexMain.main(BlurIndexMain.java:38)
The code works if I change termFetchSize to 2 instead of 20.
The command "blur terms lia technology.url" will get the same exception.
The command "blur terms lia technology.url -s2" will not get the exception,
but goes into an infinite loop after it outputs: "- |
www.pragmaticprogrammer.com "
Attached is the csv file that I loaded into an empty table. It's a
reformatted version of the Lucene In Action book's sample data (taken from
data directory in
http://www.manning-source.com/books/hatcher2/LuceneInAction.zip)
I created the table with the commands:
hadoop fs -mkdir lia_input
hadoop fs -copyFromLocal ~/projects/lucene/LuceneInAction.csv lia_input
hadoop fs -mkdir tables
blur create -t lia -c 2 -l tables/lia
foreach family (health technology philosophy education)
blur definecolumn lia $family title text
blur definecolumn lia $family isbn string
blur definecolumn lia $family author text
# blur definecolumn lia $family pubmonth date -p dateFormat yyyyMM
blur definecolumn lia $family pubmonth text # must be text for
Blur.Iface.terms
blur definecolumn lia $family subject text
blur definecolumn lia $family url text
end
blur csvloader -c localhost:40010 -A -a -t lia -i lia_input -s';' \
-d 'health title isbn author pubmonth subject url' \
-d 'technology title isbn author pubmonth subject url' \
-d 'philosophy title isbn author pubmonth subject url' \
-d 'education title isbn author pubmonth subject url'
Please let me know if you have any ideas on what I'm doing wrong.
Thanks,
-- Tom
health;Imperial Secrets of Health and Longevity;0936185511;Bob Flaws;199401;diet chinese medicine qi gong health herbs;http://www.bluepoppy.com/acb/showdetl.cfm?&DID
education;A Modern Art of Education;0854402624;Rudolf Steiner;198106;education philosophy psychology practice Waldorf;http://www.amazon.com/exec/obidos/tg/detail/-/0854402624
technology;G\u00F6del, Escher, Bach: an Eternal Golden Braid;0465026567;Douglas Hofstadter;197903;artificial intelligence number theory mathematics music;http://www.amazon.com/exec/obidos/tg/detail/-/0465026567
technology;Lucene in Action;tbd;Otis Gospodnetic,Erik Hatcher;200406;lucene search;http://www.manning.com/lucene
technology;Extreme Programming Explained;0201616416;Kent Beck;199910;extreme programming agile test driven development methodology;http://www.amazon.com/exec/obidos/tg/detail/-/0201616416
technology;Mindstorms;0465046290;Seymour Papert;198001;children computers powerful ideas LOGO education;http://www.papert.org/
technology;Java Development with Ant;1930110588;Erik Hatcher,Steve Loughran;200208;apache jakarta ant build tool junit java development;http://www.manning.com/antbook
technology;JUnit in Action;1930110995;Vincent Massol,Ted Husted;200310;junit unit testing mock objects;http://www.manning.com/massol
technology;The Pragmatic Programmer;020161622X;Dave Thomas,Andy Hunt;199910;pragmatic agile methodology developer tools;http://www.pragmaticprogrammer.com/ppbook/index.shtml
technology;Tapestry in Action;1932394117;Howard Lewis-Ship;200403;tapestry web user interface components;http://www.manning.com/lewisship
philosophy;Tao Te Ching \u9053\u5FB7\u7D93;0060812451;Stephen Mitchell;198810;taoism;http://www.amazon.com/exec/obidos/tg/detail/-/0060812451