[
https://issues.apache.org/jira/browse/HADOOP-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528212
]
stack commented on HADOOP-1913:
-------------------------------
This is a nice looking addition Ning
Here's a couple of comments:
Shouldn't IndexConf extend HBaseConfiguration else you'll not have the hbase
settings in the mix (Would IndexConfiguration be a better name than IndexConf).
You made the patch inside $HBASE_HOME/src rather than at $HADOOP_HOME. You
should fix. Otherwise it won't apply when hudson tries to apply it.
You way you add the per-column config. into a hadoop configuration is very
cute. I'm unclear how mulitple columns are done..... Should there be a columns
element to hold multiple column elements? I'd suggest you add javadoc with
example config. ('cos trying to read conjure the xml produced by the code takes
a little effort).
Ning, have you tried your patch on a distributed cluster? Does your column
trick get properly distributed out and your LuceneDocumentWrapper work in the
distributed context?
Did you use lucene 2.2 or something else?
I had a problem compiling:
{code}
[javac] Compiling 14 source files to
/Users/stack/Documents/checkouts/hadoop-trunk/build/contrib/hbase/test
[javac]
/Users/stack/Documents/checkouts/hadoop-trunk/src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestTableIndex.java:255:
cannot find symbol
[javac] symbol : variable DONE_NAME
[javac] location: class org.apache.hadoop.hbase.mapred.IndexOutputFormat
[javac] if (IndexOutputFormat.DONE_NAME.equals(name)) {
{code
> [HBase] Build a Lucene index on an HBase table
> ----------------------------------------------
>
> Key: HADOOP-1913
> URL: https://issues.apache.org/jira/browse/HADOOP-1913
> Project: Hadoop
> Issue Type: New Feature
> Components: contrib/hbase
> Reporter: Ning Li
> Priority: Minor
> Attachments: build_table_index.patch
>
>
> This patch provides a Reducer class and other related classes which help to
> build a Lucene index on an HBase table. The index build part is similar to
> that of Nutch.
> - Each row is modeled as a Lucene document: row key is indexed in its
> untokenized form, column name-value pairs are Lucene field name-value pairs.
> - IndexConf is used to configure various Lucene parameters, specify whether
> to optimize an index and which columns to index and/or store, in tokenized or
> untokenized form, etc.
> - The number of reduce tasks decides the number of indexes (partitions).
> The index(es) is stored in the output path of job configuration.
> - The index build process is done in the reduce phase. Users can use the
> map phase to join rows from different tables or to pre-parse/analyze column
> content, etc.
> - A junit test is added to test the build of an index on an HBase table
> with an identity mapper. It also serves as an example on how to use the new
> classes.
> - BuildTableIndex is provided to help building an index on an HBase table.
> It should be moved to examples package if HBase decides to have one.
> This patch requires the inclusion of the Lucene library.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.