Hi Tom, Thanks. Comments below:
On Sep 22, 2011, at 2:30 PM, Thomas Bennett wrote: > Hi, > > I have a few questions about building queries for filemgr lucene catalogs and > I was thinking someone may be able to help me. > > I've ingested some files into catalog and then using the command line tools > (and aliases - thanks Cameron!) to query the catalog. > > I'm not too familiar with writing SQL queries, but I've been able to achieve > the the following types of queries: > > bin$ ./query_tool --url http://localhost:9000 --sql -query "SELECT > Observer,Description,Duration,ExperimentID FROM KatFile WHERE > Observer='jasper'" --sortBy Duration > > Which returns: > ..... > jasper,a9909ae6-822b-11e0-a7a1-0060dd4721d8,Target track,637.841571569 > jasper,47c3a4da-822a-11e0-a7a1-0060dd4721d8,Target track,565.859450817 > jasper,777b0f34-8224-11e0-a7a1-0060dd4721d8,Target track,80.9798858166 > > > bin$ ./query_tool --url http://localhost:9000 --lucene -query > 'Observer:sharmila' > > Which returns: > ....... > ba9b292e-e506-11e0-ad74-9f1c5e7f0611 > b93dbc0d-e506-11e0-ad74-9f1c5e7f0611 > b7e530ec-e506-11e0-ad74-9f1c5e7f0611 > b66ff60b-e506-11e0-ad74-9f1c5e7f0611 > afc6556a-e506-11e0-ad74-9f1c5e7f0611 > > > Questions: > • The SQL query does what I expect ;-) but with one problem - in what > order will I receive the data? I can't figure out an automatic way to find > out which column is which data. Good question! It looks like it just prints the metadata in any order, as opposed to the order that you received it. This is probably not a great thing to do, so can you file an issue and we can take a look at it? > • Is full SQL query syntax supported? Nope, it's just a small subset. You can see what's supported here: http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/util/SqlParser.html Improvements welcome! :) > • The Lucene query returns the productID. Is there a class I can use > that will return something similar to the sql query? (Although I should look > at the code and find this out for myself - asking is free :-) Heh, great question, but the answer is no. We didn't really standardize on the output from these tools. I originally developed the QueryTool (which understood Lucene to begin with, and later Brian Foster added his SQL syntax to it, and the associated response format). Maybe we should open up an issue (and associated wiki page) on standardizing on the output. Feel free to propose something and I'll be happy to join in (hopefully others will too). > • I've not yet tested any more complex SQL and Lucene queries - I was > just wondering if there where any useful info out there that would show me > some more funky example queries. So far I've found lucene tutorial and sql > quick ref. I'll tie this into OODT Filemgr User Guide once I've figured these > things out. +1, that's the best place to start. We also only support a limited set of the Lucene syntax as well, see the following class: http://oodt.apache.org/components/maven/apidocs/org/apache/oodt/cas/filemgr/tools/CASAnalyzer.html > • I see the version of lucene being used it quiet old (2.0.0 and the > latest ver is 2.9.1). Is there any reason why OODT is using this old version? I would *love* to upgrade to 2.9.1 or 2.9.4. Upgrading to 3.0 will break APIs for us, b/c Lucene changed to the ScoreCollector method for getting hits back I believe in the 3.x series, however we should be forwards compat to e.g., 2.9.4. http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/2.9.4/ > • Should I be spending the effort to use a different (i.e. sql > database) or are other OODT implementations using lucene? > Thanks in advance for any help. Great question. Most of the folks use Lucene to begin with, because it requires no external database or service, it just works out of the box. It also has a number of other advantages: * Easy unit testing against your index * You can copy around FM index directories and share them between machines * You can test locally on your laptop by copying the FM index off of a server onto your laptop, and then spinning up a local FM from there. The file refs won't exist, but you can play around with the catalog and most other things work. * You can open up the FM index in Luke http://getopt.org/luke/ and then browse and query the Index using the Full Lucene Syntax * It's fairly scalable (up to 10s of M of products). You can scale beyond, but you have to get into index partitioning, backups, etc., Also time queries at that stage token explosion (e.g., doing a range query for 2001-01-01T00:00:00.000Z to 2003-01-01T00:00:00.000Z will explode), mainly to do with the SerDe format for storing CAS metadata and product information that we used in the LuceneCatalog. This can be improved to scale beyond a few million products, but no one has invested the effort into that yet, they typically just use a SQL RDBMS, and the DataSourceCatalog at that point To move your existing index to the DataSourceCatalog, there's a tool in FM that I wrote called ExpImpCatalog. You can find it here: http://s.apache.org/Xuq To use the tool in an existing FM deployment, do the following: 1. Stand up a new FM that you are going to configure with your DataSourceCatalog. - change the port to 9010 - if your existing FM is in e.g., /usr/local/filemgr, put this new one in /usr/local/filemgr2 - configure it with the DataSourceCatalog - set up your DB and bake in the parameters to the FM config 2. Go into /usr/local/filemgr/bin (your existing, Lucene-based FM) - run java -Djava.ext.dirs=../lib org.apache.oodt.cas.filemgr.tools.ExpImpCatalog you should see: ]$ java -Djava.ext.dirs=../lib org.apache.oodt.cas.filemgr.tools.ExpImpCatalog ExpImpCatalog [options] --source <url> --dest <url> --unique [--types <comma separate list of product type names>] [--sourceCatProps <file> --destCatProps <file>] This tool works like the following: You give it either a combination of: --source and --dest OR a combination of: --sourceCataProps and --destCatProps In the case of simply --source and --dest, it will import all of the source catalog into the dest catalog via XML-RPC, talking to your source FM URL, and your dest FM URL. In the case of the--sourceCatProps and --destCatProps, it will do the same thing, except it won't use XML-RPC as the transport layer, it will simply instantiate a copy of the source Catalog interface object, and the dest Catalog interface object (in a single JVM), and import product and met at a time from source to dest. I made the props based portion of the tool to avoid transferring large met and product objects over XML-RPC, and to keep them within a JVM. The --unique parameter will not import a source product ID into a dest catalog if that product ID exists in the dest catalog. The --types parameter specifies a comma separated list of Product Types to export from the source catalog into the dest catalog. If --types is omitted all product types are assumed. So, there is an easy way to migrate from an existing Lucene index FM catalog into any other Catalog fronted by the FM. Another thing people do sometimes is that if you have the source data and the ingestion pipeline, they will just blow away the Lucene (or whatever) Catalog, and then re-ingest using the Crawler/FM/Curation pipeline into e.g., a new DataSourceCat, that they configure their existing FM to now use. Hope that helps explain things. These would probably be good javadocs, plus Wiki pages for these tools and migration :) Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
