[ https://issues.apache.org/jira/browse/LUCENE-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley updated LUCENE-2844: --------------------------------- Attachment: benchmark-geo.patch > benchmark geospatial performance based on geonames.org > ------------------------------------------------------ > > Key: LUCENE-2844 > URL: https://issues.apache.org/jira/browse/LUCENE-2844 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark > Reporter: David Smiley > Priority: Minor > Fix For: 4.0 > > Attachments: benchmark-geo.patch > > > Until now (with this patch), the benchmark contrib module did not include a > means to test geospatial data. This patch includes some new files and > changes to existing ones. Here is a summary of what is being added in this > patch per file (all files below are within the benchmark contrib module) > along with my notes: > Changes: > * build.xml -- Add dependency on Lucene's spatial module and Solr. > ** It was a real pain to figure out the convoluted ant build system to make > this work, and I doubt I did it the proper way. > ** Rob Muir thought it would be a good idea to make the benchmark contrib > module be top level module (i.e. be alongside analysis) so that it can depend > on everything. > http://lucene.472066.n3.nabble.com/Re-Geospatial-search-in-Lucene-Solr-tp2157146p2157824.html > I agree > * ReadTask.java -- Added a search.useHitTotal boolean option that will use > the total hits number for reporting purposes, instead of the existing > behavior. > ** The existing behavior (i.e. when search.useHitTotal=false) doesn't look > very useful since the response integer is the sum of several things instead > of just one thing. I don't see how anyone makes use of it. > Note that on my local system, I also changed ReportTask & RepSelectByPrefTask > to not include the '-' every other line, and also changed Format.java to not > use commas in the numbers. These changes are to make copy-pasting into excel > more streamlined. > New Files: > * geoname-spatial.alg -- my algorithm file. > ** Note the ":0" trailing the Populate sequence. This is a trick I use to > skip building the index, since it takes a while to build and I'm not > interested in benchmarking index construction. You'll want to set this to :1 > and then subsequently put it back for further runs as long as you keep the > doc.geo.schemaField or any other configuration elements affecting index the > same. > ** In the patch, doc.geo.schemaField=geohash but unless you're tinkering with > SOLR-2155, you'll probably want to set this to "latlon" > * GeoNamesContentSource.java -- a ContentSource for a geonames.org data file > (either a single country like US.txt or allCountries.txt). > ** Uses a subclass of DocData to store all the fields. The existing DocData > wasn't very applicable to data that is not composed of a title and body. > ** Doesn't reuse the docdata parameter to getNextDocData(); a new one is > created every time. > ** Only supports content.source.forever=false > * GeoNamesDocMaker.java -- a subclass of DocMaker that works very differently > than the existing DocMaker. > ** Instead of assuming that each line from geonames.org will correspond to > one Lucene document, this implementation supports, via configuration, > creating a variable number of documents, each with a variable number of > points taken randomly from a GeoNamesContentSource. > ** doc.geo.docsToGenerate: The number of documents to generate. If blank it > defaults to the number of rows in GeoNamesContentSource. > ** doc.geo.avgPlacesPerDoc: The average number of places to be added to a > document. A random number between 0 and one less than twice this amount is > chosen on a per document basis. If this is set to 1, then exactly one is > always used. In order to support a value greater than 1, use the geohash > field type and incorporate SOLR-2155 (geohash prefix technique). > ** doc.geo.oneDocPerPlace: Whether at most one document should use the same > place. In other words, Can more than one document have the same place? If > so, set this to false. > ** doc.geo.schemaField: references a field name in schema.xml. The field > should implement SpatialQueryable. > * GeoPerfData.java: This class is a singleton storing data in memory that is > shared by GeoNamesDocMaker.java and GeoQueryMaker.java. > ** content.geo.zeroPopSubst: if a population is encountered that is <= 0, > then use this population value instead. Default is 100. > ** content.geo.maxPlaces: A limit on the number of rows read in from > GeoNamesContentSource.java can be set here. Defaults to Integer.MAX_VALUE. > ** GeoPerfData is primarily responsible for reading in data from > GeoNamesContentSource into memory to store the lat, lon, and population. > When a random place is asked for, you get one weighted according to > population. The idea is to skew the data towards more referenced places, and > a population number is a decent way of doing it. > * GeoQueryMaker.java -- returns random queries from GeoPerfData by taking a > random point and using a particular configured radius. A pure lat-lon > bounding box query is ultimately done. > ** query.geo.radiuskm: The radius of the query in kilometers. > * schema.xml -- a Solr schema file to configure SpatialQueriable fields > referenced by doc.geo.schemaField. > When I run this algorithm as provided with the file in the patch, I get this > result: > {noformat} > Operation round ____km runCnt recsPerRun rec/s elapsedSec > avgUsedMem avgTotalMem > Search_40 0 350 1 4811687 1,206,541.38 3.99 > 117,722,664 191,934,464 > {noformat} > The key metrics I use are the average milliseconds per query, and the average > places per query. The number of queries performed is the trailing numeric > suffix to Operation. The Formulas: > * avg ms/query: elapsedSec*1000/queries == 98.8 > * avg places / query: recsPerRun/queries == 120,292 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org