[jira] Created: (LUCENE-2844) benchmark geospatial performance based on geonames.org

David Smiley (JIRA) Mon, 03 Jan 2011 08:59:12 -0800

benchmark geospatial performance based on geonames.org
------------------------------------------------------

Key: LUCENE-2844
URL: https://issues.apache.org/jira/browse/LUCENE-2844
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/benchmark
Reporter: David Smiley
Priority: Minor
Fix For: 4.0

Until now (with this patch), the benchmark contrib module did not include a
means to test geospatial data. This patch includes some new files and changes
to existing ones. Here is a summary of what is being added in this patch per
file (all files below are within the benchmark contrib module) along with my
notes:

Changes:
* build.xml -- Add dependency on Lucene's spatial module and Solr.
** It was a real pain to figure out the convoluted ant build system to make
this work, and I doubt I did it the proper way.
** Rob Muir thought it would be a good idea to make the benchmark contrib
module be top level module (i.e. be alongside analysis) so that it can depend
on everything.
http://lucene.472066.n3.nabble.com/Re-Geospatial-search-in-Lucene-Solr-tp2157146p2157824.html
I agree
* ReadTask.java -- Added a search.useHitTotal boolean option that will use the
total hits number for reporting purposes, instead of the existing behavior.
** The existing behavior (i.e. when search.useHitTotal=false) doesn't look very
useful since the response integer is the sum of several things instead of just
one thing. I don't see how anyone makes use of it.

Note that on my local system, I also changed ReportTask & RepSelectByPrefTask
to not include the '-' every other line, and also changed Format.java to not
use commas in the numbers. These changes are to make copy-pasting into excel
more streamlined.

New Files:
* geoname-spatial.alg -- my algorithm file.
** Note the ":0" trailing the Populate sequence. This is a trick I use to
skip building the index, since it takes a while to build and I'm not interested
in benchmarking index construction. You'll want to set this to :1 and then
subsequently put it back for further runs as long as you keep the
doc.geo.schemaField or any other configuration elements affecting index the
same.
** In the patch, doc.geo.schemaField=geohash but unless you're tinkering with
SOLR-2155, you'll probably want to set this to "latlon"
* GeoNamesContentSource.java -- a ContentSource for a geonames.org data file
(either a single country like US.txt or allCountries.txt).
** Uses a subclass of DocData to store all the fields. The existing DocData
wasn't very applicable to data that is not composed of a title and body.
** Doesn't reuse the docdata parameter to getNextDocData(); a new one is
created every time.
** Only supports content.source.forever=false
* GeoNamesDocMaker.java -- a subclass of DocMaker that works very differently
than the existing DocMaker.
** Instead of assuming that each line from geonames.org will correspond to one
Lucene document, this implementation supports, via configuration, creating a
variable number of documents, each with a variable number of points taken
randomly from a GeoNamesContentSource.
** doc.geo.docsToGenerate: The number of documents to generate. If blank it
defaults to the number of rows in GeoNamesContentSource.
** doc.geo.avgPlacesPerDoc: The average number of places to be added to a
document. A random number between 0 and one less than twice this amount is
chosen on a per document basis. If this is set to 1, then exactly one is
always used. In order to support a value greater than 1, use the geohash field
type and incorporate SOLR-2155 (geohash prefix technique).
** doc.geo.oneDocPerPlace: Whether at most one document should use the same
place. In other words, Can more than one document have the same place? If so,
set this to false.
** doc.geo.schemaField: references a field name in schema.xml. The field
should implement SpatialQueryable.
* GeoPerfData.java: This class is a singleton storing data in memory that is
shared by GeoNamesDocMaker.java and GeoQueryMaker.java.
** content.geo.zeroPopSubst: if a population is encountered that is <= 0, then
use this population value instead. Default is 100.
** content.geo.maxPlaces: A limit on the number of rows read in from
GeoNamesContentSource.java can be set here. Defaults to Integer.MAX_VALUE.
** GeoPerfData is primarily responsible for reading in data from
GeoNamesContentSource into memory to store the lat, lon, and population. When
a random place is asked for, you get one weighted according to population. The
idea is to skew the data towards more referenced places, and a population
number is a decent way of doing it.
* GeoQueryMaker.java -- returns random queries from GeoPerfData by taking a
random point and using a particular configured radius. A pure lat-lon bounding
box query is ultimately done.
** query.geo.radiuskm: The radius of the query in kilometers.
* schema.xml -- a Solr schema file to configure SpatialQueriable fields
referenced by doc.geo.schemaField.

When I run this algorithm as provided with the file in the patch, I get this
result:
{noformat}
Operation round ____km runCnt recsPerRun rec/s elapsedSec
avgUsedMem avgTotalMem
Search_40 0 350 1 4811687 1,206,541.38 3.99
117,722,664 191,934,464
{noformat}

The key metrics I use are the average milliseconds per query, and the average
places per query. The number of queries performed is the trailing numeric
suffix to Operation. The Formulas:
* avg ms/query: elapsedSec*1000/queries == 98.8
* avg places / query: recsPerRun/queries == 120,292

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2844) benchmark geospatial performance based on geonames.org

Reply via email to