Re: Solr geospatial index?

2015-01-11 Thread Matteo Tarantino
Wow, thank you David!
You are really kind to spend your time writing all these informations to
me. This will be very helpful for my thesis work.

Thank you again.
MT



2015-01-11 2:46 GMT+01:00 david.w.smi...@gmail.com david.w.smi...@gmail.com
:

 Hello Matteo,

 Welcome. You are not bothering/me-us; you are asking in the right place.

 Jack’s right in terms of the field type dictating how it works.

 LatLonType, simply stores the latitude and longitude internally as
 separate floating point fields and it does efficient range queries over
 them for bounding-box queries.  Lucene has remarkably fast/efficient range
 queries over numbers based on a Trie/PrefixTree. In fact systems like
 TitanDB leave such queries to Lucene.  For point-radius, it iterates over
 all of them in-memory in a brute-force fashion (not scalable but may be
 fine).

 BBoxField is similar in spirit to LatLonType; each side of an indexed
 rectangle gets its own floating point field internally.

 Note that for both listed above, the underlying storage and range queries
 use built-in numeric fields.

 SpatialRecursivePrefixTreeFieldType (RPT for short) is interesting in that
 it supports indexing essentially any shape by representing the indexed
 shape as multiple grid squares.  Non-point shapes (e.g. a polygon) are
 approximated; if you need accuracy, you should additionally store the
 vector geometry and validate the results in a 2nd pass (see
 SerializedDVStrategy for help with that).  RPT, like Lucene’s numeric
 fields, uses a Trie/PrefixTree but encodes two dimensions, not one.

 The Trie/PrefixTree concept underlies both RPT and numeric fields, which
 are approaches to using Lucene’s terms index to encode prefixes.  So the
 big point here is that Lucene/Solr doesn’t have side indexes using
 fundamentally different technologies for different types of data; no;
 Lucene’s one versatile index looks up terms (for keyword search), numbers,
 AND 2-d spatial.  For keyword search, the term is a word, for numbers, the
 term represents a contiguous range of values (e.g. 100-200), and for 2-d
 spatial, a term is a grid square (a 2-D range).

 I am aware many other DBs put spatial data in R-Trees, and I have no
 interest investing energy in doing that in Lucene.  That isn’t to say I
 think that other DBs shouldn’t be using R-Trees.  I think a system based on
 sorted keys/terms (like Lucene and Cassandra, Accumulo, HBase, and others)
 already have a powerful/versatile index such that it doesn’t warrant
 complexity in adding something different.  And Lucene’s underlying index
 continues to improve.  I am most excited about an “auto-prefixing”
 technique McCandless has been working on that will bring performance up to
 the next level for numeric  spatial data in Lucene’s index.

 If you’d like to learn more about RPT and Lucene/Solr spatial, I suggest
 my “Spatial Deep Dive” presentation at Lucene Revolution in San Diego, May
 2013:  Lucene / Solr 4 Spatial Deep Dive
 https://www.youtube.com/watch?v=L2cUGv0Rebslist=PLsj1Ri57ZE94ulvk2vI_WoJrDYs3ckmH0index=31
 Also, my article here illustrates some RPT concepts in terms of indexing:
 http://opensourceconnections.com/blog/2014/04/11/indexing-polygons-in-lucene-with-accuracy/

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley

 On Sat, Jan 10, 2015 at 10:26 AM, Matteo Tarantino 
 matteo.tarant...@gmail.com wrote:

 Hi all,
 I hope to not bother you, but I think I'm writing to the only mailing
 list that can help me with my question.

 I am writing my master thesis about Geographical Information Retrieval
 (GIR) and I'm using Solr to create a little geospatial search engine.
 Reading  papers about GIR I noticed that these systems use a separate data
 structure (like an R-tree http://it.wikipedia.org/wiki/R-tree) to save
 geographical coordinates of documents, but I have found nothing about how
 Solr manages coordinates.

 Can someone help me, and most of all, can someone address me to documents
 that talk about how and where Solr saves spatial informations?

 Thank you in advance
 Matteo





Re: Solr geospatial index?

2015-01-10 Thread Jack Krupansky
Every field has its own index based of the type of the field.

-- Jack Krupansky

On Sat, Jan 10, 2015 at 11:25 AM, Matteo Tarantino 
matteo.tarant...@gmail.com wrote:

 Thank you for your reply,
 I have read the documentation, but I still don't understand if Solr
 creates or not two different indexes, one for the text of the documents and
 one for the geographic information of the document (something like this:
 http://imgur.com/E0R3alo )

 2015-01-10 17:03 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com:

 See the Solr reference guide section on Spatial Search:
 https://cwiki.apache.org/confluence/display/solr/Spatial+Search

 -- Jack Krupansky

 On Sat, Jan 10, 2015 at 10:26 AM, Matteo Tarantino 
 matteo.tarant...@gmail.com wrote:

 Hi all,
 I hope to not bother you, but I think I'm writing to the only mailing
 list that can help me with my question.

 I am writing my master thesis about Geographical Information Retrieval
 (GIR) and I'm using Solr to create a little geospatial search engine.
 Reading  papers about GIR I noticed that these systems use a separate data
 structure (like an R-tree http://it.wikipedia.org/wiki/R-tree) to save
 geographical coordinates of documents, but I have found nothing about how
 Solr manages coordinates.

 Can someone help me, and most of all, can someone address me to
 documents that talk about how and where Solr saves spatial informations?

 Thank you in advance
 Matteo






Re: Solr geospatial index?

2015-01-10 Thread Matteo Tarantino
Thank you for your reply,
I have read the documentation, but I still don't understand if Solr creates
or not two different indexes, one for the text of the documents and one for
the geographic information of the document (something like this:
http://imgur.com/E0R3alo )

2015-01-10 17:03 GMT+01:00 Jack Krupansky jack.krupan...@gmail.com:

 See the Solr reference guide section on Spatial Search:
 https://cwiki.apache.org/confluence/display/solr/Spatial+Search

 -- Jack Krupansky

 On Sat, Jan 10, 2015 at 10:26 AM, Matteo Tarantino 
 matteo.tarant...@gmail.com wrote:

 Hi all,
 I hope to not bother you, but I think I'm writing to the only mailing
 list that can help me with my question.

 I am writing my master thesis about Geographical Information Retrieval
 (GIR) and I'm using Solr to create a little geospatial search engine.
 Reading  papers about GIR I noticed that these systems use a separate data
 structure (like an R-tree http://it.wikipedia.org/wiki/R-tree) to save
 geographical coordinates of documents, but I have found nothing about how
 Solr manages coordinates.

 Can someone help me, and most of all, can someone address me to documents
 that talk about how and where Solr saves spatial informations?

 Thank you in advance
 Matteo





Re: Solr geospatial index?

2015-01-10 Thread Jack Krupansky
See the Solr reference guide section on Spatial Search:
https://cwiki.apache.org/confluence/display/solr/Spatial+Search

-- Jack Krupansky

On Sat, Jan 10, 2015 at 10:26 AM, Matteo Tarantino 
matteo.tarant...@gmail.com wrote:

 Hi all,
 I hope to not bother you, but I think I'm writing to the only mailing list
 that can help me with my question.

 I am writing my master thesis about Geographical Information Retrieval
 (GIR) and I'm using Solr to create a little geospatial search engine.
 Reading  papers about GIR I noticed that these systems use a separate data
 structure (like an R-tree http://it.wikipedia.org/wiki/R-tree) to save
 geographical coordinates of documents, but I have found nothing about how
 Solr manages coordinates.

 Can someone help me, and most of all, can someone address me to documents
 that talk about how and where Solr saves spatial informations?

 Thank you in advance
 Matteo



Re: Solr geospatial index?

2015-01-10 Thread david.w.smi...@gmail.com
Hello Matteo,

Welcome. You are not bothering/me-us; you are asking in the right place.

Jack’s right in terms of the field type dictating how it works.

LatLonType, simply stores the latitude and longitude internally as separate
floating point fields and it does efficient range queries over them for
bounding-box queries.  Lucene has remarkably fast/efficient range queries
over numbers based on a Trie/PrefixTree. In fact systems like TitanDB leave
such queries to Lucene.  For point-radius, it iterates over all of them
in-memory in a brute-force fashion (not scalable but may be fine).

BBoxField is similar in spirit to LatLonType; each side of an indexed
rectangle gets its own floating point field internally.

Note that for both listed above, the underlying storage and range queries
use built-in numeric fields.

SpatialRecursivePrefixTreeFieldType (RPT for short) is interesting in that
it supports indexing essentially any shape by representing the indexed
shape as multiple grid squares.  Non-point shapes (e.g. a polygon) are
approximated; if you need accuracy, you should additionally store the
vector geometry and validate the results in a 2nd pass (see
SerializedDVStrategy for help with that).  RPT, like Lucene’s numeric
fields, uses a Trie/PrefixTree but encodes two dimensions, not one.

The Trie/PrefixTree concept underlies both RPT and numeric fields, which
are approaches to using Lucene’s terms index to encode prefixes.  So the
big point here is that Lucene/Solr doesn’t have side indexes using
fundamentally different technologies for different types of data; no;
Lucene’s one versatile index looks up terms (for keyword search), numbers,
AND 2-d spatial.  For keyword search, the term is a word, for numbers, the
term represents a contiguous range of values (e.g. 100-200), and for 2-d
spatial, a term is a grid square (a 2-D range).

I am aware many other DBs put spatial data in R-Trees, and I have no
interest investing energy in doing that in Lucene.  That isn’t to say I
think that other DBs shouldn’t be using R-Trees.  I think a system based on
sorted keys/terms (like Lucene and Cassandra, Accumulo, HBase, and others)
already have a powerful/versatile index such that it doesn’t warrant
complexity in adding something different.  And Lucene’s underlying index
continues to improve.  I am most excited about an “auto-prefixing”
technique McCandless has been working on that will bring performance up to
the next level for numeric  spatial data in Lucene’s index.

If you’d like to learn more about RPT and Lucene/Solr spatial, I suggest my
“Spatial Deep Dive” presentation at Lucene Revolution in San Diego, May
2013:  Lucene / Solr 4 Spatial Deep Dive
https://www.youtube.com/watch?v=L2cUGv0Rebslist=PLsj1Ri57ZE94ulvk2vI_WoJrDYs3ckmH0index=31
Also, my article here illustrates some RPT concepts in terms of indexing:
http://opensourceconnections.com/blog/2014/04/11/indexing-polygons-in-lucene-with-accuracy/

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley

On Sat, Jan 10, 2015 at 10:26 AM, Matteo Tarantino 
matteo.tarant...@gmail.com wrote:

 Hi all,
 I hope to not bother you, but I think I'm writing to the only mailing list
 that can help me with my question.

 I am writing my master thesis about Geographical Information Retrieval
 (GIR) and I'm using Solr to create a little geospatial search engine.
 Reading  papers about GIR I noticed that these systems use a separate data
 structure (like an R-tree http://it.wikipedia.org/wiki/R-tree) to save
 geographical coordinates of documents, but I have found nothing about how
 Solr manages coordinates.

 Can someone help me, and most of all, can someone address me to documents
 that talk about how and where Solr saves spatial informations?

 Thank you in advance
 Matteo