Re: Techniques for Retrieving Hits

2018-05-14 Thread Shawn Heisey
On 5/14/2018 3:13 PM, Terry Steichen wrote:
> I posted this note because I've not seen list comments pertaining to the
> job of actually locating and retrieving hitlist documents.

How documents are retrieved will be highly dependent on your setup. 
Here's how things usually go:

If the original data came from a database, then the system where people
do their searches should know how to talk to the database, and use
information in the search results to look up the full original document
in the database.

If the source data is on a file server, then the system where people do
their searches will need to have the file server storage mounted.  It
will then use information in the search results to access the full
original document.

Ditto for any other kind of canonical data store with Solr as the search
engine.

The system where searches are done will be implemented by you.  It will
be up to that system to handle any kind of security filtering for both
Solr searches and document access.

Solr should not be exposed directly to end users.  Most of the time,
what's in Solr is not particularly sensitive ... but when Solr is
exposed to people who cannot be trusted, those end users may be able to
change or delete any data in Solr.  They might also be able to send
denial of service queries directly to Solr.

Thanks,
Shawn



Re: Techniques for Retrieving Hits

2018-05-14 Thread Terry Steichen
Shawn,

As noted in my embedded comments below, I don't really see the problem
you apparently do. 

Maybe I'm missing something important (which certainly wouldn't  be the
first - or last -  time that happened).

I posted this note because I've not seen list comments pertaining to the
job of actually locating and retrieving hitlist documents. 

My way "seems" to work, and it is quite simple and compact.  I just
threw it out seeking a sanity check from others.

Terry


On 05/14/2018 11:32 AM, Shawn Heisey wrote:
> On 5/14/2018 6:46 AM, Terry Steichen wrote:
>> In order to allow users to retrieve the documents that match a query, I
>> make use of the embedded Jetty container to provide file server
>> functionality.  To make this happen, I provide a symbolic link between
>> the actual document archive, and the Jetty file server.  This seems
>> somewhat of a kludge, and I'm wondering if others have a better way to
>> retrieve the desired documents?  (I'm not too concerned about security
>> because I use ssh port forwarding to connect to remote authenticated
>> clients.)
>
> This is not a recommended usage for the servlet container where Solr
> runs.
But if the retrieval traffic is light, what's the problem?
>
> Solr is a search engine.  It is not designed to be a data store,
> although some people do use it that way.
Perhaps I didn't explain it right, but I'm not using it as a datastore
(other than the fact that I keep the actual file repository on the same
machine on which Solr runs.  I've got plenty of storage, so that's not
an issue, and, as I mentioned above, traffic is quite light.
>
> If systems running Solr clients want to access all the information for
> a document when the search results do not contain all the information,
> they should use what IS in the search results to access that data from
> the system where it is stored -- that could be a database, a file
> server, a webserver, or similar.
Perhaps I'm missing something, but search results cannot "contain all
the information" can they?  I use highlighting but that's just showing a
few snippets - not a substitute for the document itself.
>
> Thanks,
> Shawn
>
>



Re: Techniques for Retrieving Hits

2018-05-14 Thread Shawn Heisey

On 5/14/2018 6:46 AM, Terry Steichen wrote:

In order to allow users to retrieve the documents that match a query, I
make use of the embedded Jetty container to provide file server
functionality.  To make this happen, I provide a symbolic link between
the actual document archive, and the Jetty file server.  This seems
somewhat of a kludge, and I'm wondering if others have a better way to
retrieve the desired documents?  (I'm not too concerned about security
because I use ssh port forwarding to connect to remote authenticated
clients.)


This is not a recommended usage for the servlet container where Solr runs.

Solr is a search engine.  It is not designed to be a data store, 
although some people do use it that way.


If systems running Solr clients want to access all the information for a 
document when the search results do not contain all the information, 
they should use what IS in the search results to access that data from 
the system where it is stored -- that could be a database, a file 
server, a webserver, or similar.


Thanks,
Shawn