If you store a hash code of the word rather then the actual word you
should be able to search for stuff but not be able to actually retrieve
it; you can trade precision for "security" based on the number of bits
in the hash code ( e.g. 32 or 64 bits). I'd think a 64 bit hash would be
a reasonable midpoint.
hash64("dog") = 4312311231123121;
"body:4312311231123121" returns document with dog, but also any other
document with a word that hashes to the same value.
Walt Stoneburner wrote:
Have an interesting scenario I'd like to get your take on with respect
to Lucene:
A data provider (e.g. someone with a private website or corporately
shared directory of proprietary documents) has requested their content
be indexed with Lucene so employees can be redirected to it, but
provisionally -- under no circumstance should that content be stored
or recreated from the index.
Is that even possible?
The data owner's request makes sense in the context of them wanting to
retain full access control via logins as well as collecting access
metrics.
If the token 'CAT' points to C:\Corporate\animals.doc and the token
'DOG' points also points there, then great, CAT AND DOG will give that
document a higher rating, though it is not possible to reconstruct
(with any great accuracy) what the actual document content is.
However, if for the sake of using the NEAR operator with Lucene the
tokens are stored as LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7
THIS:8 DECEMBER:9 ... then someone could pull all tokens for
animal.doc and reconstitute the token stream.
Does Lucene have any kind of trade off for working with "secure" (and
I use this term loosely) data?
-wls
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]