You can make no assumptions about locality in terms of where separate
documents land
on disk. I suppose if you have the whole corpus at index time you
could index these
"similar" documents contiguously. Then, assuming there was absolutely never any
updates/deletes I _think_ the doc might tend to be contiguous on disk
but that's very iffy
and based on several

My base question is why you'd care about compressing 500G. Disk space
is so cheap
that the expense of trying to control this dwarfs any imaginable
$avings, unless you're
talking about a lot of 500G indexes. In other words this seems like an
XY problem, you're
asking about compressing when you are really concerned with something else.

Best,
Erick

On Tue, Nov 15, 2016 at 5:32 PM, Kevin Burton <bur...@spinn3r.com> wrote:
> I have a large index (say 500GB) that with a large percentage of near
> duplicate documents.
>
> I have to keep the documents there (can't delete them) as the metadata is
> important.
>
> Is it possible to get the documents to be contiguous somehow?
>
> Once they are contiguous then they will compress very well - which I've
> already confirmed by writing the exact same document N times.
>
> IDEALLY I could use two fields and have a unique document ID but then a
> group_id so that they can be located on disk by the group_id... but I don't
> think this is possible.
>
> Can I just create a synthetic "id" field for this and assume that "id" is
> ordered on disk in the lucene index?
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to