You can make no assumptions about locality in terms of where separate documents land on disk. I suppose if you have the whole corpus at index time you could index these "similar" documents contiguously. Then, assuming there was absolutely never any updates/deletes I _think_ the doc might tend to be contiguous on disk but that's very iffy and based on several
My base question is why you'd care about compressing 500G. Disk space is so cheap that the expense of trying to control this dwarfs any imaginable $avings, unless you're talking about a lot of 500G indexes. In other words this seems like an XY problem, you're asking about compressing when you are really concerned with something else. Best, Erick On Tue, Nov 15, 2016 at 5:32 PM, Kevin Burton <bur...@spinn3r.com> wrote: > I have a large index (say 500GB) that with a large percentage of near > duplicate documents. > > I have to keep the documents there (can't delete them) as the metadata is > important. > > Is it possible to get the documents to be contiguous somehow? > > Once they are contiguous then they will compress very well - which I've > already confirmed by writing the exact same document N times. > > IDEALLY I could use two fields and have a unique document ID but then a > group_id so that they can be located on disk by the group_id... but I don't > think this is possible. > > Can I just create a synthetic "id" field for this and assume that "id" is > ordered on disk in the lucene index? > > > -- > > We’re hiring if you know of any awesome Java Devops or Linux Operations > Engineers! > > Founder/CEO Spinn3r.com > Location: *San Francisco, CA* > blog: http://burtonator.wordpress.com > … or check out my Google+ profile > <https://plus.google.com/102718274791889610666/posts> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org