Logically if you tokenize and put the results in a multivalued field, you should be able to get all values in sequence? On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <arafa...@gmail.com> wrote:
> Hello, > > If I use very basic tokenizers, e.g. space based and no filters, can I > reconstruct the text from the tokenized form? > > So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"? > > I know we store enough information, but I don't know internal API > enough to know what I should be looking at for reconstruction > algorithm. > > Any hints? > > The XY problem is that I want to store large amount of very repeatable > text into Solr. I want the index to be as small as possible, so > thought if I just pre-tokenized, my dictionary will be quite small. > And I will be reconstructing some final form anyway. > > The other option is to just use compressed fields on stored field, but > I assume that does not take cross-document efficiencies into account. > And, it will be a read-only index after build, so I don't care about > updates messing things up. > > Regards, > Alex > > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency >