Re: Can I reconstruct text from tokens?

Ramkumar R. Aiyengar Wed, 16 Apr 2014 09:00:06 -0700

Logically if you tokenize and put the results in a multivalued field, you
should be able to get all values in sequence?
On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <arafa...@gmail.com> wrote:


> Hello,
>
> If I use very basic tokenizers, e.g. space based and no filters, can I
> reconstruct the text from the tokenized form?
>
> So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
>
> I know we store enough information, but I don't know internal API
> enough to know what I should be looking at for reconstruction
> algorithm.
>
> Any hints?
>
> The XY problem is that I want to store large amount of very repeatable
> text into Solr. I want the index to be as small as possible, so
> thought if I just pre-tokenized, my dictionary will be quite small.
> And I will be reconstructing some final form anyway.
>
> The other option is to just use compressed fields on stored field, but
> I assume that does not take cross-document efficiencies into account.
> And, it will be a read-only index after build, so I don't care about
> updates messing things up.
>
> Regards,
>    Alex
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>

Re: Can I reconstruct text from tokens?

Reply via email to