Re: Store byte array in StoredField using zlib compression

Prashant Saxena Sat, 26 Oct 2024 05:50:45 -0700

I just need to store compressed strings to save space. If it can be done in
any other way, I'm OK with that.



On Sat, Oct 26, 2024 at 6:11 PM Andi Vajda <va...@apache.org> wrote:

>
> On Sat, 26 Oct 2024, Prashant Saxena wrote:
>
> > PyLucene 10.0.0
> >
> > I'm trying to store a long text by compressing it first using zlib
> >
> > *doc.add(StoredField("contents", zlib.compress(ftext.encode('utf-8'))))*
> >
> > The resulting index size is *~83 MB*. When reading it's value back using
> >
> > *c = doc.getBinaryValue("contents")*
> >
> > It's returning 'NoneType' and when using
> >
> > *c = doc.get("contents")*
> >
> > It's returning a string which cannot be decompressed.
> >
> > When using
> >
> > *doc.add(StoredField("contents",
> > JArray('byte')(zlib.compress(ftext.encode('utf-8')))))*
> >
> > The resulting index size is ~*160 MB. *There is no problem in getting
> it's
> > value using
> >
> >
> >
> > *c = doc.getBinaryValue("contents")cc =
> > zlib.decompress(c.bytes.bytes_).decode('utf-8') *
> >
> > *Question 1 : *Why does the index size almost double when using JArray?
>
> Because the value you're passing is actually processed correctly ?
>
> > *Question 2: *How do you correctly create and store compressed binary
> data
> > in StoredField ?
>
> If you want a python byte object, like b'abcd', to be seen by Lucene
> (Java)
> as a byte array, you should wrap it with a JArray('byte') like you did.
> Otherwise, it's seen as a string (I need to double-check) and not handled
> correctly.
>
> > I am using PyLucene in my current project. Please advise me if I should
> > post my questions on the java-user list instead of here.
>
> This particular question is specific to PyLucene and should be asked here,
> like you did ;-)
>
> Andi..
>

Re: Store byte array in StoredField using zlib compression

Reply via email to