I just need to store compressed strings to save space. If it can be done in any other way, I'm OK with that.
On Sat, Oct 26, 2024 at 6:11 PM Andi Vajda <va...@apache.org> wrote: > > On Sat, 26 Oct 2024, Prashant Saxena wrote: > > > PyLucene 10.0.0 > > > > I'm trying to store a long text by compressing it first using zlib > > > > *doc.add(StoredField("contents", zlib.compress(ftext.encode('utf-8'))))* > > > > The resulting index size is *~83 MB*. When reading it's value back using > > > > *c = doc.getBinaryValue("contents")* > > > > It's returning 'NoneType' and when using > > > > *c = doc.get("contents")* > > > > It's returning a string which cannot be decompressed. > > > > When using > > > > *doc.add(StoredField("contents", > > JArray('byte')(zlib.compress(ftext.encode('utf-8')))))* > > > > The resulting index size is ~*160 MB. *There is no problem in getting > it's > > value using > > > > > > > > *c = doc.getBinaryValue("contents")cc = > > zlib.decompress(c.bytes.bytes_).decode('utf-8') * > > > > *Question 1 : *Why does the index size almost double when using JArray? > > Because the value you're passing is actually processed correctly ? > > > *Question 2: *How do you correctly create and store compressed binary > data > > in StoredField ? > > If you want a python byte object, like b'abcd', to be seen by Lucene > (Java) > as a byte array, you should wrap it with a JArray('byte') like you did. > Otherwise, it's seen as a string (I need to double-check) and not handled > correctly. > > > I am using PyLucene in my current project. Please advise me if I should > > post my questions on the java-user list instead of here. > > This particular question is specific to PyLucene and should be asked here, > like you did ;-) > > Andi.. >