On 11.08.2016, at 19:43, Marshall Schor <[email protected]> wrote:
>
> I'm working on this now.
>
> I note that the new load(InputStream, CasMgrSerialzer, CAS, boolean) method is
> "public". Is there some code (perhaps in DkPro) that needs this form?
>
> If not, I'll remove this method and make the reading to create the
> CasMgrSerializer "lzay" - not done until needed.
Yep, I need something like that in DKPro.
When the type system information is stored outside the binary CAS in a
separate file, that TSI file would have to be re-read for every CAS file.
Being able to pass he CasMgrSerialzer to load() allows me to read it only
once.
> Not sure about zipping the type system - we have 3 choices, perhaps: 1)
> nothing,
> 2) zip, 3) custom compression zip (like the rest of form 6).
>
> I'm leaning toward doing this work (if ever done) later.
I've been pushing that ahead since implementing the BinaryCasReader/Writer :)
Probably doesn't hurt if it gets pushed ahead a bit further.
I had a quick look at the CasMgrSerialzer - you called it highly inefficient.
It doesn't look that inefficient. At least it uses primitive and String arrays
and not collections :)
> ================
>
> I have one more question - there's a comment which I don't see implemented -
> which says that when a set of deserializations are being done with the same
> type
> system, the extra work to handle the type system is only done once:
>
> * This method avoids the repeated loading of the typesystem and index
> definitions
> * from a stream when loading many CASes in a row.
>
> How do you think that should be implemented?
Well, that's happening when I read the CasMgrSerialzer from a separate file - as
explained above:
casMgr = read(casMgrFile)
for (file in directory) {
load(file, casMgr, CAS, boolean)
}
Cheers,
-- Richard