In my ongoing quest to create a PDF parser in Perl6, I have some
Rakudo/PGE/parrot questions. These are low-urgency and some of these
may not be implemented yet...
1) byte orientation
PDF's syntax is inherently an 8-bit ASCII superset. Some subsections
may be interpreted as some multi-byte encoding or even binary, but
low-level parsers can safely work solely in the string-as-byte-array
domain.
How do I make a grammar work on bytes instead of chars? Is that a
property of the $.target string?
2) file as lazy string
PDF files are largely random access, but individual segments have
arbitrary lengths. Rather than slurping in the whole file or
guessing at segment lengths, I'd like to emulate a string via a
wrapper around a seekable file, and then apply my grammar to that
fake string. I think I can accomplish this by subclassing PGE::Match
and override new(), text() and item() appropriately. text() would
seek to appropriate locations in the file and buffer chunks at a
time. From there, I could substr the desired passages.
Does anyone know any implementation details that would make this lazy-
string approach work or not work? Has someone tried this?
It seems like the runtime/parrot/library/Stream classes parallel what
I want to accomplish.
3) gzip
Has anyone worked on a zlib interface?
Thanks,
Chris