On Jun 20, 2006, at 11:29 PM, David Balmain wrote:
This seems like a lot to bundle to me, when like you said, it will
probably be available on all platforms that Lucy might target. I don't
see the problem with calling back to the native API. We are going to
have to provide call-backs for things like memory allocation and
exception handling so I don't think an extra inflate and deflate
callback is going to hurt. But if you feel strongly about this I'm not
to fussed.
I could be persuaded. Callbacks to Perl from C are verbose and kind
of hard to get your head around, so I have something of an
instinctive aversion to them.
Here's a wrapper for compress() in Perl:
use Compress::Zlib qw( compress )
sub compress_it {
my $input = shift;
return compress($input);
}
... and here's the equivalent function rendered in XS, calling back
to Perl..
SV*
compress_it(SV *input)
{
SV *retval;
dSP; /* declare stack pointer */
ENTER; /* opening bracket for a callback */
SAVETMPS; /* opening bracket for temporaries */
PUSHMARK(SP); /* start arg stack */
XPUSHs( sv_2mortal( newSVsv(input) ) ); /* pass copy of input */
PUTBACK; /* close arg stack */
call_pv("Compress::Zlib::compress", G_SCALAR); /* invoke compress
() */
FREETMPS; /* closing bracket for temporaries */
LEAVE; /* closing bracket for a callback */
retval = sv_2mortal( newSVsv( ST(0) ) ); /* copy first item on
stack */
return retval;
}
Untested. ;)
Larry Wall apparently has said that XS wasn't any simpler because it
had to be efficient...
The Snowball stemmers are also available via CPAN; I now maintain
that distribution (Lingua::Stem::Snowball). However, other platforms
probably won't have something like that available, and even within
the Perl world, bundling Snowball means greater flexibility with
regards to how Lucy interacts with it.
This I agree with. I've bundled it with Ferret. I've also bundled the
lists of stopwords from http://snowball.tartarus.org/.
Yeah, I didn't mention that, but same thing: we should bundle it.
And same thing, I maintain the CPAN distro Lingua::StopWords, but
other targets wouldn't have access to something like that.
Do you plan on
doing any other analysis at the C level or do you just want to make
the SnowBall parser available in the target API?
I think we'll want to render TokenBatch in C and make the Snowball
Stemmer able to act on the TokenBatch's member strings directly. If
I have to call back to Perl, I'll have to wrap token text in a Perl
scalar then recover it back into the TokenBatch, and it won't be as
efficient -- more copy ops.
I'm not sure how the other Analyzers will be implemented.
We need vsnprintf for formatting error messages, which may include
user-controllable input and which are therefore ripe for buffer
overflow attack. There are many variants available -- see <http://
www.ijs.si/software/snprintf/> for links to a few (some are
outdated). We may be able to derive something from APR's
implementation if we can't find one with a compatible license we can
just bundle and #inclide.
I think I'd rather derive something from APR's implementation.
OK, that's cool. This is another example of something that Perl
supplied so I didn't have to.
I've done all these before bar the external sort.
The external sort, I have nailed. I've been honing that sucker for a
while now.
What is the byte buffer for in particular?
KinoSearch does a lot of serialization and deserialization. It's
really handy to have a string that knows its own length when you're
doing stuff like concat and truncate ops all the time.
The external sorter takes ByteBuffer's as its args. Without that it
would have to take char* and a string length at the same time. It
would get really messy. Think qsort with strings that may contain
null bytes.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/