hi all.
i've been programming in perl for a few years, but i'm definitely no
expert. the begginer's list seemed like the best place to ask for
help on this program/solution i'm working on, so here goes...
i am charged with the task of parsing a (basically) infinite set of
plain text files. certain sets of files have identifiable
similarities (which i refer to as 'types'), but often within those
sets there are small variations (which i refer to as 'versions of a
type'). most of the files, however, take on this general form:
[start of file]
[data that is treated as one discrete 'entry']
[string that occurs at the end of each entry, or 'delimiter']
[entry]
[delimiter]
[entry]
[delimiter]
...
[end of file]
that is, a file is comprised of entries. the end goal is to extract a
set of metadata from each entry, and insert the metadata and the
entirety of the entry into a database table. so an entry may look
like this:
[(start of entry)]
[a bunch of stuff i don't care about at the moment]
[a timestamp 't1']
[an ip address]
[a bunch of stuff i don't care about at the moment]
[a few lines above a domain name that I am looking for
a domain name that i am looking for
a few lines after a domain name that I am looking for]
[a bunch of stuff i don't care about]
[a timestamp 't2']
[(end of entry)]
[delimiter]
so in this example the metadata i wish to extract from each entry is
't1, ip, domain_info, t2', and the database table's columns are 't1,
t2, ip, domain_info, entry'. i pull the metdata out of the entry, and
insert a row consisting of the metadata and the entirety of the entry
itself. with me so far? i hope i'm making good sense here, its all
very clear in my mind but sometimes that's a problem when i try to
communicate it to others :-)
so here's what i've been doing so far, in a compacted pseudo-code-ish
form.
## (where Config is a package that associates metadata names with
regular
## expressions to find that metdata in an entry, and a set of accessors
## for the regular expressions used to extract each piece of
metadata... i.e.
## $config->ip, $config->domain_text, etc.)
my $config = Config->new('type','version');
my @entry = undef;
while (<FILE>) {
if ($_ =~ /$config->delimiter/) {
push @entry, $_;
## (where Class is a package that finds the metadata from the entry
and
## inserts the metadata and entry into the database)
my $object = Class->new([EMAIL PROTECTED],$config);
$object->parse;
$object->insert;
@entry = ();
} else {
push @entry, $_;
}
}
this works well enough, but I want to try to do it in a more... I
don't know, professional (?) way, utilizing one of the gazillion
language parsers available on CPAN. the thing is, I don't know
anything about these 'language' or 'lexical' or whatever parsers, and
the implementation of Class is really nothing more than scanning each
entry array for regular expressions. The Config class reads an XML
file that is written for a given 'type and version' of files. the
reasoning behind this is that when i encounter a new version, i can
create a config file for it by modifying another one of the same
type. these config files have a structure sort of like this:
<parser>
<metadata_name>
<expression>[regex to match that metadata into $1]</expression>
<index>[indicie where I expect to find this bit of metadata,
or * if the whole array needs to be searched]</index>
</metadata_name>
<!-- or a more precise example -->
<t1>
<expression>IP:.*?\sDate:(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})</
expression>
<index>-3</index>
</t1>
<domain_info>
<expression>(some\.domain\.com)</expression>
<index>*</index>
</domain_info>
</parser>
so this file gets read by XML::in in the Config package and thats how
the Class package knows what to do to get the data into the database
the way that it should be there. make sense?
what would you guys recommend to use for something like this? i
perused the CPAN docs for stuff like Parse::Lex and YAPP, but I don't
know if thats what I'm looking for, or where to find a tutorial
explaining about tokens and all that kind of stuff.
Basically, anyone who has any input about any of this, I'd be really
pumped to hear it.
Thanks,
Neal
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/