hi all.

i've been programming in perl for a few years, but i'm definitely no expert. the begginer's list seemed like the best place to ask for help on this program/solution i'm working on, so here goes...

i am charged with the task of parsing a (basically) infinite set of plain text files. certain sets of files have identifiable similarities (which i refer to as 'types'), but often within those sets there are small variations (which i refer to as 'versions of a type'). most of the files, however, take on this general form:

[start of file]
[data that is treated as one discrete 'entry']
[string that occurs at the end of each entry, or 'delimiter']
[entry]
[delimiter]
[entry]
[delimiter]
...
[end of file]

that is, a file is comprised of entries. the end goal is to extract a set of metadata from each entry, and insert the metadata and the entirety of the entry into a database table. so an entry may look like this:

[(start of entry)]
[a bunch of stuff i don't care about at the moment]
[a timestamp 't1']
[an ip address]
[a bunch of stuff i don't care about at the moment]
[a few lines above a domain name that I am looking for
a domain name that i am looking for
a few lines after a domain name that I am looking for]
[a bunch of stuff i don't care about]
[a timestamp 't2']
[(end of entry)]
[delimiter]

so in this example the metadata i wish to extract from each entry is 't1, ip, domain_info, t2', and the database table's columns are 't1, t2, ip, domain_info, entry'. i pull the metdata out of the entry, and insert a row consisting of the metadata and the entirety of the entry itself. with me so far? i hope i'm making good sense here, its all very clear in my mind but sometimes that's a problem when i try to communicate it to others :-)

so here's what i've been doing so far, in a compacted pseudo-code-ish form.

## (where Config is a package that associates metadata names with regular
## expressions to find that metdata in an entry, and a set of accessors
## for the regular expressions used to extract each piece of metadata... i.e.
## $config->ip, $config->domain_text, etc.)
my $config = Config->new('type','version');

my @entry = undef;
while (<FILE>) {
        if ($_ =~ /$config->delimiter/) {
                push @entry, $_;

## (where Class is a package that finds the metadata from the entry and
                ## inserts the metadata and entry into the database)
                my $object = Class->new([EMAIL PROTECTED],$config);

                $object->parse;
                $object->insert;
                @entry = ();
        } else {
                push @entry, $_;
        }
}

this works well enough, but I want to try to do it in a more... I don't know, professional (?) way, utilizing one of the gazillion language parsers available on CPAN. the thing is, I don't know anything about these 'language' or 'lexical' or whatever parsers, and the implementation of Class is really nothing more than scanning each entry array for regular expressions. The Config class reads an XML file that is written for a given 'type and version' of files. the reasoning behind this is that when i encounter a new version, i can create a config file for it by modifying another one of the same type. these config files have a structure sort of like this:

<parser>
        <metadata_name>
                <expression>[regex to match that metadata into $1]</expression>
                <index>[indicie where I expect to find this bit of metadata,
                        or * if the whole array needs to be searched]</index>
        </metadata_name>

        <!-- or a more precise example -->
        <t1>
<expression>IP:.*?\sDate:(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})</ expression>
                <index>-3</index>
        </t1>
        
        <domain_info>
                <expression>(some\.domain\.com)</expression>
                <index>*</index>
        </domain_info>
</parser>

so this file gets read by XML::in in the Config package and thats how the Class package knows what to do to get the data into the database the way that it should be there. make sense?

what would you guys recommend to use for something like this? i perused the CPAN docs for stuff like Parse::Lex and YAPP, but I don't know if thats what I'm looking for, or where to find a tutorial explaining about tokens and all that kind of stuff.

Basically, anyone who has any input about any of this, I'd be really pumped to hear it.

Thanks,
Neal

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to