a question about parsing

Neal Clark Fri, 16 Feb 2007 00:30:25 -0800

hi all.

i've been programming in perl for a few years, but i'm definitely noexpert. the begginer's list seemed like the best place to ask forhelp on this program/solution i'm working on, so here goes...

i am charged with the task of parsing a (basically) infinite set ofplain text files. certain sets of files have identifiablesimilarities (which i refer to as 'types'), but often within thosesets there are small variations (which i refer to as 'versions of atype'). most of the files, however, take on this general form:


[start of file]
[data that is treated as one discrete 'entry']
[string that occurs at the end of each entry, or 'delimiter']
[entry]
[delimiter]
[entry]
[delimiter]
...
[end of file]

that is, a file is comprised of entries. the end goal is to extract aset of metadata from each entry, and insert the metadata and theentirety of the entry into a database table. so an entry may looklike this:


[(start of entry)]
[a bunch of stuff i don't care about at the moment]
[a timestamp 't1']
[an ip address]
[a bunch of stuff i don't care about at the moment]
[a few lines above a domain name that I am looking for
a domain name that i am looking for
a few lines after a domain name that I am looking for]
[a bunch of stuff i don't care about]
[a timestamp 't2']
[(end of entry)]
[delimiter]

so in this example the metadata i wish to extract from each entry is't1, ip, domain_info, t2', and the database table's columns are 't1,t2, ip, domain_info, entry'. i pull the metdata out of the entry, andinsert a row consisting of the metadata and the entirety of the entryitself. with me so far? i hope i'm making good sense here, its allvery clear in my mind but sometimes that's a problem when i try tocommunicate it to others :-)

so here's what i've been doing so far, in a compacted pseudo-code-ishform.

## (where Config is a package that associates metadata names withregular

## expressions to find that metdata in an entry, and a set of accessors

## for the regular expressions used to extract each piece ofmetadata... i.e.

## $config->ip, $config->domain_text, etc.)
my $config = Config->new('type','version');

my @entry = undef;
while (<FILE>) {
        if ($_ =~ /$config->delimiter/) {
                push @entry, $_;

## (where Class is a package that finds the metadata from the entryand

                ## inserts the metadata and entry into the database)
                my $object = Class->new([EMAIL PROTECTED],$config);

                $object->parse;
                $object->insert;
                @entry = ();
        } else {
                push @entry, $_;
        }
}

this works well enough, but I want to try to do it in a more... Idon't know, professional (?) way, utilizing one of the gazillionlanguage parsers available on CPAN. the thing is, I don't knowanything about these 'language' or 'lexical' or whatever parsers, andthe implementation of Class is really nothing more than scanning eachentry array for regular expressions. The Config class reads an XMLfile that is written for a given 'type and version' of files. thereasoning behind this is that when i encounter a new version, i cancreate a config file for it by modifying another one of the sametype. these config files have a structure sort of like this:


<parser>
        <metadata_name>
                <expression>[regex to match that metadata into $1]</expression>
                <index>[indicie where I expect to find this bit of metadata,
                        or * if the whole array needs to be searched]</index>
        </metadata_name>

        <!-- or a more precise example -->
        <t1>

<expression>IP:.*?\sDate:(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})</expression>

                <index>-3</index>
        </t1>
        
        <domain_info>
                <expression>(some\.domain\.com)</expression>
                <index>*</index>
        </domain_info>
</parser>

so this file gets read by XML::in in the Config package and thats howthe Class package knows what to do to get the data into the databasethe way that it should be there. make sense?

what would you guys recommend to use for something like this? iperused the CPAN docs for stuff like Parse::Lex and YAPP, but I don'tknow if thats what I'm looking for, or where to find a tutorialexplaining about tokens and all that kind of stuff.

Basically, anyone who has any input about any of this, I'd be reallypumped to hear it.


Thanks,
Neal

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

a question about parsing

Reply via email to