Re: Errors on processing 2GB XML file by using XML:Simple

Nan Jiang Wed, 18 May 2005 06:44:48 -0700

Hi Peter,

Thanks for your patient and detailed answers.

I have examined the XML file and all <Topic/>with <link/> children are followed by their own <ExternalPage/>s. If there is a <Topic/> without <link/> inside, it is followed by a new <Topic/>. So I think your codes are safe enough to use :-).

However, where I'm stuck in is, after putting some debug codes in your codes, I found that the execute order of these two subroutines is:

<Topic/> -> if no <link/> child, go to the next <Topic/> -> if there is <link/> child, go to its following <ExternalPage/> -> until there is no more <Topic/> nodes. I don't know that which part of codes control this order, is it because of the order of two handlers you set in twig_handlers which indicates the twig process should be interleaving?

Another thing is:

my ($twig, $child) = @_; <- I know this means that you assign values you caught to this two variables, but where are these values from? Are they from 'Topic' => \&_topic_handler? If so, what are they?

Sorry to bother you again,

Nan

From: Peter Rabbitson <[EMAIL PROTECTED]>
To: beginners@perl.org
Subject: Re: Errors on processing 2GB XML file by using XML:Simple
Date: Tue, 17 May 2005 07:40:14 -0500
> Your codes look great and it works perfectly with only some minor problems > which might due to the XML file itself (I think). However, compared your > codes with mine, there are something I'd like to ask you if you don't mind.
Not that much :)
> 1) what's the main difference on memory load bewteen setting handlers and > without setting handlers before calling $parser->parsefile($xml)? > > Does it mean that yours actually access the XML file partially, the first > handler only treats for <Topic/> and the last handler is only for considers > <ExternalPage/>. If so, does setting handlers actually change the way of > loading a file?
Take a look at the XML snippet you sent me as sample data. You have a
regular text file with variable data fields (in other words different
keywords/flags/operators/tags etc have unpredictable length/size in bytes).
So the only way to read such a file is going byte by byte and analyze
everything as we go. What you were doing in your code is doing this byte by
byte reading up until you don't encounter an EOF, which was taking the
corresponding amount of memory. Now let's break your example apart (I am
deliberately ommiting lots of data and adding some):
<RDF>
        <Topic>
                <1st topic related data>
        </Topic>
        <ExternalPage>
                <1st external page related to 1st topic>
        </ExternalPage>
        <ExternalPage>
                <2nd external page still related to 1st topic>
        </ExternalPage>
        <Topic>
                <2nd topic related data>
        </Topic>
        <ExternalPage>
                <1st external page related to 2nd topic>
        </ExternalPage>
        <Topic>
                <3rd topic>
        </Topic>
        <SomeOtherTag>
                <Some Other Data>
        </SomeOtherTag>
        <Topic>
                <4th topic>
        </Topic>
        <ExternalPage>
                <1st external page related to 4th topic>
        </ExternalPage>
        <ExternalPage>
                <2nd external page still related to 4th topic>
        </ExternalPage>
</RDF>
Then the following parser declartion:
my $parser = XML::Twig->new (   twig_handlers => {
                                        'Topic' => \&_topic_handler,
                                        'ExternalPage' => \&_links_handler,
                                },
                        );
$parser->parse($xml);
simply means:
Start walking through the XML data (variable, file, url) and keep going
until you see a completed tag (opening tag followed by arbitrary amount of
data and then closing tag). If the tag we just found matches the twig
handler <Topic>...</Topic> - call subroutine _topic_handler and pass as
arguments the "twig", in other words this particular tag object, and all
"children", in other words all subtags between <Topic> and </Topic>. At the
end of each _topic_handler subroutine I took all I needed from the passed
twig, so I can safely throw it away thus reclaiming memory - I execute a
->purge. Same goes for _links_handler
> 2) My understanding about your codes is, first you looked at <Topic/> nodes > and found if they have <link/> child/children, if they have, you saved them > into a hash table for later <ExternalPage/> comparisions. But my question > is, how are you going to search all <Topic/> and all <ExternalPage/>one by > one by just call the subroutine once without using any kinds of loop? and > how can you link these 2 handlers together?
It is the $parser->parse($xml) line that creates the loop - it will keep
going until there is data in the XML, just like while (<>) will keep going
until there is input from STDIN. Everytime we see a tag set defined in
twig_handlers we will call the corresponding subroutine and do whatever we
got to do. Keep in mind that if it encounters something else that is not
described in the handlers (just like SomeOtherTag between the 3rd and 4th
Topic) it will simply be ignored without occupying any memory. This is why
you can use XMLTwig to process huge files out of which you need several
scattered tags.
> 3) My original intention is for each <Topic/> with valid <link/> > child/children, to open a file in a directory named exactly the same as > what is found in a Topic->att('about') then write all links information > found in <ExternalPage/> then close the file. However, after reading at > your code times and times, I don't know where should I close the file > handler because sub _links_handler is used for finding out links one by one > and I don't know when a <ExternalPage/> is finished from parsing.

This is exactly why I said - if <Topic> is not ALWAYS followed by its OWN <ExternalPage> links - you are screwed. If this is the case you will never know if you should expect yet another <ExternalPage> tag somewhere after 1GB of data that will refer to a <Topic> in the very beginning of the file. Thus I assume in my code that when we see another <Topic> we are done with the previous one, this is wht I was completely reassigning %want_links, because I do not expect any more info pertaining to the previous <Topic>. This is where you should close your files as well - keep a global variable $last_filename and close it at the beginning of each new <TOPIC>
> Is there any suggestion about this?
In your case you could go a slightly different way which will work for
any mixture of <Topic> and <ExternalPage> even if the pages come BEFORE the
topic tag itself:
===each _topic_handler should:
* See what links are present in the tag and collect them in some accessible manner. You could stuff them into a hash with index the link itself and data part as Topic->att('about') or if the amount of topics does not permit it (the hash grows out of memory) you can use a DBM or something similar.
* See if any of the newly collected links are dangling links and create
files for them (see below). Delete the links from the dangling hash/DB
* Purge the twig we were working on
=== each _links_handler should:
* See if the links it contains are already listed in the hash/database
described above. If they are create the files as needed. Delete the links
from the hash/DB
* If there are links but there are no <Topic><about> references yet, stuff
the data into a dangling links hash/database so they can wait until the
right <Topic> is found
* Purge the twig we were working on
If you wrote everything correctly you will end up with the files you need, a hash/DB containing all links for which info was missing and a hash/DB containing all info for which a topic was missing, which should be all you will ever want from a script like that.

However first examine your file and if you can determine that ALL <Topic>s are followed by THEIR OWN <ExternalPage>s, you can safely use as a base what I wrote.
Peter
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Errors on processing 2GB XML file by using XML:Simple

Reply via email to