Actually I and my client care how fast, even if it means more work and tests
to hedge accuracy. I did try Nokogiri - which I liked getting to know, but
it also plods in at ~ 150 seconds which is just unacceptable for someone
waiting at a browser. That's what I was trying to get at with my original
post and should have provided more data, i.e. am I wasting time with
unrealistic expectations for any XML parser in this endeavor.

Unless anyone can point out a more efficient search (code and example xml
below), it seems practical in absence of other ideas, to go the way of regex
at least to triangulate the data before throwing it to an xml parser to get
the details or put the data into a db (which I am trying to avoid).

Below, the second line is what takes forever, understandably.
gsa_epls_xml_doc = Nokogiri::HTML(doc_xml)
@gsa_epls_xml_doc.xpath("//records/record[last='#{last_name}' and
first='#{first_name}']").each do |possible_match_record| ...


File structure - <Records> with a lot (65mb) of <Record> nodes.

<Records>
  <Record>
    <Prefix></Prefix>
    <First>Vr</First>
    <Middle>A</Middle>
    <Last>C</Last>
    <Suffix></Suffix>
    <Classification>Individual</Classification>
    <CTType>Reciprocal</CTType>
    <Addresses>
      <Address>
        <City>R</City>
        <ZIP>11576</ZIP>
        <Province/>
        <State>NY</State>
        <DUNS/>
      </Address>
    </Addresses>
    <References/>
    <Actions>
     <Action>
       <ActionDate>22-Apr-2004</ActionDate>
       <TermDate>Indef.</TermDate>
       <CTCode>Z2</CTCode>
       <AgencyComponent>OPM</AgencyComponent>
     </Action>
     <Action>
       <ActionDate>19-Feb-2004</ActionDate>
       <TermDate>Indef.</TermDate>
       <CTCode>Z1</CTCode>
       <AgencyComponent>HHS</AgencyComponent>
     </Action>
    </Actions>
    <Description/>
    <AgencyIdentifiers/>
  </Record>
   .
   .
   .
   n
</Records>




On Thu, Aug 5, 2010 at 11:55 AM, Marnen Laibow-Koser
<[email protected]>wrote:

> David Kahn wrote:
> > Got a question hopefully someone can answer -
> >
> > I am working on functionality to match on certain nodes of a largish
> > (65mb)
> > xml file. I implemented this with REXML and was 2 minutes and counting
> > before I killed the process. After this, I just opened the console and
> > loaded the file into a string and did a regex search for my data -- the
> > result was almost instantaneous.
> >
> > The question is, if I can get away with it, am I better off just going
> > the
> > regex route, or is it really worth my while to investigate a faster XML
> > parser (I know REXML is notorious for being slow,
>
> Then why the heck are you even bringing it up in this situation?  I
> *think* Nokogiri is supposed to be much faster.
>
> > but given how fast it
> > was
> > to call a regex on the file, I am thinking that this will still be
> > faster
> > than all parsers).
>
> Who cares how fast it is if it's inaccurate?  Regular expressions are
> the wrong tool for parsing XML, because they can't cope easily (or at
> all) with lots of valid XML constructs.  If you're parsing XML, use an
> actual XML parser, or you risk serious errors.
>
> >
> > Any comments or suggestions appreciated.
> >
> > David
>
> Best,
> --
> Marnen Laibow-Koser
> http://www.marnen.org
> [email protected]
> --
> Posted via http://www.ruby-forum.com/.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Ruby on Rails: Talk" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<rubyonrails-talk%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/rubyonrails-talk?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en.

Reply via email to