On Sun, Dec 16, 2012 at 12:10 AM, Paul Mena <[email protected]> wrote: > I'm a Ruby Newbie trying to write a program to process thousands of HTML > files, extracting pertinent text and inserting it into a MySQL database. > Ruby seems ideally suited to the task in general, and I've already used > Nokogiri to extract comment text. What I need to do next is to print - > and then ultimately delete or strip - the text between "pre" tags. > > Picture some html like this: > > <html> > <head> > <title>My Title</title> > </head> > <body> > <h1>My Heading</h1> > <strong>From:</strong>Me<br> > <strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST > <!-- body="start" --> > <p> > text line 1 > <br> > text line 2 > <br> > text line 3 > <br> > <p><pre> > very important text > more important text > would you believe even more important text? > </pre> > <p><!-- body="end" --> > </body> > </html> > > I basically need to do 2 things: 1) to print only the text between the 2 > "pre" tags, and then 2) to print all of the non-tagged text between the > "body" comments - minus the text between the "pre" tags. I've been > messing with this for a couple of hours - unsuccessfully - but I'm still > convinced that this is the right tool for the job.
If you need to do more HTML and XML manipulation, learning XPath is a good investment! You can look here for a start: http://www.w3schools.com/Xpath/default.asp _One_ way to achieve what you want: require 'nokogiri' text = <<HTML <html> <head> <title>My Title</title> </head> <body> <h1>My Heading</h1> <strong>From:</strong>Me<br> <strong>Date:</strong> Wed Dec 05 2012 - 18:17:49 EST <!-- body="start" --> <p> text line 1 <br> text line 2 <br> text line 3 <br> <p><pre> very important text more important text would you believe even more important text? </pre> <p><!-- body="end" --> </body> </html> HTML dom = Nokogiri.HTML(text) puts dom.xpath('/html/body//pre/text()').map(&:to_s) puts '---' puts dom.xpath('/html/body//text()[not(ancestor::pre)]').map(&:to_s) You can also process nodes individually if you replace ".map..." with ".each" and a block which receives the node and does something with it. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/ -- You received this message because you are subscribed to the Google Groups ruby-talk-google group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at https://groups.google.com/d/forum/ruby-talk-google?hl=en
