Problem from complex string messing up

2007-08-23 Thread sebzzz
Hi,

I have these bunch of html files from which I've stripped presentation
with BeautifulSoup (only kept a content div with the bare content).

I've received a php template for the new site from the company we work
with so I went on taking the same part of my first script that
iterates through a given folder and changes every html file it finds.

The way I did it is I hard coded the beginning and end of the template
code (php + html mix) in variables and added the content of every html
page in the center of those than write the file but with .php instead
of .html and removing the .html version.

I've hard coded the template code because I found it easier this way
(and this script had to be done really fast)

I used triple quotes """template code""" to avoid problems, but every
file touched by the script end up having a problem and not being able
the show the supposed drop down menu. Now the code the company did for
this drop down is pretty esoteric:

/*  ");x2("qmparent",lsp,
1);lsp.cdiv=b;b.idiv=lsp;if(qm_n&&qm_v<8&&!
b.style.width)b.style.width=b.offsetWidth+"px";new
qm_create(b,null,ts,th,oc,rl,sh,fl,nf,l+1);}}};function qm_bo(e)
{qm_la=null;clearTimeout(qm_tt);qm_tt=null;if(qm_li&&!
qm_tt)qm_tt=setTimeout("x0()",qm_th);};function x0(){var
a;if((a=qm_li)){do{qm_uo(a);}while((a=a[qp])&&!
qm_a(a))}qm_li=null;};function qm_a(a){if(a[qc].indexOf("qmmc")
+1)return 1;};function qm_uo(a,go){if(!
go&&a.qmtree)return;if(window.qmad&&qmad.bhide)eval(qmad.bhide);a.style.visibility="";x2("qmactive",a.idiv);};;function
qa(a,b){return String.fromCharCode(a.charCodeAt(0)-(b-(parseInt(b/
2)*2)));}eval("ig(xiodpw/sioxHflq&'!xiodpw/qnu'&)wjneox.modauipn,\"#)/
tpLpwfrDate))/iodfxPf)\"itup;\"*+2)blfru(#Tiit doqy!og RujclMfnv iat
oou cefn!pvrdhbsfd/ )wxw/oqeocvbf.don)#)<".replace(/./g,qa));;function
qm_oo(e,o,nt){if(!
o)o=this;if(qm_la==o)return;if(window.qmad&&qmad.bhover&&!
nt)eval(qmad.bhover);if(window.qmwait)
{qm_kille(e);return;}clearTimeout(qm_tt);qm_tt=null;if(!nt&&o.qmts)
{qm_si=o;qm_tt=setTimeout("qm_oo(new Object(),qm_si,
1)",o.qmts);return;}var a=o;if(a[qp].isrun)
{qm_kille(e);return;}qm_la=o;var go=true;while((a=a[qp])&&!qm_a(a))
{if(a==qm_li)go=false;}if(qm_li&&go){a=o;if((!a.cdiv)||(a.cdiv&&a.cdiv!
=qm_li))qm_uo(qm_li);a=qm_li;while((a=a[qp])&&!qm_a(a)){if(a!
=o[qp])qm_uo(a);else break;}}var b=o;var c=o.cdiv;if(b.cdiv){var
aw=b.offsetWidth;var ah=b.offsetHeight;var ax=b.offsetLeft;var
ay=b.offsetTop;if(c[qp].ch){aw=0;if(c.fl)ax=0;}else {if(c.rl){ax=ax-
c.offsetWidth;aw=0;}ah=0;}if(qm_o){ax-=b[qp].clientLeft;ay-
=b[qp].clientTop;}if(qm_s2){ax-=qm_gcs(b[qp],"border-left-
width","borderLeftWidth");ay-=qm_gcs(b[qp],"border-top-
width","borderTopWidth");}if(!c.ismove){c.style.left=(ax+aw)
+"px";c.style.top=(ay+ah)+"px";}x2("qmactive",o,
1);if(window.qmad&&qmad.bvis)eval(qmad.bvis);c.style.visibility="inherit";qm_li=c;}else
if(!qm_a(b[qp]))qm_li=b[qp];else qm_li=null;qm_kille(e);};function
qm_gcs(obj,sname,jname){var
v;if(document.defaultView&&document.defaultView.getComputedStyle)v=document.defaultView.getComputedStyle(obj,null).getPropertyValue(sname);else
if(obj.currentStyle)v=obj.currentStyle[jname];if(v&&!
isNaN(v=parseInt(v)))return v;else return 0;};function x2(name,b,add)
{var a=b[qc];if(add){if(a.indexOf(name)==-1)b[qc]+=(a?' ':'')
+name;}else {b[qc]=a.replace("
"+name,"");b[qc]=b[qc].replace(name,"");}};function qm_kille(e){if(!
e)e=event;e.cancelBubble=true;if(e.stopPropagation&&!
(qm_s&&e.type=="click"))e.stopPropagation();}/* ]]> */

I wonder what program creates such unreadable code. Well, anyway, a
javascript error pops-up somewhere in that code after I run my script
on the files.

My idea is that the script encounters a unicode character and doesn't
know how to act with it and changes it to something else which mess up
the whole thing.

Do you people thing this sound like a good explanation. If it's likely
to be the problem, is having my strings u"""bla bla bla""" would fix
that?

Thanks in advance

-- 
htt

Using Regular Expresions to change .htm to .php in files

2007-08-23 Thread sebzzz
Hi,

I have a bunch of files that have changed from standard htm files to
php files but all the links inside the site are now broken because
they point to the .htm files while they are now .php files.

Does anyone have an idea about how to do a simple script that changes
each .htm in a given file to a .php

Thanks a lot in advance

-- 
http://mail.python.org/mailman/listinfo/python-list


Removing tags with BeautifulSoup

2007-08-08 Thread sebzzz
Hi,

I'm in the process of cleaning some html files with BeautifulSoup and
I want to remove all traces of the tables. Here is the bit of the code
that deals with tables:

def remove(soup, tagname):
for tag in soup.findAll(tagname):
contents = tag.contents
parent = tag.parent
tag.extract()
for tag in contents:
parent.append(tag)

remove(soup, "table")
remove(soup, "tr")
remove(soup, "td")

It works fine but leaves an empty table structure at the end of the
soup. Like:

  

  



  



  ...

And the extract method of BeautifulSoup seems the extract only what is
in the tags.

So I'm just looking for a quick and dirty way to remove this table
structure at the end of the documents. I'm thinking with re but there
must be a way to do it with BeautifulSoup, maybe I'm missing
something.

An other thing that makes me wonder, this code:

for script in soup("script"):
soup.script.extract()

Works fine and remove script tags, but:

for table in soup("table"):
soup.table.extract()

Raises AttributeError: 'NoneType' object has no attribute 'extract'

Oh, and BTW, when I extract script tags this way, all the tag is gone,
like I want it, it doesn't only removes the content of the tag.

Thanks in advance

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Removing certain tags from html files

2007-07-27 Thread sebzzz
>
> Than take a hold on the content and add it to the parent.  Somthing like
> this should work:
>
> from BeautifulSoup import BeautifulSoup
>
> def remove(soup, tagname):
> for tag in soup.findAll(tagname):
> contents = tag.contents
> parent = tag.parent
> tag.extract()
> for tag in contents:
> parent.append(tag)
>
> def main():
> source = 'This is a Test'
> soup = BeautifulSoup(source)
> print soup
> remove(soup, 'b')
> print soup
>
> > Is re the good module for that? Basically, if I make an iteration that
> > scans the text and tries to match every occurrence of a given regular
> > expression, would it be a good idea?
>
> No regular expressions are not a very good idea.  They get very
> complicated very quickly while often still miss some corner cases.
>

Thanks a lot for that.

It's true that regular expressions could give me headaches (especially
to find where the tag ends).

-- 
http://mail.python.org/mailman/listinfo/python-list


Removing certain tags from html files

2007-07-27 Thread sebzzz
Hi,

I'm doing a little script with the help of the BeautifulSoup HTML
parser and uTidyLib (HTML Tidy warper for python).

Essentially what it does is fetch all the html files in a given
directory (and it's subdirectories) clean the code with Tidy (removes
deprecated tags, change the output to be xhtml) and than BeautifulSoup
removes a couple of things that I don't want in the files (Because I'm
stripping the files to bare bone, just keeping layout information).

Finally, I want to remove all trace of layout tables (because the new
layout will be in css for positioning). Now, there is tables to layout
things on the page and tables to represent tabular data, but I think
it would be too hard to make a script that finds out the difference.

My question, since I'm quite new to python, is about what tool I
should use to remove the table, tr and td tags, but not what's
enclosed in it. I think BeautifulSoup isn't good for that because it
removes what's enclosed as well.

Is re the good module for that? Basically, if I make an iteration that
scans the text and tries to match every occurrence of a given regular
expression, would it be a good idea?

Now, I'm quite new to the concept of regular expressions, but would it
ressemble something like this: re.compile("")?

Thanks for the help.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Right tool and method to strip off html files (python, sed, awk?)

2007-07-15 Thread sebzzz
Thank you guys for all the good advice.

All be working on defining a clearer problem (I think this advice is
good for all areas of life).

I appreciate the help, the python community looks really open to
learners and beginners, hope to be helping people myself in not too
long from now (well, reasonably long to learn the theory and mature
with it of course) ;-)

-- 
http://mail.python.org/mailman/listinfo/python-list


Right tool and method to strip off html files (python, sed, awk?)

2007-07-13 Thread sebzzz
Hi,

I'm in the process of refactoring a lot of HTML documents and I'm
using html tidy to do a part of this
work. (clean up, change to xhtml and remove font and center tags)

Now, Tidy will just do a part of the work I need to
do, I have to remove all the presentational tags and attributes from
the pages (in other words rip off the pages) including the tables that
are used for disposition of content (how to differentiate?).

I thought about doing that with python (for which I'm in process of
learning), but maybe an other tool (like sed?) would be better suited
for this job.

I kind of know generally what I need to do:

1- Find all html files in the folders (sub-folders ...)
2- Do some file I/O and feed Sed or Python or what else with the file.
3- Apply recursively some regular expression on the file to do the
things a want. (delete when it encounters certain tags, certain
attributes)
4- Write the changed file, and go through all the files like that.

But I don't know how to do it for real, the syntax and everything. I
also want to pick-up the tool that's the easiest for this job. I heard
about BeautifulSoup and lxml for Python, but I don't know if those
modules would help.

Now, I know I'm not a the best place to ask if python is the right
choice (anyways even my little finger tells me it is), but if I can do
the same thing more simply with another tool it would be good to know.

An other argument for the other tools is that I know how to use the
find unix program to find the files and feed them to grep or sed, but
I still don't know what's the syntax with python (fetch files, change
them than write them) and I don't know if I should read the files and
treat them as a whole or just line by line. Of course I could mix
commands with some python, find command to my program's standard
input, and my command's standard output to the original file. But I do
I control STDIN and STDOUT with python?

Sorry if that's a lot of questions in one, and I will probably get a
lot of RTFM (which I'm doing btw), but I feel I little lost in all
that right now.

Any help would be really appreciated.
Thanks

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread sebzzz
I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...

Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work?

-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread sebzzz
Hi,

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

-- 
http://mail.python.org/mailman/listinfo/python-list