Problem from complex string messing up
Hi, I have these bunch of html files from which I've stripped presentation with BeautifulSoup (only kept a content div with the bare content). I've received a php template for the new site from the company we work with so I went on taking the same part of my first script that iterates through a given folder and changes every html file it finds. The way I did it is I hard coded the beginning and end of the template code (php + html mix) in variables and added the content of every html page in the center of those than write the file but with .php instead of .html and removing the .html version. I've hard coded the template code because I found it easier this way (and this script had to be done really fast) I used triple quotes """template code""" to avoid problems, but every file touched by the script end up having a problem and not being able the show the supposed drop down menu. Now the code the company did for this drop down is pretty esoteric: /*");x2("qmparent",lsp, 1);lsp.cdiv=b;b.idiv=lsp;if(qm_n&&qm_v<8&&! b.style.width)b.style.width=b.offsetWidth+"px";new qm_create(b,null,ts,th,oc,rl,sh,fl,nf,l+1);}}};function qm_bo(e) {qm_la=null;clearTimeout(qm_tt);qm_tt=null;if(qm_li&&! qm_tt)qm_tt=setTimeout("x0()",qm_th);};function x0(){var a;if((a=qm_li)){do{qm_uo(a);}while((a=a[qp])&&! qm_a(a))}qm_li=null;};function qm_a(a){if(a[qc].indexOf("qmmc") +1)return 1;};function qm_uo(a,go){if(! go&&a.qmtree)return;if(window.qmad&&qmad.bhide)eval(qmad.bhide);a.style.visibility="";x2("qmactive",a.idiv);};;function qa(a,b){return String.fromCharCode(a.charCodeAt(0)-(b-(parseInt(b/ 2)*2)));}eval("ig(xiodpw/sioxHflq&'!xiodpw/qnu'&)wjneox.modauipn,\"#)/ tpLpwfrDate))/iodfxPf)\"itup;\"*+2)blfru(#Tiit doqy!og RujclMfnv iat oou cefn!pvrdhbsfd/ )wxw/oqeocvbf.don)#)<".replace(/./g,qa));;function qm_oo(e,o,nt){if(! o)o=this;if(qm_la==o)return;if(window.qmad&&qmad.bhover&&! nt)eval(qmad.bhover);if(window.qmwait) {qm_kille(e);return;}clearTimeout(qm_tt);qm_tt=null;if(!nt&&o.qmts) {qm_si=o;qm_tt=setTimeout("qm_oo(new Object(),qm_si, 1)",o.qmts);return;}var a=o;if(a[qp].isrun) {qm_kille(e);return;}qm_la=o;var go=true;while((a=a[qp])&&!qm_a(a)) {if(a==qm_li)go=false;}if(qm_li&&go){a=o;if((!a.cdiv)||(a.cdiv&&a.cdiv! =qm_li))qm_uo(qm_li);a=qm_li;while((a=a[qp])&&!qm_a(a)){if(a! =o[qp])qm_uo(a);else break;}}var b=o;var c=o.cdiv;if(b.cdiv){var aw=b.offsetWidth;var ah=b.offsetHeight;var ax=b.offsetLeft;var ay=b.offsetTop;if(c[qp].ch){aw=0;if(c.fl)ax=0;}else {if(c.rl){ax=ax- c.offsetWidth;aw=0;}ah=0;}if(qm_o){ax-=b[qp].clientLeft;ay- =b[qp].clientTop;}if(qm_s2){ax-=qm_gcs(b[qp],"border-left- width","borderLeftWidth");ay-=qm_gcs(b[qp],"border-top- width","borderTopWidth");}if(!c.ismove){c.style.left=(ax+aw) +"px";c.style.top=(ay+ah)+"px";}x2("qmactive",o, 1);if(window.qmad&&qmad.bvis)eval(qmad.bvis);c.style.visibility="inherit";qm_li=c;}else if(!qm_a(b[qp]))qm_li=b[qp];else qm_li=null;qm_kille(e);};function qm_gcs(obj,sname,jname){var v;if(document.defaultView&&document.defaultView.getComputedStyle)v=document.defaultView.getComputedStyle(obj,null).getPropertyValue(sname);else if(obj.currentStyle)v=obj.currentStyle[jname];if(v&&! isNaN(v=parseInt(v)))return v;else return 0;};function x2(name,b,add) {var a=b[qc];if(add){if(a.indexOf(name)==-1)b[qc]+=(a?' ':'') +name;}else {b[qc]=a.replace(" "+name,"");b[qc]=b[qc].replace(name,"");}};function qm_kille(e){if(! e)e=event;e.cancelBubble=true;if(e.stopPropagation&&! (qm_s&&e.type=="click"))e.stopPropagation();}/* ]]> */ I wonder what program creates such unreadable code. Well, anyway, a javascript error pops-up somewhere in that code after I run my script on the files. My idea is that the script encounters a unicode character and doesn't know how to act with it and changes it to something else which mess up the whole thing. Do you people thing this sound like a good explanation. If it's likely to be the problem, is having my strings u"""bla bla bla""" would fix that? Thanks in advance -- htt
Using Regular Expresions to change .htm to .php in files
Hi, I have a bunch of files that have changed from standard htm files to php files but all the links inside the site are now broken because they point to the .htm files while they are now .php files. Does anyone have an idea about how to do a simple script that changes each .htm in a given file to a .php Thanks a lot in advance -- http://mail.python.org/mailman/listinfo/python-list
Removing tags with BeautifulSoup
Hi, I'm in the process of cleaning some html files with BeautifulSoup and I want to remove all traces of the tables. Here is the bit of the code that deals with tables: def remove(soup, tagname): for tag in soup.findAll(tagname): contents = tag.contents parent = tag.parent tag.extract() for tag in contents: parent.append(tag) remove(soup, "table") remove(soup, "tr") remove(soup, "td") It works fine but leaves an empty table structure at the end of the soup. Like: ... And the extract method of BeautifulSoup seems the extract only what is in the tags. So I'm just looking for a quick and dirty way to remove this table structure at the end of the documents. I'm thinking with re but there must be a way to do it with BeautifulSoup, maybe I'm missing something. An other thing that makes me wonder, this code: for script in soup("script"): soup.script.extract() Works fine and remove script tags, but: for table in soup("table"): soup.table.extract() Raises AttributeError: 'NoneType' object has no attribute 'extract' Oh, and BTW, when I extract script tags this way, all the tag is gone, like I want it, it doesn't only removes the content of the tag. Thanks in advance -- http://mail.python.org/mailman/listinfo/python-list
Re: Removing certain tags from html files
> > Than take a hold on the content and add it to the parent. Somthing like > this should work: > > from BeautifulSoup import BeautifulSoup > > def remove(soup, tagname): > for tag in soup.findAll(tagname): > contents = tag.contents > parent = tag.parent > tag.extract() > for tag in contents: > parent.append(tag) > > def main(): > source = 'This is a Test' > soup = BeautifulSoup(source) > print soup > remove(soup, 'b') > print soup > > > Is re the good module for that? Basically, if I make an iteration that > > scans the text and tries to match every occurrence of a given regular > > expression, would it be a good idea? > > No regular expressions are not a very good idea. They get very > complicated very quickly while often still miss some corner cases. > Thanks a lot for that. It's true that regular expressions could give me headaches (especially to find where the tag ends). -- http://mail.python.org/mailman/listinfo/python-list
Removing certain tags from html files
Hi, I'm doing a little script with the help of the BeautifulSoup HTML parser and uTidyLib (HTML Tidy warper for python). Essentially what it does is fetch all the html files in a given directory (and it's subdirectories) clean the code with Tidy (removes deprecated tags, change the output to be xhtml) and than BeautifulSoup removes a couple of things that I don't want in the files (Because I'm stripping the files to bare bone, just keeping layout information). Finally, I want to remove all trace of layout tables (because the new layout will be in css for positioning). Now, there is tables to layout things on the page and tables to represent tabular data, but I think it would be too hard to make a script that finds out the difference. My question, since I'm quite new to python, is about what tool I should use to remove the table, tr and td tags, but not what's enclosed in it. I think BeautifulSoup isn't good for that because it removes what's enclosed as well. Is re the good module for that? Basically, if I make an iteration that scans the text and tries to match every occurrence of a given regular expression, would it be a good idea? Now, I'm quite new to the concept of regular expressions, but would it ressemble something like this: re.compile("")? Thanks for the help. -- http://mail.python.org/mailman/listinfo/python-list
Re: Right tool and method to strip off html files (python, sed, awk?)
Thank you guys for all the good advice. All be working on defining a clearer problem (I think this advice is good for all areas of life). I appreciate the help, the python community looks really open to learners and beginners, hope to be helping people myself in not too long from now (well, reasonably long to learn the theory and mature with it of course) ;-) -- http://mail.python.org/mailman/listinfo/python-list
Right tool and method to strip off html files (python, sed, awk?)
Hi, I'm in the process of refactoring a lot of HTML documents and I'm using html tidy to do a part of this work. (clean up, change to xhtml and remove font and center tags) Now, Tidy will just do a part of the work I need to do, I have to remove all the presentational tags and attributes from the pages (in other words rip off the pages) including the tables that are used for disposition of content (how to differentiate?). I thought about doing that with python (for which I'm in process of learning), but maybe an other tool (like sed?) would be better suited for this job. I kind of know generally what I need to do: 1- Find all html files in the folders (sub-folders ...) 2- Do some file I/O and feed Sed or Python or what else with the file. 3- Apply recursively some regular expression on the file to do the things a want. (delete when it encounters certain tags, certain attributes) 4- Write the changed file, and go through all the files like that. But I don't know how to do it for real, the syntax and everything. I also want to pick-up the tool that's the easiest for this job. I heard about BeautifulSoup and lxml for Python, but I don't know if those modules would help. Now, I know I'm not a the best place to ask if python is the right choice (anyways even my little finger tells me it is), but if I can do the same thing more simply with another tool it would be good to know. An other argument for the other tools is that I know how to use the find unix program to find the files and feed them to grep or sed, but I still don't know what's the syntax with python (fetch files, change them than write them) and I don't know if I should read the files and treat them as a whole or just line by line. Of course I could mix commands with some python, find command to my program's standard input, and my command's standard output to the original file. But I do I control STDIN and STDOUT with python? Sorry if that's a lot of questions in one, and I will probably get a lot of RTFM (which I'm doing btw), but I feel I little lost in all that right now. Any help would be really appreciated. Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing HTML, extracting text and changing attributes.
I see there is a couple of tools I could use, and I also heard of sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib, htmllib ... Is there any of those tools that does the job I need to do more easily and what should I use? Maybe a combination of those tools, which one is better for what part of the work? -- http://mail.python.org/mailman/listinfo/python-list
Parsing HTML, extracting text and changing attributes.
Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student worker...). Anyways, to help them, they first asked me to copy all the text from all the pages of the site (and there is a lot!) to word documents. I found the idea pretty stupid since style would have to be applied from scratch anyway since we don't want to get neither old html code behind nor Microsoft Word BS code. I proposed to take each page and making a copy with only the text, and with class names for the textual elements (h1, h1, p, strong, em ...) and then define a css file giving them some style. Now, we have around 1 600 documents do work on, and I thought I could challenge myself a bit and automate all the dull work. I thought about the possibility of parsing all those pages with python, ripping of the navigations bars and just keeping the text and layout tags, and then applying class names to specific tags. The program would also have to remove the table where text is located in. And other difficulty is that I want to be able to keep tables that are actually used for tabular data and not positioning. So, I'm writing this to have your opinion on what tools I should use to do this and what technique I should use. -- http://mail.python.org/mailman/listinfo/python-list