Hi Andrea, I face a similar issue while organizing large-scale documents prepared by members of my group (many folks are not conversant with TeX here and write documents with WORD). My solution was to take their input through a wiki and convert the HTML to context markup using filters written with ruby (also see http://wiki.contextgarden.net/HTML_and_ConTeXt). Converting HTML syntax to ConTeXt syntax is very do-able.
If it is of any use, I attach the ruby filters I use for my purpose. BTW, I use a ruby library called "hpricot" to ease some of these conversions. saji ... def scrape_the_page(pagePath,oFile,hFile) items_to_remove = [ "#menus", #menus notice "div.markedup", "div.navigation", "head", #table of contents "hr" ] doc=Hpricot(open(pagePath)) # this may not be applicable to your case # this removes some unnecessary markup from the Wiki pages @article = (doc/"#container").each do |content| #remove unnecessary content and edit links items_to_remove.each { |x| (content/x).remove } end # Write HTML content to file hFile.write @article.inner_html # How to replace various syntactic elements using Hpricot # replace p/b element with \bf (@article/"p/*/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace p/b element with \bf (@article/"p/b").each do |pb| pb.swap("{\\bf #{pb.inner_html}}") end # replace strong element with \bf (@article/"strong").each do |ps| ps.swap("{\\bf #{ps.inner_html}}") end # replace h1 element with section (@article/"h1").each do |h1| h1.swap("\\section{#{h1.inner_html}}") end # replace h2 element with subsection (@article/"h2").each do |h2| h2.swap("\\subsection{#{h2.inner_html}}") end # replace h3 element with subsection (@article/"h3").each do |h3| h3.swap("\\subsubsection{#{h3.inner_html}}") end # replace h4 element with subsection (@article/"h4").each do |h4| h4.swap("\\subsubsubsection{#{h4.inner_html}}") end # replace h5 element with subsection (@article/"h5").each do |h5| h5.swap("\\subsubsubsubsection{#{h5.inner_html}}") end # replace <pre><code> by equivalent command in context (@article/"pre").each do |pre| pre.swap("\\startcode \n #{pre.at("code").inner_html} \n \\stopcode") end # when we encounter a reference to a figure inside the html # we replace it with a ConTeXt reference (@article/"a").each do |a| a.swap("\\in[#{a.inner_html}]") end # remove 'alt' attribute inside <img> element # replace <p><img> by equivalent command in context (@article/"p/img").each do |img| img_attrs=img.attributes['alt'].split(",") # separate the file name from the extension # have to take of file names that have a "." embedded in them img_src=img.attributes['src'].reverse.sub(/\w+\./,"").reverse # puts img_src # see if position of figure is indicated img_pos="force" img_attrs.each do |arr| img_pos=arr.gsub("position=","") if arr.match("position=") end img_attrs.delete("position=#{img_pos}") unless img_pos=="force" # see if the array img_attrs contains an referral key word if img_attrs.first.match(/\w+[=]\w+/) img_id=" " else img_id=img_attrs.first img_attrs.delete_at(0) end if img_pos=="force" if img.attributes['title'] img.swap(" \\placefigure\n [#{img_pos}][#{img_id}] \n {#{img.attributes['title']}} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") else img.swap(" \\placefigure\n [#{img_pos}] \n {none} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} ") end else if img.attributes['title'] img.swap(" \\placefigure\n [#{img_pos}][#{img_id}] \n {#{img.attributes['title']}} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") else img.swap(" \\placefigure\n [#{img_pos}] \n {none} \n {\\externalfigure[#{img_src}][#{img_attrs.join(",")}]} \n ") end end end # end of converting inside (@article/"p/img") # why not search for table and if we find caption, keep it ; if not add an empty # Styling options: Here I catch the div element called Col2 and # format the tex document in 2 columns # Tables : placing them # replace <p><img> by equivalent command in context (@article/"table").each do |tab| if tab.at("caption") tab.swap(" \\placetable[split]{#{tab.at("caption").inner_html}}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} ") else tab.swap(" \\placetable[split]{}\n {\\bTABLE \n #{tab.inner_html} \\eTABLE} \n ") end end # Tables: remove the caption (@article/"caption").each do |cap| cap.swap("\n") end # Now we transfer the syntactically altered html to a string Object # and manipulate that object further [EMAIL PROTECTED] # remove empty space in the beginning newdoc.gsub!(/^\s+/,"") # remove all elements we don't need. newdoc.gsub!(/^<div.*/,"") newdoc.gsub!(/^<\/div.*/,"") newdoc.gsub!(/^<form.*/,"") newdoc.gsub!(/^<\/form.*/,"") newdoc.gsub!(/<p>/,"\n") newdoc.gsub!(/<\/p>/,"\n") newdoc.gsub!(/<\u>/,"") newdoc.gsub!(/<\/u>/,"") newdoc.gsub!(/<ul>/,"\\startitemize[1]") newdoc.gsub!(/<\/ul>/,"\\stopitemize") newdoc.gsub!(/<ol>/,"\\startitemize[n]") newdoc.gsub!(/<\/ol>/,"\\stopitemize") newdoc.gsub!(/<li>/,"\\item ") newdoc.gsub!(/<\/li>/,"\n") newdoc.gsub!("_","\\_") newdoc.gsub!(/<table>/,"\\bTABLE \n") newdoc.gsub!(/<\/table>/,"\\eTABLE \n") newdoc.gsub!(/<tr>/,"\\bTR ") newdoc.gsub!(/<\/tr>/,"\\eTR ") newdoc.gsub!(/<td>/,"\\bTD ") newdoc.gsub!(/<\/td>/,"\\eTD ") newdoc.gsub!(/<th>/,"\\bTH ") newdoc.gsub!(/<\/th>/,"\\eTH ") newdoc.gsub!(/<center>/,"") newdoc.gsub!(/<\/center>/,"") newdoc.gsub!(/<em>/,"{\\em ") newdoc.gsub!(/<\/em>/,"}") newdoc.gsub!("^","") newdoc.gsub!("\%","\\%") newdoc.gsub!("&","&") newdoc.gsub!("&",'\\\&') newdoc.gsub!("$",'\\$') newdoc.gsub!(/<tbody>/,"\\bTABLEbody \n") newdoc.gsub!(/<\/tbody>/,"\\eTABLEbody \n") # Context does not mind "_" in figures and does not recognize \_, # so i have to catch these and replace \_ with _ # First catch filter=/\/AnnRep07\/Figures\/(\w+\/)*(\w+\\_)*/ if newdoc[filter] newdoc.gsub!(filter) { |fString| fString.gsub("\\_","_") } end # Second catch filter2=/\/AnnRep07\/Figures\/(\w+\/)*\w+[-.]\w+\\_\w+/ if newdoc[filter2] newdoc.gsub!(filter2) { |fString| fString.gsub("\\_","_") } end # Third catch; remove \_ inside [] filter3=/\[\w+\\_\w+\]/ if newdoc[filter3] newdoc.gsub!(filter3) { |fString| puts fString fString.gsub("\\_","_") } end # remove the comment tag, which we used to embed context commands newdoc.gsub!("<!--","") newdoc.gsub!("-->","") # add full path to the images newdoc.gsub!("\/AnnRep07\/Figures\/","~\/AnnRep07\/Figures\/") newdoc.gsub!(/<\w+\s*\/>/,"") #puts newdoc # open file for output #outfil="#{oFile}.tex" #`rm #{outfil}` #fil=File.new(outfil,"a") #puts "Writing #{oFile}" oFile.write newdoc end # imgProps={} # img_attrs.each do |arr| # imgProps['width']=arr.gsub("width=","") if arr.match("width=") # imgProps['position']=arr.gsub("position=","") if arr.match("position=") # end * Andrea Valle <[EMAIL PROTECTED]> [2007-11-10 02:30:36 +0100]: > Hi to all (Idris, in particular, as we are always dealing with the same > problems... ), > > I just want to share some thoughts about the ol' damn' problem of > converting to ConTeXt from Word et al. > >> As I told Andrea: For relatively simple documents (like the kind we use in >> academic journals) it seems we can now >> >> 1) convert doc to odt using OOo >> 2) convert odt to markdown using > > As suggest by Idris, I subscribed to the pandoc list, but I have to say > that the activity is not exactly like the one on ConTeXt list... > So the actual support for ConTeXt conversion is not convincing. More, it's > always better to put the hands on your machine... > > My problem is to convert a series of academic journals in ConTeXt. They > come form the Humanities so little structure (basically, mainly body and > footnotes). > Far from me the idea of automatically doing all the stuff, I'd like to be > faster and more accurate in conversion. > (No particular interest in figures, they are few, not so much in > references: they tends to be typographically inconsistent if done > in a WYSISYG environment, so difficult to parse). > More, as the journal has already being published we need to work with final > pdfs. > > After wasting my time with an awful pdf to html converter by Acrobat, I > discovered this, you may all know: > http://pdftohtml.sourceforge.net/ > > The html conversion is very very good in resulting rendering and also in > sources, but after some tweakings I got interested in the xml conversion it > allows. > The xml format substantially encodes the infos related to page, typically > each line is an element. Plus, there are bold and italics marked easily as > <b> and <i> > I'm still struggling to understand something really operative of XML > processing in ConTeXt, so I switched back to Python. > I used an incremental sax parser with some replacement. > This is today's draft. > Original: > http://www.semiotiche.it/andrea/membrana/02%20imp.pdf > > Recomposed (no setup at all, only \enableregime[utf]): > http://www.semiotiche.it/andrea/membrana/02imp.pdf > > pdf --> pdftoxml --> xml --> python script --> tex --> pdf > > I recovered par, bold, em, footnotes, stripping dashes and reassembling > the text with footnote references. Not bad as a first step. > > I guess that you xml gurus could probably do much easier and cleaner. > So, I mean -just for my very specific needs, I con probably take word > sources, convert to pdf and then finally reach ConTeXt as discussed. > > Just some ideas to share with the list > > Best > > -a- > > > > > -------------------------------------------------- > Andrea Valle > -------------------------------------------------- > CIRMA - DAMS > Università degli Studi di Torino > --> http://www.cirma.unito.it/andrea/ > --> [EMAIL PROTECTED] > -------------------------------------------------- > > > I did this interview where I just mentioned that I read Foucault. Who > doesn't in university, right? I was in this strip club giving this guy a > lap dance and all he wanted to do was to discuss Foucault with me. Well, I > can stand naked and do my little dance, or I can discuss Foucault, but not > at the same time; too much information. > (Annabel Chong) > > > > > ___________________________________________________________________________________ > If your question is of interest to others as well, please add an entry to the > Wiki! > > maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context > webpage : http://www.pragma-ade.nl / http://tex.aanhet.net > archive : https://foundry.supelec.fr/projects/contextrev/ > wiki : http://contextgarden.net > ___________________________________________________________________________________ -- Saji N. Hameed APEC Climate Center +82 51 668 7470 National Pension Corporation Busan Building 12F Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 [EMAIL PROTECTED] KOREA ___________________________________________________________________________________ If your question is of interest to others as well, please add an entry to the Wiki! maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context webpage : http://www.pragma-ade.nl / http://tex.aanhet.net archive : https://foundry.supelec.fr/projects/contextrev/ wiki : http://contextgarden.net ___________________________________________________________________________________