Re: [dev] suckless html to markdown (text)
Although implementations usually get this wrong, Markdown is supposed to be an extension of HTML; that is, any HTML document is also a Markdown document. Consequently, you can use cat(1) to convert. cat webpage.html > webpage.md You likely want also to remove some of the HTML tags and use the Markdown equivalents. This going to suck no matter how you do it because HTML is involved. But at least it is pretty short if you use an HTML parser. Attached is a demonstration in CSS. /* html5doctor.com Reset v1.6.1 (http://html5doctor.com/html-5-reset-stylesheet/) - http://cssreset.com */ html,body,div,span,object,iframe,h1,h2,h3,h4,h5,h6,p,blockquote,pre,abbr,address,cite,code,del,dfn,em,img,ins,kbd,q,samp,small,strong,sub,sup,var,b,i,dl,dt,dd,ol,ul,li,fieldset,form,label,legend,table,caption,tbody,tfoot,thead,tr,th,td,article,aside,canvas,details,figcaption,figure,footer,header,hgroup,menu,nav,section,summary,time,mark,audio,video{margin:0;padding:0;border:0;outline:0;font-size:100%;vertical-align:baseline;background:transparent} body{line-height:1} article,aside,details,figcaption,figure,footer,header,hgroup,menu,nav,section{display:block} nav ul{list-style:none} blockquote,q{quotes:none} blockquote:before,blockquote:after,q:before,q:after{content:none} a{margin:0;padding:0;font-size:100%;vertical-align:baseline;background:transparent} ins{background-color:#ff9;color:#000;text-decoration:none} mark{background-color:#ff9;color:#000;font-style:italic;font-weight:bold} del{text-decoration:line-through} abbr[title],dfn[title]{border-bottom:1px dotted;cursor:help} table{border-collapse:collapse;border-spacing:0} hr{display:block;height:1px;border:0;border-top:1px solid #ccc;margin:1em 0;padding:0} input,select{vertical-align:middle} /* End reset */ body { position: relative; height: 100%; font-family: monospace; } h1, h2, h3, h4, h5, h6 { font: inherit; } h1:before { content: '# ' } h2:before { content: '## ' } h3:before { content: '### ' } h4:before { content: ' ' } h5:before { content: '# ' } em { font-style: normal; } em:before { content: '*' } em:after { content: '*' } a, a:visited, a:hover, a:active { color: inherit; text-decoration: inherit; } a:before { content: "["; } a:after { content: "](" attr(href) ")"; } /* Now rewrite all of the html. */ table, tr { display: block; } table:before { content: ""; } table:after { content: ""; } tr:before { content: ""; } tr:after { content: ""; } th:before { content: ""; } th:after { content: ""; } td:before { content: ""; } td:after { content: ""; } dl:before { content: ""; } dl:after { content: ""; } dd:before { content: ""; } dd:after { content: ""; } dt:before { content: ""; } dt:after { content: ""; }
Re: [dev] suckless html to markdown (text)
Quoth Alexander Krotov: > > Ideally, with sed/awk, or better in C. > > "Parsing" HTML with sed is simply wrong. This is a good point that I should have mentioned. I spent years using sed and awk to extract things from HTML, writing crawlers and suchlike, for personal projects. It can work, of course, but tends to be very obfuscated and fragile. I haven't needed to do any such crawling for a while now (and often the data is easier to access as json, an unexpected side-effect of the horrors of javascript overuse), but if I needed to I'd likely look into using something like go's html parsing these days. I'd rather have something slightly slower that's more robust and reusable, really. awk is a good fit for line-based parsing, and sed is good for stream transformation, neither work well for parsing machine-generated mountains of HTML of the sort that dominates the web today.
Re: [dev] suckless html to markdown (text)
> Ideally, with sed/awk, or better in C. "Parsing" HTML with sed is simply wrong. You need to use a decent HTML parsing library, as parsing HTML is complex. There is https://github.com/yujiahaol68/downmark that uses Go html library, but I have not tried it. Seriously though, if you are not going to convert HTML to markdown every day and you are not building a long-term solution, just use pandoc.
Re: [dev] suckless html to markdown (text)
I'm afraid pandoc won't be considered suckless by most of the list, but I would double Nick's recommendation: pandoc is the only tool that eventually worked reliably for my tasks. Escpecially in corporative environment, I appreciate that I can convert accross formats,even to docx and import to / export from google docs. Actually, I prepare also my talks with a chain of [markdown and tex mix] --pandoc--> pdf unless they are reasonably simple to fit in `sent`. --s
Re: [dev] suckless html to markdown (text)
Hi Thuban, Quoth Thuban: > I'm looking for a suckless html to markdown (or text) tool. > Ideally, with sed/awk, or better in C. pandoc seems to always do a reasonable job - I use it daily for this. It's written in haskell, which may not fit your definition of suckless, but it is widely used and seems quite sensible. It can also convert to formats like epub, if that's useful for you. Nick
Re: [dev] suckless html to markdown (text)
On Tue, 1 Jan 2019 at 13:33, Thuban wrote: > > Hi, > I'm looking for a suckless html to markdown (or text) tool. > Ideally, with sed/awk, or better in C. > > Any idea? > > Regards > -- > thuban > Not relevant but here is a md2html awk script I have used in the past: https://github.com/wlangstroth/simple-static/blob/master/md2html.awk
[dev] suckless html to markdown (text)
Hi, I'm looking for a suckless html to markdown (or text) tool. Ideally, with sed/awk, or better in C. Any idea? Regards -- thuban