Re: [dev] suckless html to markdown (text)

2019-01-06 Thread Fischers Fritz
Although implementations usually get this wrong, Markdown is supposed
to be an extension of HTML; that is, any HTML document is also a
Markdown document. Consequently, you can use cat(1) to convert.

  cat webpage.html > webpage.md

You likely want also to remove some of the HTML tags and use the
Markdown equivalents. This going to suck no matter how you do it because
HTML is involved.

But at least it is pretty short if you use an HTML parser.
Attached is a demonstration in CSS.
/* html5doctor.com Reset v1.6.1 (http://html5doctor.com/html-5-reset-stylesheet/) - http://cssreset.com */
html,body,div,span,object,iframe,h1,h2,h3,h4,h5,h6,p,blockquote,pre,abbr,address,cite,code,del,dfn,em,img,ins,kbd,q,samp,small,strong,sub,sup,var,b,i,dl,dt,dd,ol,ul,li,fieldset,form,label,legend,table,caption,tbody,tfoot,thead,tr,th,td,article,aside,canvas,details,figcaption,figure,footer,header,hgroup,menu,nav,section,summary,time,mark,audio,video{margin:0;padding:0;border:0;outline:0;font-size:100%;vertical-align:baseline;background:transparent}
body{line-height:1}
article,aside,details,figcaption,figure,footer,header,hgroup,menu,nav,section{display:block}
nav ul{list-style:none}
blockquote,q{quotes:none}
blockquote:before,blockquote:after,q:before,q:after{content:none}
a{margin:0;padding:0;font-size:100%;vertical-align:baseline;background:transparent}
ins{background-color:#ff9;color:#000;text-decoration:none}
mark{background-color:#ff9;color:#000;font-style:italic;font-weight:bold}
del{text-decoration:line-through}
abbr[title],dfn[title]{border-bottom:1px dotted;cursor:help}
table{border-collapse:collapse;border-spacing:0}
hr{display:block;height:1px;border:0;border-top:1px solid #ccc;margin:1em 0;padding:0}
input,select{vertical-align:middle}
/* End reset */

body { position: relative; height: 100%; font-family: monospace; }

h1, h2, h3, h4, h5, h6 { font: inherit; }
h1:before { content: '# ' }
h2:before { content: '## ' }
h3:before { content: '### ' }
h4:before { content: ' ' }
h5:before { content: '# ' }

em { font-style: normal; }
em:before { content: '*' }
em:after { content: '*' }

a, a:visited, a:hover, a:active {
  color: inherit;
  text-decoration: inherit;
}
a:before { content: "["; }
a:after { content: "](" attr(href) ")"; }

/* Now rewrite all of the html. */
table, tr { display: block; }
table:before { content: ""; }
table:after { content: ""; }
tr:before { content: ""; }
tr:after { content: ""; }
th:before { content: ""; }
th:after { content: ""; }
td:before { content: ""; }
td:after { content: ""; }
dl:before { content: ""; }
dl:after { content: ""; }
dd:before { content: ""; }
dd:after { content: ""; }
dt:before { content: ""; }
dt:after { content: ""; }


Re: [dev] suckless html to markdown (text)

2019-01-06 Thread Nick
Quoth Alexander Krotov:
> > Ideally, with sed/awk, or better in C.
> 
> "Parsing" HTML with sed is simply wrong.

This is a good point that I should have mentioned. I spent years 
using sed and awk to extract things from HTML, writing crawlers and 
suchlike, for personal projects. It can work, of course, but tends 
to be very obfuscated and fragile. I haven't needed to do any such 
crawling for a while now (and often the data is easier to access as 
json, an unexpected side-effect of the horrors of javascript 
overuse), but if I needed to I'd likely look into using something 
like go's html parsing these days.  I'd rather have something 
slightly slower that's more robust and reusable, really.  awk is a 
good fit for line-based parsing, and sed is good for stream 
transformation, neither work well for parsing machine-generated 
mountains of HTML of the sort that dominates the web today.



Re: [dev] suckless html to markdown (text)

2019-01-06 Thread Alexander Krotov

> Ideally, with sed/awk, or better in C.

"Parsing" HTML with sed is simply wrong.

You need to use a decent HTML parsing library, as parsing HTML is complex.

There is https://github.com/yujiahaol68/downmark that uses Go html 
library, but I have not tried it.


Seriously though, if you are not going to convert HTML to markdown every 
day and you are not building a long-term solution, just use pandoc.





Re: [dev] suckless html to markdown (text)

2019-01-05 Thread ssd
I'm afraid pandoc won't be considered suckless by most of the list, but
I would double Nick's recommendation: pandoc is the only tool that
eventually worked reliably for my tasks.

Escpecially in corporative environment, I appreciate that I can convert
accross formats,even to docx and import to / export from google docs. 

Actually, I prepare also my talks with a chain of

[markdown and tex mix] --pandoc--> pdf

unless they are reasonably simple to fit in `sent`.

--s




Re: [dev] suckless html to markdown (text)

2019-01-04 Thread Nick
Hi Thuban,

Quoth Thuban: 
> I'm looking for a suckless  html to markdown (or text) tool.
> Ideally, with sed/awk, or better in C. 

pandoc seems to always do a reasonable job - I use it daily for 
this.  It's written in haskell, which may not fit your definition of 
suckless, but it is widely used and seems quite sensible. It can 
also convert to formats like epub, if that's useful for you.

Nick



Re: [dev] suckless html to markdown (text)

2019-01-02 Thread Calvin Morrison
On Tue, 1 Jan 2019 at 13:33, Thuban  wrote:
>
> Hi,
> I'm looking for a suckless  html to markdown (or text) tool.
> Ideally, with sed/awk, or better in C.
>
> Any idea?
>
> Regards
> --
> thuban
>

Not relevant but here is a md2html  awk script I have used in the past:

https://github.com/wlangstroth/simple-static/blob/master/md2html.awk



[dev] suckless html to markdown (text)

2019-01-01 Thread Thuban
Hi,
I'm looking for a suckless  html to markdown (or text) tool.
Ideally, with sed/awk, or better in C. 

Any idea?

Regards
-- 
thuban