Decoding HTML escape sequences

Hugo Florentino via Digitalmars-d-learn Mon, 12 May 2014 15:27:30 -0700

Hi, I have some documents where some strings appears in HTML escapesequences in one of these forms:


\x3C\x53\x43\x52\x49\x50\x54\x20\x4C\x41\x4E\x47\x55\x41\x47\x45\x3D\x22\x4A\x61\x76\x61\x53\x63\x72\x69\x70\x74\x22\x3e


%3C%53%43%52%49%50%54%20%4C%41%4E%47%55%41%47%45%3D%22%4A%61%76%61%53%63%72%69%70%74%22%3e

And I would like to recode them to readable form:

<SCRIPT LANGUAGE="Javascript">

I tried something like this, using regular expressions and the urimodule:

import std.stdio, std.file, std.encoding, std.string, std.regex,std.uri;


static auto re = regex(`(%[a-fA-F0-9]{2})`);

int main(in string[] args)
{
  if (args.length < 2)
  {
    writeln("Usage: unescape file1.htm > file2.htm");
    return -1;
  }
  auto input = cast(Latin1String) read(args[1]);
  string buffer;
  transcode(input, buffer);

  string output;
  foreach(m; matchAll(buffer, re)) output ~= decode(m.hit);

  writeln(output);

  return 0;
}


Unfortunately it doesn't seem to work 100%.

I would appreciate any suggestion.

Regards, Hugo

Decoding HTML escape sequences

Reply via email to