On 05/23/2015 04:15 PM, savitha devi wrote:
What I exactly want is the java script is in the html code. I am trying for
a regular expression to find the email address embedded with in the java
script.

On Sat, May 23, 2015 at 2:31 PM, Chris Angelico <ros...@gmail.com> wrote:

On Sat, May 23, 2015 at 4:46 PM, savitha devi <savith...@gmail.com> wrote:
I am developing a web scraper code using HTMLParser. I need to extract
text/email address from java script with in the HTMLCode.I am beginner
level
in python coding and totally lost here. Need some help on this. The java
script code is as below:

<script type='text/javascript'>
  //<!--
  document.getElementById('cloak48218').innerHTML = '';
  var prefix = '&#109;a' + 'i&#108;' + '&#116;o';
  var path = 'hr' + 'ef' + '=';
  var addy48218 = '&#105;nf&#111;' + '&#64;';
  addy48218 = addy48218 + 'tsv-n&#101;&#117;r&#105;&#101;d' + '&#46;' +
'd&#101;';
  document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
  //-->
This is deliberately being done to prevent scripted usage. What
exactly are you needing to do this for?

You're basically going to have to execute the entire block of
JavaScript code, and then decode the entities to get to what you want.
Doing it manually is pretty easy; doing it automatically will
virtually require a language interpreter.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list


This is just about nuts and bolts, not about the ethics of presumed intentions.

Hope it helps one way or other

Frederic


-------------------------------------------------------------------------------

sample = '''//<!--
 document.getElementById('cloak48218').innerHTML = '';
 var prefix = '&#109;a' + 'i&#108;' + '&#116;o';
 var path = 'hr' + 'ef' + '=';
 var addy48218 = '&#105;nf&#111;' + '&#64;';
 addy48218 = addy48218 + 'tsv-n&#101;&#117;r&#105;&#101;d' + '&#46;' +
'd&#101;';
 document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
prefix + ':' + addy48218 + ''>' + addy48218+'<\/a>';
 //-->'''

>>> import SE  # Download from PyPi at https://pypi.python.org/pypi/SE

>>> def make_se_translator ():

    # Make SE substitutions
    subs_list = []

    # Make &# code substitutions
    for i in range (256):
        subs_list.append ('&#%d;=%c' % (i, chr(i)))

    # Delete Java stuff
    subs_list.append (' "document.getElementById(\'cloak48218\').=" ')
    subs_list.append (' "var =" "\n=" //<!--= //-->= ')

    # Java syntax? Tweaks needed to get the sample working
    subs_list.append (' "+ \'\'\'=" \'\'>\'=\'>\' <\/=</ ')

    # Add more as needed trial and error style
    # subs_list.append ( . . . format: ' old=new "delete this=" '

    # Make text
    subs = '\n'.join (subs_list)

    # Make SE translator
    translator = SE.SE (subs)

# return translator, subs # print subs, if you want to see what they look like
    return translator


>>> translator = make_se_translator ()

>>> translation = translator (sample)

>>> print translation   # See
innerHTML = ''; prefix = 'ma' + 'il' + 'to'; path = 'hr' + 'ef' + '='; addy48218 = 'info' + '@'; addy48218 = addy48218 + 'tsv-neuried' + '.' +'de'; innerHTML += '<a ' + path +prefix + ':' + addy48218 + '>' + addy48218+'</a>';

>>> exec (translation.lstrip ())

>>> print innerHTML
<a href=mailto:i...@tsv-neuried.de>i...@tsv-neuried.de</a>

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to