date:20090428

[Tutor] Can not run under python 2.6?

2009-04-28 Thread Jianchun Zhou

Hi, there:

I am new to python, and now I got a trouble:

I have an application named canola, it is written under python 2.5, and can
run normally under python 2.5

But when it comes under python 2.6, problem up, it says:

Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line
151, in _load_plugins
classes = plg.load()
  File "/usr/lib/python2.6/site-packages/terra/core/plugin_manager.py", line
94, in load
mod = self._ldr.load()
  File "/usr/lib/python2.6/site-packages/terra/core/module_loader.py", line
42, in load
mod = __import__(modpath, fromlist=[mod_name])
ImportError: Import by filename is not supported.

Any body any idea what should I do?


-- 
Best Regards
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Add newline's, wrap, a long string

2009-04-28 Thread Martin Walsh

David wrote:
> vince spicer wrote:
>> first, grabbing output from an external command try:
>>
>> import commands
>>
>> USE = commands.getoutput('grep USE /tmp/comprookie2000/emege_info.txt
>> |head -n1|cut -d\\"-f2')
>>  
>> then you can wrap strings,
>>
>> import textwrap
>>
>> Lines = textwrap.wrap(USE, 80) # return a list
>>
>> so in short:
>>
>> import commands, textwrap
>> data = textwrap.wrap(commands.getoutput('my command'), 80)
>>
>>
>>
>> Vince
> Thanks Vince,
> I could not get command to work, but I did not try very hard;
> ["cut: the delimiter must be a single character Try `cut --help' for
> more", 'information. head: write error: Broken pipe']

Ah, I see. This error is most likely due to the typo (missing space
before -f2).

> 
> But textwrap did the trick, here is what I came up with;
> 
> #!/usr/bin/python
> 
> import subprocess
> import os
> import textwrap
> import string
> 
> def subopen():
> u_e = subprocess.Popen(
> 'grep USE /tmp/comprookie2000/emerge_info.txt |head -n1|cut
> -d\\" -f2',
> shell=True, stdout=subprocess.PIPE,)
> os.waitpid(u_e.pid, 0)
> USE = u_e.stdout.read().strip()
> L = textwrap.wrap(USE, 80) # return a list
> Lines = string.join(L, '\n')

Just one more comment, string.join is deprecated, yet join is a method
of str objects. So ...

  Lines = '\n'.join(L)

... or use textwrap.fill which returns a string with the newlines
already in place ...

  Lines = textwrap.fill(USE, 80)

HTH,
Marty

> fname = 'usetest.txt'
> fobj = open(fname, 'w')
> fobj.write(Lines)
> fobj.close
> 
> subopen()
> 
> Here is the output;
> http://linuxcrazy.pastebin.com/m66105e3
> 

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Add newline's, wrap, a long string

2009-04-28 Thread Martin Walsh

David wrote:
> I am getting information from .txt files and posting them in fields on a
> web site. I need to break up single strings so they are around 80
> characters then a new line because when I enter the info to the form on
> the website it has fields and it errors out with such a long string.
> 
> here is a sample of the code;
> 
> #!/usr/bin/python
> import subprocess
> import os
> 
> u_e = subprocess.Popen(
> 'grep USE /tmp/comprookie2000/emerge_info.txt |head -n1|cut -d\\"-f2',

Did you copy and paste this faithfully? The 'cut -d\\"-f2' looks a bit
odd. Is the delimiter a " (double-quote)? Perhaps you left out a space
before the -f2?

> shell=True, stdout=subprocess.PIPE,)
> os.waitpid(u_e.pid, 0)

'u_e.wait()' would wait the way you intend, as well, I believe.

> USE = u_e.stdout.read().strip()

Or, if you use the communicate() method of the Popen object, the wait is
implicit. As in,

stdout, stderr = u_e.communicate()

... or perhaps ...

USE = u_e.communicate()[0].strip()

... but, you don't need to use subprocess at all. How about (untested),

# grep USE /tmp/comprookie2000/emerge_info.txt |head -n1|cut -d\" -f2
infof = open('/tmp/comprookie2000/emerge_info.txt')
for line in infof:
if 'USE' in line:
USE = line.split('"')[1]
break
else:
USE = ''
infof.close()

> L = len(USE)
> print L
> print USE
> 
> L returns 1337

cosmic :)

HTH,
Marty
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Lie Ryan


Dinesh B Vadhia wrote:

A.T. / Marty
 
I'd prefer that the html parser didn't replace the missing tags as I 
want to know where and what the problems are.  Also, the source html 
documents were generated by another computer ie. they are not web page 
documents.  


If the source document was generated by a computer, and it produces 
invalid markup, shouldn't that be considered a bug in the producing 
program? Is it possible to fix the producing program instead (i.e. is it 
under your control)? [Or are you trying to know where the errors are 
because you're currently debugging it?]


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Regular Expresions instances

2009-04-28 Thread Emile van Sebille


Emile van Sebille wrote:

Emilio Casbas wrote:

Hi,

following the example from
http://docs.python.org/3.0/howto/regex.html


...from version 3.0 docs...



If I execute the following code on the python shell (3.1a1):


import re
p = re.compile('ab*')
p


I get the msg:
<_sre.SRE_Pattern object at 0x013A3440>



... is the same as I get on version 2.5.  Coincidence?


Nope.  I just installed 3.1a2 and it's the same there.

Possibly a case of the documentation not keeping up with the release...

Emile






instead of the msg from the example:


Why I get an SRE_Patterns object instead of a RegexObject instance?

Regards
Emilio



  ___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor



___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor



___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Regular Expresions instances

2009-04-28 Thread Emile van Sebille


Emilio Casbas wrote:

Hi,

following the example from
http://docs.python.org/3.0/howto/regex.html


...from version 3.0 docs...



If I execute the following code on the python shell (3.1a1):


import re
p = re.compile('ab*')
p


I get the msg:
<_sre.SRE_Pattern object at 0x013A3440>



... is the same as I get on version 2.5.  Coincidence?



instead of the msg from the example:


Why I get an SRE_Patterns object instead of a RegexObject instance?

Regards
Emilio



  
___

Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor



___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Regular Expresions instances

2009-04-28 Thread Emilio Casbas


Hi,

following the example from
http://docs.python.org/3.0/howto/regex.html

If I execute the following code on the python shell (3.1a1):

>>> import re
>>> p = re.compile('ab*')
>>> p

I get the msg:
<_sre.SRE_Pattern object at 0x013A3440>

instead of the msg from the example:


Why I get an SRE_Patterns object instead of a RegexObject instance?

Regards
Emilio



  
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Add newline's, wrap, a long string

2009-04-28 Thread David


vince spicer wrote:

first, grabbing output from an external command try:

import commands

USE = commands.getoutput('grep USE /tmp/comprookie2000/emege_info.txt 
|head -n1|cut -d\\"-f2')
 
then you can wrap strings,


import textwrap

Lines = textwrap.wrap(USE, 80) # return a list

so in short:

import commands, textwrap
data = textwrap.wrap(commands.getoutput('my command'), 80)



Vince





On Tue, Apr 28, 2009 at 3:43 PM, David > wrote:


I am getting information from .txt files and posting them in fields
on a web site. I need to break up single strings so they are around
80 characters then a new line because when I enter the info to the
form on the website it has fields and it errors out with such a long
string.

here is a sample of the code;

#!/usr/bin/python
import subprocess
import os

u_e = subprocess.Popen(
'grep USE /tmp/comprookie2000/emerge_info.txt |head -n1|cut
-d\\"-f2', shell=True, stdout=subprocess.PIPE,)
os.waitpid(u_e.pid, 0)
USE = u_e.stdout.read().strip()
L = len(USE)
print L
print USE

L returns 1337

Here is what USE returns;
http://linuxcrazy.pastebin.com/m2239816f

thanks
-david
-- 
Powered by Gentoo GNU/Linux

http://linuxcrazy.com
___
Tutor maillist  -  Tutor@python.org 
http://mail.python.org/mailman/listinfo/tutor



Thanks Vince,
I could not get command to work, but I did not try very hard;
["cut: the delimiter must be a single character Try `cut --help' for 
more", 'information. head: write error: Broken pipe']


But textwrap did the trick, here is what I came up with;

#!/usr/bin/python

import subprocess
import os
import textwrap
import string

def subopen():
u_e = subprocess.Popen(
'grep USE /tmp/comprookie2000/emerge_info.txt |head -n1|cut 
-d\\" -f2',

shell=True, stdout=subprocess.PIPE,)
os.waitpid(u_e.pid, 0)
USE = u_e.stdout.read().strip()
L = textwrap.wrap(USE, 80) # return a list
Lines = string.join(L, '\n')
fname = 'usetest.txt'
fobj = open(fname, 'w')
fobj.write(Lines)
fobj.close

subopen()

Here is the output;
http://linuxcrazy.pastebin.com/m66105e3

--
Powered by Gentoo GNU/Linux
http://linuxcrazy.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Dinesh B Vadhia

Stefan / Alan et al

Thank-you for all the advice and links.  A simple script using etree is 
scanning 500K+ xhtml files and 2 files with mismatched files have been found so 
far which can be fixed manually.  I'll definitely look into "tidy" as it sounds 
pretty cool.  Because, we are running data processing programs on a 64-bit 
Windows box (yes, I know, I know ...) using 64-bit Python we can only use pure 
Python-only libraries.  I believe that lxml uses C libraries.  Again, thanks to 
everyone - a terrific community as usual!

Message: 5
Date: Tue, 28 Apr 2009 19:39:17 +0200
From: Stefan Behnel 
Subject: Re: [Tutor] finding mismatched or unpaired html tags
To: tutor@python.org
Message-ID: 
Content-Type: text/plain; charset=ISO-8859-1

A.T.Hofkamp wrote:
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>>
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>>
>> I now want to scan each file and simply identify each mismatched or
>> unpaired
> tags (by line number) in each file. I've read the ElementTree docs and
> cannot
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
> 
> Don't use elementTree, use BeautifulSoup instead.

Actually, now that the code is there anyway, the OP might be happier with
lxml.html. It's a lot faster than BeautifulSoup, uses less memory, and
often parses broken HTML better. It's also more user friendly for many HTML
tasks.

http://codespeak.net/lxml/lxmlhtml.html

This might also be worth a read:

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

Stefan
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Add newline's, wrap, a long string

2009-04-28 Thread vince spicer

first, grabbing output from an external command try:

import commands

USE = commands.getoutput('grep USE /tmp/comprookie2000/emege_info.txt |head
-n1|cut -d\\"-f2')
then you can wrap strings,

import textwrap

Lines = textwrap.wrap(USE, 80) # return a list

so in short:

import commands, textwrap
data = textwrap.wrap(commands.getoutput('my command'), 80)



Vince





On Tue, Apr 28, 2009 at 3:43 PM, David  wrote:

> I am getting information from .txt files and posting them in fields on a
> web site. I need to break up single strings so they are around 80 characters
> then a new line because when I enter the info to the form on the website it
> has fields and it errors out with such a long string.
>
> here is a sample of the code;
>
> #!/usr/bin/python
> import subprocess
> import os
>
> u_e = subprocess.Popen(
> 'grep USE /tmp/comprookie2000/emerge_info.txt |head -n1|cut -d\\"-f2',
> shell=True, stdout=subprocess.PIPE,)
> os.waitpid(u_e.pid, 0)
> USE = u_e.stdout.read().strip()
> L = len(USE)
> print L
> print USE
>
> L returns 1337
>
> Here is what USE returns;
> http://linuxcrazy.pastebin.com/m2239816f
>
> thanks
> -david
> --
> Powered by Gentoo GNU/Linux
> http://linuxcrazy.com
> ___
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor
>
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Add newline's, wrap, a long string

2009-04-28 Thread David

I am getting information from .txt files and posting them in fields on a 
web site. I need to break up single strings so they are around 80 
characters then a new line because when I enter the info to the form on 
the website it has fields and it errors out with such a long string.


here is a sample of the code;

#!/usr/bin/python
import subprocess
import os

u_e = subprocess.Popen(
'grep USE /tmp/comprookie2000/emerge_info.txt |head -n1|cut -d\\"-f2', 
shell=True, stdout=subprocess.PIPE,)

os.waitpid(u_e.pid, 0)
USE = u_e.stdout.read().strip()
L = len(USE)
print L
print USE

L returns 1337

Here is what USE returns;
http://linuxcrazy.pastebin.com/m2239816f

thanks
-david
--
Powered by Gentoo GNU/Linux
http://linuxcrazy.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Stefan Behnel

A.T.Hofkamp wrote:
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>>
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>>
>> I now want to scan each file and simply identify each mismatched or
>> unpaired
> tags (by line number) in each file. I've read the ElementTree docs and
> cannot
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
> 
> Don't use elementTree, use BeautifulSoup instead.

Actually, now that the code is there anyway, the OP might be happier with
lxml.html. It's a lot faster than BeautifulSoup, uses less memory, and
often parses broken HTML better. It's also more user friendly for many HTML
tasks.

http://codespeak.net/lxml/lxmlhtml.html

This might also be worth a read:

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

Stefan

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Alan Gauld


"Dinesh B Vadhia"  wrote


I'm processing tens of thousands of html files and a few of them contain
mismatched tags and ElementTree throws the error:

"Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: line 
124, column 8"


IMHO the best way to cleanse HTML files is to use tidy.
It is available for *nix and Windows and has a wealth of
options to control it's output. It can even converty html into
valid xhtml which ElementTree should be happy with.

http://tidy.sourceforge.net/

It may not be Python but it's fast and effective!
And there is a Python wrapper:

http://utidylib.berlios.de/

although I've never used it.

--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/ 



___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Dinesh B Vadhia

Found the mismatched tag on line 94:

"My Name in Nelma Lois Thornton-S.S. No. sjn-yz-yokv/p>"

should be:

"My Name in Nelma Lois Thornton-S.S. No. sjn-yz-yokv"

I'll run all the html files through a simple script to identify the mismatches 
using etree.  Thanks.

Dinesh

From: Kent Johnson 
Sent: Tuesday, April 28, 2009 8:17 AM
To: Dinesh B Vadhia 
Cc: tutor@python.org 
Subject: Re: [Tutor] finding mismatched or unpaired html tags

On Tue, Apr 28, 2009 at 10:41 AM, Dinesh B Vadhia
 wrote:
> This is the error and traceback:
>
> Unexpected error opening J:/F2/html: mismatched tag: line 124, column 8
>
> Traceback (most recent call last):
>   File "C:\py", line 492, in 
> raw = extractText(xhtmlfile)
>   File "C:\py", line 334, in extractText
> tree = make_tree(xhtmlfile)
>   File "py", line 169, in make_tree
> return tree
> UnboundLocalError: local variable 'tree' referenced before assignment

This is inconsistent. The exception in the stack trace is from a
coding error in extractText. It looks like maybe ExtractText is
catching exceptions and printing them, and a bug in the exception
handling is causing the UnboundLocalError

> Here is line 124, col 8 and I cannot see any obvious missing/mismatched
> tags:
>
> "As to the present time I am unable physical and mentally to secure all
> this information at present."

If you look at a few more lines do you see anything untoward? Perhaps
there is a missing  before the , for example? I don't think 
is allowed inside every tag.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Kent Johnson

On Tue, Apr 28, 2009 at 10:41 AM, Dinesh B Vadhia
 wrote:
> This is the error and traceback:
>
> Unexpected error opening J:/F2/html: mismatched tag: line 124, column 8
>
> Traceback (most recent call last):
>   File "C:\py", line 492, in 
>     raw = extractText(xhtmlfile)
>   File "C:\py", line 334, in extractText
>     tree = make_tree(xhtmlfile)
>   File "py", line 169, in make_tree
>     return tree
> UnboundLocalError: local variable 'tree' referenced before assignment

This is inconsistent. The exception in the stack trace is from a
coding error in extractText. It looks like maybe ExtractText is
catching exceptions and printing them, and a bug in the exception
handling is causing the UnboundLocalError

> Here is line 124, col 8 and I cannot see any obvious missing/mismatched
> tags:
>
> "As to the present time I am unable physical and mentally to secure all
> this information at present."

If you look at a few more lines do you see anything untoward? Perhaps
there is a missing  before the , for example? I don't think 
is allowed inside every tag.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread spir

Le Tue, 28 Apr 2009 07:41:36 -0700,
"Dinesh B Vadhia"  s'exprima ainsi:

> This is the error and traceback:
> 
> Unexpected error opening J:/F2/html: mismatched tag: line 124, column 8
> 
> Traceback (most recent call last):
>   File "C:\py", line 492, in 
> raw = extractText(xhtmlfile)
>   File "C:\py", line 334, in extractText
> tree = make_tree(xhtmlfile)
>   File "py", line 169, in make_tree
> return tree
> UnboundLocalError: local variable 'tree' referenced before assignment
>  
> 
> Here is line 124, col 8 and I cannot see any obvious missing/mismatched
> tags:
> 
> "As to the present time I am unable physical and mentally to secure all
> this information at present."
> 
> Dinesh

As for programming syntax error, the place where an html error is detected by 
the parser may well be one or more line(s) after the actual error: you should 
watch lines before #124.
Also, the traceback looks strange: it seems that ElementTree interprets a 
python UnboundLocalError for the variable 'tree' as being caused by an html 
error in source. ???

denis
--
la vita e estrany
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Dinesh B Vadhia

This is the error and traceback:

Unexpected error opening J:/F2/html: mismatched tag: line 124, column 8

Traceback (most recent call last):
  File "C:\py", line 492, in 
raw = extractText(xhtmlfile)
  File "C:\py", line 334, in extractText
tree = make_tree(xhtmlfile)
  File "py", line 169, in make_tree
return tree
UnboundLocalError: local variable 'tree' referenced before assignment
 

Here is line 124, col 8 and I cannot see any obvious missing/mismatched tags:

"As to the present time I am unable physical and mentally to secure all this 
information at present."

Dinesh




From: Kent Johnson 
Sent: Tuesday, April 28, 2009 7:13 AM
To: Dinesh B Vadhia 
Cc: tutor@python.org 
Subject: Re: [Tutor] finding mismatched or unpaired html tags


On Tue, Apr 28, 2009 at 8:54 AM, Dinesh B Vadhia
 wrote:
> I'm processing tens of thousands of html files and a few of them contain
> mismatched tags and ElementTree throws the error:
>
> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: line 124,
> column 8"
>
> I now want to scan each file and simply identify each mismatched or unpaired
> tags (by line number) in each file.  I've read the ElementTree docs and
> cannot see anything obvious how to do this.  I know this is a common problem
> but feeling a bit clueless here - any ideas?

It seems like the exception gives you the line number. What kind of
exception is raised? The exception object may contain the line and
column in a more accessible form, so you could catch the exception,
get the line number, then read that line out of the file and show it.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Martin Walsh

Dinesh B Vadhia wrote:
> A.T. / Marty
>  
> I'd prefer that the html parser didn't replace the missing tags as I
> want to know where and what the problems are.  Also, the source html
> documents were generated by another computer ie. they are not web page
> documents.  My sense is that it is only a few files out of tens of
> thousands.  Cheers ...
>  
> Dinesh

If this is a one time task, write a script to iterate over the html
files, and collect the traceback info from those that throw a
'mismatched tag' error. Based on your example below, it appears to
contain the line number. You'd only get one error per file per run, but
you can run it until there are no errors remaining. I hope that makes
sense.

HTH,
Marty

>  
>  
> 
> Message: 7
> Date: Tue, 28 Apr 2009 08:54:33 -0500
> From: Martin Walsh 
> Subject: Re: [Tutor] finding mismatched or unpaired html tags
> To: "tutor@python.org" 
> Message-ID: <49f70a99.3050...@mwalsh.org>
> Content-Type: text/plain; charset=us-ascii
> 
> A.T.Hofkamp wrote:
>> Dinesh B Vadhia wrote:
>>> I'm processing tens of thousands of html files and a few of them
>>> contain mismatched tags and ElementTree throws the error:
>>>
>>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>>> line 124, column 8"
>>>
>>> I now want to scan each file and simply identify each mismatched or
>>> unpaired
>> tags (by line number) in each file. I've read the ElementTree docs and
>> cannot
>> see anything obvious how to do this. I know this is a common problem but
>> feeling a bit clueless here - any ideas?
>>>
>>
>> Don't use elementTree, use BeautifulSoup instead.
>>
>> elementTree expects perfect input, typically generated by another
> computer.
>> BeautifulSoup is designed to handle your everyday HTML page, filled with
>> errors of all possible kinds.
> 
> But it also modifies the source html by default, adding closing tags,
> etc. Important to know, I suppose, if you intend to re-write the html
> files you parse with BeautifulSoup.
> 
> Also, unless you're running python 3.0 or greater, use the 3.0.x series
> of BeautifulSoup -- otherwise you may run into the same issue.
> 
> http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
> 
> HTH,
> Marty
> 
> 
> 
> 
> ___
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Kent Johnson

On Tue, Apr 28, 2009 at 8:54 AM, Dinesh B Vadhia
 wrote:
> I'm processing tens of thousands of html files and a few of them contain
> mismatched tags and ElementTree throws the error:
>
> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: line 124,
> column 8"
>
> I now want to scan each file and simply identify each mismatched or unpaired
> tags (by line number) in each file.  I've read the ElementTree docs and
> cannot see anything obvious how to do this.  I know this is a common problem
> but feeling a bit clueless here - any ideas?

It seems like the exception gives you the line number. What kind of
exception is raised? The exception object may contain the line and
column in a more accessible form, so you could catch the exception,
get the line number, then read that line out of the file and show it.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Dinesh B Vadhia

A.T. / Marty

I'd prefer that the html parser didn't replace the missing tags as I want to 
know where and what the problems are.  Also, the source html documents were 
generated by another computer ie. they are not web page documents.  My sense is 
that it is only a few files out of tens of thousands.  Cheers ...

Dinesh

Message: 7
Date: Tue, 28 Apr 2009 08:54:33 -0500
From: Martin Walsh 
Subject: Re: [Tutor] finding mismatched or unpaired html tags
To: "tutor@python.org" 
Message-ID: <49f70a99.3050...@mwalsh.org>
Content-Type: text/plain; charset=us-ascii

A.T.Hofkamp wrote:
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>>
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>>
>> I now want to scan each file and simply identify each mismatched or
>> unpaired
> tags (by line number) in each file. I've read the ElementTree docs and
> cannot
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
>>
> 
> Don't use elementTree, use BeautifulSoup instead.
> 
> elementTree expects perfect input, typically generated by another computer.
> BeautifulSoup is designed to handle your everyday HTML page, filled with
> errors of all possible kinds.

But it also modifies the source html by default, adding closing tags,
etc. Important to know, I suppose, if you intend to re-write the html
files you parse with BeautifulSoup.

Also, unless you're running python 3.0 or greater, use the 3.0.x series
of BeautifulSoup -- otherwise you may run into the same issue.

http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

HTH,
Marty

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Martin Walsh

A.T.Hofkamp wrote:
> Dinesh B Vadhia wrote:
>> I'm processing tens of thousands of html files and a few of them
>> contain mismatched tags and ElementTree throws the error:
>>
>> "Unexpected error opening J:/F2/663/blahblah.html: mismatched tag:
>> line 124, column 8"
>>
>> I now want to scan each file and simply identify each mismatched or
>> unpaired
> tags (by line number) in each file. I've read the ElementTree docs and
> cannot
> see anything obvious how to do this. I know this is a common problem but
> feeling a bit clueless here - any ideas?
>>
> 
> Don't use elementTree, use BeautifulSoup instead.
> 
> elementTree expects perfect input, typically generated by another computer.
> BeautifulSoup is designed to handle your everyday HTML page, filled with
> errors of all possible kinds.

But it also modifies the source html by default, adding closing tags,
etc. Important to know, I suppose, if you intend to re-write the html
files you parse with BeautifulSoup.

Also, unless you're running python 3.0 or greater, use the 3.0.x series
of BeautifulSoup -- otherwise you may run into the same issue.

http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

HTH,
Marty





___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread A.T.Hofkamp


Dinesh B Vadhia wrote:

I'm processing tens of thousands of html files and a few of them contain 
mismatched tags and ElementTree throws the error:

"Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: line 124, column 
8"

I now want to scan each file and simply identify each mismatched or
unpaired

tags (by line number) in each file. I've read the ElementTree docs and cannot
see anything obvious how to do this. I know this is a common problem but
feeling a bit clueless here - any ideas?




Don't use elementTree, use BeautifulSoup instead.

elementTree expects perfect input, typically generated by another computer.
BeautifulSoup is designed to handle your everyday HTML page, filled with 
errors of all possible kinds.



Sincerely,
Albert

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] finding mismatched or unpaired html tags

2009-04-28 Thread Dinesh B Vadhia

I'm processing tens of thousands of html files and a few of them contain 
mismatched tags and ElementTree throws the error:

"Unexpected error opening J:/F2/663/blahblah.html: mismatched tag: line 124, 
column 8"

I now want to scan each file and simply identify each mismatched or unpaired 
tags (by line number) in each file.  I've read the ElementTree docs and cannot 
see anything obvious how to do this.  I know this is a common problem but 
feeling a bit clueless here - any ideas?

Dinesh
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regular expression question

2009-04-28 Thread Kent Johnson

On Tue, Apr 28, 2009 at 4:03 AM, Kelie  wrote:
> Hello,
>
> The following code returns 'abc123abc45abc789jk'. How do I revise the pattern 
> so
> that the return value will be 'abc789jk'? In other words, I want to find the
> pattern 'abc' that is closest to 'jk'. Here the string '123', '45' and '789' 
> are
> just examples. They are actually quite different in the string that I'm 
> working
> with.
>
> import re
> s = 'abc123abc45abc789jk'
> p = r'abc.+jk'
> lst = re.findall(p, s)
> print lst[0]

re.findall() won't work because it finds non-overlapping matches.

If there is a character in the initial match which cannot occur in the
middle section, change .+ to exclude that character. For example,
r'abc[^a]+jk' works with your example.

Another possibility is to look for the match starting at different
locations, something like this:
p = re.compile(r'abc.+jk')
lastMatch = None
i = 0
while i < len(s):
  m = p.search(s, i)
  if m is None:
break
  lastMatch = m.group()
  i = m.start() + 1

print lastMatch

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regular expression question

2009-04-28 Thread Kent Johnson

2009/4/28 Marek spociń...@go2.pl,Poland :

>> import re
>> s = 'abc123abc45abc789jk'
>> p = r'abc.+jk'
>> lst = re.findall(p, s)
>> print lst[0]
>
> I suggest using r'abc.+?jk' instead.
>
> the additional ? makes the preceeding '.+' non-greedy so instead of matching 
> as long string as it can it matches as short string as possible.

Did you try it? It doesn't do what you expect, it still matches at the
beginning of the string.

The re engine searches for a match at a location and returns the first
one it finds. A non-greedy match doesn't mean "Find the shortest
possible match anywhere in the string", it means, "find the shortest
possible match starting at this location."

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regular expression question

2009-04-28 Thread Kelie

spir  free.fr> writes:

> To avoid that, use non-grouping parens (?:...). This also avoids the need for
parens around the whole format:
> p = Pattern(r'abc(?:(?!abc).)+jk')
> print p.findall(s)
> ['abc789jk']
> 
> Denis


This one works! Thank you Denis. I'll try it out on the actual much longer
(multiline) string and see what happens.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regular expression question

2009-04-28 Thread Kelie

Andre Engels  gmail.com> writes:

> 
> 2009/4/28 Marek Spociński  go2.pl,Poland  10g.pl>:

> > I suggest using r'abc.+?jk' instead.
> >

> 
> That was my first idea too, but it does not work for this case,
> because Python will still try to _start_ the match as soon as
> possible. 

yeah, i tried the '?' as well and realized it would not work.


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regular expression question

2009-04-28 Thread spir

Le Tue, 28 Apr 2009 11:06:16 +0200,
Marek spociń...@go2.pl,  Poland  s'exprima ainsi:

> > Hello,
> > 
> > The following code returns 'abc123abc45abc789jk'. How do I revise the
> > pattern so that the return value will be 'abc789jk'? In other words, I
> > want to find the pattern 'abc' that is closest to 'jk'. Here the string
> > '123', '45' and '789' are just examples. They are actually quite
> > different in the string that I'm working with. 
> > 
> > import re
> > s = 'abc123abc45abc789jk'
> > p = r'abc.+jk'
> > lst = re.findall(p, s)
> > print lst[0]
> 
> I suggest using r'abc.+?jk' instead.
> 
> the additional ? makes the preceeding '.+' non-greedy so instead of
> matching as long string as it can it matches as short string as possible.

Non-greedy repetition will not work in this case, I guess:

from re import compile as Pattern
s = 'abc123abc45abc789jk'
p = Pattern(r'abc.+?jk')
print p.match(s).group()
==>
abc123abc45abc789jk

(Someone explain why?)

My solution would be to explicitely exclude 'abc' from the sequence of chars 
matched by '.+'. To do this, use negative lookahead (?!...) before '.':
p = Pattern(r'(abc((?!abc).)+jk)')
print p.findall(s)
==>
[('abc789jk', '9')]

But it's not exactly what you want. Because the internal () needed to express 
exclusion will be considered by findall as a group to be returned, so that you 
also get the last char matched in there.
To avoid that, use non-grouping parens (?:...). This also avoids the need for 
parens around the whole format:
p = Pattern(r'abc(?:(?!abc).)+jk')
print p.findall(s)
['abc789jk']

Denis
--
la vita e estrany
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regular expression question

2009-04-28 Thread Marek Spociński , Poland

Dnia 28 kwietnia 2009 11:16 Andre Engels  napisał(a):
> 2009/4/28 Marek spociń...@go2.pl,Poland :
> >> Hello,
> >>
> >> The following code returns 'abc123abc45abc789jk'. How do I revise the 
> >> pattern so
> >> that the return value will be 'abc789jk'? In other words, I want to find 
> >> the
> >> pattern 'abc' that is closest to 'jk'. Here the string '123', '45' and 
> >> '789' are
> >> just examples. They are actually quite different in the string that I'm 
> >> working
> >> with.
> >>
> >> import re
> >> s = 'abc123abc45abc789jk'
> >> p = r'abc.+jk'
> >> lst = re.findall(p, s)
> >> print lst[0]
> >
> > I suggest using r'abc.+?jk' instead.
> >
> > the additional ? makes the preceeding '.+' non-greedy so instead of 
> > matching as long string as it can it matches as short string as possible.
> 
> That was my first idea too, but it does not work for this case,
> because Python will still try to _start_ the match as soon as
> possible. To use .+? one would have to revert the string, then use the
> reverse regular expression on the result, which looks like a rather
> roundabout way of doing things.

I don't have access to python right now so i cannot test my ideas...
And i don't really want to give you wrong idea too.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regular expression question

2009-04-28 Thread Andre Engels

2009/4/28 Marek spociń...@go2.pl,Poland :
>> Hello,
>>
>> The following code returns 'abc123abc45abc789jk'. How do I revise the 
>> pattern so
>> that the return value will be 'abc789jk'? In other words, I want to find the
>> pattern 'abc' that is closest to 'jk'. Here the string '123', '45' and '789' 
>> are
>> just examples. They are actually quite different in the string that I'm 
>> working
>> with.
>>
>> import re
>> s = 'abc123abc45abc789jk'
>> p = r'abc.+jk'
>> lst = re.findall(p, s)
>> print lst[0]
>
> I suggest using r'abc.+?jk' instead.
>
> the additional ? makes the preceeding '.+' non-greedy so instead of matching 
> as long string as it can it matches as short string as possible.

That was my first idea too, but it does not work for this case,
because Python will still try to _start_ the match as soon as
possible. To use .+? one would have to revert the string, then use the
reverse regular expression on the result, which looks like a rather
roundabout way of doing things.



-- 
André Engels, andreeng...@gmail.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] regular expression question

2009-04-28 Thread =?UTF-8?Q?Marek_Spoci=C5=84ski

> Hello,
> 
> The following code returns 'abc123abc45abc789jk'. How do I revise the pattern 
> so
> that the return value will be 'abc789jk'? In other words, I want to find the
> pattern 'abc' that is closest to 'jk'. Here the string '123', '45' and '789' 
> are
> just examples. They are actually quite different in the string that I'm 
> working
> with. 
> 
> import re
> s = 'abc123abc45abc789jk'
> p = r'abc.+jk'
> lst = re.findall(p, s)
> print lst[0]

I suggest using r'abc.+?jk' instead.

the additional ? makes the preceeding '.+' non-greedy so instead of matching as 
long string as it can it matches as short string as possible.


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How to run a .py file or load a module?

2009-04-28 Thread Dayo Adewunmi

Denis, this mail was very comprehensive, and went a long way of driving 
it all home for me.
There are several different concepts that are involved in this simple 
problem that I had, and
you guys explaining them has really expanded my pythonic horizon, 
especially the explanations

on the argv module, and also the idea of
  
   from  import  as 


Thanks a lot, everybody. :-)

Dayo
--
spir wrote:

Le Sun, 26 Apr 2009 22:35:36 +0100,
Dayo Adewunmi  s'exprima ainsi:

  

How can I

a) Open my shell, and do something like: $ python countdown.py   
but have it take an argument and pass it to the function, and execute.



When your code is (nicely) organised as a set of funcs or class definitions, you also need a 
"laucher" usually called "main()". Otherwise python only parses and records the 
definitions into live objects that wait for someone to tell them what they're supposed to do. I'll 
stick first at processes without any parameter, like if your func would always countdown from 10.
There are several use patterns:

(1) Program launched from command line.
Just add a call to your func:
   countdown(10)

(2) Module imported from other prog
Nothing to add to your module.
Instead, the importing code needs to hold:
   import countdown # the module (file)
   ...
   countdown.countdown(n)   # the func itself
or
   from countdown import countdown  # the func, directly
   ...
   countdown(n)

(3) Both
You need to differenciate between launching and importing. Python provides a 
rather esoteric idiom for that:
   
   if __name__ == "__main__":
  countdown(10)
The trick is that when a prog is launched directly (as opposed to imported), it silently 
gets a '__name__' attribute that is automatically set to "__main__". So that 
the one-line block above will only run when the prog is launched, like in case (1). While 
nothing will happen when the module is imported -- instead the importing code will have 
the countdown func available under name 'countdown' as expected, like in case (2). Clear?

  

b) Import the function in the interactive interpreter, and call it like so:

countdown(10)

without getting the abovementioned error.



In the case of an import, as your func definition has the proper parameter, you 
have nothing to change.
While for a launch from command-line, you need to get the parameter given by 
the user.
But how? Python provides a way to read the command-line arguments under an 
attribute called 'argv' of the 'sys' module.
argv is a list which zerost item is the name of the file. For instance if called
   python countdown.py 9
argv holds: ['countdown.py', '9']
Note that both are strings. Then you can catch and use the needed parameter, 
e.g.

from time import sleep as wait
from sys import argv as user_args

def countdown(n=10):
if n <= 0:
print 'Blastoff!'
else:
wait(0.333)
print n
countdown(n-1)

def launch():
if len(user_args) == 1:
countdown()
else:
n = int(user_args[1])
countdown(n)

if __name__ == "__main__":
launch()

(You can indeed put the content of launch() in the if block. But I find it 
clearer that way, and it happens to be a common practice.)

Denis
--
la vita e estrany
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

  


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] How to run a .py file or load a module?

2009-04-28 Thread Dayo Adewunmi


David wrote:

Norman Khine wrote:
On Mon, Apr 27, 2009 at 12:07 AM, Sander Sweers 
 wrote:

Here is another one for fun, you run it like
python countdown.py 10

#!/usr/bin/env python

import sys
from time import sleep

times = int(sys.argv[1]) # The argument given on the command line

def countdown(n):
try:
while n != 1:
n = n-1
print n
sleep(1)
finally:
print 'Blast Off!'

countdown(times)

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor



Thank you all for all your  valuable input on this. I have learned so 
much on this particular subject
in such a short time. David, I ran your code, and noticed that given 
countdown(10) your countdown

starts at 9 and Blastoff takes place after 1, not 0. To fix that, I changed

  while n ! = 1

to
  
  while n != 0



and changed

  n = n - 1
  print n

to

  print n
  n = n -1


Thanks for the time you guys have put into this. It's much appreciated. :-)

Dayo
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Working with lines from file and printing to another keeping sequential order

2009-04-28 Thread spir

Le Mon, 27 Apr 2009 23:29:13 -0400,
Dan Liang  s'exprima ainsi:

> Hi Bob, Shantanoo, Kent, and tutors,
> 
> Thank you Bob, Shantanoo, Kent for all the nice feedback. Exception
> handling, the concept of states in cs, and the use of the for loop with
> offset helped a lot. Here is the code I now have, based on your suggestions,
> and it does what I need:
> 
> ListLines = [ line.rstrip() for line in open('test.txt') ]
> 
> countYes = 0
> countNo = 0
> 
> for i in range(len(ListLines)):
>  if ListLines[i].endswith('yes'):
>  countYes+=1
>  print "countYes", countYes, "\t\t", ListLines[i]
> 
>  if not ListLines[i].endswith('yes'):
> continue
> 
>  for offset in (1, 2, 3, 4, 5, 6, 7, 8):
> if i+offset < len(ListLines) and ListLines[i+offset].endswith('no'):
> 
>countNo+=1
> 
>print "countNo", countNo, "\t\t", ListLines[i+offset]

It probably works, but there is something uselessly complicated, logically 
speaking:
-1- case ends with 'yes', do
-2- case not ends with 'yes', do
-3- case ends with 'yes', again, do

You'd better group -1- and -3-, no? Moreover, as action -2- is to continue, 
further code is simplified if -2- is placed first:

for i in range(len(ListLines)):
  if not ListLines[i].endswith('yes'):
continue
  # case line ends with 'yes': process it
  countYes+=1
  print "countYes", countYes, "\t\t", ListLines[i]
  for offset in (1, 2, 3, 4, 5, 6, 7, 8):
if i+offset < len(ListLines) and ListLines[i+offset].endswith('no'):
  countNo+=1
  print "countNo", countNo, "\t\t", ListLines[i+offset]

Also, use more than 1 space for indent, and be consistent (set the value in 
your editor settings and use the TAB key to achieve that); and avoid too many 
useless blank lines.

Denis
--
la vita e estrany
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] regular expression question

2009-04-28 Thread Kelie

Hello,

The following code returns 'abc123abc45abc789jk'. How do I revise the pattern so
that the return value will be 'abc789jk'? In other words, I want to find the
pattern 'abc' that is closest to 'jk'. Here the string '123', '45' and '789' are
just examples. They are actually quite different in the string that I'm working
with. 

import re
s = 'abc123abc45abc789jk'
p = r'abc.+jk'
lst = re.findall(p, s)
print lst[0]

Thanks for your help!

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Can not run under python 2.6?

Re: [Tutor] Add newline's, wrap, a long string

Re: [Tutor] Add newline's, wrap, a long string

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] Regular Expresions instances

Re: [Tutor] Regular Expresions instances

[Tutor] Regular Expresions instances

Re: [Tutor] Add newline's, wrap, a long string

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] Add newline's, wrap, a long string

[Tutor] Add newline's, wrap, a long string

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

Re: [Tutor] finding mismatched or unpaired html tags

[Tutor] finding mismatched or unpaired html tags

Re: [Tutor] regular expression question

Re: [Tutor] regular expression question

Re: [Tutor] regular expression question

Re: [Tutor] regular expression question

Re: [Tutor] regular expression question

Re: [Tutor] regular expression question

Re: [Tutor] regular expression question

Re: [Tutor] regular expression question

Re: [Tutor] How to run a .py file or load a module?

Re: [Tutor] How to run a .py file or load a module?

Re: [Tutor] Working with lines from file and printing to another keeping sequential order

[Tutor] regular expression question

35 matches

Site Navigation

Mail list logo

Footer information