subject:"Mining strings from a HTML document."

Re: Mining strings from a HTML document.

2006-01-26 Thread Derick van Niekerk

Thanks Guys!

I've written several functions yesterday to import from different types
of raw data including html and different text formats. In the end I
never used the extract function or the parser module, but your advice
put me on the right track. All these functions are now in a single
object and the inner workings are abstracted (as much as python
allows). So a single object can now import from any file without me
having to worry about what file it is!

Might not sound like much, but the whole OOP thing is new to me too, so
I am very happy with what python could do for me.

Now just to get this stuff into MySQL...new topic :)

Thanks for all your help!
-d-

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mining strings from a HTML document.

2006-01-26 Thread Runsun Pan

>def extract(text,s1,s2):
>''' Extract strings wrapped between s1 and s2.
>
>>>> t="""this is a test for extract()
>that does multiple extract """
>>>> extract(t,'','')
>['test', 'extract()', 'does multiple extract']
>
>'''
>beg = [1,0][text.startswith(s1)]
>tmp = text.split(s1)[beg:]
>end = [len(tmp), len(tmp)+1][ text.endswith(s2)]
>return [ x.split(s2)[0] for x in tmp if len(x.split(s2))>1][:end]

> Could you/anyone explain the 4 lines of code to me though? A crash
> course in Python shorthand? What does it mean when you use two sets of
> brackets as in : beg = [1,0][text.startswith(s1)] ?

The idea is using .split( ) to cut the string in different manners.
For a string:

 -AderickBArunsunB--

first cut at A gives you  [-, derickB--, runsunB-]   (line-1,2)
2nd cut at B gives you  [ derick, runsun ](line-3,4)

The function uses list comprehension heavily. As Magnus already explained,
line-1 is just a switch. Same as line-3. These two lines exist to solve the
difference between

 -AderickBArunsunB--
 AderickBArunsunB--

or

 -AderickBArunsunB--
 -AderickBArunsunB

That is, if the original raw string startswith or ends with s1 or s2, special
consideration should be taken.

Line-2 and -4 are just common practice of list slicing that u should be
able to find in any python tutorial.

Let us know if it's still not clear.

--
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
Runsun Pan, PhD
[EMAIL PROTECTED]
Nat'l Center for Macromolecular Imaging
http://ncmi.bcm.tmc.edu/ncmi/
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mining strings from a HTML document.

2006-01-26 Thread Magnus Lycka

Derick van Niekerk wrote:
> Could you/anyone explain the 4 lines of code to me though? A crash
> course in Python shorthand? What does it mean when you use two sets of
> brackets as in : beg = [1,0][text.startswith(s1)] ?

It's not as strange as it looks. [1,0] is a list. If you put []
after a list, it's for indexing, right? (Unless there's one or
two ':' somehere, in which case it's slicing.)

text.startswith(s1) evaluates to True or False, which is equivalent
to 1 or 0 in a numerical context. [1,0][0] is 1, and [1,0][1] is
0, so you could say that it's a somewhat contrieved way of writing
"beg = int(not text.startswith(s1))" or "beg = 1 - text.startswith(s1)"
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mining strings from a HTML document.

2006-01-26 Thread Cameron Laird

In article <[EMAIL PROTECTED]>,
Derick van Niekerk <[EMAIL PROTECTED]> wrote:
.
.
.
>I suppose very few books on python start off with HTML processing in
>stead of 'hello world' :p
.
.
.
... very few, perhaps, but how many do you need when the
one example is so strong?  In any case, you'll want to look
into *Text Processing in Python* http://gnosis.cx/TPiP/ >.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mining strings from a HTML document.

2006-01-26 Thread Derick van Niekerk

Runsun Pan helped me out with the following:

You can also try the following very primitive solution that I
sometimes
use to extract simple information in a quick and dirty way:

def extract(text,s1,s2):
''' Extract strings wrapped between s1 and s2.

>>> t="""this is a test for extract()
that does multiple extract """
>>> extract(t,'','')
['test', 'extract()', 'does multiple extract']

'''
beg = [1,0][text.startswith(s1)]
tmp = text.split(s1)[beg:]
end = [len(tmp), len(tmp)+1][ text.endswith(s2)]
return [ x.split(s2)[0] for x in tmp if
len(x.split(s2))>1][:end]


This will help out a  *lot*! Thank you. This is a better bet than the
parser in this particular implementation because the data I need is not
encapsulated in tags! Field names are within  tags followed by
plain text data and ended with a  tag. This was my main problem
with a parser, but your extract fuction solves it beautifully!

I'm posting back to the NG in just in case it is of value to anyone
else.

Could you/anyone explain the 4 lines of code to me though? A crash
course in Python shorthand? What does it mean when you use two sets of
brackets as in : beg = [1,0][text.startswith(s1)] ?

Thanks for the help!
-d-

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mining strings from a HTML document.

2006-01-26 Thread Derick van Niekerk

I'm battling to understand this. I am switching to python while in a
production environment so I am tossed into the deep end. Python seems
easier to learn than other languages, but some of the conventions still
trip me up. Thanks for the link - I'll have to go through all the
previous chapters to understand this one though...

I suppose very few books on python start off with HTML processing in
stead of 'hello world' :p

Could you give me an example of how to use it to extract basic
information from the web page? I need a bit of a hit-the-ground-running
approach to python. You'll see that the data in my example isn't
encapsulated in tags - is there still an easy way to extract it using
the parser module?

Thanks

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mining strings from a HTML document.

2006-01-25 Thread Derick van Niekerk

Thanks, Jay!

I'll try this out today. Trying to write my own parser is such a pain.
This BeatifullSoup script is very nice! I'll give it a try.

If you can help me out with an example of how to do what I explained, I
would appreciate it. I actually finished doing an import last night,
but there is no way I'm creating another parser from scratch!

I tried figuring out what to do by going through the code, but I am
still waay too fresh to understand generators and some of the coding
conventions.

Thanks again

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mining strings from a HTML document.

2006-01-25 Thread Chris Lasher

I think Jay's advice is solid: you shouldn't rule out HTML parsing.
It's not too scary and it's probably not overboard. Using a common HTML
parsing library saves you from having to write and debug your own
parser. Try looking at Dive Into Python's chapter on it, first.
http://www.diveintopython.org/html_processing/index.html

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mining strings from a HTML document.

2006-01-25 Thread jay graves

Derick van Niekerk wrote:
> What are the string functions I would use and how would I use them? I
> saw something about html parsing in python, but that might be overkill.
> Babysteps.

Despite your reluctance, I would still recommend an HTML parsing
module. I like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

...
jay

-- 
http://mail.python.org/mailman/listinfo/python-list

Mining strings from a HTML document.

2006-01-25 Thread Derick van Niekerk

Hi,

I am new to Python and have been doing most of my work with PHP until
now. I find Python to be *much* nicer for the development of local apps
(running on my machine) but I am very new to the Python way of thinking
and I don't realy know where to start other than just by doing it...so
far I'm just through the tut :)

My problem is as follows:
I have an html file with a list of records from a database. The list of
records is delimited with a comment and the format is as follows:


Record 1
Field1Data data dataField2Data data
dataField3Data data dataField4Data data data

Record 2
Field1Data data dataField2Data data
dataField3Data data dataField4Data data data

Record 3
Field1Data data dataField2Data data
dataField3Data data dataField4Data data data


The data fields could be up to 2 or 3 paragraphs each. The number and
names of fields may differ between records (some info in one, but not
the other - ie null values do not show up in the html)

What are the string functions I would use and how would I use them? I
saw something about html parsing in python, but that might be overkill.
Babysteps.

Thanks

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Mining strings from a HTML document.

Re: Mining strings from a HTML document.

Re: Mining strings from a HTML document.

Re: Mining strings from a HTML document.

Re: Mining strings from a HTML document.

Re: Mining strings from a HTML document.

Re: Mining strings from a HTML document.

Re: Mining strings from a HTML document.

Re: Mining strings from a HTML document.

Mining strings from a HTML document.

10 matches

Site Navigation

Mail list logo

Footer information