Re: Mining strings from a HTML document.
Thanks Guys! I've written several functions yesterday to import from different types of raw data including html and different text formats. In the end I never used the extract function or the parser module, but your advice put me on the right track. All these functions are now in a single object and the inner workings are abstracted (as much as python allows). So a single object can now import from any file without me having to worry about what file it is! Might not sound like much, but the whole OOP thing is new to me too, so I am very happy with what python could do for me. Now just to get this stuff into MySQL...new topic :) Thanks for all your help! -d- -- http://mail.python.org/mailman/listinfo/python-list
Re: Mining strings from a HTML document.
>def extract(text,s1,s2): >''' Extract strings wrapped between s1 and s2. > >>>> t="""this is a test for extract() >that does multiple extract """ >>>> extract(t,'','') >['test', 'extract()', 'does multiple extract'] > >''' >beg = [1,0][text.startswith(s1)] >tmp = text.split(s1)[beg:] >end = [len(tmp), len(tmp)+1][ text.endswith(s2)] >return [ x.split(s2)[0] for x in tmp if len(x.split(s2))>1][:end] > Could you/anyone explain the 4 lines of code to me though? A crash > course in Python shorthand? What does it mean when you use two sets of > brackets as in : beg = [1,0][text.startswith(s1)] ? The idea is using .split( ) to cut the string in different manners. For a string: -AderickBArunsunB-- first cut at A gives you [-, derickB--, runsunB-] (line-1,2) 2nd cut at B gives you [ derick, runsun ](line-3,4) The function uses list comprehension heavily. As Magnus already explained, line-1 is just a switch. Same as line-3. These two lines exist to solve the difference between -AderickBArunsunB-- AderickBArunsunB-- or -AderickBArunsunB-- -AderickBArunsunB That is, if the original raw string startswith or ends with s1 or s2, special consideration should be taken. Line-2 and -4 are just common practice of list slicing that u should be able to find in any python tutorial. Let us know if it's still not clear. -- ~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~ Runsun Pan, PhD [EMAIL PROTECTED] Nat'l Center for Macromolecular Imaging http://ncmi.bcm.tmc.edu/ncmi/ ~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~ -- http://mail.python.org/mailman/listinfo/python-list
Re: Mining strings from a HTML document.
Derick van Niekerk wrote: > Could you/anyone explain the 4 lines of code to me though? A crash > course in Python shorthand? What does it mean when you use two sets of > brackets as in : beg = [1,0][text.startswith(s1)] ? It's not as strange as it looks. [1,0] is a list. If you put [] after a list, it's for indexing, right? (Unless there's one or two ':' somehere, in which case it's slicing.) text.startswith(s1) evaluates to True or False, which is equivalent to 1 or 0 in a numerical context. [1,0][0] is 1, and [1,0][1] is 0, so you could say that it's a somewhat contrieved way of writing "beg = int(not text.startswith(s1))" or "beg = 1 - text.startswith(s1)" -- http://mail.python.org/mailman/listinfo/python-list
Re: Mining strings from a HTML document.
In article <[EMAIL PROTECTED]>, Derick van Niekerk <[EMAIL PROTECTED]> wrote: . . . >I suppose very few books on python start off with HTML processing in >stead of 'hello world' :p . . . ... very few, perhaps, but how many do you need when the one example is so strong? In any case, you'll want to look into *Text Processing in Python* http://gnosis.cx/TPiP/ >. -- http://mail.python.org/mailman/listinfo/python-list
Re: Mining strings from a HTML document.
Runsun Pan helped me out with the following: You can also try the following very primitive solution that I sometimes use to extract simple information in a quick and dirty way: def extract(text,s1,s2): ''' Extract strings wrapped between s1 and s2. >>> t="""this is a test for extract() that does multiple extract """ >>> extract(t,'','') ['test', 'extract()', 'does multiple extract'] ''' beg = [1,0][text.startswith(s1)] tmp = text.split(s1)[beg:] end = [len(tmp), len(tmp)+1][ text.endswith(s2)] return [ x.split(s2)[0] for x in tmp if len(x.split(s2))>1][:end] This will help out a *lot*! Thank you. This is a better bet than the parser in this particular implementation because the data I need is not encapsulated in tags! Field names are within tags followed by plain text data and ended with a tag. This was my main problem with a parser, but your extract fuction solves it beautifully! I'm posting back to the NG in just in case it is of value to anyone else. Could you/anyone explain the 4 lines of code to me though? A crash course in Python shorthand? What does it mean when you use two sets of brackets as in : beg = [1,0][text.startswith(s1)] ? Thanks for the help! -d- -- http://mail.python.org/mailman/listinfo/python-list
Re: Mining strings from a HTML document.
I'm battling to understand this. I am switching to python while in a production environment so I am tossed into the deep end. Python seems easier to learn than other languages, but some of the conventions still trip me up. Thanks for the link - I'll have to go through all the previous chapters to understand this one though... I suppose very few books on python start off with HTML processing in stead of 'hello world' :p Could you give me an example of how to use it to extract basic information from the web page? I need a bit of a hit-the-ground-running approach to python. You'll see that the data in my example isn't encapsulated in tags - is there still an easy way to extract it using the parser module? Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: Mining strings from a HTML document.
Thanks, Jay! I'll try this out today. Trying to write my own parser is such a pain. This BeatifullSoup script is very nice! I'll give it a try. If you can help me out with an example of how to do what I explained, I would appreciate it. I actually finished doing an import last night, but there is no way I'm creating another parser from scratch! I tried figuring out what to do by going through the code, but I am still waay too fresh to understand generators and some of the coding conventions. Thanks again -- http://mail.python.org/mailman/listinfo/python-list
Re: Mining strings from a HTML document.
I think Jay's advice is solid: you shouldn't rule out HTML parsing. It's not too scary and it's probably not overboard. Using a common HTML parsing library saves you from having to write and debug your own parser. Try looking at Dive Into Python's chapter on it, first. http://www.diveintopython.org/html_processing/index.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Mining strings from a HTML document.
Derick van Niekerk wrote: > What are the string functions I would use and how would I use them? I > saw something about html parsing in python, but that might be overkill. > Babysteps. Despite your reluctance, I would still recommend an HTML parsing module. I like BeautifulSoup. http://www.crummy.com/software/BeautifulSoup/ ... jay -- http://mail.python.org/mailman/listinfo/python-list
Mining strings from a HTML document.
Hi, I am new to Python and have been doing most of my work with PHP until now. I find Python to be *much* nicer for the development of local apps (running on my machine) but I am very new to the Python way of thinking and I don't realy know where to start other than just by doing it...so far I'm just through the tut :) My problem is as follows: I have an html file with a list of records from a database. The list of records is delimited with a comment and the format is as follows: Record 1 Field1Data data dataField2Data data dataField3Data data dataField4Data data data Record 2 Field1Data data dataField2Data data dataField3Data data dataField4Data data data Record 3 Field1Data data dataField2Data data dataField3Data data dataField4Data data data The data fields could be up to 2 or 3 paragraphs each. The number and names of fields may differ between records (some info in one, but not the other - ie null values do not show up in the html) What are the string functions I would use and how would I use them? I saw something about html parsing in python, but that might be overkill. Babysteps. Thanks -- http://mail.python.org/mailman/listinfo/python-list