Hello Python Community,

It'd be great if someone could provide guidance or sample code for
accomplishing the following:

I have a single unicode file that has  descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.

I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.

Any tips, advice and guidance is greatly appreciated.

Thanks,

Egon




=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1      H2      DIV     Segment1        Segment2        Segment3
RoséH1-1        RoséH2-1        RoséDIV-1       RoséSegmentDIV1-1       
RoséSegmentDIV2-1
RoséSegmentDIV3-1
PinkH1-2        PinkH2-2        PinkDIV2-2      PinkSegmentDIV1-2       
No-Value        No-Value
BlackH1-3       BlackH2-3       BlackDIV2-3     BlackSegmentDIV1-3      
No-Value        No-Value
YellowH1-4      YellowH2-4      YellowDIV2-4    YellowSegmentDIV1-4
YellowSegmentDIV2-4     No-Value
------Tab Separated Output File End------



=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>

<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
<div>RoséDIV-1</div>
<div "segment1">RoséSegmentDIV1-1</div><br>
<div "segment2">RoséSegmentDIV2-1</div><br>
<div "segment3">RoséSegmentDIV3-1</div><br>
<br>
<br>

<h1>PinkH1-2</h1>
<h2>PinkH2-2</h2>
<div>PinkDIV2-2</div>
<div "segment1">PinkSegmentDIV1-2</div><br>
<br>
<comment></comment>

<h1>BlackH1-3</h1>
<h2>BlackH2-3</h2>
<div>BlackDIV2-3</div>
<div "segment1">BlackSegmentDIV1-3</div><br>

<h1>YellowH1-4</h1>
<h2>YellowH2-4</h2>
<div>YellowDIV2-4</div>
<div "segment1">YellowSegmentDIV1-4</div><br>
<div "segment2">YellowSegmentDIV2-4</div><br>

</html>
------HTML Example End------
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to