Hi all, Thanks for your help! I have the regular expression to select data in 2 columns.
<table> <tr> <td>(.*?)<\td>(.*?)<td><\td> <\tr> <\table> the selection is saved in a variable called MyVar. Then I loop through MyVar_X_Y to access the data, where X is the row and Y the column. What's the equivalent in xpath query? Thanks Thanh On Mon, Jan 24, 2011 at 4:31 PM, Deepak Shetty <[email protected]> wrote: > what is it that you want to select? all the columns? that are not titles > would be something like > //tbody/tr/td/span (but this will flatten out the structure)? > > regards > deepak > > On Mon, Jan 24, 2011 at 10:08 AM, thanh nguyen <[email protected] > >wrote: > > > Felix, > > > > I'll have look at the xpath. it looks interesting. But I can't find any > > example of code for xpath? > > Thank you > > Thanh > > > > ps: this is the table I'm working on. 1st row is the title. 2nd row > > contains > > data. I want to extract data1, data2....the regular expression reads row > by > > row. In the beanshell I do 2 loop: for each row and for each column. > There > > are rows number odd and rows number even. > > > > > > <table> > > <tr><th class="sbListHeaderCellEnd" scope="col" valign="top" > width="5"><img > > alt="" height="5" src="/assets/common/img/cnr_t_tl.gif" > width="5"></th><th > > class="sbListHeaderCell" nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText"><a class="sbListHeaderText" > > href="javascript:void('sort_name')" onclick="submitForm1023(event);return > > false;" title="Sort by column Title">Title1</a></span></th><td > > class="sbListColumnSpacer"><img alt="" border="0" height="1" > > src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" > > nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText">Title2</span></th><td > > class="sbListColumnSpacer"><img alt="" border="0" height="1" > > src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" > > nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText"><a class="sbListHeaderText" > > href="javascript:void('sort_deliveryType')" > > onclick="submitForm1024(event);return false;" title="Sort by column > > Delivery > > Type">Title3</a></span></th><td class="sbListColumnSpacer"><img alt="" > > border="0" height="1" src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText"><a class="sbListHeaderText" > > href="javascript:void('sort_regStartDate')" > > onclick="submitForm1025(event);return false;" title="Sort by column > > Registration Date">Title4</a></span></th><td > > class="sbListColumnSpacer"><img > > alt="" border="0" height="1" src="/assets/common/img/1x1.gif" > > width="1"></td><th class="sbListHeaderCell" nowrap="true" > scope="col"><img > > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText"><a class="sbListHeaderText" > > href="javascript:void('sort_completionStatus')" > > onclick="submitForm1026(event);return false;" title="Sort by column > > Completion Status">Title5</a></span></th><td > > class="sbListColumnSpacer"><img > > alt="" border="0" height="1" src="/assets/common/img/1x1.gif" > > width="1"></td><th class="sbListHeaderCell" nowrap="true" > scope="col"><img > > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText"><a class="sbListHeaderText" > > href="javascript:void('sort_completionDate')" > > onclick="submitForm1027(event);return false;" title="Sort by column Date > > Marked Complete">Title6</a></span></th><td > class="sbListColumnSpacer"><img > > alt="" border="0" height="1" src="/assets/common/img/1x1.gif" > > width="1"></td><th class="sbListHeaderCell" nowrap="true" > scope="col"><img > > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText">Title7</span></th><td > > class="sbListColumnSpacer"><img alt="" border="0" height="1" > > src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" > > nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText"><a class="sbListHeaderText" > > href="javascript:void('sort_score')" > onclick="submitForm1028(event);return > > false;" title="Sort by column Score">Title8</a></span></th><td > > class="sbListColumnSpacer"><img alt="" border="0" height="1" > > src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" > > nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText"><a class="sbListHeaderText" > > href="javascript:void('sort_grade')" > onclick="submitForm1029(event);return > > false;" title="Sort by column Grade">Title9</a></span></th><td > > class="sbListColumnSpacer"><img alt="" border="0" height="1" > > src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" > > nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText">Title10</span></th><td > > class="sbListColumnSpacer"><img alt="" border="0" height="1" > > src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" > > nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText">Title11</span></th><td > > class="sbListColumnSpacer"><img alt="" border="0" height="1" > > src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" > > nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText">Title12</span></th><td > > class="sbListColumnSpacer"><img alt="" border="0" height="1" > > src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" > > nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText">Title13</span></th><td > > class="sbListColumnSpacer"><img alt="" border="0" height="1" > > src="/assets/common/img/1x1.gif" width="1"></td><th > > class="sbListHeaderCell" > > nowrap="true" scope="col"><img alt="" height="1" > > src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText"><a class="sbListHeaderText" > > href="javascript:void('sort_startDate')" > > onclick="submitForm1030(event);return false;" title="Sort by column > > Offering > > Start Date">Title14</a></span></th><td class="sbListColumnSpacer"><img > > alt="" border="0" height="1" src="/assets/common/img/1x1.gif" > > width="1"></td><th class="sbListHeaderCell" nowrap="true" > scope="col"><img > > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span > > class="sbListHeaderText">Title15</span></th><th align="right" > > class="sbListHeaderCellEnd" scope="col" valign="top" width="5"><img > alt="" > > height="5" src="/assets/common/img/cnr_t_tr.gif" width="5"></th></tr> > > > > <tr><td class="sbListOddCellEnd"></td><td class="sbListOddCell"><span > > class="sbListText"><a class="sbLinkTableDisplay" doTruncate="false" > > href="javascript:void('titleLink')" onclick="submitForm1031(event);return > > false;" title="data1">data1</a></span></td><td > > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span > > class="sbListText"> </span></td><td > > class="sbListColumnSpacer"></td><td > > class="sbListOddCell"><span class="sbListText">data2</span></td><td > > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span > > class="sbListText">data3</span></td><td > class="sbListColumnSpacer"></td><td > > class="sbListOddCell"><span class="sbListText" nowrap="nowrap"><span > > class="sbListText">data4</span><br><a class="sbLinkTableDisplay" > > doTruncate="false" href="javascript:void('blah')" > > onclick="submitForm1033(event);return false;" title="blah > > blah">blah</a></span></td><td class="sbListColumnSpacer"></td><td > > class="sbListOddCell"><span class="sbListText">data5</span></td><td > > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span > > class="sbListText"> </span></td><td > > class="sbListColumnSpacer"></td><td > > class="sbListOddCell"><span class="sbListText"> </span></td><td > > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span > > class="sbListText"> </span></td><td > > class="sbListColumnSpacer"></td><td > > class="sbListOddCell"><span class="sbListText">data6</span></td><td > > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span > > class="sbListText">data7</span></td><td > class="sbListColumnSpacer"></td><td > > class="sbListOddCell"><span class="sbListText">data8</span></td><td > > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span > > class="sbListText">data8</span></td><td > class="sbListColumnSpacer"></td><td > > class="sbListOddCell"><span class="sbListText"> </span></td><td > > class="sbListColumnSpacer"></td><td class="sbListOddCell" nowrap><a > > class="sbLinkTableDisplay" doTruncate="false" > > href="javascript:void('editLink')" onclick="submitForm1035(event);return > > false;" title="Edit">Edit</a><br><a class="sbLinkTableDisplay" > > doTruncate="false" href="javascript:void('deleteLink')" > > onclick="submitForm1036(event);return false;" > > title="Delete">Delete</a><br><br></td><td > > class="sbListOddCellEnd"></td></tr><tr> > > > > </table> > > > > > > > > On Mon, Jan 24, 2011 at 10:34 AM, Felix Frank <[email protected]> wrote: > > > > > On 01/24/2011 04:27 PM, thanh nguyen wrote: > > > > Hi everyone, > > > > > > > > I have a big HTML table from which I need to extract data. The table > > has > > > > several columns. The regulation expression required to do the > > extraction > > > job > > > > is very long and complex. The code is hard to debug and to maintain. > > I'd > > > > like to know what are the alternatives? Is there HTML parser that > > create > > > DOM > > > > objects? I could program a postprocessor in beanshell... > > > > > > > > Thanks a lot > > > > > > That would be the XPath Extractor, but maybe someone can help you build > > > a simpler regex instead (you need to share more details for this to > > > happen). > > > > > > Regards, > > > Felix > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [email protected] > > > For additional commands, e-mail: [email protected] > > > > > > > > >

