The final solution I found is to break down my regulation expression and
work with partial results. I applied the 'Divide and Conquer' strategy. It's
more code lines but it's now more readable.


On Tue, Jan 25, 2011 at 1:11 PM, thanh nguyen <[email protected]>wrote:

> Hi all,
>
> Thanks for your help! I have the regular expression to select data in 2
> columns.
>
> <table>
> <tr>
> <td>(.*?)<\td>(.*?)<td><\td>
> <\tr>
> <\table>
>
> the selection is saved in a variable called MyVar. Then I loop through
> MyVar_X_Y to access the data, where X is the row and Y the column.
>
> What's the equivalent in xpath query?
>
> Thanks
> Thanh
>
>
> On Mon, Jan 24, 2011 at 4:31 PM, Deepak Shetty <[email protected]> wrote:
>
>> what is it that you want to select? all the columns? that are not titles
>> would be something like
>> //tbody/tr/td/span (but this will flatten out the structure)?
>>
>> regards
>> deepak
>>
>> On Mon, Jan 24, 2011 at 10:08 AM, thanh nguyen <[email protected]
>> >wrote:
>>
>> > Felix,
>> >
>> > I'll have look at the xpath. it looks interesting. But I can't find any
>> > example of code for xpath?
>> > Thank you
>> > Thanh
>> >
>> > ps: this is the table I'm working on. 1st row is the title. 2nd row
>> > contains
>> > data. I want to extract data1, data2....the regular expression reads row
>> by
>> > row. In the beanshell I do 2 loop: for each row and for each column.
>> There
>> > are rows number odd and rows number even.
>> >
>> >
>> > <table>
>> > <tr><th class="sbListHeaderCellEnd" scope="col" valign="top"
>> width="5"><img
>> > alt="" height="5" src="/assets/common/img/cnr_t_tl.gif"
>> width="5"></th><th
>> > class="sbListHeaderCell" nowrap="true" scope="col"><img alt=""
>> height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText"><a class="sbListHeaderText"
>> > href="javascript:void('sort_name')"
>> onclick="submitForm1023(event);return
>> > false;" title="Sort by column Title">Title1</a></span></th><td
>> > class="sbListColumnSpacer"><img alt="" border="0" height="1"
>> > src="/assets/common/img/1x1.gif" width="1"></td><th
>> > class="sbListHeaderCell"
>> > nowrap="true" scope="col"><img alt="" height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText">Title2</span></th><td
>> > class="sbListColumnSpacer"><img alt="" border="0" height="1"
>> > src="/assets/common/img/1x1.gif" width="1"></td><th
>> > class="sbListHeaderCell"
>> > nowrap="true" scope="col"><img alt="" height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText"><a class="sbListHeaderText"
>> > href="javascript:void('sort_deliveryType')"
>> > onclick="submitForm1024(event);return false;" title="Sort by column
>> > Delivery
>> > Type">Title3</a></span></th><td class="sbListColumnSpacer"><img alt=""
>> > border="0" height="1" src="/assets/common/img/1x1.gif"
>> width="1"></td><th
>> > class="sbListHeaderCell" nowrap="true" scope="col"><img alt=""
>> height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText"><a class="sbListHeaderText"
>> > href="javascript:void('sort_regStartDate')"
>> > onclick="submitForm1025(event);return false;" title="Sort by column
>> > Registration Date">Title4</a></span></th><td
>> > class="sbListColumnSpacer"><img
>> > alt="" border="0" height="1" src="/assets/common/img/1x1.gif"
>> > width="1"></td><th class="sbListHeaderCell" nowrap="true"
>> scope="col"><img
>> > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText"><a class="sbListHeaderText"
>> > href="javascript:void('sort_completionStatus')"
>> > onclick="submitForm1026(event);return false;" title="Sort by column
>> > Completion Status">Title5</a></span></th><td
>> > class="sbListColumnSpacer"><img
>> > alt="" border="0" height="1" src="/assets/common/img/1x1.gif"
>> > width="1"></td><th class="sbListHeaderCell" nowrap="true"
>> scope="col"><img
>> > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText"><a class="sbListHeaderText"
>> > href="javascript:void('sort_completionDate')"
>> > onclick="submitForm1027(event);return false;" title="Sort by column Date
>> > Marked Complete">Title6</a></span></th><td
>> class="sbListColumnSpacer"><img
>> > alt="" border="0" height="1" src="/assets/common/img/1x1.gif"
>> > width="1"></td><th class="sbListHeaderCell" nowrap="true"
>> scope="col"><img
>> > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText">Title7</span></th><td
>> > class="sbListColumnSpacer"><img alt="" border="0" height="1"
>> > src="/assets/common/img/1x1.gif" width="1"></td><th
>> > class="sbListHeaderCell"
>> > nowrap="true" scope="col"><img alt="" height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText"><a class="sbListHeaderText"
>> > href="javascript:void('sort_score')"
>> onclick="submitForm1028(event);return
>> > false;" title="Sort by column Score">Title8</a></span></th><td
>> > class="sbListColumnSpacer"><img alt="" border="0" height="1"
>> > src="/assets/common/img/1x1.gif" width="1"></td><th
>> > class="sbListHeaderCell"
>> > nowrap="true" scope="col"><img alt="" height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText"><a class="sbListHeaderText"
>> > href="javascript:void('sort_grade')"
>> onclick="submitForm1029(event);return
>> > false;" title="Sort by column Grade">Title9</a></span></th><td
>> > class="sbListColumnSpacer"><img alt="" border="0" height="1"
>> > src="/assets/common/img/1x1.gif" width="1"></td><th
>> > class="sbListHeaderCell"
>> > nowrap="true" scope="col"><img alt="" height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText">Title10</span></th><td
>> > class="sbListColumnSpacer"><img alt="" border="0" height="1"
>> > src="/assets/common/img/1x1.gif" width="1"></td><th
>> > class="sbListHeaderCell"
>> > nowrap="true" scope="col"><img alt="" height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText">Title11</span></th><td
>> > class="sbListColumnSpacer"><img alt="" border="0" height="1"
>> > src="/assets/common/img/1x1.gif" width="1"></td><th
>> > class="sbListHeaderCell"
>> > nowrap="true" scope="col"><img alt="" height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText">Title12</span></th><td
>> > class="sbListColumnSpacer"><img alt="" border="0" height="1"
>> > src="/assets/common/img/1x1.gif" width="1"></td><th
>> > class="sbListHeaderCell"
>> > nowrap="true" scope="col"><img alt="" height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText">Title13</span></th><td
>> > class="sbListColumnSpacer"><img alt="" border="0" height="1"
>> > src="/assets/common/img/1x1.gif" width="1"></td><th
>> > class="sbListHeaderCell"
>> > nowrap="true" scope="col"><img alt="" height="1"
>> > src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText"><a class="sbListHeaderText"
>> > href="javascript:void('sort_startDate')"
>> > onclick="submitForm1030(event);return false;" title="Sort by column
>> > Offering
>> > Start Date">Title14</a></span></th><td class="sbListColumnSpacer"><img
>> > alt="" border="0" height="1" src="/assets/common/img/1x1.gif"
>> > width="1"></td><th class="sbListHeaderCell" nowrap="true"
>> scope="col"><img
>> > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span
>> > class="sbListHeaderText">Title15</span></th><th align="right"
>> > class="sbListHeaderCellEnd" scope="col" valign="top" width="5"><img
>> alt=""
>> > height="5" src="/assets/common/img/cnr_t_tr.gif" width="5"></th></tr>
>> >
>> > <tr><td class="sbListOddCellEnd"></td><td class="sbListOddCell"><span
>> > class="sbListText"><a class="sbLinkTableDisplay" doTruncate="false"
>> > href="javascript:void('titleLink')"
>> onclick="submitForm1031(event);return
>> > false;" title="data1">data1</a></span></td><td
>> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span
>> > class="sbListText">&nbsp;</span></td><td
>> > class="sbListColumnSpacer"></td><td
>> > class="sbListOddCell"><span class="sbListText">data2</span></td><td
>> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span
>> > class="sbListText">data3</span></td><td
>> class="sbListColumnSpacer"></td><td
>> > class="sbListOddCell"><span class="sbListText" nowrap="nowrap"><span
>> > class="sbListText">data4</span><br><a class="sbLinkTableDisplay"
>> > doTruncate="false" href="javascript:void('blah')"
>> > onclick="submitForm1033(event);return false;" title="blah
>> > blah">blah</a></span></td><td class="sbListColumnSpacer"></td><td
>> > class="sbListOddCell"><span class="sbListText">data5</span></td><td
>> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span
>> > class="sbListText">&nbsp;</span></td><td
>> > class="sbListColumnSpacer"></td><td
>> > class="sbListOddCell"><span class="sbListText">&nbsp;</span></td><td
>> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span
>> > class="sbListText">&nbsp;</span></td><td
>> > class="sbListColumnSpacer"></td><td
>> > class="sbListOddCell"><span class="sbListText">data6</span></td><td
>> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span
>> > class="sbListText">data7</span></td><td
>> class="sbListColumnSpacer"></td><td
>> > class="sbListOddCell"><span class="sbListText">data8</span></td><td
>> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span
>> > class="sbListText">data8</span></td><td
>> class="sbListColumnSpacer"></td><td
>> > class="sbListOddCell"><span class="sbListText">&nbsp;</span></td><td
>> > class="sbListColumnSpacer"></td><td class="sbListOddCell" nowrap><a
>> > class="sbLinkTableDisplay" doTruncate="false"
>> > href="javascript:void('editLink')" onclick="submitForm1035(event);return
>> > false;" title="Edit">Edit</a><br><a class="sbLinkTableDisplay"
>> > doTruncate="false" href="javascript:void('deleteLink')"
>> > onclick="submitForm1036(event);return false;"
>> > title="Delete">Delete</a><br><br></td><td
>> > class="sbListOddCellEnd"></td></tr><tr>
>> >
>> > </table>
>> >
>> >
>> >
>> > On Mon, Jan 24, 2011 at 10:34 AM, Felix Frank <[email protected]> wrote:
>> >
>> > > On 01/24/2011 04:27 PM, thanh nguyen wrote:
>> > > > Hi everyone,
>> > > >
>> > > > I have a big HTML table from which I need to extract data. The table
>> > has
>> > > > several columns. The regulation expression required to do the
>> > extraction
>> > > job
>> > > > is very long and complex. The code is hard to debug and to maintain.
>> > I'd
>> > > > like to know what are the alternatives? Is there HTML parser that
>> > create
>> > > DOM
>> > > > objects? I could program a postprocessor in beanshell...
>> > > >
>> > > > Thanks a lot
>> > >
>> > > That would be the XPath Extractor, but maybe someone can help you
>> build
>> > > a simpler regex instead (you need to share more details for this to
>> > > happen).
>> > >
>> > > Regards,
>> > > Felix
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: [email protected]
>> > > For additional commands, e-mail: [email protected]
>> > >
>> > >
>> >
>>
>
>

Reply via email to