The final solution I found is to break down my regulation expression and work with partial results. I applied the 'Divide and Conquer' strategy. It's more code lines but it's now more readable.
On Tue, Jan 25, 2011 at 1:11 PM, thanh nguyen <[email protected]>wrote: > Hi all, > > Thanks for your help! I have the regular expression to select data in 2 > columns. > > <table> > <tr> > <td>(.*?)<\td>(.*?)<td><\td> > <\tr> > <\table> > > the selection is saved in a variable called MyVar. Then I loop through > MyVar_X_Y to access the data, where X is the row and Y the column. > > What's the equivalent in xpath query? > > Thanks > Thanh > > > On Mon, Jan 24, 2011 at 4:31 PM, Deepak Shetty <[email protected]> wrote: > >> what is it that you want to select? all the columns? that are not titles >> would be something like >> //tbody/tr/td/span (but this will flatten out the structure)? >> >> regards >> deepak >> >> On Mon, Jan 24, 2011 at 10:08 AM, thanh nguyen <[email protected] >> >wrote: >> >> > Felix, >> > >> > I'll have look at the xpath. it looks interesting. But I can't find any >> > example of code for xpath? >> > Thank you >> > Thanh >> > >> > ps: this is the table I'm working on. 1st row is the title. 2nd row >> > contains >> > data. I want to extract data1, data2....the regular expression reads row >> by >> > row. In the beanshell I do 2 loop: for each row and for each column. >> There >> > are rows number odd and rows number even. >> > >> > >> > <table> >> > <tr><th class="sbListHeaderCellEnd" scope="col" valign="top" >> width="5"><img >> > alt="" height="5" src="/assets/common/img/cnr_t_tl.gif" >> width="5"></th><th >> > class="sbListHeaderCell" nowrap="true" scope="col"><img alt="" >> height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText"><a class="sbListHeaderText" >> > href="javascript:void('sort_name')" >> onclick="submitForm1023(event);return >> > false;" title="Sort by column Title">Title1</a></span></th><td >> > class="sbListColumnSpacer"><img alt="" border="0" height="1" >> > src="/assets/common/img/1x1.gif" width="1"></td><th >> > class="sbListHeaderCell" >> > nowrap="true" scope="col"><img alt="" height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText">Title2</span></th><td >> > class="sbListColumnSpacer"><img alt="" border="0" height="1" >> > src="/assets/common/img/1x1.gif" width="1"></td><th >> > class="sbListHeaderCell" >> > nowrap="true" scope="col"><img alt="" height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText"><a class="sbListHeaderText" >> > href="javascript:void('sort_deliveryType')" >> > onclick="submitForm1024(event);return false;" title="Sort by column >> > Delivery >> > Type">Title3</a></span></th><td class="sbListColumnSpacer"><img alt="" >> > border="0" height="1" src="/assets/common/img/1x1.gif" >> width="1"></td><th >> > class="sbListHeaderCell" nowrap="true" scope="col"><img alt="" >> height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText"><a class="sbListHeaderText" >> > href="javascript:void('sort_regStartDate')" >> > onclick="submitForm1025(event);return false;" title="Sort by column >> > Registration Date">Title4</a></span></th><td >> > class="sbListColumnSpacer"><img >> > alt="" border="0" height="1" src="/assets/common/img/1x1.gif" >> > width="1"></td><th class="sbListHeaderCell" nowrap="true" >> scope="col"><img >> > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText"><a class="sbListHeaderText" >> > href="javascript:void('sort_completionStatus')" >> > onclick="submitForm1026(event);return false;" title="Sort by column >> > Completion Status">Title5</a></span></th><td >> > class="sbListColumnSpacer"><img >> > alt="" border="0" height="1" src="/assets/common/img/1x1.gif" >> > width="1"></td><th class="sbListHeaderCell" nowrap="true" >> scope="col"><img >> > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText"><a class="sbListHeaderText" >> > href="javascript:void('sort_completionDate')" >> > onclick="submitForm1027(event);return false;" title="Sort by column Date >> > Marked Complete">Title6</a></span></th><td >> class="sbListColumnSpacer"><img >> > alt="" border="0" height="1" src="/assets/common/img/1x1.gif" >> > width="1"></td><th class="sbListHeaderCell" nowrap="true" >> scope="col"><img >> > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText">Title7</span></th><td >> > class="sbListColumnSpacer"><img alt="" border="0" height="1" >> > src="/assets/common/img/1x1.gif" width="1"></td><th >> > class="sbListHeaderCell" >> > nowrap="true" scope="col"><img alt="" height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText"><a class="sbListHeaderText" >> > href="javascript:void('sort_score')" >> onclick="submitForm1028(event);return >> > false;" title="Sort by column Score">Title8</a></span></th><td >> > class="sbListColumnSpacer"><img alt="" border="0" height="1" >> > src="/assets/common/img/1x1.gif" width="1"></td><th >> > class="sbListHeaderCell" >> > nowrap="true" scope="col"><img alt="" height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText"><a class="sbListHeaderText" >> > href="javascript:void('sort_grade')" >> onclick="submitForm1029(event);return >> > false;" title="Sort by column Grade">Title9</a></span></th><td >> > class="sbListColumnSpacer"><img alt="" border="0" height="1" >> > src="/assets/common/img/1x1.gif" width="1"></td><th >> > class="sbListHeaderCell" >> > nowrap="true" scope="col"><img alt="" height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText">Title10</span></th><td >> > class="sbListColumnSpacer"><img alt="" border="0" height="1" >> > src="/assets/common/img/1x1.gif" width="1"></td><th >> > class="sbListHeaderCell" >> > nowrap="true" scope="col"><img alt="" height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText">Title11</span></th><td >> > class="sbListColumnSpacer"><img alt="" border="0" height="1" >> > src="/assets/common/img/1x1.gif" width="1"></td><th >> > class="sbListHeaderCell" >> > nowrap="true" scope="col"><img alt="" height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText">Title12</span></th><td >> > class="sbListColumnSpacer"><img alt="" border="0" height="1" >> > src="/assets/common/img/1x1.gif" width="1"></td><th >> > class="sbListHeaderCell" >> > nowrap="true" scope="col"><img alt="" height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText">Title13</span></th><td >> > class="sbListColumnSpacer"><img alt="" border="0" height="1" >> > src="/assets/common/img/1x1.gif" width="1"></td><th >> > class="sbListHeaderCell" >> > nowrap="true" scope="col"><img alt="" height="1" >> > src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText"><a class="sbListHeaderText" >> > href="javascript:void('sort_startDate')" >> > onclick="submitForm1030(event);return false;" title="Sort by column >> > Offering >> > Start Date">Title14</a></span></th><td class="sbListColumnSpacer"><img >> > alt="" border="0" height="1" src="/assets/common/img/1x1.gif" >> > width="1"></td><th class="sbListHeaderCell" nowrap="true" >> scope="col"><img >> > alt="" height="1" src="/assets/common/img/1x1.gif" width="30"><br><span >> > class="sbListHeaderText">Title15</span></th><th align="right" >> > class="sbListHeaderCellEnd" scope="col" valign="top" width="5"><img >> alt="" >> > height="5" src="/assets/common/img/cnr_t_tr.gif" width="5"></th></tr> >> > >> > <tr><td class="sbListOddCellEnd"></td><td class="sbListOddCell"><span >> > class="sbListText"><a class="sbLinkTableDisplay" doTruncate="false" >> > href="javascript:void('titleLink')" >> onclick="submitForm1031(event);return >> > false;" title="data1">data1</a></span></td><td >> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span >> > class="sbListText"> </span></td><td >> > class="sbListColumnSpacer"></td><td >> > class="sbListOddCell"><span class="sbListText">data2</span></td><td >> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span >> > class="sbListText">data3</span></td><td >> class="sbListColumnSpacer"></td><td >> > class="sbListOddCell"><span class="sbListText" nowrap="nowrap"><span >> > class="sbListText">data4</span><br><a class="sbLinkTableDisplay" >> > doTruncate="false" href="javascript:void('blah')" >> > onclick="submitForm1033(event);return false;" title="blah >> > blah">blah</a></span></td><td class="sbListColumnSpacer"></td><td >> > class="sbListOddCell"><span class="sbListText">data5</span></td><td >> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span >> > class="sbListText"> </span></td><td >> > class="sbListColumnSpacer"></td><td >> > class="sbListOddCell"><span class="sbListText"> </span></td><td >> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span >> > class="sbListText"> </span></td><td >> > class="sbListColumnSpacer"></td><td >> > class="sbListOddCell"><span class="sbListText">data6</span></td><td >> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span >> > class="sbListText">data7</span></td><td >> class="sbListColumnSpacer"></td><td >> > class="sbListOddCell"><span class="sbListText">data8</span></td><td >> > class="sbListColumnSpacer"></td><td class="sbListOddCell"><span >> > class="sbListText">data8</span></td><td >> class="sbListColumnSpacer"></td><td >> > class="sbListOddCell"><span class="sbListText"> </span></td><td >> > class="sbListColumnSpacer"></td><td class="sbListOddCell" nowrap><a >> > class="sbLinkTableDisplay" doTruncate="false" >> > href="javascript:void('editLink')" onclick="submitForm1035(event);return >> > false;" title="Edit">Edit</a><br><a class="sbLinkTableDisplay" >> > doTruncate="false" href="javascript:void('deleteLink')" >> > onclick="submitForm1036(event);return false;" >> > title="Delete">Delete</a><br><br></td><td >> > class="sbListOddCellEnd"></td></tr><tr> >> > >> > </table> >> > >> > >> > >> > On Mon, Jan 24, 2011 at 10:34 AM, Felix Frank <[email protected]> wrote: >> > >> > > On 01/24/2011 04:27 PM, thanh nguyen wrote: >> > > > Hi everyone, >> > > > >> > > > I have a big HTML table from which I need to extract data. The table >> > has >> > > > several columns. The regulation expression required to do the >> > extraction >> > > job >> > > > is very long and complex. The code is hard to debug and to maintain. >> > I'd >> > > > like to know what are the alternatives? Is there HTML parser that >> > create >> > > DOM >> > > > objects? I could program a postprocessor in beanshell... >> > > > >> > > > Thanks a lot >> > > >> > > That would be the XPath Extractor, but maybe someone can help you >> build >> > > a simpler regex instead (you need to share more details for this to >> > > happen). >> > > >> > > Regards, >> > > Felix >> > > >> > > --------------------------------------------------------------------- >> > > To unsubscribe, e-mail: [email protected] >> > > For additional commands, e-mail: [email protected] >> > > >> > > >> > >> > >

