not really

i guess any page in this website can have two links generated by javascript
function, that's why nutch can't find that url because nutch will not click
that link to trigger that js function as human does.

I suggest that, you can generated those multilingual links in server side,
for example, in jsp, then in web pages you can get stataic links that can be
found by nutch.

for example, now in your jsp page, those two links are like this:

<a href="javascript:jump('en')">English</a>
<a href="javascript:jump('la')">La</a>

these two links can not be found by nutch, so u can change your jsp like
this:
<%
String pageUrl = request.getRequestURI();
String enUrl = pageUrl + "&request_locale=en";
String laUrl = pageUrl + "&request_locale=la";
%>
<a href="<%=enUrl%>">English</a>
<a href="<%=laUrl%>">La</a>

then u get static urls in your pages when u browse

good luck

yanky

2009/3/20 陈琛 <kylin.chc...@gmail.com>

> thanks very much!!!
>
>
> in other words, now i only put
>
> http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_US&id=10110&from=ePortal_NewsDetail_FromHome
> and
>
> http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LA&id=10110&from=ePortal_NewsDetail_FromHome
> in the url.txt?
>
>
> 2009/3/20 yanky young <yanky.yo...@gmail.com>
>
> > I think my guess is right. I just see the code of that page.
> >
> > those two urls are generated by javascript function:
> >
> > function jump(lan)
> >
> > in this case, nutch might not be that smart to recognize this kind of
> > generated url
> >
> > but if you generated this two links from server side, and then the
> > urls in web pages is static link, then nutch
> >
> > can crawl as usual.
> >
> > good luck
> >
> > yanky
> >
> >
> > 2009/3/20 陈琛 <kylin.chc...@gmail.com>
> >
> > > thanks
> > >
> > > u can login in
> > >
> > >
> >
> http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110&from=ePortal_NewsDetail_FromHome
> > >
> > > and notice the upper right corner, have two translate , it can reach
> > those
> > > two urls
> > >
> > > so i am worried .
> > > 2009/3/20 yanky young <yanky.yo...@gmail.com>
> > >
> > > > that must work, but it seems weird. u know, from the seed url you
> > given,
> > > > nutch will crawl from the seed url and the whole crawled pages is
> > > actually
> > > > a
> > > > tree. the root node is the seed url. if u can not reach those two
> urls
> > > from
> > > > the seed url by yourself, nutch can not too.
> > > >
> > > > yanky
> > > >
> > > >
> > > > 2009/3/20 陈琛 <kylin.chc...@gmail.com>
> > > >
> > > > > thanks..
> > > > >               the url is http://www.laopdr.gov.la/...
> > > > > depth 15 topN1200 ...
> > > > >
> > > > > seems must put
> > > > >
> > > > >
> > > >
> > >
> >
> http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LA&id=10110&from=ePortal_NewsDetail_FromHome%0A&;
> > > > > <
> > > > >
> > > >
> > >
> >
> http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LA&id=10110&from=ePortal_NewsDetail_FromHome%0A
> > > > > >
> > > > > in
> > > > > the urls directory
> > > > >
> > > > >
> > > > >
> > > > > 2009/3/19 yanky young <yanky.yo...@gmail.com>
> > > > >
> > > > > > Hi:
> > > > > >
> > > > > > i guess the urls you mentioned are all directed to the same jsp
> or
> > > > > servlet,
> > > > > > apparently they all begin with
> > > > > > http://app02.laopdr.gov.la/ePortal/news/detail.action<
> > > > > >
> > > > >
> > > >
> > >
> >
> http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110&from=ePortal_NewsDetail_FromHome
> > > > > > >.
> > > > > > the difference is the request_locale parameter. I have no idea
> how
> > > > these
> > > > > > two
> > > > > > urls with different request_locale parameters are generated, but
> I
> > > > guess
> > > > > > nutch just don't know this request_locale parameters because this
> > > > > parameter
> > > > > > may be added by javascript or backend content management system.
> > > Maybe
> > > > u
> > > > > > can
> > > > > > write these links in a page that can be crawled by nutch. The
> point
> > > is
> > > > > that
> > > > > > these links must can be found somewhere in your whole website
> > pages.
> > > if
> > > > > > not,
> > > > > > they can not be found by nutch.
> > > > > >
> > > > > > good luck
> > > > > >
> > > > > > yanky
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2009/3/19 陈琛 <kylin.chc...@gmail.com>
> > > > > >
> > > > > > > please help me, it is Urgent and Important, thanks
> > > > > > >
> > > > > > > ---------- Forwarded message ----------
> > > > > > > From: 陈琛 <kylin.chc...@gmail.com>
> > > > > > > Date: 2009/3/19
> > > > > > > Subject: index web
> > > > > > > To: nutch-user@lucene.apache.org
> > > > > > >
> > > > > > >
> > > > > > > hi, all:
> > > > > > >
> > > > > > > i can get index url like
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110&from=ePortal_NewsDetail_FromHome
> > > > > > >
> > > > > > > but  cannot get index like
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_US&id=10110&from=ePortal_NewsDetail_FromHome
> > > > > > > &<
> > > > > >
> > > > >
> > > >
> > >
> >
> http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_US&id=10110&from=ePortal_NewsDetail_FromHome%0A&;
> > > > > > >
> > > > > > > and
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LA&id=10110&from=ePortal_NewsDetail_FromHome
> > > > > > > &<
> > > > > >
> > > > >
> > > >
> > >
> >
> http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LA&id=10110&from=ePortal_NewsDetail_FromHome%0A&;
> > > > > > >
> > > > > >  >
> > > > > > >
> > > > > > > why not index ?
> > > > > > > the web have any different?
> > > > > > >
> > > > > > > please notice "request_locale="
> > > > > > >
> > > > > > >
> > > > > > > thanks
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to