[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901059#comment-14901059
 ] 

Asitang Mishra commented on NUTCH-2110:
---------------------------------------

Hi Sebastain,

Yes, using the crawldatum is the perfect idea.

This thought came to my mind when we had a use case where: The whole site was 
ajax based. So the pagination was also ajax (the url wouldnt change with the 
pagination click), so we needed to fetch the whole site in one go. We thought 
there must be a way to identify an ajax based resource/page because url was 
insufficient. That is when I thought url+a series of selenium interaction info 
can be used as a unique identifier in such scenarios.
This is mostly theoretical right now, because things need to be discussed upon 
like how the outlinks can be identified for the next fetch (have some ideas 
though).

And to answer your last questions. Imagine this scenario: We have a starting 
page called page1. There are a bunch of ajax clicks here. We click all of them 
the page manipulates and we save all the info into the data of that page. Then 
we need to go to the next page, which is still not exactly a different url but 
a page interaction. So, we 'somehow' save this for the next round. How do we do 
that??. So in the next round we come back to the page1 (cause there is no other 
way to page2 if not thru page1 since it does not have a unique url) and this 
time we dont go thru all the interaction in page1 and save no data for this 
page, but only click the pagination for page2 --> go to page2 and click around 
again and save data for it.


> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2110
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2110
>             Project: Nutch
>          Issue Type: Sub-task
>          Components: fetcher
>    Affects Versions: 1.10
>            Reporter: Asitang Mishra
>              Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to