[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901059#comment-14901059 ]
Asitang Mishra commented on NUTCH-2110: --------------------------------------- Hi Sebastain, Yes, using the crawldatum is the perfect idea. This thought came to my mind when we had a use case where: The whole site was ajax based. So the pagination was also ajax (the url wouldnt change with the pagination click), so we needed to fetch the whole site in one go. We thought there must be a way to identify an ajax based resource/page because url was insufficient. That is when I thought url+a series of selenium interaction info can be used as a unique identifier in such scenarios. This is mostly theoretical right now, because things need to be discussed upon like how the outlinks can be identified for the next fetch (have some ideas though). And to answer your last questions. Imagine this scenario: We have a starting page called page1. There are a bunch of ajax clicks here. We click all of them the page manipulates and we save all the info into the data of that page. Then we need to go to the next page, which is still not exactly a different url but a page interaction. So, we 'somehow' save this for the next round. How do we do that??. So in the next round we come back to the page1 (cause there is no other way to page2 if not thru page1 since it does not have a unique url) and this time we dont go thru all the interaction in page1 and save no data for this page, but only click the pagination for page2 --> go to page2 and click around again and save data for it. > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > ------------------------------------------------------------------------------------------------------------------ > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher > Affects Versions: 1.10 > Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)