[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951411#comment-14951411 ] Asitang Mishra commented on NUTCH-2110: --- Ack > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > -- > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951409#comment-14951409 ] Chris A. Mattmann commented on NUTCH-2110: -- Great so can you link this to those issues and close this out? > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > -- > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951407#comment-14951407 ] Asitang Mishra commented on NUTCH-2110: --- >From the ideas from this issue I created two more issues that are more clear >in terms of what has to be done and what is the scope. Those are : NUTCH-2126 >and NUTCH-2127. Both these I described in our last meeting. Will push them >this weekend. > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > -- > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951384#comment-14951384 ] Chris A. Mattmann commented on NUTCH-2110: -- Asitang where are we on this? > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > -- > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933845#comment-14933845 ] Asitang Mishra commented on NUTCH-2110: --- To keep everything under one single url in the end (how it practically is) or under some new concocted url I think is the question. I am not sure if in the end one needs to distinguish all this data into separate parts or not. Here we need to think more I guess. Meanwhile, I created two more sub tasks that can do more specific things using standardized key value pairs to the injector. Let us focus on them right now and then we can move back here to this issue which is a little abstract. > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > -- > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903524#comment-14903524 ] Sebastian Nagel commented on NUTCH-2110: Ok, understood. One point to consider: shall all paginated documents be kept under the same URL? As a batch crawler Nutch uses the URL in many places to uniquely identify content, meta data, status information, indexed documents, etc. Of course, the outlinks generated for page1 could be modified by adding a suffix which makes the URL unique. Only inside protocol-selenium the suffix is removed to fetch the right page. > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > -- > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. > Atleast, this should make nutch capable of distinguishing if a url should be > opened using the basic http, httpclient or selenium protocols. And provide > the selenium protocol with basic authentication capabilities based on the > above ideas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901798#comment-14901798 ] Asitang Mishra commented on NUTCH-2110: --- Also updated the description to tackle some basic problems with this idea first. > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > -- > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. > Atleast, this should make nutch capable of distinguishing if a url should be > opened using the basic http, httpclient or selenium protocols. And provide > the selenium protocol with basic authentication capabilities based on the > above ideas. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901059#comment-14901059 ] Asitang Mishra commented on NUTCH-2110: --- Hi Sebastain, Yes, using the crawldatum is the perfect idea. This thought came to my mind when we had a use case where: The whole site was ajax based. So the pagination was also ajax (the url wouldnt change with the pagination click), so we needed to fetch the whole site in one go. We thought there must be a way to identify an ajax based resource/page because url was insufficient. That is when I thought url+a series of selenium interaction info can be used as a unique identifier in such scenarios. This is mostly theoretical right now, because things need to be discussed upon like how the outlinks can be identified for the next fetch (have some ideas though). And to answer your last questions. Imagine this scenario: We have a starting page called page1. There are a bunch of ajax clicks here. We click all of them the page manipulates and we save all the info into the data of that page. Then we need to go to the next page, which is still not exactly a different url but a page interaction. So, we 'somehow' save this for the next round. How do we do that??. So in the next round we come back to the page1 (cause there is no other way to page2 if not thru page1 since it does not have a unique url) and this time we dont go thru all the interaction in page1 and save no data for this page, but only click the pagination for page2 --> go to page2 and click around again and save data for it. > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > -- > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"
[ https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14899934#comment-14899934 ] Sebastian Nagel commented on NUTCH-2110: Hi Asitang, the Injector is already able to store key-value pairs from the seed list in CrawlDb withing CrawlDatum's meta data, see [[1|http://nutch.apache.org/apidocs/apidocs-1.10/org/apache/nutch/crawl/Injector.html]]. If the XPath statements are not too complex, this would be the easiest way: the protocol plugin could then read the XPath from the CrawlDatum. Regarding the "state of a selenium operation": should the a state be passed to the outlinks of a page or is the same page fetched multiple times with varying Ajax/JavaScript actions to be performed? > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" > -- > > Key: NUTCH-2110 > URL: https://issues.apache.org/jira/browse/NUTCH-2110 > Project: Nutch > Issue Type: Sub-task > Components: fetcher >Affects Versions: 1.10 >Reporter: Asitang Mishra > Labels: memex > > Create the capability to provide seeds in the form of "url+xpath(including > option to enter seach terms).selenium" to be used by selenium > protocols/plugins as urls/flow to reach to a specific ajax based page or save > the state of a selenium operation for the next fetching round. -- This message was sent by Atlassian JIRA (v6.3.4#6332)