[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-10-09 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951411#comment-14951411
 ] 

Asitang Mishra commented on NUTCH-2110:
---

Ack

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-10-09 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951409#comment-14951409
 ] 

Chris A. Mattmann commented on NUTCH-2110:
--

Great so can you link this to those issues and close this out?

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-10-09 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951407#comment-14951407
 ] 

Asitang Mishra commented on NUTCH-2110:
---

>From the ideas from this issue I created two more issues that are more clear 
>in terms of what has to be done and what is the scope. Those are : NUTCH-2126 
>and NUTCH-2127. Both these I described in our last meeting. Will push them 
>this weekend.

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-10-09 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951384#comment-14951384
 ] 

Chris A. Mattmann commented on NUTCH-2110:
--

Asitang where are we on this?

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-28 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933845#comment-14933845
 ] 

Asitang Mishra commented on NUTCH-2110:
---

To keep everything under one single url in the end (how it practically is) or 
under some new concocted url I think is the question. I am not sure if in the 
end one needs to distinguish all this data into separate parts or not. Here we 
need to think more I guess.
Meanwhile, I created two more sub tasks that can do more specific things using 
standardized key value pairs to the injector. Let us focus on them right now 
and then we can move back here to this issue which is a little abstract.

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903524#comment-14903524
 ] 

Sebastian Nagel commented on NUTCH-2110:


Ok, understood. One point to consider: shall all paginated documents be kept 
under the same URL? As a batch crawler Nutch uses the URL in many places to 
uniquely identify content, meta data, status information, indexed documents, 
etc.  Of course, the outlinks generated for page1 could be modified by adding a 
suffix which makes the URL unique. Only inside protocol-selenium the suffix is 
removed to fetch the right page.

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.
> Atleast, this should make nutch capable of distinguishing if a url should be 
> opened using the basic http, httpclient or selenium protocols. And provide 
> the selenium protocol with basic authentication capabilities based on the 
> above ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-21 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901798#comment-14901798
 ] 

Asitang Mishra commented on NUTCH-2110:
---

Also updated the description to tackle some basic problems with this idea first.

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.
> Atleast, this should make nutch capable of distinguishing if a url should be 
> opened using the basic http, httpclient or selenium protocols. And provide 
> the selenium protocol with basic authentication capabilities based on the 
> above ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-21 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901059#comment-14901059
 ] 

Asitang Mishra commented on NUTCH-2110:
---

Hi Sebastain,

Yes, using the crawldatum is the perfect idea.

This thought came to my mind when we had a use case where: The whole site was 
ajax based. So the pagination was also ajax (the url wouldnt change with the 
pagination click), so we needed to fetch the whole site in one go. We thought 
there must be a way to identify an ajax based resource/page because url was 
insufficient. That is when I thought url+a series of selenium interaction info 
can be used as a unique identifier in such scenarios.
This is mostly theoretical right now, because things need to be discussed upon 
like how the outlinks can be identified for the next fetch (have some ideas 
though).

And to answer your last questions. Imagine this scenario: We have a starting 
page called page1. There are a bunch of ajax clicks here. We click all of them 
the page manipulates and we save all the info into the data of that page. Then 
we need to go to the next page, which is still not exactly a different url but 
a page interaction. So, we 'somehow' save this for the next round. How do we do 
that??. So in the next round we come back to the page1 (cause there is no other 
way to page2 if not thru page1 since it does not have a unique url) and this 
time we dont go thru all the interaction in page1 and save no data for this 
page, but only click the pagination for page2 --> go to page2 and click around 
again and save data for it.


> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-20 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14899934#comment-14899934
 ] 

Sebastian Nagel commented on NUTCH-2110:


Hi Asitang, the Injector is already able to store key-value pairs from the seed 
list in CrawlDb withing CrawlDatum's meta data, see 
[[1|http://nutch.apache.org/apidocs/apidocs-1.10/org/apache/nutch/crawl/Injector.html]].
 If the XPath statements are not too complex, this would be the easiest way: 
the protocol plugin could then read the XPath from the CrawlDatum.
Regarding the "state of a selenium operation": should the a state be passed to 
the outlinks of a page or is the same page fetched multiple times with varying 
Ajax/JavaScript actions to be performed?

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)