[jira] Commented: (CONNECTORS-104) Make it easier to limit a web crawl to a single site

2010-09-08 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907154#action_12907154
 ] 

Karl Wright commented on CONNECTORS-104:


Trying to limit to the seed domains automatically would, I think, cause more 
confusion than help.  I can, however, imagine introducing a checkbox on the 
Inclusions tab that, if checked, would limit the crawl to just the domains 
represented by the seeds, and even making it checked by default.  The implied 
regular expression would be:

^http[?s]://domain[/$\?]

for each seed, I believe.  (That's potentially a lot of regular expressions if 
the number of seeds is large, so obviously the logic wouldn't be using regexp's 
in practice.)


 Make it easier to limit a web crawl to a single site
 

 Key: CONNECTORS-104
 URL: https://issues.apache.org/jira/browse/CONNECTORS-104
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: Web connector
Reporter: Jack Krupansky
Priority: Minor

 Unless the user explicitly enters an include regex carefully, a web crawl can 
 quickly get out of control and start crawling the entire web when all the 
 user may really want is to crawl just a single web site or portion thereof. 
 So, it would be preferable if either by default or with a simple button the 
 crawl could be limited to the seed web site(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-104) Make it easier to limit a web crawl to a single site

2010-09-08 Thread Jack Krupansky (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907201#action_12907201
 ] 

Jack Krupansky commented on CONNECTORS-104:
---

Simple works best. This enhancement is primarily for the simple use case where 
a novice user tries to do what they think is obvious (crawl the web pages at 
this URL), but without considering all of the potential nuances or how to 
fully specify the details of their goal.

One nuance is whether subdomains are considered part of the domain. I would say 
no if a subdomain was specified by the user and yes if no subdomain was 
specified.

Another nuance is whether a path is specified to select a subset of a domain. 
It would be nice to handle that and (optionally) limit the crawl to that path 
(or sub-paths below it). An example would be to crawl the news archive for a 
site.


 Make it easier to limit a web crawl to a single site
 

 Key: CONNECTORS-104
 URL: https://issues.apache.org/jira/browse/CONNECTORS-104
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: Web connector
Reporter: Jack Krupansky
Priority: Minor

 Unless the user explicitly enters an include regex carefully, a web crawl can 
 quickly get out of control and start crawling the entire web when all the 
 user may really want is to crawl just a single web site or portion thereof. 
 So, it would be preferable if either by default or with a simple button the 
 crawl could be limited to the seed web site(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (CONNECTORS-104) Make it easier to limit a web crawl to a single site

2010-09-08 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907203#action_12907203
 ] 

Karl Wright commented on CONNECTORS-104:


For someone who is purportedly trying to make things simpler, you have 
specified a rather complex set of rules, many of which seem of questionable 
utility to me.

Since this is basically just a shortcut, I propose a simple feature that just 
limits all urls to hosts that are explicitly mentioned in the seeds.


 Make it easier to limit a web crawl to a single site
 

 Key: CONNECTORS-104
 URL: https://issues.apache.org/jira/browse/CONNECTORS-104
 Project: Apache Connectors Framework
  Issue Type: Improvement
  Components: Web connector
Reporter: Jack Krupansky
Priority: Minor

 Unless the user explicitly enters an include regex carefully, a web crawl can 
 quickly get out of control and start crawling the entire web when all the 
 user may really want is to crawl just a single web site or portion thereof. 
 So, it would be preferable if either by default or with a simple button the 
 crawl could be limited to the seed web site(s).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.