[jira] [Commented] (CONNECTORS-214) Add post-extraction inclusions and exclusions into the web connector

2011-06-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054304#comment-13054304
 ] 

Erlend Garåsen commented on CONNECTORS-214:
---

I agree. I suggest that the post-extraction fields are placed under the 
pre-extraction fields. This means that the include and exclude tabs will both 
have two fields each.

Maybe the post-extraction fields should support more advanced filtering rules, 
for instance filtering based on mime types? This will make it easier to filter 
out video files without having to define all kinds of video files extensions. 
What do you think?



 Add post-extraction inclusions and exclusions into the web connector
 

 Key: CONNECTORS-214
 URL: https://issues.apache.org/jira/browse/CONNECTORS-214
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2
Reporter: Erlend Garåsen
Assignee: Erlend Garåsen
 Fix For: ManifoldCF next


 If html files are excluded for a job, links in these files will not be 
 followed. If we add inclusion and exclusion filters based on post-extraction, 
 it will be possible to fetch only certain types of documents, such as PDFs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-214) Add post-extraction inclusions and exclusions into the web connector

2011-06-24 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054343#comment-13054343
 ] 

Karl Wright commented on CONNECTORS-214:


The web connector already filters by mime types, but it filters using the mime 
types accepted by the output connection.  This makes some degree of sense 
because presumably the output system is the determinant for what kinds of 
documents are acceptable for indexing.

This makes me wonder whether we'd be better off adding BOTH post-fetch indexing 
URL filtering and mime-type filtering to the Solr output connector.  Right now, 
the Solr output connector tells the world it accepts all mime types, but we can 
readily put that under user control.  The downside of that approach is that 
some repository connectors don't even know the mime types of the documents they 
are crawling, and thus this feature would be superfluous and confusing with 
those connectors.  URL filtering, though, would always be appropriate.


 Add post-extraction inclusions and exclusions into the web connector
 

 Key: CONNECTORS-214
 URL: https://issues.apache.org/jira/browse/CONNECTORS-214
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2
Reporter: Erlend Garåsen
Assignee: Erlend Garåsen
 Fix For: ManifoldCF next


 If html files are excluded for a job, links in these files will not be 
 followed. If we add inclusion and exclusion filters based on post-extraction, 
 it will be possible to fetch only certain types of documents, such as PDFs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-214) Add post-extraction inclusions and exclusions into the web connector

2011-06-24 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054355#comment-13054355
 ] 

Karl Wright commented on CONNECTORS-214:


Thinking about this further.  The mime-type (and length-based) filtering 
clearly belongs with the Solr connector in my mind.  An output connector should 
also have a say in what URLs it will accept.  Unless there are objections, I'm 
going to change this ticket to make it cover all three of these output 
filtering criteria.


 Add post-extraction inclusions and exclusions into the web connector
 

 Key: CONNECTORS-214
 URL: https://issues.apache.org/jira/browse/CONNECTORS-214
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2
Reporter: Erlend Garåsen
Assignee: Erlend Garåsen
 Fix For: ManifoldCF next


 If html files are excluded for a job, links in these files will not be 
 followed. If we add inclusion and exclusion filters based on post-extraction, 
 it will be possible to fetch only certain types of documents, such as PDFs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (CONNECTORS-214) Add post-extraction inclusions and exclusions into the web connector

2011-06-24 Thread Karl Wright (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-214:
--

Assignee: Karl Wright  (was: Erlend Garåsen)

 Add post-extraction inclusions and exclusions into the web connector
 

 Key: CONNECTORS-214
 URL: https://issues.apache.org/jira/browse/CONNECTORS-214
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2
Reporter: Erlend Garåsen
Assignee: Karl Wright
 Fix For: ManifoldCF next


 If html files are excluded for a job, links in these files will not be 
 followed. If we add inclusion and exclusion filters based on post-extraction, 
 it will be possible to fetch only certain types of documents, such as PDFs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-214) Add post-extraction inclusions and exclusions into the web connector

2011-06-24 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054440#comment-13054440
 ] 

Karl Wright commented on CONNECTORS-214:


I added the necessary infrastructure in the framework for all of these related 
pieces in r1139294.


 Add post-extraction inclusions and exclusions into the web connector
 

 Key: CONNECTORS-214
 URL: https://issues.apache.org/jira/browse/CONNECTORS-214
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2
Reporter: Erlend Garåsen
Assignee: Karl Wright
 Fix For: ManifoldCF next


 If html files are excluded for a job, links in these files will not be 
 followed. If we add inclusion and exclusion filters based on post-extraction, 
 it will be possible to fetch only certain types of documents, such as PDFs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-214) Add post-extraction inclusions and exclusions into the web connector

2011-06-24 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054473#comment-13054473
 ] 

Karl Wright commented on CONNECTORS-214:


What remains: (a) Hooking up length-based filtering in all repository 
connectors; (b) Hooking up URL-based filtering in at least the web and RSS 
connectors, and maybe the rest as well; (c) Adding filtering support in the 
Solr connector.

 Add post-extraction inclusions and exclusions into the web connector
 

 Key: CONNECTORS-214
 URL: https://issues.apache.org/jira/browse/CONNECTORS-214
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2
Reporter: Erlend Garåsen
Assignee: Karl Wright
 Fix For: ManifoldCF next


 If html files are excluded for a job, links in these files will not be 
 followed. If we add inclusion and exclusion filters based on post-extraction, 
 it will be possible to fetch only certain types of documents, such as PDFs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CONNECTORS-214) Add post-extraction inclusions and exclusions into the web connector

2011-06-24 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054581#comment-13054581
 ] 

Karl Wright commented on CONNECTORS-214:


Also, r1139390 for changes to web and RSS connectors.


 Add post-extraction inclusions and exclusions into the web connector
 

 Key: CONNECTORS-214
 URL: https://issues.apache.org/jira/browse/CONNECTORS-214
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 0.1, ManifoldCF 0.2
Reporter: Erlend Garåsen
Assignee: Karl Wright
 Fix For: ManifoldCF next


 If html files are excluded for a job, links in these files will not be 
 followed. If we add inclusion and exclusion filters based on post-extraction, 
 it will be possible to fetch only certain types of documents, such as PDFs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira