[ 
https://issues.apache.org/jira/browse/CONNECTORS-552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480954#comment-13480954
 ] 

Karl Wright commented on CONNECTORS-552:
----------------------------------------

I looked in depth at what would be needed to make this a general framework 
feature.  The biggest concern is around the ingeststatus table, as managed by 
the class IncrementalIngester.java.  The performance of this subsystem is the 
biggest single contributor to overall performance, so it is critical that any 
changes don't significantly hurt performance.

If we did things in the most SQL-standard manner, we would need to add a second 
(child) table to the ingeststatus table, which would keep track of the 
name/value metadata pairs being forced along with each document.  
Unfortunately, this would add at least two additional SQL queries per 
ingeststatus row update, which would probably have a significant impact on 
overall crawler performance.  So we probably don't want to do it that way.

A second way to do this would involve adding another unlimited text column to 
the ingeststatus table, and serializing the name/value pairs.  I think this 
would impact table updates minimally, and is worth an experiment to see if my 
hunch is correct.  The experiment would necessarily require a new branch (let's 
call it CONNECTORS-552-2), and we'd be changing the schema and interfaces in 
the following way:

- Add a new child table to the jobs table, which contains name/value pairs, and 
associated manager class in org.apache.manifoldcf.crawler.jobs, with support in 
the JobDescription class, and an associated UI tab.
- Modify the ingeststatus table schema to add a new column for containing the 
serialized name/value pairs.
- Modify the documentIngest method if IncrementalIngester.java to accept an 
additional argument of type Map<String,Set<String>>, containing the forced 
metadata parameters.

Thoughts?

                
> Forced solr attributes in job specification and/or configuration
> ----------------------------------------------------------------
>
>                 Key: CONNECTORS-552
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-552
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework crawler agent
>            Reporter: Maciej Lizewski
>
> Would be nice if there was a globally managed tab (like "connection" or 
> "scheduling") for job specification (or configuration) allowing to force some 
> solr attributes. It could look and work similar to "Solr Field Mapping" 
> allowing to specify name=value associations.
> I am thinking about such case:
> Index all documents from repository X, and set then "source" attribute to 
> "repository X". Then I could filter results to those that came from specified 
> source. But I think there can be other possibilities, like: index all 
> documents from windows share and set them field "client" to "Client X", 
> because all documents there are associated with one client and I would like 
> to have filters, facets on such field (and I cannot fetch such value from 
> documents because people never set meta tags...).
> Real life: I have three document sources: Samba share with some project 
> documents, internal wiki system, mantis bug tracker. I would like to query 
> Solr for "all documents from wiki, which contain phrase XXXX".

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to