[
https://issues.apache.org/jira/browse/CONNECTORS-552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13480954#comment-13480954
]
Karl Wright commented on CONNECTORS-552:
----------------------------------------
I looked in depth at what would be needed to make this a general framework
feature. The biggest concern is around the ingeststatus table, as managed by
the class IncrementalIngester.java. The performance of this subsystem is the
biggest single contributor to overall performance, so it is critical that any
changes don't significantly hurt performance.
If we did things in the most SQL-standard manner, we would need to add a second
(child) table to the ingeststatus table, which would keep track of the
name/value metadata pairs being forced along with each document.
Unfortunately, this would add at least two additional SQL queries per
ingeststatus row update, which would probably have a significant impact on
overall crawler performance. So we probably don't want to do it that way.
A second way to do this would involve adding another unlimited text column to
the ingeststatus table, and serializing the name/value pairs. I think this
would impact table updates minimally, and is worth an experiment to see if my
hunch is correct. The experiment would necessarily require a new branch (let's
call it CONNECTORS-552-2), and we'd be changing the schema and interfaces in
the following way:
- Add a new child table to the jobs table, which contains name/value pairs, and
associated manager class in org.apache.manifoldcf.crawler.jobs, with support in
the JobDescription class, and an associated UI tab.
- Modify the ingeststatus table schema to add a new column for containing the
serialized name/value pairs.
- Modify the documentIngest method if IncrementalIngester.java to accept an
additional argument of type Map<String,Set<String>>, containing the forced
metadata parameters.
Thoughts?
> Forced solr attributes in job specification and/or configuration
> ----------------------------------------------------------------
>
> Key: CONNECTORS-552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-552
> Project: ManifoldCF
> Issue Type: Improvement
> Components: Framework crawler agent
> Reporter: Maciej Lizewski
>
> Would be nice if there was a globally managed tab (like "connection" or
> "scheduling") for job specification (or configuration) allowing to force some
> solr attributes. It could look and work similar to "Solr Field Mapping"
> allowing to specify name=value associations.
> I am thinking about such case:
> Index all documents from repository X, and set then "source" attribute to
> "repository X". Then I could filter results to those that came from specified
> source. But I think there can be other possibilities, like: index all
> documents from windows share and set them field "client" to "Client X",
> because all documents there are associated with one client and I would like
> to have filters, facets on such field (and I cannot fetch such value from
> documents because people never set meta tags...).
> Real life: I have three document sources: Samba share with some project
> documents, internal wiki system, mantis bug tracker. I would like to query
> Solr for "all documents from wiki, which contain phrase XXXX".
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira