Karl,

I'm importing data from a number of sources to include: SharePoint, File
shares, and an ORACLE database. The files/records are indexed by SOLR.

Right now, some of the import is done through custom SOLR's Data Import
Handler facilities. I'm hoping to move away from that in the future.

We are also aggregating some of the file share data into custom views on
the web client. Lots of preprocessing.

All of this is stored in the SOLR index with metadata related as to how to
display it within our custom web client. If the result is a certain type,
we have custom templates that are display as a result of that.

Manifold is a good solution for the SharePoint data. We don't really do any
custom processing on it other than strip HTML from the text.
It's the database and file share information  that adds some challenges.
I'm hoping to get SOLR out of the text processing pipeline, and just
let it index data. We are moving to Pentaho at some point, and we'll
probably handle most of the custom metadata processing there.
At some point, we'll possibly integrate Pentaho as an output connection in
Manifold.

Thanks,

Matt

On Mon, Feb 27, 2012 at 10:04 AM, Karl Wright <daddy...@gmail.com> wrote:

> Please see my response interleaved below.
>
> On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
> <mpar...@apogeeintegration.com> wrote:
> > I'm trying to push data into SOLR..
> >
> > Is there a way to transform the metadata coming in from different data
> > sources like SharePoint, and the File Share, prior to posting it into
> SOLR?
> >
>
> In general, ManifoldCF does not have data transformation abilities.
> With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
> extract content from documents and to perform transformations to
> document metadata etc.  It is possible that at some point it will be
> possible to do more transformations in ManifoldCF in order to support
> search engines that don't have a pipeline, but that is currently not
> available.
>
> > For instance, documents have metadata specifying their file path. I need
> to
> > transform that to a URL I can use within SOLR to retrieve that document
> > through a servlet that I wrote.
> >
>
> The ManifoldCF model is that a connector creates a URL for each
> document that it indexes, using whatever makes sense for that
> particular repository to get you back to the document in question.
> So, for instance, Documentum documents will use URLs that point at
> Documentum's Webtop web application.
>
> It would be helpful to understand more precisely what you are trying
> to do.  You could, for instance, modify your servlet to redirect to
> the ManifoldCF-generated URL.  It gets indexed into Solr as the "id"
> field.
>
> > Also, based on specific metadata that I'm seeing in the documents, I
> might
> > want to conditionally add populate other fields in SOLR index.
> >
>
> That sounds like a job for the Tika pipeline to me.
>
> Thanks,
> Karl
>
> > ------------------------------
> > This e-mail and any files transmitted with it may be proprietary.  Please
> > note that any views or opinions presented in this e-mail are solely
> those of
> > the author and do not necessarily represent those of Apogee Integration.
> >
>

------------------------------
This e-mail and any files transmitted with it may be proprietary.  Please note 
that any views or opinions presented in this e-mail are solely those of the 
author and do not necessarily represent those of Apogee Integration.

Reply via email to