Re: Cannot find OracleDriver

2012-02-27 Thread Matthew Parker
type: JDBC
Authority: None
Database Type: ORACLE
Database and Host: 21:16:18:145:1521
Instance/Database: main
User Name: 
Password: X

On Sun, Feb 26, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote:

 I haven't seen this one.  I'd love to know what the connect
 descriptor it refers to is.

 Can you tell me what the parameters all look like for the JDBC
 connection you are setting up?  Are you specifying, for instance, the
 port as part of the server name?

 Karl

 On Sat, Feb 25, 2012 at 1:22 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  Karl,
 
  That fixed the driver issue. I just updated my start.jar file by hand for
  now.
 
  The problem I have now is connecting to ORACLE. I can do it through
 NetBeans
  on my machine, but
  I cannot connect through ManfoldCF with the same settings. I get the
  following error:
 
  Error getting connection. Listener refused the connection with the
 following
  error.
 
  ORA-12514. TNS:Listener does not currently know of service requested in
  connect descriptor.
 
  This might be more of an ORACLE issue than Manifold issue, but I was
  wondering whether
  you've encountered the same thing during testing?
 
  Regards,
 
  Matt
 
  On Fri, Jan 20, 2012 at 10:28 AM, Matthew Parker
  mpar...@apogeeintegration.com wrote:
 
  Thanks Karl.
 
  On Thu, Jan 19, 2012 at 9:44 PM, Karl Wright daddy...@gmail.com
 wrote:
 
  The problem has been fixed on trunk.  Basically, the instructions
  changed as did some of the build files.  It turned out to be extremely
  challenging to get JDBC drivers to run when they were loaded by
  anything other than the system classloader, so that's what I was
  forced to insure.
 
  Thanks,
  Karl
 
 
  On Thu, Jan 19, 2012 at 3:33 PM, Karl Wright daddy...@gmail.com
 wrote:
   The ticket for this problem is CONNECTORS-390.
  
   Karl
  
   On Thu, Jan 19, 2012 at 3:05 PM, Matthew Parker
   mpar...@apogeeintegration.com wrote:
   Many thanks. I'll give that a try.
  
   On Thu, Jan 19, 2012 at 3:01 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   The problem is that the JDBC driver is using a pool driver that is
 in
   common with the core of ManifoldCF.  So the connector-lib path,
 which
   only the connectors know about, won't do.  That's a bug which I'll
   create a ticket for.
  
   A temporary fix, which is slightly involved, requires you to put
 the
   ojdbc6.jar in the example/lib area, as you already tried, but in
   addition you will need to explicitly include the jar in your
   classpath.  Normally the start.jar's manifest describes all the
 jars
   in the initial classpath.  I thought it was possible to also
 include
   additional classpath info through the normal --classpath mechanism,
   but that doesn't seem to work, so you may be stuck with modifying
 the
   root build.xml file to add the jar to the manifest.
  
   I'm going to experiment a bit and see if I can come up with
 something
   quickly.
  
   Karl
  
  
   On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com
   wrote:
I was able to reproduce the problem.  I'll get back to you when I
figure out what the issue is.
Karl
   
On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
I've used the jar file in NetBeans to connect to the database
without
any
issue.
   
Seems more like a class loader issue.
   
   
On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
   
I have the latest release from the Apache Manifold site (i.e.
0.3-incubating).
   
I checked the driver jar file with winzip, and the driver name
 is
still
the same (oracle.jdbc.OracleDriver).
   
I'm running java 1.6.0_18-b7 on Windows XP SP 3.
   
On Thu, Jan 19, 2012 at 2:27 PM, Karl Wright 
 daddy...@gmail.com
wrote:
   
MCF's Oracle support was written against earlier versions of
 the
Oracle driver.  It is possible that they have changed the
 driver
class.  If the driver winds up in the dist/connector-lib
directory
(I'm assuming you are using trunk or 0.4-incubating), then it
should
be accessible.
   
Could you please try the following:
   
jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver
   
... assuming you are using Linux?
   
If the driver class IS found, then the other possibility is
 that
the
jar is compiled against a later version of Java than the one
 you
are
using to run MCF.
   
Please let me know what you find.
   
Karl
   
On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 I downloaded MCF and started playing with the default setup
 under
 Jetty
 and
 Derby. It starts up without any issue.

 I would like to connect to our ORACLE database and import
 data
 into
 SOLR.

 I placed the ojdbc6.jar file in the
 connectors/jdbc/jdbc-drivers
 directory
 as stated 

Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR

2012-02-27 Thread Karl Wright
Please see my response interleaved below.

On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 I'm trying to push data into SOLR..

 Is there a way to transform the metadata coming in from different data
 sources like SharePoint, and the File Share, prior to posting it into SOLR?


In general, ManifoldCF does not have data transformation abilities.
With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
extract content from documents and to perform transformations to
document metadata etc.  It is possible that at some point it will be
possible to do more transformations in ManifoldCF in order to support
search engines that don't have a pipeline, but that is currently not
available.

 For instance, documents have metadata specifying their file path. I need to
 transform that to a URL I can use within SOLR to retrieve that document
 through a servlet that I wrote.


The ManifoldCF model is that a connector creates a URL for each
document that it indexes, using whatever makes sense for that
particular repository to get you back to the document in question.
So, for instance, Documentum documents will use URLs that point at
Documentum's Webtop web application.

It would be helpful to understand more precisely what you are trying
to do.  You could, for instance, modify your servlet to redirect to
the ManifoldCF-generated URL.  It gets indexed into Solr as the id
field.

 Also, based on specific metadata that I'm seeing in the documents, I might
 want to conditionally add populate other fields in SOLR index.


That sounds like a job for the Tika pipeline to me.

Thanks,
Karl

 --
 This e-mail and any files transmitted with it may be proprietary.  Please
 note that any views or opinions presented in this e-mail are solely those of
 the author and do not necessarily represent those of Apogee Integration.



Re: Cannot find OracleDriver

2012-02-27 Thread Karl Wright
So if the Database and Host field really is 21:16:18:145:1521, try
21.16.18.145:1521 instead. ;-)

Karl

On Mon, Feb 27, 2012 at 9:22 AM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 type: JDBC
 Authority: None
 Database Type: ORACLE
 Database and Host: 21:16:18:145:1521
 Instance/Database: main
 User Name: 
 Password: X


 On Sun, Feb 26, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote:

 I haven't seen this one.  I'd love to know what the connect
 descriptor it refers to is.

 Can you tell me what the parameters all look like for the JDBC
 connection you are setting up?  Are you specifying, for instance, the
 port as part of the server name?

 Karl

 On Sat, Feb 25, 2012 at 1:22 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  Karl,
 
  That fixed the driver issue. I just updated my start.jar file by hand
  for
  now.
 
  The problem I have now is connecting to ORACLE. I can do it through
  NetBeans
  on my machine, but
  I cannot connect through ManfoldCF with the same settings. I get the
  following error:
 
  Error getting connection. Listener refused the connection with the
  following
  error.
 
  ORA-12514. TNS:Listener does not currently know of service requested in
  connect descriptor.
 
  This might be more of an ORACLE issue than Manifold issue, but I was
  wondering whether
  you've encountered the same thing during testing?
 
  Regards,
 
  Matt
 
  On Fri, Jan 20, 2012 at 10:28 AM, Matthew Parker
  mpar...@apogeeintegration.com wrote:
 
  Thanks Karl.
 
  On Thu, Jan 19, 2012 at 9:44 PM, Karl Wright daddy...@gmail.com
  wrote:
 
  The problem has been fixed on trunk.  Basically, the instructions
  changed as did some of the build files.  It turned out to be extremely
  challenging to get JDBC drivers to run when they were loaded by
  anything other than the system classloader, so that's what I was
  forced to insure.
 
  Thanks,
  Karl
 
 
  On Thu, Jan 19, 2012 at 3:33 PM, Karl Wright daddy...@gmail.com
  wrote:
   The ticket for this problem is CONNECTORS-390.
  
   Karl
  
   On Thu, Jan 19, 2012 at 3:05 PM, Matthew Parker
   mpar...@apogeeintegration.com wrote:
   Many thanks. I'll give that a try.
  
   On Thu, Jan 19, 2012 at 3:01 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   The problem is that the JDBC driver is using a pool driver that is
   in
   common with the core of ManifoldCF.  So the connector-lib path,
   which
   only the connectors know about, won't do.  That's a bug which I'll
   create a ticket for.
  
   A temporary fix, which is slightly involved, requires you to put
   the
   ojdbc6.jar in the example/lib area, as you already tried, but in
   addition you will need to explicitly include the jar in your
   classpath.  Normally the start.jar's manifest describes all the
   jars
   in the initial classpath.  I thought it was possible to also
   include
   additional classpath info through the normal --classpath
   mechanism,
   but that doesn't seem to work, so you may be stuck with modifying
   the
   root build.xml file to add the jar to the manifest.
  
   I'm going to experiment a bit and see if I can come up with
   something
   quickly.
  
   Karl
  
  
   On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com
   wrote:
I was able to reproduce the problem.  I'll get back to you when
I
figure out what the issue is.
Karl
   
On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
I've used the jar file in NetBeans to connect to the database
without
any
issue.
   
Seems more like a class loader issue.
   
   
On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
   
I have the latest release from the Apache Manifold site (i.e.
0.3-incubating).
   
I checked the driver jar file with winzip, and the driver name
is
still
the same (oracle.jdbc.OracleDriver).
   
I'm running java 1.6.0_18-b7 on Windows XP SP 3.
   
On Thu, Jan 19, 2012 at 2:27 PM, Karl Wright
daddy...@gmail.com
wrote:
   
MCF's Oracle support was written against earlier versions of
the
Oracle driver.  It is possible that they have changed the
driver
class.  If the driver winds up in the dist/connector-lib
directory
(I'm assuming you are using trunk or 0.4-incubating), then it
should
be accessible.
   
Could you please try the following:
   
jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver
   
... assuming you are using Linux?
   
If the driver class IS found, then the other possibility is
that
the
jar is compiled against a later version of Java than the one
you
are
using to run MCF.
   
Please let me know what you find.
   
Karl
   
On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 I downloaded MCF and started playing with the default setup
 under
 Jetty
 and
 

Re: Cannot find OracleDriver

2012-02-27 Thread Matthew Parker
Sorry. I used the wrong character. It is configured for 21.16.18.145:1521

On Mon, Feb 27, 2012 at 10:27 AM, Karl Wright daddy...@gmail.com wrote:

 So if the Database and Host field really is 21:16:18:145:1521, try
 21.16.18.145:1521 instead. ;-)

 Karl

 On Mon, Feb 27, 2012 at 9:22 AM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  type: JDBC
  Authority: None
  Database Type: ORACLE
  Database and Host: 21:16:18:145:1521
  Instance/Database: main
  User Name: 
  Password: X
 
 
  On Sun, Feb 26, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote:
 
  I haven't seen this one.  I'd love to know what the connect
  descriptor it refers to is.
 
  Can you tell me what the parameters all look like for the JDBC
  connection you are setting up?  Are you specifying, for instance, the
  port as part of the server name?
 
  Karl
 
  On Sat, Feb 25, 2012 at 1:22 PM, Matthew Parker
  mpar...@apogeeintegration.com wrote:
   Karl,
  
   That fixed the driver issue. I just updated my start.jar file by hand
   for
   now.
  
   The problem I have now is connecting to ORACLE. I can do it through
   NetBeans
   on my machine, but
   I cannot connect through ManfoldCF with the same settings. I get the
   following error:
  
   Error getting connection. Listener refused the connection with the
   following
   error.
  
   ORA-12514. TNS:Listener does not currently know of service requested
 in
   connect descriptor.
  
   This might be more of an ORACLE issue than Manifold issue, but I was
   wondering whether
   you've encountered the same thing during testing?
  
   Regards,
  
   Matt
  
   On Fri, Jan 20, 2012 at 10:28 AM, Matthew Parker
   mpar...@apogeeintegration.com wrote:
  
   Thanks Karl.
  
   On Thu, Jan 19, 2012 at 9:44 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   The problem has been fixed on trunk.  Basically, the instructions
   changed as did some of the build files.  It turned out to be
 extremely
   challenging to get JDBC drivers to run when they were loaded by
   anything other than the system classloader, so that's what I was
   forced to insure.
  
   Thanks,
   Karl
  
  
   On Thu, Jan 19, 2012 at 3:33 PM, Karl Wright daddy...@gmail.com
   wrote:
The ticket for this problem is CONNECTORS-390.
   
Karl
   
On Thu, Jan 19, 2012 at 3:05 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
Many thanks. I'll give that a try.
   
On Thu, Jan 19, 2012 at 3:01 PM, Karl Wright daddy...@gmail.com
 
wrote:
   
The problem is that the JDBC driver is using a pool driver that
 is
in
common with the core of ManifoldCF.  So the connector-lib path,
which
only the connectors know about, won't do.  That's a bug which
 I'll
create a ticket for.
   
A temporary fix, which is slightly involved, requires you to put
the
ojdbc6.jar in the example/lib area, as you already tried, but in
addition you will need to explicitly include the jar in your
classpath.  Normally the start.jar's manifest describes all the
jars
in the initial classpath.  I thought it was possible to also
include
additional classpath info through the normal --classpath
mechanism,
but that doesn't seem to work, so you may be stuck with
 modifying
the
root build.xml file to add the jar to the manifest.
   
I'm going to experiment a bit and see if I can come up with
something
quickly.
   
Karl
   
   
On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright 
 daddy...@gmail.com
wrote:
 I was able to reproduce the problem.  I'll get back to you
 when
 I
 figure out what the issue is.
 Karl

 On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
 I've used the jar file in NetBeans to connect to the database
 without
 any
 issue.

 Seems more like a class loader issue.


 On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:

 I have the latest release from the Apache Manifold site
 (i.e.
 0.3-incubating).

 I checked the driver jar file with winzip, and the driver
 name
 is
 still
 the same (oracle.jdbc.OracleDriver).

 I'm running java 1.6.0_18-b7 on Windows XP SP 3.

 On Thu, Jan 19, 2012 at 2:27 PM, Karl Wright
 daddy...@gmail.com
 wrote:

 MCF's Oracle support was written against earlier versions
 of
 the
 Oracle driver.  It is possible that they have changed the
 driver
 class.  If the driver winds up in the dist/connector-lib
 directory
 (I'm assuming you are using trunk or 0.4-incubating), then
 it
 should
 be accessible.

 Could you please try the following:

 jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver

 ... assuming you are using Linux?

 If the driver class IS found, then the other possibility is
 that
 the
 jar is compiled against a 

Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR

2012-02-27 Thread Matthew Parker
Karl,

I'm importing data from a number of sources to include: SharePoint, File
shares, and an ORACLE database. The files/records are indexed by SOLR.

Right now, some of the import is done through custom SOLR's Data Import
Handler facilities. I'm hoping to move away from that in the future.

We are also aggregating some of the file share data into custom views on
the web client. Lots of preprocessing.

All of this is stored in the SOLR index with metadata related as to how to
display it within our custom web client. If the result is a certain type,
we have custom templates that are display as a result of that.

Manifold is a good solution for the SharePoint data. We don't really do any
custom processing on it other than strip HTML from the text.
It's the database and file share information  that adds some challenges.
I'm hoping to get SOLR out of the text processing pipeline, and just
let it index data. We are moving to Pentaho at some point, and we'll
probably handle most of the custom metadata processing there.
At some point, we'll possibly integrate Pentaho as an output connection in
Manifold.

Thanks,

Matt

On Mon, Feb 27, 2012 at 10:04 AM, Karl Wright daddy...@gmail.com wrote:

 Please see my response interleaved below.

 On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  I'm trying to push data into SOLR..
 
  Is there a way to transform the metadata coming in from different data
  sources like SharePoint, and the File Share, prior to posting it into
 SOLR?
 

 In general, ManifoldCF does not have data transformation abilities.
 With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
 extract content from documents and to perform transformations to
 document metadata etc.  It is possible that at some point it will be
 possible to do more transformations in ManifoldCF in order to support
 search engines that don't have a pipeline, but that is currently not
 available.

  For instance, documents have metadata specifying their file path. I need
 to
  transform that to a URL I can use within SOLR to retrieve that document
  through a servlet that I wrote.
 

 The ManifoldCF model is that a connector creates a URL for each
 document that it indexes, using whatever makes sense for that
 particular repository to get you back to the document in question.
 So, for instance, Documentum documents will use URLs that point at
 Documentum's Webtop web application.

 It would be helpful to understand more precisely what you are trying
 to do.  You could, for instance, modify your servlet to redirect to
 the ManifoldCF-generated URL.  It gets indexed into Solr as the id
 field.

  Also, based on specific metadata that I'm seeing in the documents, I
 might
  want to conditionally add populate other fields in SOLR index.
 

 That sounds like a job for the Tika pipeline to me.

 Thanks,
 Karl

  --
  This e-mail and any files transmitted with it may be proprietary.  Please
  note that any views or opinions presented in this e-mail are solely
 those of
  the author and do not necessarily represent those of Apogee Integration.
 


--
This e-mail and any files transmitted with it may be proprietary.  Please note 
that any views or opinions presented in this e-mail are solely those of the 
author and do not necessarily represent those of Apogee Integration.


Re: Cannot find OracleDriver

2012-02-27 Thread Karl Wright
The connect URL it will use given those parameters is the following:

String dburl = jdbc: + providerName + // + host + / +
database + ((instanceName==null)?:;instance=+instanceName);

Or, filled in with your parameters:

jdbc:oracle:thin:@//21.16.18.145:1521/main

The main at the end is what I would wonder about.  Oracle's default
is database; if you leave the database/instance name field blank,
that's what you'll get.

I also recommend turning on connector debugging, in properties.xml, by adding:

property name=org.apache.manifoldcf.connectors value=DEBUG/

... and restarting ManifoldCF.  Try viewing the connection in the UI;
you should see the connect string logged, as well as possibly a more
detailed response.

Thanks,
Karl

On Mon, Feb 27, 2012 at 11:12 AM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 Sorry. I used the wrong character. It is configured for 21.16.18.145:1521


 On Mon, Feb 27, 2012 at 10:27 AM, Karl Wright daddy...@gmail.com wrote:

 So if the Database and Host field really is 21:16:18:145:1521, try
 21.16.18.145:1521 instead. ;-)

 Karl

 On Mon, Feb 27, 2012 at 9:22 AM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  type: JDBC
  Authority: None
  Database Type: ORACLE
  Database and Host: 21:16:18:145:1521
  Instance/Database: main
  User Name: 
  Password: X
 
 
  On Sun, Feb 26, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote:
 
  I haven't seen this one.  I'd love to know what the connect
  descriptor it refers to is.
 
  Can you tell me what the parameters all look like for the JDBC
  connection you are setting up?  Are you specifying, for instance, the
  port as part of the server name?
 
  Karl
 
  On Sat, Feb 25, 2012 at 1:22 PM, Matthew Parker
  mpar...@apogeeintegration.com wrote:
   Karl,
  
   That fixed the driver issue. I just updated my start.jar file by hand
   for
   now.
  
   The problem I have now is connecting to ORACLE. I can do it through
   NetBeans
   on my machine, but
   I cannot connect through ManfoldCF with the same settings. I get the
   following error:
  
   Error getting connection. Listener refused the connection with the
   following
   error.
  
   ORA-12514. TNS:Listener does not currently know of service requested
   in
   connect descriptor.
  
   This might be more of an ORACLE issue than Manifold issue, but I was
   wondering whether
   you've encountered the same thing during testing?
  
   Regards,
  
   Matt
  
   On Fri, Jan 20, 2012 at 10:28 AM, Matthew Parker
   mpar...@apogeeintegration.com wrote:
  
   Thanks Karl.
  
   On Thu, Jan 19, 2012 at 9:44 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   The problem has been fixed on trunk.  Basically, the instructions
   changed as did some of the build files.  It turned out to be
   extremely
   challenging to get JDBC drivers to run when they were loaded by
   anything other than the system classloader, so that's what I was
   forced to insure.
  
   Thanks,
   Karl
  
  
   On Thu, Jan 19, 2012 at 3:33 PM, Karl Wright daddy...@gmail.com
   wrote:
The ticket for this problem is CONNECTORS-390.
   
Karl
   
On Thu, Jan 19, 2012 at 3:05 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
Many thanks. I'll give that a try.
   
On Thu, Jan 19, 2012 at 3:01 PM, Karl Wright
daddy...@gmail.com
wrote:
   
The problem is that the JDBC driver is using a pool driver that
is
in
common with the core of ManifoldCF.  So the connector-lib path,
which
only the connectors know about, won't do.  That's a bug which
I'll
create a ticket for.
   
A temporary fix, which is slightly involved, requires you to
put
the
ojdbc6.jar in the example/lib area, as you already tried, but
in
addition you will need to explicitly include the jar in your
classpath.  Normally the start.jar's manifest describes all the
jars
in the initial classpath.  I thought it was possible to also
include
additional classpath info through the normal --classpath
mechanism,
but that doesn't seem to work, so you may be stuck with
modifying
the
root build.xml file to add the jar to the manifest.
   
I'm going to experiment a bit and see if I can come up with
something
quickly.
   
Karl
   
   
On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright
daddy...@gmail.com
wrote:
 I was able to reproduce the problem.  I'll get back to you
 when
 I
 figure out what the issue is.
 Karl

 On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
 I've used the jar file in NetBeans to connect to the
 database
 without
 any
 issue.

 Seems more like a class loader issue.


 On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:

 I have the latest release from the Apache Manifold site
 (i.e.
 0.3-incubating).

 I 

Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR

2012-02-27 Thread Matthew Parker
Thanks for the insights Karl. I'll have to give this a little more thought.

On Mon, Feb 27, 2012 at 1:22 PM, Karl Wright daddy...@gmail.com wrote:

 If you've got a mix of data and only some of it comes through
 ManifoldCF, you can still use the ManifoldCF-generated URL for those
 that originate with ManifoldCF.  This should even work for documents
 from the JCIFS connector - even though the default urls from this
 connector are file: style, there's a mapping you can set up for
 documents from that connector that maps to a URL format of your
 choice.  Similarly, most JDBC document urls can readily be constructed
 as part of the database queries that you provide for the job.  So it
 does not sound like your servlet would have to do anything custom for
 any of the data that comes from ManifoldCF at this time, as long as
 you define your connections and jobs with some care as to the URLs
 they will produce.

 Thanks,
 Karl


 On Mon, Feb 27, 2012 at 11:25 AM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  Karl,
 
  I'm importing data from a number of sources to include: SharePoint, File
  shares, and an ORACLE database. The files/records are indexed by SOLR.
 
  Right now, some of the import is done through custom SOLR's Data Import
  Handler facilities. I'm hoping to move away from that in the future.
 
  We are also aggregating some of the file share data into custom views on
 the
  web client. Lots of preprocessing.
 
  All of this is stored in the SOLR index with metadata related as to how
 to
  display it within our custom web client. If the result is a certain type,
  we have custom templates that are display as a result of that.
 
  Manifold is a good solution for the SharePoint data. We don't really do
 any
  custom processing on it other than strip HTML from the text.
  It's the database and file share information  that adds some challenges.
 I'm
  hoping to get SOLR out of the text processing pipeline, and just
  let it index data. We are moving to Pentaho at some point, and we'll
  probably handle most of the custom metadata processing there.
  At some point, we'll possibly integrate Pentaho as an output connection
 in
  Manifold.
 
  Thanks,
 
  Matt
 
  On Mon, Feb 27, 2012 at 10:04 AM, Karl Wright daddy...@gmail.com
 wrote:
 
  Please see my response interleaved below.
 
  On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
  mpar...@apogeeintegration.com wrote:
   I'm trying to push data into SOLR..
  
   Is there a way to transform the metadata coming in from different data
   sources like SharePoint, and the File Share, prior to posting it into
   SOLR?
  
 
  In general, ManifoldCF does not have data transformation abilities.
  With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
  extract content from documents and to perform transformations to
  document metadata etc.  It is possible that at some point it will be
  possible to do more transformations in ManifoldCF in order to support
  search engines that don't have a pipeline, but that is currently not
  available.
 
   For instance, documents have metadata specifying their file path. I
 need
   to
   transform that to a URL I can use within SOLR to retrieve that
 document
   through a servlet that I wrote.
  
 
  The ManifoldCF model is that a connector creates a URL for each
  document that it indexes, using whatever makes sense for that
  particular repository to get you back to the document in question.
  So, for instance, Documentum documents will use URLs that point at
  Documentum's Webtop web application.
 
  It would be helpful to understand more precisely what you are trying
  to do.  You could, for instance, modify your servlet to redirect to
  the ManifoldCF-generated URL.  It gets indexed into Solr as the id
  field.
 
   Also, based on specific metadata that I'm seeing in the documents, I
   might
   want to conditionally add populate other fields in SOLR index.
  
 
  That sounds like a job for the Tika pipeline to me.
 
  Thanks,
  Karl
 
   --
   This e-mail and any files transmitted with it may be proprietary.
Please
   note that any views or opinions presented in this e-mail are solely
   those of
   the author and do not necessarily represent those of Apogee
 Integration.
  
 
 
  --
  This e-mail and any files transmitted with it may be proprietary.  Please
  note that any views or opinions presented in this e-mail are solely
 those of
  the author and do not necessarily represent those of Apogee Integration.