from:"Karl Wright"

Re: When running on MySQL initialize.sh causes access denied error

2012-05-21 Thread Karl Wright

This sounds like a reasonable fix.

Would you be so kind as to create a ticket, and attach your proposed
change?  Adding a special property for mysql is also reasonable.

Karl

On Mon, May 21, 2012 at 5:34 AM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:

 Hi guys.

 I suppose some pepole use multiple servers to create MCF-MySQL environtment.
 Well, I'm one of them but I found that initialize.sh causes access denied
 error if DB server is seperated from MCF's.

 Suppose each server's IP are like followings:

 MySQL Server IP: A
 MCF Server IP: B

 and properties.xml has the follwing parameters and values:

  property name=org.apache.manifoldcf.databaseimplementationclass
 value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/
  property name=org.apache.manifoldcf.dbsuperusername value=root/
  property name=org.apache.manifoldcf.dbsuperuserpassword
 value=password/
  property name=org.apache.manifoldcf.database.name value=manifoldcf/
  property name=org.apache.manifoldcf.mysql.server value=A/

 Then, executing initialize.sh causes the follwing error:

 Caused by: java.sql.SQLException: Access denied for user 'manifoldcf'@'B'
 (using password: YES)
         at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:943)
         at com.mysql.jdbc.MysqlIO.secureAuth411(MysqlIO.java:4113)
         at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1308)
         at
 com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2336)
         at
 com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2369)
         at
 com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2153)
         at com.mysql.jdbc.ConnectionImpl.init(ConnectionImpl.java:792)
         at com.mysql.jdbc.JDBC4Connection.init(JDBC4Connection.java:47)
         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
         at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

 The problem is that MCF requests MySQL to create a new manifoldcf user
 with localhost access.
 In this case, of course, MCF is on the server with IP B, MySQL should create
 a new user with IP Address B.

 The following method is the one creating a new MySQL user in the user
 table.
 Modifying the part where IP information is added solves this problem.


 JAR NAME   :mcf-core.jar
 PACKAGE    :org.apache.manifoldcf.core.database
 CLASS NAME :DBInterfaceMySQL
 METHOD NAME: public void createUserAndDatabase

             if(userName != null)
                 try
                 {
                     list.clear();
                     list.add(userName);
              //     list.add(localhost);
                     list.add(IP_ADDRESS_B);
                     list.add(password);
 ...
 ...
 }


 I guess it would be nice if properties.xml can have a new property taking
 MCF server IP to
 have MySQL create a manifoldcf user with that IP. What do you think?


 Regards,

 Shigeki

Re: When running on MySQL initialize.sh causes access denied error

2012-05-21 Thread Karl Wright

I checked a fix into trunk.  The property is:

org.apache.manifoldcf.mysql.client

... which defaults to localhost.

Karl


On Mon, May 21, 2012 at 9:17 PM, Shigeki Kobayashi
shigeki.kobayas...@g.softbank.co.jp wrote:
 OK, I posted a ticket as CONNECTORS-476.

 Thanks.

 Shigeki

 2012/5/21 Karl Wright daddy...@gmail.com

 This sounds like a reasonable fix.

 Would you be so kind as to create a ticket, and attach your proposed
 change?  Adding a special property for mysql is also reasonable.

 Karl

 On Mon, May 21, 2012 at 5:34 AM, Shigeki Kobayashi
 shigeki.kobayas...@g.softbank.co.jp wrote:
 
  Hi guys.
 
  I suppose some pepole use multiple servers to create MCF-MySQL
  environtment.
  Well, I'm one of them but I found that initialize.sh causes access
  denied
  error if DB server is seperated from MCF's.
 
  Suppose each server's IP are like followings:
 
  MySQL Server IP: A
  MCF Server IP: B
 
  and properties.xml has the follwing parameters and values:
 
   property name=org.apache.manifoldcf.databaseimplementationclass
  value=org.apache.manifoldcf.core.database.DBInterfaceMySQL/
   property name=org.apache.manifoldcf.dbsuperusername value=root/
   property name=org.apache.manifoldcf.dbsuperuserpassword
  value=password/
   property name=org.apache.manifoldcf.database.name
  value=manifoldcf/
   property name=org.apache.manifoldcf.mysql.server value=A/
 
  Then, executing initialize.sh causes the follwing error:
 
  Caused by: java.sql.SQLException: Access denied for user
  'manifoldcf'@'B'
  (using password: YES)
          at
  com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1073)
          at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3609)
          at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3541)
          at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:943)
          at com.mysql.jdbc.MysqlIO.secureAuth411(MysqlIO.java:4113)
          at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1308)
          at
  com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2336)
          at
 
  com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2369)
          at
  com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2153)
          at com.mysql.jdbc.ConnectionImpl.init(ConnectionImpl.java:792)
          at
  com.mysql.jdbc.JDBC4Connection.init(JDBC4Connection.java:47)
          at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
          at
 
  sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 
  The problem is that MCF requests MySQL to create a new manifoldcf user
  with localhost access.
  In this case, of course, MCF is on the server with IP B, MySQL should
  create
  a new user with IP Address B.
 
  The following method is the one creating a new MySQL user in the user
  table.
  Modifying the part where IP information is added solves this problem.
 
 
  JAR NAME   :mcf-core.jar
  PACKAGE    :org.apache.manifoldcf.core.database
  CLASS NAME :DBInterfaceMySQL
  METHOD NAME: public void createUserAndDatabase
 
              if(userName != null)
                  try
                  {
                      list.clear();
                      list.add(userName);
               //     list.add(localhost);
                      list.add(IP_ADDRESS_B);
                      list.add(password);
  ...
  ...
  }
 
 
  I guess it would be nice if properties.xml can have a new property
  taking
  MCF server IP to
  have MySQL create a manifoldcf user with that IP. What do you think?
 
 
  Regards,
 
  Shigeki

Re: Proposed first graduation step: Moving the repository

2012-05-17 Thread Karl Wright

INFRA-4802.

Karl

On Thu, May 17, 2012 at 2:33 AM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:


 2012/5/17 Karl Wright daddy...@gmail.com

 Looks like I don't have permissions to do this.  I suppose I would
 need to open an infra ticket?


 Yes, I think so.
 Tommaso


 Karl

 On Wed, May 16, 2012 at 10:25 PM, Karl Wright daddy...@gmail.com wrote:
  Folks,
 
  Heads up: Now that we've graduated, I'd like to move the repository
  from https://svn.apache.org/repos/asf/incubator/lcf to
  https://svn.apache.org/repos/asf/manifoldcf.  This, of course, will
  mean that all workspaces will need to do a svn switch operation to
  change their path.  svn help switch should give you sufficient hints
  as to how.
 
  I'm planning to do the move tomorrow morning, as soon as Abe-san is
  done producing a 0.5.1 RC0 release candidate.  Please object if you
  want me to hold off.
 
  [I'm hoping, of course, that I now have the proper permissions to do
  this.  We'll see shortly.]
 
  Karl

Re: Proposed first graduation step: Moving the repository

2012-05-17 Thread Karl Wright

I've found the documents on PMC chair responsibilities, and have
started the process of setting up ManifoldCF as a TLP in the
foundation documents.  I've opened a second ticket (INFRA-4806) for
creating a root site under /www/manifoldcf.apache.org.  But I've not
found a checklist of all the tasks that need to be done to complete
graduation, and google is not helpful.  Does anyone have a link I can
use?

Karl

On Thu, May 17, 2012 at 6:31 AM, Karl Wright daddy...@gmail.com wrote:
 INFRA-4802.

 Karl

 On Thu, May 17, 2012 at 2:33 AM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:


 2012/5/17 Karl Wright daddy...@gmail.com

 Looks like I don't have permissions to do this.  I suppose I would
 need to open an infra ticket?


 Yes, I think so.
 Tommaso


 Karl

 On Wed, May 16, 2012 at 10:25 PM, Karl Wright daddy...@gmail.com wrote:
  Folks,
 
  Heads up: Now that we've graduated, I'd like to move the repository
  from https://svn.apache.org/repos/asf/incubator/lcf to
  https://svn.apache.org/repos/asf/manifoldcf.  This, of course, will
  mean that all workspaces will need to do a svn switch operation to
  change their path.  svn help switch should give you sufficient hints
  as to how.
 
  I'm planning to do the move tomorrow morning, as soon as Abe-san is
  done producing a 0.5.1 RC0 release candidate.  Please object if you
  want me to hold off.
 
  [I'm hoping, of course, that I now have the proper permissions to do
  this.  We'll see shortly.]
 
  Karl

Re: Proposed first graduation step: Moving the repository

2012-05-16 Thread Karl Wright

Since he's already done, I'm going to give this a try right now.
Karl

On Wed, May 16, 2012 at 10:25 PM, Karl Wright daddy...@gmail.com wrote:
 Folks,

 Heads up: Now that we've graduated, I'd like to move the repository
 from https://svn.apache.org/repos/asf/incubator/lcf to
 https://svn.apache.org/repos/asf/manifoldcf.  This, of course, will
 mean that all workspaces will need to do a svn switch operation to
 change their path.  svn help switch should give you sufficient hints
 as to how.

 I'm planning to do the move tomorrow morning, as soon as Abe-san is
 done producing a 0.5.1 RC0 release candidate.  Please object if you
 want me to hold off.

 [I'm hoping, of course, that I now have the proper permissions to do
 this.  We'll see shortly.]

 Karl

Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused

2012-05-11 Thread Karl Wright

Shigeki,

There are dozens of individual kinds of error that the Web Connector
detects and retries for; it would of course be possible to allow users
to set parameters to control all of them but it seems to me like it
would be almost too much freedom.  And, like I said initially, one
prime reason for the retry strategies of each error type is to avoid
having ManifoldCF behave badly and get blocked by the webmaster of the
site being crawled.

Having said that, if you have a case for changing the strategy for any
particular kind of error, we can certainly look into that.

In the case of connect exceptions, because there is a fairly long
socket timeout trying to connect (it's measured in minutes), and
because attempting to connect ties up a worker thread for that whole
time, you really don't want to retry too frequently.  You could make
the case for retrying for a longer period of time (say, 12 or 24
hours), or for slightly more frequently (1 hour instead of 2 hours).
If you have a case for doing that please go ahead and create a ticket.

Thanks,
Karl



On Thu, May 10, 2012 at 10:09 PM, 小林 茂樹(情報システム本部 / サービス企画部)
shigeki.kobayas...@g.softbank.co.jp wrote:
 Karl,

 There should be a Scheduled value also listed which is *when* the URL
 will be retried

 So, I see valuse in Scheduled and Retry Limit. The next re-crawling is
 two hours later and the final crawling is six hour later. It sounds like too
 much waiting. Are you guys planning to create new feature you can change
 these waiting periods, or a such thing already exists?

 Thanks for sharing your knowledge.

 Best regards,

 Shigeki

 2012/5/10 Karl Wright daddy...@gmail.com

 Waiting for Processing means that the URL will be retried.  There
 should be a Scheduled value also listed which is *when* the URL will
 be retried, and a Scheduled action column that says Process.  If
 you see these things you only need to wait until the time specified
 and the document will be recrawled.

 Karl

 On Wed, May 9, 2012 at 9:54 PM, 小林 茂樹(情報システム本部 / サービス企画部)
 shigeki.kobayas...@g.softbank.co.jp wrote:
  Karl,
 
  Thanks for the reply.
 
 
  For web crawling, no single URL failure will cause the job to
  abort;
 
  OK, so I understand if I want it stopped, I need to manually abort the
  job.
 
 
  You can check on the status of an individual URL by using the Document
  Status report.
 
  The Document Status report says the seed URL is Waiting for
  Proecssing,
  which makes sense because the connection is refused. The report does not
  show retry count.
 
  The MCF log outputs exception. Is this also expected behavior?:
  -
 
  DEBUG 2012-05-10 10:10:48,215 (Worker thread '34') - WEB: Fetch
  exception
  for 'http://xxx.xxx.xxx/index.html'
 
  java.net.ConnectException: Connection refused
 
  at java.net.PlainSocketImpl.socketConnect(Native Method)
 
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
 
  at
  java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
 
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
 
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
 
  at java.net.Socket.connect(Socket.java:529)
 
  at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
 
  at
 
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 
  at java.lang.reflect.Method.invoke(Method.java:597)
 
  at
 
  org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(Unknown
  Source)
 
  at
 
  org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(Unknown
  Source)
 
  at org.apache.commons.httpclient.HttpConnection.open(Unknown Source)
 
  at
 
  org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(Unknown
  Source)
 
  at
 
  org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Unknown
  Source)
 
  at
  org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
  Source)
 
  at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown
  Source)
 
  at
 
  org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)
 
   WARN 2012-05-10 10:10:48,216 (Worker thread '34') - Pre-ingest service
  interruption reported for job 1335340623530 connection 'WEB': Timed out
  waiting for a connection for 'http://xxx.xxx.xxx/index.html': Connection
  refused
 
 
  -
 
  Regards,
 
  Shigeki
 
 
  2012/5/9 Karl Wright daddy...@gmail.com
 
  Hi,
 
  ManifoldCF's web connector is, in general, very cautious about not
  offending the owners of sites.  If it concludes that the site has
  blocked access to a URL, it may remove the URL from its queue for
  politeness, which would prevent further crawling of that URL for the
  duration of the current job.  Under most cases, however, if a URL is
  temporarily unavailable

Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused

2012-05-10 Thread Karl Wright

Waiting for Processing means that the URL will be retried. There
should be a Scheduled value also listed which is *when* the URL will
be retried, and a Scheduled action column that says Process. If
you see these things you only need to wait until the time specified
and the document will be recrawled.

Karl

On Wed, May 9, 2012 at 9:54 PM, 小林 茂樹(情報システム本部 / サービス企画部)
shigeki.kobayas...@g.softbank.co.jp wrote:
Karl,

Thanks for the reply.

For web crawling, no single URL failure will cause the job to
abort;

OK, so I understand if I want it stopped, I need to manually abort the job.

You can check on the status of an individual URL by using the Document
Status report.

The Document Status report says the seed URL is Waiting for Proecssing,
which makes sense because the connection is refused. The report does not
show retry count.

The MCF log outputs exception. Is this also expected behavior?:
-

DEBUG 2012-05-10 10:10:48,215 (Worker thread '34') - WEB: Fetch exception
for 'http://xxx.xxx.xxx/index.html'

java.net.ConnectException: Connection refused

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)

at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)

at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)

at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)

at java.net.Socket.connect(Socket.java:529)

at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at
org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(Unknown
Source)

at
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(Unknown
Source)

at org.apache.commons.httpclient.HttpConnection.open(Unknown Source)

at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(Unknown
Source)

at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Unknown
Source)

at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(Unknown
Source)

at org.apache.commons.httpclient.HttpClient.executeMethod(Unknown
Source)

at
org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection$ExecuteMethodThread.run(ThrottledFetcher.java:1244)

WARN 2012-05-10 10:10:48,216 (Worker thread '34') - Pre-ingest service
interruption reported for job 1335340623530 connection 'WEB': Timed out
waiting for a connection for 'http://xxx.xxx.xxx/index.html': Connection
refused

Regards,

Shigeki

2012/5/9 Karl Wright daddy...@gmail.com

Hi,

ManifoldCF's web connector is, in general, very cautious about not
offending the owners of sites. If it concludes that the site has
blocked access to a URL, it may remove the URL from its queue for
politeness, which would prevent further crawling of that URL for the
duration of the current job. Under most cases, however, if a URL is
temporarily unavailable, it will be requeued for crawling at a later
time. The typical pattern is to attempt to recrawl the URL
periodically (e.g. every 5 minutes) for many hours before giving up on
it. For web crawling, no single URL failure will cause the job to
abort; it will continue running until all the other URLs have been
processed or forever (if the job is continuous).

You can check on the status of an individual URL by using the Document
Status report. This report should tell you what ManifoldCF intends to
do with a specific document. If you locate one such URL and try out
this report, what does it say?

Karl

On Tue, May 8, 2012 at 10:04 PM, 小林 茂樹(情報システム本部 / サービス企画部)
shigeki.kobayas...@g.softbank.co.jp wrote:

Hi guys.

I need some advice for stopping the MCF web crawler from a running state
when a network connection refused.

I use MCF 0.5 with Solr 3.5. I was testing what would happen to the web
crawler when shutting down the web site that is to be crawled. I checked
the
simple history and saw “Connection refused” with status code of “-1”,
that
looked fine. But as I was waiting, the job status never changed and
remained
running. The crawler never crawls in this situation, but when I opened
the
web site, the crawler never started crawling again either.

At least, somehow, I want the crawler to stop from running when a
network
connection refused, but I don’t know how. Does anyone have any ideas?

Re: [ManifoldCF 0.5] The web crawler remains running after a network connection refused

2012-05-09 Thread Karl Wright

Hi,

Karl

On Tue, May 8, 2012 at 10:04 PM, 小林 茂樹(情報システム本部 / サービス企画部)
shigeki.kobayas...@g.softbank.co.jp wrote:

Hi guys.

I need some advice for stopping the MCF web crawler from a running state
when a network connection refused.

I use MCF 0.5 with Solr 3.5. I was testing what would happen to the web
crawler when shutting down the web site that is to be crawled. I checked the
simple history and saw “Connection refused” with status code of “-1”, that
looked fine. But as I was waiting, the job status never changed and remained
running. The crawler never crawls in this situation, but when I opened the
web site, the crawler never started crawling again either.

At least, somehow, I want the crawler to stop from running when a network
connection refused, but I don’t know how. Does anyone have any ideas?

Re: manifoldcf 0.5 from Windows Dev machine to Debian Server

2012-05-09 Thread Karl Wright

Did you use Tomcat on Windows?

There is a -D switch you need to use when starting Tomcat which tells
ManifoldCF web applications where to find the properties.xml file.  It
may be that you'd need to make modifications to the tomcat startup
(/etc/init.d/tomcat6) to set that property.

The other thing to note is that, unless you change something explicit,
under Debian tomcat runs as the tomcat6 user.  So your sync
directory has to be both readable and writeable by that user, as well
as the user that runs the agents process.  Indeed, at MetaCarta we
gave up and ran everything as the tomcat6 user - it seemed easier.

Karl

On Wed, May 9, 2012 at 6:14 AM, Marcus Kröller
kroel...@igd-r.fraunhofer.de wrote:
 Hello everybody,

 for searching internal resources (MySQL DBs, Wiki, Filesystem, own JDBC
 based connector) we have created a ManifoldCF and Solr instance (0.5 and
 3.4). Development happened on Windows (7) machines and everything is
 running as desired.
 Now I am facing the challenge of getting it all to run on a Debian Server
 with a Tomcat6 and PGSQl 8.4 on the same machine. Configuration paths should
 be adjusted properly. The webapps and the agent start individually. We are
 using scripts calling the JAVA command API as well as the servlet API via
 curl (easier for connection/job creation using json files) to initialise DB,
 register Agent, connectors, connections etc. (which have been translated to
 bash including EOL characters).

 The problem is: following the script the servlet API does not respond and
 the Crawler UI hangs on the empty template when requesting any of the lists
 (Connections, Jobs, etc.). When restoring a database dump from the Windows
 machine these lists are accessible but starting the Agent process leads to
 the same non reacting behavior.
 I imagine I am facing permission issues for the synch directory but was
 unable to find documentation or similar issues and I would unfortunately not
 consider myself a Linux professional.

 Any input would be highly appreciated.


 Regards and thank you,


 Marcus Kröller

 Student Research Assistant - Fraunhofer IGD Rostock

Re: JDBC Connection Exception

2012-05-09 Thread Karl Wright

FWIW, the ticket is CONNECTORS-96.  I've created a branch to work on
it.  I'll let you know when I think it's ready to try out.

Karl


On Mon, May 7, 2012 at 5:53 AM, Karl Wright daddy...@gmail.com wrote:
 Also, there has been a long-running ticket to replace the JDBC pool
 driver with something more modern for a while.  Many of the
 off-the-shelf pool drivers are inadequate for various reasons, so I
 have one that I wrote myself, but it is not yet committed.  So I am
 curious - which connections are timing out?  The Oracle connections or
 the Postgresql ones?

 Karl

 On Mon, May 7, 2012 at 5:34 AM, Karl Wright daddy...@gmail.com wrote:
 What database are you using?  (Not the JDBC database, the underlying
 one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
 you could also post some of the long-running queries, that would be
 good as well.

 Depending on the database, ManifoldCF periodically
 re-analyzes/reindexes the underlying database during the crawl, which
 when the table is large can cause some warnings about long-running
 queries, because during the reindex process the database performance
 is slowed.  That's not usually a problem, other than briefly slowing
 the crawl.  However, it's also possible that there's a point where
 Postgresql's plan is poor, and we should see that because the warning
 also dumps the plan.

 Truncating the jobqueue table is not recommended, since then
 ManifoldCF has no idea of what it has crawled and what it hasn't, and
 its incremental properties tend to suffer.

 Karl


 On Mon, May 7, 2012 at 1:25 AM, Michael Le michael.aaron...@gmail.com 
 wrote:
 Hello,

 Using a JDBC Repository connection to an Oracle 11g database, I've had
 issues where in the initial seeding stage the connection to the database is
 closing in the middle of processing the result set.  The original data table
 I'm trying to index is about 10 million records, and with the original code,
 I could never get past about 750K records.

 I spent some time with the pooling parameters to the bitmachanic database
 pooling, but the API and source doesn't seem to be available any more.  Even
 the original author doesn't have the code or specs any more.  The parameter
 modifications to the pool allowed me to get through the first stage of
 processing a 2M row subset, but during the second stage where it's trying to
 obtain the documents, the connections again started being closed.  I ended
 up just replacing the connection pool code, with an oracle implementation,
 and its churning through the documents happily.  As a foot note, on my
 sample subset of about 400K documents, the throughput went from about 10
 documents/s to 19 docs/s, but this may just be a side effect of oracle
 database load or network traffic.

 Has anyone else had issues processing a large Oracle repository?  I've noted
 the benchmarks were done with 300K documents, and even in our initial
 testing with about 500K documents, no issues arose.

 The second and more pressing issue is the jobqueues table.  In the process
 of dubugging the database connection issues, jobs were started, stopped,
 deleted, aborted, and various WHERE clauses were applied to the seeding
 queries/jobs.   MCF is now reporting that there are long running queries
 against this table.  In the past, I've just truncated the jobqueues table,
 but this had the side effect of stuffing a document into solr (output
 connector) multiple times.  What API calls, or sql can I run to clean up the
 jobqueues table?  Should I just wait for all jobs to finish and then at that
 point truncate the table?  I've broken my data into several smaller subsets
 of around 1-2 million rows, but that has the side effect of a jobqueues
 table that is 6-8 million rows.

 Any support would be greatly appreciated.

 Thanks,
 -Michael Le

Re: ManifoldCF 0.5 / SharePoint 2010 connector

2012-05-08 Thread Karl Wright

Hi Prem,

In the future questions like this should go the the connectors-user
list, not my personal email.

If you search the users list you will find that a number of people
have successfully used ManifoldCF to crawl SharePoint recently.  You
can see this yourself by searching the archive here:
http://incubator.apache.org/connectors/en_US/mail.html .  I do not
remember what version they are using but we have made no functional
changes to the SharePoint connector between version 0.4 and 0.5.  The
users include at least one other who is crawling secure governmental
systems.  Internationalization of the UI was the only change that was
done.

If you would like assistance in diagnosing your particular problems,
please provide some details as to the exact problems you are having.
Are you able to establish a working connection to SharePoint?  What
version of SharePoint are you trying to connect to?  If version 3 or
above, did you deploy the ManifoldCF user permissions webservice?

Thanks,
Karl

On Tue, May 8, 2012 at 8:58 AM, prem bangle prem...@gmail.com wrote:
 Hi Karl,
 We are unable to successfully crawl SharePoint 2010 repository using
 ManifoldCF 2010 ver 0.5. Do you have feedback from
 others successfully crawling SharePoint 2010? Your opinion on this will help
 us go forward. Google searches and issues recorded in Apache Jira did not
 help us to
 come to a conclusion.

 We are evaluating ManifoldCF in the context of one of the Dept. of Homeland
 Security(DHS) programs.

 Any feedback from you much appreciated.

 thanks
 Prem

Re: ManifoldCF 0.5 / SharePoint 2010 connector

2012-05-08 Thread Karl Wright

Hi Daniel,

Here's the story.

The SharePoint connector works using SharePoint web services, and has
been explicitly tested against both SharePoint 2003 and SharePoint
2007.  It has not been explicitly tested against SharePoint 2010,
because none of the developers have a working SharePoint 2010 instance
to test against.  The MetaCarta Permissions web service was designed
to provide access to folder and file permissions, which appeared in
SharePoint 2007 and are no doubt also present in SharePoint 2010.

So, I would expect the following for SharePoint 2010.

- the basic web services should continue to work as they did in
SharePoint 2007.  If you can connect and get Connection working it
basically confirms this picture.
- the MetaCarta Permissions service is a greater risk.  It may not
work because it is compiled against SharePoint.dll from SharePoint
2007, not SharePoint 2010.  It's technically still required, so if it
*doesn't* work we're going to need to make some changes to support
SharePoint 2010.

So I'd suggest that you try the following in order:

(1) First, try connecting to SharePoint 2010, specifying SharePoint
2.0 in the connection parameters.  Do not try deploying the MetaCarta
Permissions service for this test.  If you can connect, and crawl,
then we're in pretty good shape.

(2) If (1) works, then try deploying the MC Permissions service on the
SharePoint 2010 server.  If it deploys correctly, then try connecting
to it by specifying a SharePoint 3.0 connection.  If you get back
Connection working from that, then it is functioning, and everything
should be working.

Please let me know exactly how far you get in this process, and what
errors you see in both the manifoldcf.log and for the connection
status.

Thanks!
Karl


On Tue, May 8, 2012 at 10:06 AM, Silvia, Daniel [USA]
silvia_dan...@bah.com wrote:
 Hi Karl

 When upgrading to SharePoint 2010, will we still need to install the 
 MetaCarta Permission Webservice to sharepoint instance?

 Thanks

 Dan Silvia

 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Tuesday, May 08, 2012 9:06 AM
 To: prem bangle; connectors-user@incubator.apache.org
 Subject: Re: ManifoldCF 0.5 / SharePoint 2010 connector

 Hi Prem,

 In the future questions like this should go the the connectors-user
 list, not my personal email.

 If you search the users list you will find that a number of people
 have successfully used ManifoldCF to crawl SharePoint recently.  You
 can see this yourself by searching the archive here:
 http://incubator.apache.org/connectors/en_US/mail.html .  I do not
 remember what version they are using but we have made no functional
 changes to the SharePoint connector between version 0.4 and 0.5.  The
 users include at least one other who is crawling secure governmental
 systems.  Internationalization of the UI was the only change that was
 done.

 If you would like assistance in diagnosing your particular problems,
 please provide some details as to the exact problems you are having.
 Are you able to establish a working connection to SharePoint?  What
 version of SharePoint are you trying to connect to?  If version 3 or
 above, did you deploy the ManifoldCF user permissions webservice?

 Thanks,
 Karl

 On Tue, May 8, 2012 at 8:58 AM, prem bangle prem...@gmail.com wrote:
 Hi Karl,
 We are unable to successfully crawl SharePoint 2010 repository using
 ManifoldCF 2010 ver 0.5. Do you have feedback from
 others successfully crawling SharePoint 2010? Your opinion on this will help
 us go forward. Google searches and issues recorded in Apache Jira did not
 help us to
 come to a conclusion.

 We are evaluating ManifoldCF in the context of one of the Dept. of Homeland
 Security(DHS) programs.

 Any feedback from you much appreciated.

 thanks
 Prem

Re: ManifoldCF 0.5 / SharePoint 2010 connector

2012-05-08 Thread Karl Wright

)
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Target service:
 StsAdapterSoap
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Enter:
 SOAPPart::getAsSOAPEnvelope()
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') -
 org.apache.axis.i18n.resource::handleGetObject(currForm)
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - current form is
 FORM_SOAPENVELOPE
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') -
 org.apache.axis.i18n.resource::handleGetObject(addHeader00)
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Adding header to
 message...
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') -
 org.apache.axis.i18n.resource::handleGetObject(addHeader00)
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Adding header to
 message...
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - MessageContext:
 setTargetService(http://schemas.microsoft.com/sharepoint/dsp/queryRequest)
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') -
 org.apache.axis.i18n.resource::handleGetObject(noService10)
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - Exception:
 org.apache.axis.ConfigurationException: No service named
 http://schemas.microsoft.com/sharepoint/dsp/queryRequest is available
 org.apache.axis.ConfigurationException: No service named
 http://schemas.microsoft.com/sharepoint/dsp/queryRequest is available
 at
 org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper$ResourceProvider.getService(SPSProxyHelper.java:2208)
 at org.apache.axis.AxisEngine.getService(AxisEngine.java:311)
 at org.apache.axis.MessageContext.setTargetService(MessageContext.java:756)
 at
 org.apache.axis.transport.http.HTTPTransport.setupMessageContextImpl(HTTPTransport.java:89)
 at org.apache.axis.client.Transport.setupMessageContext(Transport.java:46)
 at org.apache.axis.client.Call.invoke(Call.java:2738)
 at org.apache.axis.client.Call.invoke(Call.java:2443)
 at org.apache.axis.client.Call.invoke(Call.java:2366)
 at org.apache.axis.client.Call.invoke(Call.java:1812)
 at
 com.microsoft.schemas.sharepoint.dsp.StsAdapterSoapStub.query(StsAdapterSoapStub.java:317)
 at
 org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getDocuments(SPSProxyHelper.java:540)
 at
 org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:906)
 at
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561)

 at
 org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper$ResourceProvider.getService(SPSProxyHelper.java:2208)
 at org.apache.axis.AxisEngine.getService(AxisEngine.java:311)
 at org.apache.axis.MessageContext.setTargetService(MessageContext.java:756)
 at
 org.apache.axis.transport.http.HTTPTransport.setupMessageContextImpl(HTTPTransport.java:89)
 at org.apache.axis.client.Transport.setupMessageContext(Transport.java:46)
 at org.apache.axis.client.Call.invoke(Call.java:2738)
 at org.apache.axis.client.Call.invoke(Call.java:2443)
 at org.apache.axis.client.Call.invoke(Call.java:2366)
 at org.apache.axis.client.Call.invoke(Call.java:1812)
 at
 com.microsoft.schemas.sharepoint.dsp.StsAdapterSoapStub.query(StsAdapterSoapStub.java:317)
 at
 org.apache.manifoldcf.crawler.connectors.sharepoint.SPSProxyHelper.getDocuments(SPSProxyHelper.java:540)
 at
 org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:906)
 at
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
 at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561)
 DEBUG 2012-05-07 18:45:21,429 (Worker thread '30') - MessageContext:
 setServiceHandler(null)


 ###







 On Tue, May 8, 2012 at 11:14 AM, Karl Wright daddy...@gmail.com wrote:

 Hi Daniel,

 Here's the story.

 The SharePoint connector works using SharePoint web services, and has
 been explicitly tested against both SharePoint 2003 and SharePoint
 2007.  It has not been explicitly tested against SharePoint 2010,
 because none of the developers have a working SharePoint 2010 instance
 to test against.  The MetaCarta Permissions web service was designed
 to provide access to folder and file permissions, which appeared in
 SharePoint 2007 and are no doubt also present in SharePoint 2010.

 So, I would expect the following for SharePoint 2010.

 - the basic web services should continue to work as they did in
 SharePoint 2007.  If you can connect and get Connection working it
 basically confirms this picture.
 - the MetaCarta Permissions service is a greater risk.  It may not
 work because it is compiled against SharePoint.dll from SharePoint
 2007, not SharePoint 2010.  It's technically still required, so if it
 *doesn't* work we're going to need to make some changes to support
 SharePoint 2010.

 So I'd suggest that you

Re: JDBC Connection Exception

2012-05-07 Thread Karl Wright

What database are you using?  (Not the JDBC database, the underlying
one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
you could also post some of the long-running queries, that would be
good as well.

Depending on the database, ManifoldCF periodically
re-analyzes/reindexes the underlying database during the crawl, which
when the table is large can cause some warnings about long-running
queries, because during the reindex process the database performance
is slowed.  That's not usually a problem, other than briefly slowing
the crawl.  However, it's also possible that there's a point where
Postgresql's plan is poor, and we should see that because the warning
also dumps the plan.

Truncating the jobqueue table is not recommended, since then
ManifoldCF has no idea of what it has crawled and what it hasn't, and
its incremental properties tend to suffer.

Karl


On Mon, May 7, 2012 at 1:25 AM, Michael Le michael.aaron...@gmail.com wrote:
 Hello,

 Using a JDBC Repository connection to an Oracle 11g database, I've had
 issues where in the initial seeding stage the connection to the database is
 closing in the middle of processing the result set.  The original data table
 I'm trying to index is about 10 million records, and with the original code,
 I could never get past about 750K records.

 I spent some time with the pooling parameters to the bitmachanic database
 pooling, but the API and source doesn't seem to be available any more.  Even
 the original author doesn't have the code or specs any more.  The parameter
 modifications to the pool allowed me to get through the first stage of
 processing a 2M row subset, but during the second stage where it's trying to
 obtain the documents, the connections again started being closed.  I ended
 up just replacing the connection pool code, with an oracle implementation,
 and its churning through the documents happily.  As a foot note, on my
 sample subset of about 400K documents, the throughput went from about 10
 documents/s to 19 docs/s, but this may just be a side effect of oracle
 database load or network traffic.

 Has anyone else had issues processing a large Oracle repository?  I've noted
 the benchmarks were done with 300K documents, and even in our initial
 testing with about 500K documents, no issues arose.

 The second and more pressing issue is the jobqueues table.  In the process
 of dubugging the database connection issues, jobs were started, stopped,
 deleted, aborted, and various WHERE clauses were applied to the seeding
 queries/jobs.   MCF is now reporting that there are long running queries
 against this table.  In the past, I've just truncated the jobqueues table,
 but this had the side effect of stuffing a document into solr (output
 connector) multiple times.  What API calls, or sql can I run to clean up the
 jobqueues table?  Should I just wait for all jobs to finish and then at that
 point truncate the table?  I've broken my data into several smaller subsets
 of around 1-2 million rows, but that has the side effect of a jobqueues
 table that is 6-8 million rows.

 Any support would be greatly appreciated.

 Thanks,
 -Michael Le

Re: JDBC Connection Exception

2012-05-07 Thread Karl Wright

Also, there has been a long-running ticket to replace the JDBC pool
driver with something more modern for a while.  Many of the
off-the-shelf pool drivers are inadequate for various reasons, so I
have one that I wrote myself, but it is not yet committed.  So I am
curious - which connections are timing out?  The Oracle connections or
the Postgresql ones?

Karl

On Mon, May 7, 2012 at 5:34 AM, Karl Wright daddy...@gmail.com wrote:
 What database are you using?  (Not the JDBC database, the underlying
 one...)  If PostgreSQL, what version?  What version of ManifoldCF?  If
 you could also post some of the long-running queries, that would be
 good as well.

 Depending on the database, ManifoldCF periodically
 re-analyzes/reindexes the underlying database during the crawl, which
 when the table is large can cause some warnings about long-running
 queries, because during the reindex process the database performance
 is slowed.  That's not usually a problem, other than briefly slowing
 the crawl.  However, it's also possible that there's a point where
 Postgresql's plan is poor, and we should see that because the warning
 also dumps the plan.

 Truncating the jobqueue table is not recommended, since then
 ManifoldCF has no idea of what it has crawled and what it hasn't, and
 its incremental properties tend to suffer.

 Karl


 On Mon, May 7, 2012 at 1:25 AM, Michael Le michael.aaron...@gmail.com wrote:
 Hello,

 Using a JDBC Repository connection to an Oracle 11g database, I've had
 issues where in the initial seeding stage the connection to the database is
 closing in the middle of processing the result set.  The original data table
 I'm trying to index is about 10 million records, and with the original code,
 I could never get past about 750K records.

 I spent some time with the pooling parameters to the bitmachanic database
 pooling, but the API and source doesn't seem to be available any more.  Even
 the original author doesn't have the code or specs any more.  The parameter
 modifications to the pool allowed me to get through the first stage of
 processing a 2M row subset, but during the second stage where it's trying to
 obtain the documents, the connections again started being closed.  I ended
 up just replacing the connection pool code, with an oracle implementation,
 and its churning through the documents happily.  As a foot note, on my
 sample subset of about 400K documents, the throughput went from about 10
 documents/s to 19 docs/s, but this may just be a side effect of oracle
 database load or network traffic.

 Has anyone else had issues processing a large Oracle repository?  I've noted
 the benchmarks were done with 300K documents, and even in our initial
 testing with about 500K documents, no issues arose.

 The second and more pressing issue is the jobqueues table.  In the process
 of dubugging the database connection issues, jobs were started, stopped,
 deleted, aborted, and various WHERE clauses were applied to the seeding
 queries/jobs.   MCF is now reporting that there are long running queries
 against this table.  In the past, I've just truncated the jobqueues table,
 but this had the side effect of stuffing a document into solr (output
 connector) multiple times.  What API calls, or sql can I run to clean up the
 jobqueues table?  Should I just wait for all jobs to finish and then at that
 point truncate the table?  I've broken my data into several smaller subsets
 of around 1-2 million rows, but that has the side effect of a jobqueues
 table that is 6-8 million rows.

 Any support would be greatly appreciated.

 Thanks,
 -Michael Le

Re: Adding more exclusions and document deletion

2012-05-07 Thread Karl Wright

If the job is run in continuous mode, you would have to wait until the
document went away.  But if you are using a job meant to be run to the
end then it should pick up changes to the spec on each run.

Karl


On Mon, May 7, 2012 at 7:11 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

 How sophisticated is MCF when it comes to document deletion?

 I have previously entered a lot of URLs into the exclude from crawl list
 in order to exclude them from my web crawl. Now, a couple of months later, I
 want to exclude a bunch of other URLs as well since these now are handled by
 our CMS instead. Will MCF delete these new URLs/documents from our Solr
 server at next job run or will they only be deleted when they have been
 outdated?

 Erlend

 --
 Erlend Garåsen
 Center for Information Technology Services
 University of Oslo
 P.O. Box 1086 Blindern, N-0317 OSLO, Norway
 Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Ingestion API socket timeout exception waiting for response code

2012-05-07 Thread Karl Wright

Thanks for the update!
Karl

On Mon, May 7, 2012 at 7:15 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

 Document deletion works perfectly after I reinstalled the SSL certificate
 and reentered the username and password to our Solr server. So I think this
 issue has been solved.

 Erlend

 On 27.04.12 12.11, Erlend Garåsen wrote:


 Many thanks for your suggestions and help, Karl. Using a filesystem
 crawl was actually a good idea for debugging/testing. To install a new
 version of Solr is not that easy on our test server for many reasons,
 generally because it is under control of another division dealing with
 servers at the uni, even though I can get root access. Anyway, according
 to the logs on our Solr 3.2 server, it seems that MCF successfully
 managed to delete one test document I removed:
 [2012-04-27 11:18:33.092] {delete=[file:/tmp/mcf/docs/app_lasso.pdf]} 0 7
 [2012-04-27 11:18:33.092] [] webapp=/solr path=/update params={}
 status=0 QTime=7

 The result code is 200 according to Simple History in MCF.

 I entered the passwords once again for the Solr servers into the Solr
 output configuration, deleted and uploaded our SSL certificate once
 again before I did the filesystem test. I should have performed the
 tests prior to the password updates.

 The crawl will start again later today at 6 pm on our production server,
 so I will try to figure out whether we still have problems later. I'm
 going to Scotland later this evening for some days without my laptop, so
 I cannot check the status of my crawl before I'm back, but I'll let my
 colleague watch the logs.

 Erlend

 On 26.04.12 21.14, Karl Wright wrote:

 Hi Erlend,

 I had some time today and was able to verify that everything worked
 fine against what I have currently on my laptop, which is Solr 3.2.
 The second job run looks like this:

 04-26-2012 15:11:44.154 job end 1335467343879(test) 0 1
 04-26-2012 15:11:34.159 document deletion (solr)
 file:/C:/testcrawl/there.txt 200 0 117
 04-26-2012 15:11:24.690 read document C:\testcrawl OK 0 1
 04-26-2012 15:11:24.494 job start 1335467343879(test) 0 1

 So it appears that either something changed in Solr, or SSL support is
 broken, or your network is not permitting a valid HTTP response for
 some reason.

 Karl


 On Thu, Apr 26, 2012 at 11:10 AM, Karl Wrightdaddy...@gmail.com wrote:

 Hi Erlend,

 Can you try the following:

 (1) Make a fresh Solr checkout of 3.6 or whatever Solr version you are
 using, and build it
 (2) Start it
 (3) Run a simple filesystem crawl using a Solr connection that is
 created with the default values
 (4) Delete a file in your filesystem that was crawled
 (5) Crawl again

 Does the deletion happen OK?

 AFAIK, nothing has changed in the Solr connector that should affect
 the ability to delete. This test will confirm that it is still
 working.

 Thanks,
 Karl


 On Thu, Apr 26, 2012 at 10:19 AM, Erlend Garåsen
 e.f.gara...@usit.uio.no wrote:

 It seems that MCF cannot delete documents from Solr. A timeout
 occurs, and
 the job stops after a while.

 This is what I can see from the log:
 WARN 2012-04-20 18:24:30,373 (Worker thread '16') - Service
 interruption
 reported for job 1327930125433 connection 'Web crawler': Ingestion API
 socket timeout exception waiting for response code: Read timed out;
 ingestion will be retried again later

 If I take a further look in Simple History, it seems that this error is
 related to document deletion.

 I have tried to delete the document manually by using curl from the
 same
 server MCF is installed on in case we have some access restrictions,
 but
 Curr succeeded.

 We do not have any problems with adding, the timeout only occurs while
 deleting documents.

 I have checked our Solr configuration. MCF does use the correct path
 for
 document deletion, i.e. /update.

 The correct realm, username and password for our Solr server are
 entered
 correctly and the SSL certificate is valid as well.

 Erlend

 --
 Erlend Garåsen
 Center for Information Technology Services
 University of Oslo
 P.O. Box 1086 Blindern, N-0317 OSLO, Norway
 Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968,
 VIP: 31050





 --
 Erlend Garåsen
 Center for Information Technology Services
 University of Oslo
 P.O. Box 1086 Blindern, N-0317 OSLO, Norway
 Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Can we have location indexed as a field into solr through ManifoldCF

2012-05-03 Thread Karl Wright

You want to look at the end-user documentation where it describes the
Metadata tab for the windows share connector.

Karl


On Thu, May 3, 2012 at 4:14 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi,

 Am using ManifoldCF to crawl Windows Share repository and index documents
 from them into Solr. I have got a requirement where I want the location of
 the document to be indexed as a field in Solr. Can we achieve this from
 ManifoldCF ?

 For example, when I define a Windows Share repository connection with
 server1, and a Job with path name path1 (present on server1), I want all the
 documents that are indexed into Solr with this job to have a field which
 tells me the location, which is \\server1\path1.
 Hope am clear in what I want.

 Can we achieve this by defining some parameters at ManifoldCF end or is
 there any possible way ? Can someone please let me know of any ideas to get
 this done ?

 Thanks and Regards,
 Swapna.

Re: MCF Crawler UI doesn't load

2012-05-01 Thread Karl Wright

You need a patch.  See CONNECTORS-467.

Karl

On Tue, May 1, 2012 at 12:25 PM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi,

 Till recently, I have been using ManifoldCF 0.4 version with Tomcat 7.0 and
 it used to work perfectly fine.

 Am trying to switch to ManifoldCF 0.5 version and I have built from source
 and configured everything that I did for earlier version.

 But am not able to browse to the page http://localhost:8080/mcf-crawler-ui.
 It says

 HTTP Status 404 - /mcf-crawler-ui

 

 type Status report

 message /mcf-crawler-ui

 description The requested resource (/mcf-crawler-ui) is not available.

 

 Apache Tomcat/7.0.22

 And, am able to use the other service like
 http://localhost:8080/mcf-api-service/json/outputconnectors, it is giving me
 the json object listing all connectors.

 Can someone please help me out in resolving this issue ?

 Thanks and Regards,

 Swapna.

Re: Output Connector for SearchBlox

2012-04-09 Thread Karl Wright

The right guy to look at this is on Easter vacation at the moment.
I'm sure he will respond when he is back.

Thanks,
Karl

On Sat, Apr 7, 2012 at 5:36 PM, Timo Selvaraj tselva...@searchblox.com wrote:
 Hello,

 Anyone available (on a paid basis) to create a output connector for
 SearchBlox? I am interested in creating an output connector for SearchBlox
 (through the REST API http://www.searchblox.com/developers/api ) for
 contribution to the ManifoldCF project. Please message me directly
 tselva...@searchblox.com

 Thanks,

 --
 Timo Selvaraj
 SearchBlox Software, Inc. http://www.searchblox.com/

Re: Running 2 jobs to update same document Index but different

2012-03-30 Thread Karl Wright

I did not see that you tried creating a filesystem connection and job.
 Did you do that, and did it work for you without sending a deletion?
If not, please go back to using the manifoldcf id field and try that
first.

Here is the patch I'd like you to apply:

===
--- 
framework/agents/src/main/java/org/apache/manifoldcf/agents/incrementalingest/IncrementalIngester.java
  (revision
1307149)
+++ 
framework/agents/src/main/java/org/apache/manifoldcf/agents/incrementalingest/IncrementalIngester.java
  (working
copy)
@@ -697,6 +697,8 @@
   {
 IOutputConnection connection =
connectionManager.load(outputConnectionName);

+Logging.ingest.error(Deleting documents!, new
Exception(Deletion stack trace));
+
 if (Logging.ingest.isDebugEnabled())
 {
   int i = 0;


Then, rebuild ManifoldCF.  Every document that is deleted from the
index will generate a trace in the log.  Run your crawl and send me
one of those traces.

Karl


On Fri, Mar 30, 2012 at 6:06 AM, Anupam Bhattacharya
anupam...@gmail.com wrote:
 I checked the Manifoldcf logs and i there were no exceptions.

 Additionally i changed the id (uniqueKey) in SOLR to the documentum specific
 unique id i.e. r_object_id and ran the job. This i time i could easily
 create the indexes.

 For (4) please provide the places for which i need to enable logging.

 On Thu, Mar 29, 2012 at 6:56 PM, Karl Wright daddy...@gmail.com wrote:

 But as per my observation the deletion happens only when uniqueKey in
 SOLR schema is set to id. 

 The SOLR setup cannot influence the flow in ManifoldCF unless it causes
 SOLR to reject the ManifoldCF requests.  So I suspect that the delete
 request is happening in both cases, and it is not getting acted upon by SOLR
 in the case where uniqueKey is not set to id.  That's because the delete
 request from ManifoldCF will be for a key that solr doesn't recognize as
 such.

 Please do try recommendations (3) and (4).

 Karl

Re: Running 2 jobs to update same document Index but different fields

2012-03-29 Thread Karl Wright

The REJECTED result is because the document has the wrong mime type or is
too long, according to your length restriction.  Do you have just one job,
or do you still have two?  If you have two jobs covering the same overall
documents with different document criteria, this is the kind of thing that
happens when you run one job after the other; the documents belonging to
the first.  You will only need one job if you try the plan I was talking
about, but it should include the PDFs as well as the XML documents.

If you only have one job, then I can't explain it unless you changed the
document criteria and ran the job a second time.

Karl



On Thu, Mar 29, 2012 at 3:39 AM, Anupam Bhattacharya anupam...@gmail.comwrote:

 Okay. I tried to use the id which is formed my manifoldcf documentum
 connector. I ran the job i could see in between from the SOLR admin screen
 that documents were getting indexed. But just after the end of the job i
 see all my created indexes gets deleted.

 Snippet from Simple History is given below.

 Why this document deletion activity gets added and deletes all my created
 indexes when i keep the unique id as id in the schema.xml file of SOLR ?

  Start Time http://localhost:8080/mcf-crawler-ui/execute.jsp 
 Activityhttp://localhost:8080/mcf-crawler-ui/execute.jsp
 Identifier Result Code http://localhost:8080/mcf-crawler-ui/execute.jsp
 Bytes http://localhost:8080/mcf-crawler-ui/execute.jsp 
 Timehttp://localhost:8080/mcf-crawler-ui/execute.jspResult Description
 03-29-2012 13:00:26.837 document deletion (Solr_TEST_QA)
 http://example.domain.com:8088/webtop/component/drl?versio...
 nLabel=CURRENTobjectId=09d905e78000676d
 200 0 110
 03-29-2012 12:55:37.869 fetch 09d905e78000676d
 REJECTED 86823 4184
 03-29-2012 12:55:34.934 document ingest (Solr_TEST_QA)
 http://example.domain.com:8088/webtop/component/drl?versio...
 nLabel=CURRENTobjectId=09d905e78000676d
 200 8158 235

 On Thu, Mar 29, 2012 at 12:41 AM, Karl Wright daddy...@gmail.com wrote:

 So do you find this design appropriate and feasible ?  It sounds
 like you are still trying to merge records in Solr but this time using
 Solr Cell to somehow do this.  Since SolrCell is a pipeline, I don't
 think you will find it easy to keep data from one job aligned with
 data from another.  That's why I suggested just allowing both kinds of
 documents to be indexed as-is, and just making sure that you include a
 metadata reference to the main document in each.

 Karl


 On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya
 anupam...@gmail.com wrote:
  The second option seems to be more useful as it will allow me add to any
  business logic.
  So similar to SOLR Cell (/update/extract) my new RequestHandler will be
  added in solrconfig.xml which will do all the manipulations.
  Later, I need to get all field values into a temp variable by first
  searching by id in the lucene indexes and then add these values into the
  incoming new field values list.
 
  So do you find this design appropriate and feasible ?
 
  Anupam
 
  On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright daddy...@gmail.com
 wrote:
 
  Thanks - now I understand what you are trying to do more clearly.
 
  The Documentum connector is going to pick up the XML document and the
  PDF document as separate entities.  Thus, they'd also be indexed in
  Solr separately.  So if we use that as a starting point, let's see
  where it might lead.
 
  First, you'd want each PDF document to have metadata that refers back
  to the XML parent document.  I'm not sure how easy it is to set up
  such a metadata reference in Documentum, but I vaguely recall there
  was indeed some such field.  So let's presume you can get that.  Then,
  you'd want to make sure your Solr schema included an XML document
  field, which had the URL of the parent XML document (or, for XML
  documents, the document's own URL) as content.  That would be the ID
  you'd use to present a result item to a user.
 
  Does this sound reasonable so far?
 
  The only other piece you might need is manipulation of either the
  PDF's metadata, or the XML document's metadata, or both.  For that,
  I'd use Solr Cell to perform whatever mappings and manipulations made
  sense before the documents actually get indexed.
 
  Karl
 
  On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
  anupam...@gmail.com wrote:
   I would have been happy if  I had to index PDF and XML separately.
   But for my use-case. XML is the main document containing
 bibliographic
   information (which needs to presented as search result) and consists
 a
   reference to a child/supporting document which is a actual PDF file.
 I
   need
   to index the PDF text and if any search matches with the PDF content
   then
   the parent/XML bibliographic information needs to be presented.
  
   I am trying to call the SOLR search engine with one single query to
 show
   corresponding XML detail for a search term present in the PDF. I
 checked
   that from SOLR 4.x version SOLR

Re: Running 2 jobs to update same document Index but different fields

2012-03-28 Thread Karl Wright

Right, LUCENE never did allow you to modify a document's indexes, only
replace them.  What I'm trying to tell you is that there is no reason
to have the same document ID for both documents.  ManifoldCF will
support treating the XML document and PDF document as different
documents in Solr.  But if you want them to in fact be the same
document, just combined in some way, neither ManifoldCF nor Lucene
will support that at this time.

Karl


On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
anupam...@gmail.com wrote:
 I saw that the index getting created by 1st PDF indexing job which worked
 perfectly well for a particular id. Later when i ran the 2nd XML indexing
 Job for the same id. I lost all field indexed by the 1st job and i was left
 out with field indexes created my this 2nd job.

 I thought that it would combine field values for a specified doc id.

 As per Lucene developers they mention that by design Lucene doesn't support
 this.

 Pls. see following url ::
 https://issues.apache.org/jira/browse/LUCENE-3837

 -Anupam


 On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright daddy...@gmail.com wrote:

 The Solr handler that you are using should not matter here.

 Can you look at the Simple History report, and do the following:

 - Look for a document that is being indexed in both PDF and XML.
 - Find the ingestion activity for that document for both PDF and XML
 - Compare the ID's (which for the ingestion activity are the URL's of
 the documents in Webtop)

 If the URLs are in fact different, then you should be able to make
 this work.  You need to look at how you configured your Solr instance,
 and which fields you are specifying in your Solr output connection.
 You want those Webtop urls to be indexed as the unique document
 identifier in Solr, not some other ID.

 Thanks,
 Karl


 On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
 anupam...@gmail.com wrote:
  Today I ran 2 job one by one but it seems since we are using
  /update/extract Request Handler the field values for common id gets
  overridden by the latest job. I want to update certain field in the
  lucene indexes for the doc rather than completely update with new
  values and by loosing other field value entries.
 
  On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright daddy...@gmail.com
  wrote:
  For Documentum, content length is in bytes, I believe.  It does not
  set the length, it filters out all documents greater than the
  specified length.  Leaving the field blank will perform no filtering.
 
  Document types in Documentum are specified by mime type, so you'd want
  to select all that apply.  The actual one used will depend on how your
  particular instance of Documentum is configured, but if you pick them
  all you should have no problem.
 
  Karl
 
 
  On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
  anupam...@gmail.com wrote:
  Thanks!! Seems from your explanation that i can update same documents
  other
  field values. I inquired about this because I have two different
  document
  with a parent-child relationship which needs to be indexed as one
  document
  in lucene index.
 
  As you must have understood by now that i am trying to do this for
  Documentum CMS. I have seen the configuration screen for setting the
  Content
  length  second for filtering document type. So my question is what
  unit the
  Content length accepts values (bit,bytes,KB,MB etc)  whether this
  configuration set the lengths for documents full text indexing ?.
 
  Additionally to scan only one kind of document e.g PDF what should be
  added
  to filter those documents? is it application/pdf OR PDF ?
 
  Regards
  Anupam
 
 
  On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright daddy...@gmail.com
  wrote:
 
  The document key in Solr is the url of the document, as constructed
  by
  the connector you are using.  If you are using the same document to
  construct two different Solr documents, ManifoldCF by definition
  cannot be aware of this.  But if these are different files from the
  point of view of ManifoldCF they will have different URLs and be
  treated differently.  The jobs can overlap in this case with no
  difficulty.
 
  Karl
 
  On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
  anupam...@gmail.com wrote:
   I want to configure two jobs to index in SOLR using ManifoldCF
   using
   /extract/update requestHandler.
   1st to synchronize only XML files  2nd to synchronize the PDF
   file.
   If both these document share a unique id. Can i combine the indexes
   for
   both
   in 1 SOLR schema without overriding the details added by previous
   job.
  
   suppose,
     xmldoc indexes field0(id), field1, field2, field3
   pdfdoc indexes field0(id), field4, field5, field6.
  
   Output docindex == (xml+pdf doc), field0(id), field1, field2,
   field3,
   field4, field5, field6
  
   Regards
   Anupam
  
  
 
 
 
 
 
 
 
  --
  Thanks  Regards
  Anupam Bhattacharya




 --
 Thanks  Regards
 Anupam Bhattacharya

Re: Running 2 jobs to update same document Index but different fields

2012-03-28 Thread Karl Wright

Thanks - now I understand what you are trying to do more clearly.

The Documentum connector is going to pick up the XML document and the
PDF document as separate entities.  Thus, they'd also be indexed in
Solr separately.  So if we use that as a starting point, let's see
where it might lead.

First, you'd want each PDF document to have metadata that refers back
to the XML parent document.  I'm not sure how easy it is to set up
such a metadata reference in Documentum, but I vaguely recall there
was indeed some such field.  So let's presume you can get that.  Then,
you'd want to make sure your Solr schema included an XML document
field, which had the URL of the parent XML document (or, for XML
documents, the document's own URL) as content.  That would be the ID
you'd use to present a result item to a user.

Does this sound reasonable so far?

The only other piece you might need is manipulation of either the
PDF's metadata, or the XML document's metadata, or both.  For that,
I'd use Solr Cell to perform whatever mappings and manipulations made
sense before the documents actually get indexed.

Karl

On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
anupam...@gmail.com wrote:
 I would have been happy if  I had to index PDF and XML separately.
 But for my use-case. XML is the main document containing bibliographic
 information (which needs to presented as search result) and consists a
 reference to a child/supporting document which is a actual PDF file. I need
 to index the PDF text and if any search matches with the PDF content then
 the parent/XML bibliographic information needs to be presented.

 I am trying to call the SOLR search engine with one single query to show
 corresponding XML detail for a search term present in the PDF. I checked
 that from SOLR 4.x version SOLR-Join Plugin is introduced.
 (http://wiki.apache.org/solr/Join) but work like inner query.

 Again the main requirement is that the PDF should be searchable but it
 master details from XML should only be presented to request the actual PDF.

 -Anupam

 On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright daddy...@gmail.com wrote:

 This doesn't sound like a problem a connector can solve.  The problem
 sounds like severe misuse of Solr/Lucene to me.  You are using the
 wrong document key and Lucene does not let you modify a document index
 once it is created, and no matter what you do to ManifoldCF it can't
 get around that restriction.  So it sounds like you need to
 fundamentally rethink your design.

 If all you want to do is index XML and PDF as separate documents, just
 change your Solr output connection specification to change the
 selected id field appropriately.  Then, BOTH documents will be
 indexed by Solr, each with different metadata as you originally
 specified.  I'm frankly having a really hard time seeing why this is
 so hard.

 Karl


 On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
 anupam...@gmail.com wrote:
  Should I write a new Documentum Connector with our specific use-case to
  go
  forward ?
  I guess your book will be helpful to understand connector framework in
  manifoldcf.
 
  On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright daddy...@gmail.com wrote:
 
  Right, LUCENE never did allow you to modify a document's indexes, only
  replace them.  What I'm trying to tell you is that there is no reason
  to have the same document ID for both documents.  ManifoldCF will
  support treating the XML document and PDF document as different
  documents in Solr.  But if you want them to in fact be the same
  document, just combined in some way, neither ManifoldCF nor Lucene
  will support that at this time.
 
  Karl
 
 
  On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
  anupam...@gmail.com wrote:
   I saw that the index getting created by 1st PDF indexing job which
   worked
   perfectly well for a particular id. Later when i ran the 2nd XML
   indexing
   Job for the same id. I lost all field indexed by the 1st job and i
   was
   left
   out with field indexes created my this 2nd job.
  
   I thought that it would combine field values for a specified doc id.
  
   As per Lucene developers they mention that by design Lucene doesn't
   support
   this.
  
   Pls. see following url ::
   https://issues.apache.org/jira/browse/LUCENE-3837
  
   -Anupam
  
  
   On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   The Solr handler that you are using should not matter here.
  
   Can you look at the Simple History report, and do the following:
  
   - Look for a document that is being indexed in both PDF and XML.
   - Find the ingestion activity for that document for both PDF and
   XML
   - Compare the ID's (which for the ingestion activity are the URL's
   of
   the documents in Webtop)
  
   If the URLs are in fact different, then you should be able to make
   this work.  You need to look at how you configured your Solr
   instance,
   and which fields you are specifying in your Solr output connection

Re: Running 2 jobs to update same document Index but different fields

2012-03-28 Thread Karl Wright

So do you find this design appropriate and feasible ?  It sounds
like you are still trying to merge records in Solr but this time using
Solr Cell to somehow do this.  Since SolrCell is a pipeline, I don't
think you will find it easy to keep data from one job aligned with
data from another.  That's why I suggested just allowing both kinds of
documents to be indexed as-is, and just making sure that you include a
metadata reference to the main document in each.

Karl


On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya
anupam...@gmail.com wrote:
 The second option seems to be more useful as it will allow me add to any
 business logic.
 So similar to SOLR Cell (/update/extract) my new RequestHandler will be
 added in solrconfig.xml which will do all the manipulations.
 Later, I need to get all field values into a temp variable by first
 searching by id in the lucene indexes and then add these values into the
 incoming new field values list.

 So do you find this design appropriate and feasible ?

 Anupam

 On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright daddy...@gmail.com wrote:

 Thanks - now I understand what you are trying to do more clearly.

 The Documentum connector is going to pick up the XML document and the
 PDF document as separate entities.  Thus, they'd also be indexed in
 Solr separately.  So if we use that as a starting point, let's see
 where it might lead.

 First, you'd want each PDF document to have metadata that refers back
 to the XML parent document.  I'm not sure how easy it is to set up
 such a metadata reference in Documentum, but I vaguely recall there
 was indeed some such field.  So let's presume you can get that.  Then,
 you'd want to make sure your Solr schema included an XML document
 field, which had the URL of the parent XML document (or, for XML
 documents, the document's own URL) as content.  That would be the ID
 you'd use to present a result item to a user.

 Does this sound reasonable so far?

 The only other piece you might need is manipulation of either the
 PDF's metadata, or the XML document's metadata, or both.  For that,
 I'd use Solr Cell to perform whatever mappings and manipulations made
 sense before the documents actually get indexed.

 Karl

 On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
 anupam...@gmail.com wrote:
  I would have been happy if  I had to index PDF and XML separately.
  But for my use-case. XML is the main document containing bibliographic
  information (which needs to presented as search result) and consists a
  reference to a child/supporting document which is a actual PDF file. I
  need
  to index the PDF text and if any search matches with the PDF content
  then
  the parent/XML bibliographic information needs to be presented.
 
  I am trying to call the SOLR search engine with one single query to show
  corresponding XML detail for a search term present in the PDF. I checked
  that from SOLR 4.x version SOLR-Join Plugin is introduced.
  (http://wiki.apache.org/solr/Join) but work like inner query.
 
  Again the main requirement is that the PDF should be searchable but it
  master details from XML should only be presented to request the actual
  PDF.
 
  -Anupam
 
  On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright daddy...@gmail.com
  wrote:
 
  This doesn't sound like a problem a connector can solve.  The problem
  sounds like severe misuse of Solr/Lucene to me.  You are using the
  wrong document key and Lucene does not let you modify a document index
  once it is created, and no matter what you do to ManifoldCF it can't
  get around that restriction.  So it sounds like you need to
  fundamentally rethink your design.
 
  If all you want to do is index XML and PDF as separate documents, just
  change your Solr output connection specification to change the
  selected id field appropriately.  Then, BOTH documents will be
  indexed by Solr, each with different metadata as you originally
  specified.  I'm frankly having a really hard time seeing why this is
  so hard.
 
  Karl
 
 
  On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
  anupam...@gmail.com wrote:
   Should I write a new Documentum Connector with our specific use-case
   to
   go
   forward ?
   I guess your book will be helpful to understand connector framework
   in
   manifoldcf.
  
   On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   Right, LUCENE never did allow you to modify a document's indexes,
   only
   replace them.  What I'm trying to tell you is that there is no
   reason
   to have the same document ID for both documents.  ManifoldCF will
   support treating the XML document and PDF document as different
   documents in Solr.  But if you want them to in fact be the same
   document, just combined in some way, neither ManifoldCF nor Lucene
   will support that at this time.
  
   Karl
  
  
   On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
   anupam...@gmail.com wrote:
I saw that the index getting created by 1st PDF

Re: Running 2 jobs to update same document Index but different fields

2012-03-27 Thread Karl Wright

For Documentum, content length is in bytes, I believe.  It does not
set the length, it filters out all documents greater than the
specified length.  Leaving the field blank will perform no filtering.

Document types in Documentum are specified by mime type, so you'd want
to select all that apply.  The actual one used will depend on how your
particular instance of Documentum is configured, but if you pick them
all you should have no problem.

Karl


On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
anupam...@gmail.com wrote:
 Thanks!! Seems from your explanation that i can update same documents other
 field values. I inquired about this because I have two different document
 with a parent-child relationship which needs to be indexed as one document
 in lucene index.

 As you must have understood by now that i am trying to do this for
 Documentum CMS. I have seen the configuration screen for setting the Content
 length  second for filtering document type. So my question is what unit the
 Content length accepts values (bit,bytes,KB,MB etc)  whether this
 configuration set the lengths for documents full text indexing ?.

 Additionally to scan only one kind of document e.g PDF what should be added
 to filter those documents? is it application/pdf OR PDF ?

 Regards
 Anupam


 On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright daddy...@gmail.com wrote:

 The document key in Solr is the url of the document, as constructed by
 the connector you are using.  If you are using the same document to
 construct two different Solr documents, ManifoldCF by definition
 cannot be aware of this.  But if these are different files from the
 point of view of ManifoldCF they will have different URLs and be
 treated differently.  The jobs can overlap in this case with no
 difficulty.

 Karl

 On Tue, Mar 27, 2012 at 1:08 PM, Anupam Bhattacharya
 anupam...@gmail.com wrote:
  I want to configure two jobs to index in SOLR using ManifoldCF using
  /extract/update requestHandler.
  1st to synchronize only XML files  2nd to synchronize the PDF file.
  If both these document share a unique id. Can i combine the indexes for
  both
  in 1 SOLR schema without overriding the details added by previous job.
 
  suppose,
    xmldoc indexes field0(id), field1, field2, field3
      pdfdoc indexes field0(id), field4, field5, field6.
 
  Output docindex == (xml+pdf doc), field0(id), field1, field2, field3,
  field4, field5, field6
 
  Regards
  Anupam

RE: SmbException

2012-03-17 Thread Karl Wright

Hi,

The problem you are seeing is a server-side error of some kind. The
jcifs connector will retry documents that fail to fetch properly after
some period of time, usually five minutes. Warning messages are
nevertheless recorded in the log for every retry. So, unless the job
aborts, or ceases to make progress, everything may actually be ok.

If the job does abort or stops moving forward it means that certain
documents are having consistent errors. You can find an example of such
a document easily enough by reviewing the Simple History. Then, see if
you can reach the doc via windows, or if you get errors also.

Please let us know what you find.

Karl

Sent from my Windows Phone
From: takagig
Sent: 3/17/2012 1:48 AM
To: connectors-user@incubator.apache.org
Subject: SmbException
Hi, everyone

I am using ManifoldCF(from trunk build. 0.5?) for crawling Windows
Share Folder for our application.
When I run ManifoldCF sometimes I am getting SmbException.
SmbException occur more often around crawling 70,000 files over.

Please suggest a method of solving this problem.

1)My Environment
-Windows 2003 R2 SE SP2
-ManifoldCF (from trunk | 2012-03-01)

2)Case
crawling target = 100,000 files.
max complete files = 79503 files.

3)ManifoldCF Status (showjobstatus.jsp)
Error: SmbException thrown: No process is on the other end of the pipe.

4)Log
++
WARN 2012-03-16 17:58:55,731 (Worker thread '8') - JCIFS: Possibly
transient exception detected on attempt 1 while getting share
security: 0x
jcifs.dcerpc.DcerpcException: 0x
at jcifs.dcerpc.DcerpcBind.getResult(DcerpcBind.java:40)
at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:249)
at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126)
at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140)
at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2943)
at 
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2342)
at 
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.describeDocumentSecurity(SharedDriveConnector.java:1003)
at 
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:546)
at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
 WARN 2012-03-16 17:58:55,731 (Worker thread '8') - JCIFS: Possibly
transient exception detected on attempt 2 while getting share
security: No process is on the other end of the pipe.
jcifs.smb.SmbException: No process is on the other end of the pipe.
at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:563)
at jcifs.smb.SmbTransport.send(SmbTransport.java:663)
at jcifs.smb.SmbSession.send(SmbSession.java:238)
at jcifs.smb.SmbTree.send(SmbTree.java:119)
at jcifs.smb.SmbFile.send(SmbFile.java:775)
at jcifs.smb.SmbFile.open0(SmbFile.java:989)
at jcifs.smb.SmbFile.open(SmbFile.java:1006)
at jcifs.smb.SmbFileOutputStream.init(SmbFileOutputStream.java:142)
at 
jcifs.smb.TransactNamedPipeOutputStream.init(TransactNamedPipeOutputStream.java:32)
at 
jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187)
at 
jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68)
at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190)
at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126)
at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140)
at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2943)
at 
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2342)
at 
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.describeDocumentSecurity(SharedDriveConnector.java:1003)
at 
org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:546)
at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)

++

What kind of other information should I provide ?

Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions

2012-03-16 Thread Karl Wright

Hi Shigeki,

A service interruption means that a connector (either a repository
connector like the web connector or an output connector like the Solr
connector) could not communicate with the configured service.

Repeated service interruptions means that certain URLs failed to
fetch properly even after a pattern of retries which lasted many
hours. ManifoldCF connectors deal with such errors in one of several
ways, depending on the exact details of the error:

- ignore it and proceed
- retry periodically for some time interval, and then give up and proceed
- retry periodically for some time interval, and then shut down the job

It sounds like your job has encountered one of the latter errors. The
Error: Repeated service interruptions - failure processing document:
Ingestion HTTP error code 500 indicates that the problem is due to
communication with Solr. Apparently certain documents you are
indexing are causing Solr to return an error code 500, which is an
internal server error, and is usually associated with a Solr
exception. You will need to diagnose why this is, and take corrective
steps, in order for your ManifoldCF job to complete successfully.

Job no longer active is harmless - it's a side effect of the job
shutting down. When a job is shutting down, active document
processing cannot always be interrupted within a connector, but the
framework helps it to stop quickly by throwing this exception.

Thanks,
Karl

2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) shigeki.kobayas...@g.softbank.co.jp:

I was crawling web sites with links to html and pdf files on the provided
multiprocess-example agent for a few hours, then Simple History started
showing -104 result code with a message saying Interrupted: Job no longer
active.

After the same error occurred repeatedly around 40 times, the job status
became Aborting and then ended up with Error: Repeated service
interruptions
- failure processing document: Ingestion HTTP error code 500.

The job was interrupted and stopped.

Does anyone know what situation brings Repeated service interruptions and
has jobs stopped?
Also in what circumstance an error status code -104 occurs? What is the
meaning of the code -104?

If you have any ideas, please advise me on how to avoid this error.

I am using the followings:

Solr 1.4 (Extracting Request Handler is set)
ManifoldCF 0.4 (multiprocess-example)
- Repository connector: WEB
- Output connector: Solr
Tomcat 6.0.29
PostgreSQL 9.1.3

Here is MCF’s debug log right before the job was interrupted:

DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to get
connection to http://xx.xx.xx.xx:80 (95697 ms)
DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 ms
before starting fetch on http://xx.xx.xx.xx:80
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to get
connection to http://xx.xx.xx.xx:80 (99593 ms)
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully got
connection to http://xx.xx.xx.xx:80 (99593 ms)
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
HttpClient object
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an HttpClient
object after 0 ms.
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
'/xx/xx.pdf'
DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 128 ms.
DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a

Re: Transforming Manifold Metadata Prior to Pushing the Data into SOLR

2012-02-27 Thread Karl Wright

Please see my response interleaved below.

On Mon, Feb 27, 2012 at 9:53 AM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 I'm trying to push data into SOLR..

 Is there a way to transform the metadata coming in from different data
 sources like SharePoint, and the File Share, prior to posting it into SOLR?


In general, ManifoldCF does not have data transformation abilities.
With Solr, we rely on Solr Cell, which is a pipeline built on Tika, to
extract content from documents and to perform transformations to
document metadata etc.  It is possible that at some point it will be
possible to do more transformations in ManifoldCF in order to support
search engines that don't have a pipeline, but that is currently not
available.

 For instance, documents have metadata specifying their file path. I need to
 transform that to a URL I can use within SOLR to retrieve that document
 through a servlet that I wrote.


The ManifoldCF model is that a connector creates a URL for each
document that it indexes, using whatever makes sense for that
particular repository to get you back to the document in question.
So, for instance, Documentum documents will use URLs that point at
Documentum's Webtop web application.

It would be helpful to understand more precisely what you are trying
to do.  You could, for instance, modify your servlet to redirect to
the ManifoldCF-generated URL.  It gets indexed into Solr as the id
field.

 Also, based on specific metadata that I'm seeing in the documents, I might
 want to conditionally add populate other fields in SOLR index.


That sounds like a job for the Tika pipeline to me.

Thanks,
Karl

 --
 This e-mail and any files transmitted with it may be proprietary.  Please
 note that any views or opinions presented in this e-mail are solely those of
 the author and do not necessarily represent those of Apogee Integration.

Re: Cannot find OracleDriver

2012-02-27 Thread Karl Wright

So if the Database and Host field really is 21:16:18:145:1521, try
21.16.18.145:1521 instead. ;-)

Karl

On Mon, Feb 27, 2012 at 9:22 AM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 type: JDBC
 Authority: None
 Database Type: ORACLE
 Database and Host: 21:16:18:145:1521
 Instance/Database: main
 User Name: 
 Password: X


 On Sun, Feb 26, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote:

 I haven't seen this one.  I'd love to know what the connect
 descriptor it refers to is.

 Can you tell me what the parameters all look like for the JDBC
 connection you are setting up?  Are you specifying, for instance, the
 port as part of the server name?

 Karl

 On Sat, Feb 25, 2012 at 1:22 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  Karl,
 
  That fixed the driver issue. I just updated my start.jar file by hand
  for
  now.
 
  The problem I have now is connecting to ORACLE. I can do it through
  NetBeans
  on my machine, but
  I cannot connect through ManfoldCF with the same settings. I get the
  following error:
 
  Error getting connection. Listener refused the connection with the
  following
  error.
 
  ORA-12514. TNS:Listener does not currently know of service requested in
  connect descriptor.
 
  This might be more of an ORACLE issue than Manifold issue, but I was
  wondering whether
  you've encountered the same thing during testing?
 
  Regards,
 
  Matt
 
  On Fri, Jan 20, 2012 at 10:28 AM, Matthew Parker
  mpar...@apogeeintegration.com wrote:
 
  Thanks Karl.
 
  On Thu, Jan 19, 2012 at 9:44 PM, Karl Wright daddy...@gmail.com
  wrote:
 
  The problem has been fixed on trunk.  Basically, the instructions
  changed as did some of the build files.  It turned out to be extremely
  challenging to get JDBC drivers to run when they were loaded by
  anything other than the system classloader, so that's what I was
  forced to insure.
 
  Thanks,
  Karl
 
 
  On Thu, Jan 19, 2012 at 3:33 PM, Karl Wright daddy...@gmail.com
  wrote:
   The ticket for this problem is CONNECTORS-390.
  
   Karl
  
   On Thu, Jan 19, 2012 at 3:05 PM, Matthew Parker
   mpar...@apogeeintegration.com wrote:
   Many thanks. I'll give that a try.
  
   On Thu, Jan 19, 2012 at 3:01 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   The problem is that the JDBC driver is using a pool driver that is
   in
   common with the core of ManifoldCF.  So the connector-lib path,
   which
   only the connectors know about, won't do.  That's a bug which I'll
   create a ticket for.
  
   A temporary fix, which is slightly involved, requires you to put
   the
   ojdbc6.jar in the example/lib area, as you already tried, but in
   addition you will need to explicitly include the jar in your
   classpath.  Normally the start.jar's manifest describes all the
   jars
   in the initial classpath.  I thought it was possible to also
   include
   additional classpath info through the normal --classpath
   mechanism,
   but that doesn't seem to work, so you may be stuck with modifying
   the
   root build.xml file to add the jar to the manifest.
  
   I'm going to experiment a bit and see if I can come up with
   something
   quickly.
  
   Karl
  
  
   On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com
   wrote:
I was able to reproduce the problem.  I'll get back to you when
I
figure out what the issue is.
Karl
   
On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
I've used the jar file in NetBeans to connect to the database
without
any
issue.
   
Seems more like a class loader issue.
   
   
On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
   
I have the latest release from the Apache Manifold site (i.e.
0.3-incubating).
   
I checked the driver jar file with winzip, and the driver name
is
still
the same (oracle.jdbc.OracleDriver).
   
I'm running java 1.6.0_18-b7 on Windows XP SP 3.
   
On Thu, Jan 19, 2012 at 2:27 PM, Karl Wright
daddy...@gmail.com
wrote:
   
MCF's Oracle support was written against earlier versions of
the
Oracle driver.  It is possible that they have changed the
driver
class.  If the driver winds up in the dist/connector-lib
directory
(I'm assuming you are using trunk or 0.4-incubating), then it
should
be accessible.
   
Could you please try the following:
   
jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver
   
... assuming you are using Linux?
   
If the driver class IS found, then the other possibility is
that
the
jar is compiled against a later version of Java than the one
you
are
using to run MCF.
   
Please let me know what you find.
   
Karl
   
On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 I downloaded MCF and started playing with the default setup
 under
 Jetty

Re: Cannot find OracleDriver

2012-02-27 Thread Karl Wright

The connect URL it will use given those parameters is the following:

String dburl = jdbc: + providerName + // + host + / +
database + ((instanceName==null)?:;instance=+instanceName);

Or, filled in with your parameters:

jdbc:oracle:thin:@//21.16.18.145:1521/main

The main at the end is what I would wonder about.  Oracle's default
is database; if you leave the database/instance name field blank,
that's what you'll get.

I also recommend turning on connector debugging, in properties.xml, by adding:

property name=org.apache.manifoldcf.connectors value=DEBUG/

... and restarting ManifoldCF.  Try viewing the connection in the UI;
you should see the connect string logged, as well as possibly a more
detailed response.

Thanks,
Karl

On Mon, Feb 27, 2012 at 11:12 AM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 Sorry. I used the wrong character. It is configured for 21.16.18.145:1521


 On Mon, Feb 27, 2012 at 10:27 AM, Karl Wright daddy...@gmail.com wrote:

 So if the Database and Host field really is 21:16:18:145:1521, try
 21.16.18.145:1521 instead. ;-)

 Karl

 On Mon, Feb 27, 2012 at 9:22 AM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  type: JDBC
  Authority: None
  Database Type: ORACLE
  Database and Host: 21:16:18:145:1521
  Instance/Database: main
  User Name: 
  Password: X
 
 
  On Sun, Feb 26, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote:
 
  I haven't seen this one.  I'd love to know what the connect
  descriptor it refers to is.
 
  Can you tell me what the parameters all look like for the JDBC
  connection you are setting up?  Are you specifying, for instance, the
  port as part of the server name?
 
  Karl
 
  On Sat, Feb 25, 2012 at 1:22 PM, Matthew Parker
  mpar...@apogeeintegration.com wrote:
   Karl,
  
   That fixed the driver issue. I just updated my start.jar file by hand
   for
   now.
  
   The problem I have now is connecting to ORACLE. I can do it through
   NetBeans
   on my machine, but
   I cannot connect through ManfoldCF with the same settings. I get the
   following error:
  
   Error getting connection. Listener refused the connection with the
   following
   error.
  
   ORA-12514. TNS:Listener does not currently know of service requested
   in
   connect descriptor.
  
   This might be more of an ORACLE issue than Manifold issue, but I was
   wondering whether
   you've encountered the same thing during testing?
  
   Regards,
  
   Matt
  
   On Fri, Jan 20, 2012 at 10:28 AM, Matthew Parker
   mpar...@apogeeintegration.com wrote:
  
   Thanks Karl.
  
   On Thu, Jan 19, 2012 at 9:44 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   The problem has been fixed on trunk.  Basically, the instructions
   changed as did some of the build files.  It turned out to be
   extremely
   challenging to get JDBC drivers to run when they were loaded by
   anything other than the system classloader, so that's what I was
   forced to insure.
  
   Thanks,
   Karl
  
  
   On Thu, Jan 19, 2012 at 3:33 PM, Karl Wright daddy...@gmail.com
   wrote:
The ticket for this problem is CONNECTORS-390.
   
Karl
   
On Thu, Jan 19, 2012 at 3:05 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
Many thanks. I'll give that a try.
   
On Thu, Jan 19, 2012 at 3:01 PM, Karl Wright
daddy...@gmail.com
wrote:
   
The problem is that the JDBC driver is using a pool driver that
is
in
common with the core of ManifoldCF.  So the connector-lib path,
which
only the connectors know about, won't do.  That's a bug which
I'll
create a ticket for.
   
A temporary fix, which is slightly involved, requires you to
put
the
ojdbc6.jar in the example/lib area, as you already tried, but
in
addition you will need to explicitly include the jar in your
classpath.  Normally the start.jar's manifest describes all the
jars
in the initial classpath.  I thought it was possible to also
include
additional classpath info through the normal --classpath
mechanism,
but that doesn't seem to work, so you may be stuck with
modifying
the
root build.xml file to add the jar to the manifest.
   
I'm going to experiment a bit and see if I can come up with
something
quickly.
   
Karl
   
   
On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright
daddy...@gmail.com
wrote:
 I was able to reproduce the problem.  I'll get back to you
 when
 I
 figure out what the issue is.
 Karl

 On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
 I've used the jar file in NetBeans to connect to the
 database
 without
 any
 issue.

 Seems more like a class loader issue.


 On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:

 I have the latest release from the Apache Manifold site
 (i.e.
 0.3-incubating).

 I

Re: ManifoldCF's dist/shapoint-integration dir

2012-02-23 Thread Karl Wright

Hi Daniel,

I have not personally tried ManifoldCF on JBoss, but since both Jetty
and Tomcat work without modification I would wonder if there is a
JBoss classloader option you might be setting incorrectly.  The reason
this is likely is because the web container specification is pretty
clear about the hierarchical order of resolution of classes for web
applications, and it is this characteristic which will determine
whether JDBC DriverManager registration works properly or not.  Jetty
has two possible settings, for instance - one that makes it conform to
the spec, and one that is useful for single-process deployments.

Perhaps other users on this list might have some hints?

Karl


On Thu, Feb 23, 2012 at 7:47 PM, Silvia, Daniel [USA]
silvia_dan...@bah.com wrote:
 Hi Karl

 I have been trying to configure ManifoldCF to run on JBoss. When I Manifold 
 on JBoss the connection pool can't be created. Do we need to set the 
 datasource through the web console of JBoss. I believe the code is in the 
 DatabaseFactory.

 Thanks
 Dan

 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Monday, February 13, 2012 10:10 AM
 To: Silvia, Daniel [USA]
 Cc: connectors-user@incubator.apache.org
 Subject: Re: ManifoldCF's dist/shapoint-integration dir

 The SharePoint connector only looks at documents within libraries, and
 documents within folders in those libraries.  I don't know how
 SharePoint is structuring your Wiki content, though.  If it is
 individual documents within libraries, it should be accessible by the
 SharePoint Connector.  If it is some other construct, then it likely
 won't be found by that connector.

 The Simple History is going to list the URLs that the SharePoint
 connector fetches.  If you know the URL of a piece of Wiki content and
 that URL does not appear in the Simple History, it's not being
 fetched.  Similarly, if the URL of that piece of Wiki content has no
 library name in the path, it's not something the SharePoint Connector
 will be able to index.

 If the SharePoint connector is not going to do it for you, and your
 wiki content is being rendered in a manner that supports standard Wiki
 API calls, you can use the Wiki Connector to index it.  If that too
 isn't going to work, then we should analyze exactly what SharePoint is
 presenting with a view towards extending the SharePoint connector.

 Karl

 On Mon, Feb 13, 2012 at 9:51 AM, Silvia, Daniel [USA]
 silvia_dan...@bah.com wrote:
 Hi Karl

 Does the SharePoint connector only pull files from the SharePoint instance 
 and not content like Wiki content. As mentioned in the previous e-mail I am 
 able to see the xml content in the log file for the wikis with the element 
 similar to someWikisomeNameWiki_rowsome other 
 elementsWikiFiledcontent./WikiField/someNameWiki_row/someWiki. 
 However, I do not see information in the Simple History Report pulling Wiki 
 information or the .aspx pages. Does this report only produce information on 
 files and not content pulled from SharePoint?

 I am just trying to figure out if I need to configure another connector to 
 pull content from SharePoint other than the SharePoint connector.


 Thanks

 Dan
 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Sunday, February 12, 2012 12:08 PM
 To: Silvia, Daniel [USA]
 Cc: connectors-user@incubator.apache.org
 Subject: Re: ManifoldCF's dist/shapoint-integration dir

 Hi Daniel,

 If you are seeing fetches in the Simple History that include the wiki
 URLs you are trying to capture, the SharePoint job is likely correct.
 Are you seeing Document ingest activities for the same documents?
 If so, they are being sent to Solr, and you'd have to look into the
 Solr configuration to figure out why they aren't being indexed.

 Thanks,


 On Sun, Feb 12, 2012 at 11:37 AM, Silvia, Daniel [USA]
 silvia_dan...@bah.com wrote:
 Hi Karl

 Quick question regarding SharePoint Wikis and ingesting them into Solr.

 I have been trying to get the Wikis, created in SharePoint, to be ingested 
 into Solr. I am able to see the Wikis in the logging where the SharePoint 
 Connector pulls everything from site, however, I do not see the Wikis 
 content in the solr instance. When creating a job to run, do I need to 
 indicate a path similar to *Wiki* for the entire site or do I need to 
 configure the solr metadata in the job to capture WikiField element in 
 the xml being passed to the Solr connector?

 Thanks for your help.

 Dan
 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Tuesday, January 31, 2012 10:52 AM
 To: Silvia, Daniel [USA]
 Cc: connectors-user@incubator.apache.org
 Subject: Re: ManifoldCF's dist/shapoint-integration dir

 It's been a while since I've set up a SharePoint job but I think what
 you are missing is a file rule (instead of just a library rule).
 Here's what the end-user documentation says on the matter:

 Each rule consists of a path, a rule

Re: Need Help on setting up ManifoldCF

2012-02-15 Thread Karl Wright

Hi Anupam,

I did not see a ticket from you about the DOCUMENTUM environment
variable and the dmcl.ini vs. dfc.properties file.  I've created an
issue at https://issues.apache.org/jira/browse/CONNECTORS-410 to track
this problem.  It would be great if you could confirm that: (a) the
DOCUMENTUM environment variable is still needed at all by DFC, and (b)
that when it is set properly, the file dfc.properties can be found at
$DOCUMENTUM\dfc.properties (on Windows, at least).

Thanks,
Karl

On Tue, Feb 14, 2012 at 3:23 PM, Karl Wright daddy...@gmail.com wrote:
 Hi Anupam,

 Please post emails like this directly to
 connectors-user@incubator.apache.org.  See below for responses.

 On Tue, Feb 14, 2012 at 3:07 PM, Anupam Bhattacharya
 anupam...@gmail.com wrote:

 Hello Karl,

 I am a software programmer in DuPont, Gurgaon, India. Recently, due to the
 economic instability all over the world the company has decided to go for
 cheaper Search Engine Applications. Thus we are getting rid of many costly
 proprietary Search Applications and will be replacing with FAST.

 Although, I recently came across SOLR search engine  ManiFoldCF Connector
 framework. Thus, I am currently driving this effort within my company as i
 am a big supporter of open source technologies. I started my career in
 Alfresco CMS and now working on Search Technologies.

 Currently I am facing lots of initial building/deploying/installing issues.
 I have already referred the url
 http://incubator.apache.org/connectors/en_US/how-to-build-and-deploy.html
 Read it multiple times but still face many issues. I downloaded the latest
 0.4 version and it seems the documentation is not up to date on the above
 link.


 The online documentation is pertinent to trunk.  The documentation you
 want to use is contained within the 0.4-incubating release.  Go to
 dist/doc and you will see it there.

 Few issues which took me a long time to resolve which can be added in
 ManifoldCF wiki as learnings for others are listed below:
 a. No single example is given for running the executecommand.bat with proper
 arguments. Only list of commands given with parameter defined.

 I'm not entirely sure I get this.  Do you just want an example in the
 documentation?

 b. Setting where and which file for the property manifoldcf.configfile for 
 deploying the war on tomcat with Postgresql database.

 The documentation already tells you that you need to add an
 appropriate -D to your tomcat invocation to point to your
 properties.xml file.  Tomcat documentation differs from version to
 version and platform to platform on how best to do that, and if you
 run under Windows there's even a service wrapper with a configuration
 UI that allows you to set these parameters.  So it's way beyond
 ManifoldCF's mission to describe all that, I think.

 c. I am trying to build the Documentum Connector but came to know that some
 additional environment variables needs to be added for DOCUMENTUM.
 Additionally the latest version of documentum uses dfc.properties file while
 run.bat look for dctl.ini file.

 Could you open a ticket in Jira for this issue?
 https://issues.apache.org/jira. It should not be a problem if you
 modify the script temporarily, but we can readily make the script look
 for either of these.

 d. postgresql driver is jdbc3 thus it creates problem with JVM6 or above.

 We use JDK 6 all the time without problems, so I don't know what you
 are talking about here.

 e. I was getting errors during  the ant build which tries to delete jar
 files from lib directory. Don't have the source code right now with me thus
 cant provide the full path.

 It sounds like you were trying to run ant while you still had
 ManifoldCF processes running from the same tree.

 f. It was advised in the documentation to set MCF_Home for
 example_multiprocess project but it seems the build of documentum connector
 refers to this property differently from run.bat.

 Yes, this was noticed and fixed on trunk recently.


 Can you please update the Apache ManifoldCF website with the latest
 installation procedures. Also, It will be very kind of you in the meanwhile
 if you can send few notes for me to head start the configuration of
 ManifoldCF, with SOLR  Documentum connector.


 The documentation online has been updated to be consistent with trunk,
 so if you want to use the trunk version this might be a good
 opportunity to help clarify the documentation.  Either that or you
 will need to stick with the 0.4-incubating release and the
 0.4-incubating documentation that is part of it; we cannot at this
 time update documentation that has already been released.

 Thanks,
 Karl

 Looking forward for your help.

 Thanks  Regards
 Anupam Bhattacharya

Re: Need Help on setting up ManifoldCF

2012-02-14 Thread Karl Wright

By all means, please go ahead.  Solr has a tutorial - maybe something
like that would be appropriate?

Karl

On Tue, Feb 14, 2012 at 6:54 PM, Hitoshi Ozawa
ozawa_hito...@ogis-ri.co.jp wrote:
 Hi,

 I agree with Anupam on getting started with ManifoldCF. I'm thinking of
 writing up a simple quick guide
 because many people are having trouble. I think it would help others if
 there was a simple example with
 ManifoldCF + Solr + local file + jsp to crawl some files in local directory
 (ManifoldCF documents in PDF?)
 and search and display results.

 H.Ozawa


 (2012/02/15 5:23), Karl Wright wrote:

 Hi Anupam,

 Please post emails like this directly to
 connectors-user@incubator.apache.org.  See below for responses.

 On Tue, Feb 14, 2012 at 3:07 PM, Anupam Bhattacharya
 anupam...@gmail.com  wrote:


 Hello Karl,

 I am a software programmer in DuPont, Gurgaon, India. Recently, due to
 the
 economic instability all over the world the company has decided to go for
 cheaper Search Engine Applications. Thus we are getting rid of many
 costly
 proprietary Search Applications and will be replacing with FAST.

 Although, I recently came across SOLR search engine  ManiFoldCF
 Connector

 framework. Thus, I am currently driving this effort within my company as
 i
 am a big supporter of open source technologies. I started my career in
 Alfresco CMS and now working on Search Technologies.

 Currently I am facing lots of initial building/deploying/installing
 issues.
 I have already referred the url
 http://incubator.apache.org/connectors/en_US/how-to-build-and-deploy.html
 Read it multiple times but still face many issues. I downloaded the
 latest
 0.4 version and it seems the documentation is not up to date on the above
 link.



 The online documentation is pertinent to trunk.  The documentation you
 want to use is contained within the 0.4-incubating release.  Go to
 dist/doc and you will see it there.



 Few issues which took me a long time to resolve which can be added in
 ManifoldCF wiki as learnings for others are listed below:
 a. No single example is given for running the executecommand.bat with
 proper
 arguments. Only list of commands given with parameter defined.


 I'm not entirely sure I get this.  Do you just want an example in the
 documentation?



 b. Setting where and which file for the property manifoldcf.configfile
 for deploying the war on tomcat with Postgresql database.


 The documentation already tells you that you need to add an
 appropriate -D to your tomcat invocation to point to your
 properties.xml file.  Tomcat documentation differs from version to
 version and platform to platform on how best to do that, and if you
 run under Windows there's even a service wrapper with a configuration
 UI that allows you to set these parameters.  So it's way beyond
 ManifoldCF's mission to describe all that, I think.



 c. I am trying to build the Documentum Connector but came to know that
 some
 additional environment variables needs to be added for DOCUMENTUM.
 Additionally the latest version of documentum uses dfc.properties file
 while
 run.bat look for dctl.ini file.


 Could you open a ticket in Jira for this issue?
 https://issues.apache.org/jira. It should not be a problem if you
 modify the script temporarily, but we can readily make the script look
 for either of these.



 d. postgresql driver is jdbc3 thus it creates problem with JVM6 or above.


 We use JDK 6 all the time without problems, so I don't know what you
 are talking about here.



 e. I was getting errors during  the ant build which tries to delete jar
 files from lib directory. Don't have the source code right now with me
 thus
 cant provide the full path.


 It sounds like you were trying to run ant while you still had
 ManifoldCF processes running from the same tree.



 f. It was advised in the documentation to set MCF_Home for
 example_multiprocess project but it seems the build of documentum
 connector
 refers to this property differently from run.bat.


 Yes, this was noticed and fixed on trunk recently.



 Can you please update the Apache ManifoldCF website with the latest
 installation procedures. Also, It will be very kind of you in the
 meanwhile
 if you can send few notes for me to head start the configuration of
 ManifoldCF, with SOLR  Documentum connector.



 The documentation online has been updated to be consistent with trunk,
 so if you want to use the trunk version this might be a good
 opportunity to help clarify the documentation.  Either that or you
 will need to stick with the 0.4-incubating release and the
 0.4-incubating documentation that is part of it; we cannot at this
 time update documentation that has already been released.

 Thanks,
 Karl



 Looking forward for your help.

 Thanks  Regards
 Anupam Bhattacharya

Re: ManifoldCF's dist/shapoint-integration dir

2012-02-13 Thread Karl Wright

The SharePoint connector only looks at documents within libraries, and
documents within folders in those libraries.  I don't know how
SharePoint is structuring your Wiki content, though.  If it is
individual documents within libraries, it should be accessible by the
SharePoint Connector.  If it is some other construct, then it likely
won't be found by that connector.

The Simple History is going to list the URLs that the SharePoint
connector fetches.  If you know the URL of a piece of Wiki content and
that URL does not appear in the Simple History, it's not being
fetched.  Similarly, if the URL of that piece of Wiki content has no
library name in the path, it's not something the SharePoint Connector
will be able to index.

If the SharePoint connector is not going to do it for you, and your
wiki content is being rendered in a manner that supports standard Wiki
API calls, you can use the Wiki Connector to index it.  If that too
isn't going to work, then we should analyze exactly what SharePoint is
presenting with a view towards extending the SharePoint connector.

Karl

On Mon, Feb 13, 2012 at 9:51 AM, Silvia, Daniel [USA]
silvia_dan...@bah.com wrote:
 Hi Karl

 Does the SharePoint connector only pull files from the SharePoint instance 
 and not content like Wiki content. As mentioned in the previous e-mail I am 
 able to see the xml content in the log file for the wikis with the element 
 similar to someWikisomeNameWiki_rowsome other 
 elementsWikiFiledcontent./WikiField/someNameWiki_row/someWiki. 
 However, I do not see information in the Simple History Report pulling Wiki 
 information or the .aspx pages. Does this report only produce information on 
 files and not content pulled from SharePoint?

 I am just trying to figure out if I need to configure another connector to 
 pull content from SharePoint other than the SharePoint connector.


 Thanks

 Dan
 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Sunday, February 12, 2012 12:08 PM
 To: Silvia, Daniel [USA]
 Cc: connectors-user@incubator.apache.org
 Subject: Re: ManifoldCF's dist/shapoint-integration dir

 Hi Daniel,

 If you are seeing fetches in the Simple History that include the wiki
 URLs you are trying to capture, the SharePoint job is likely correct.
 Are you seeing Document ingest activities for the same documents?
 If so, they are being sent to Solr, and you'd have to look into the
 Solr configuration to figure out why they aren't being indexed.

 Thanks,


 On Sun, Feb 12, 2012 at 11:37 AM, Silvia, Daniel [USA]
 silvia_dan...@bah.com wrote:
 Hi Karl

 Quick question regarding SharePoint Wikis and ingesting them into Solr.

 I have been trying to get the Wikis, created in SharePoint, to be ingested 
 into Solr. I am able to see the Wikis in the logging where the SharePoint 
 Connector pulls everything from site, however, I do not see the Wikis 
 content in the solr instance. When creating a job to run, do I need to 
 indicate a path similar to *Wiki* for the entire site or do I need to 
 configure the solr metadata in the job to capture WikiField element in the 
 xml being passed to the Solr connector?

 Thanks for your help.

 Dan
 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Tuesday, January 31, 2012 10:52 AM
 To: Silvia, Daniel [USA]
 Cc: connectors-user@incubator.apache.org
 Subject: Re: ManifoldCF's dist/shapoint-integration dir

 It's been a while since I've set up a SharePoint job but I think what
 you are missing is a file rule (instead of just a library rule).
 Here's what the end-user documentation says on the matter:

 Each rule consists of a path, a rule type, and an action. The actions
 are Include and Exclude. The rule type tells the connection what
 kind of SharePoint entity it is allowed to exactly match. For example,
 a File rule will only exactly match SharePoint paths that represent
 files - it cannot exactly match sites or libraries. The path itself is
 just a sequence of characters, where the * character has the special
 meaning of being able to match any number of any kind of characters,
 and the ? character matches exactly one character of any kind.

 The rule matcher extends strict, exact matching by introducing a
 concept of implicit inclusion rules. If your rule action is Include,
 and you specify (say) a File rule, the matcher presumes implicit
 inclusion rules for the corresponding site and library. So, if you
 create an Include File rule that matches (for example)
 /MySite/MyLibrary/MyFile, there is an implied Site Include rule
 for /MySite, and an implied Library Include rule for
 /MySite/MyLibrary. Similarly, if you create a Library Include
 rule, there is an implied Site Include rule that corresponds to it.
 Note that these shortcuts only applies to Include rules - there are
 no corresponding implied Exclude rules.

 What this means is that you should probably be declaring file rules
 with * as the file name for each

Re: Unable to index Windows share repositories

2012-02-10 Thread Karl Wright

Nothing has changed as far as the connectors are concerned.  Is your
domain controller now upgraded to a different version of windows too?
If so you may need to play around with the fields that are used for
authorization, e.g. the form of the username and/or the domain name.

Windows is not an open platform and they change stuff all the time,
but to the best of my knowledge they have not introduced any new
authentication modes in Windows 7, so something should work.  If not
the guy to talk with is Michael Allen, who maintains the jcifs
library.

Karl

On Fri, Feb 10, 2012 at 2:17 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi,

 Till recently, I have been using ManifoldCF trunk code (before 0.4 was
 released) on Windows XP. I was able to index files from Windows Share
 repositories successfully into Solr. Now, I have started using ManifoldCF
 0.4 version on Windows 7.
 With the new setup, am able to index files from File system repository with
 no issue, but I have problems indexing data from Windows Share repository.

 The job starts and ends with Result Description : Authorization: Access is
 denied. in Simple History. The log file has the message JCIFS:
 Authorization exception reading document/directory smb://nhance29/TestMails/
 - skipping

 Can you please tell me what needs to be done to resolve this ?

 I tried enabling Debug from properties.xml and this is what I get in the log
 file.

 DEBUG 2012-02-10 12:34:37,869 (Startup thread) - Connecting to:
 smb://GLOBAL;stgserver:password@nhance29/
 DEBUG 2012-02-10 12:34:37,907 (Startup thread) - Seed =
 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:39,781 (Worker thread '1') - JCIFS: getVersions():
 documentIdentifiers[0] is: smb://nhance29/TestMails/
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: In checkInclude
 for 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Matching
 startpoint 'smb://nhance29/TestMails/' against actual
 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Startpoint found!
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Startpoint:
 always included
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Leaving
 checkInclude for 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:44,421 (Worker thread '1') - JCIFS: Processing
 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:44,421 (Worker thread '1') - JCIFS:
 'smb://nhance29/TestMails/' is a directory
  WARN 2012-02-10 12:34:44,425 (Worker thread '1') - JCIFS: Possibly
 transient exception detected on attempt 1 while listing files: Access is
 denied.
 jcifs.smb.SmbAuthException: Access is denied.
     at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:546)
     at jcifs.smb.SmbTransport.send(SmbTransport.java:640)
     at jcifs.smb.SmbSession.send(SmbSession.java:238)
     at jcifs.smb.SmbTree.send(SmbTree.java:119)
     at jcifs.smb.SmbFile.send(SmbFile.java:775)
     at jcifs.smb.SmbFile.doFindFirstNext(SmbFile.java:1986)
     at jcifs.smb.SmbFile.doEnum(SmbFile.java:1738)
     at jcifs.smb.SmbFile.listFiles(SmbFile.java:1715)
     at jcifs.smb.SmbFile.listFiles(SmbFile.java:1704)
     at
 org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileListFiles(SharedDriveConnector.java:2224)
     at
 org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:701)
     at
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
     at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561)

 Thanks and Regards,
 Swapna.

Re: Unable to index Windows share repositories

2012-02-10 Thread Karl Wright

Good to hear.

The connector, by the way, is resigned to the fact that sometimes
various things fail when talking to Windows, which is why you see the
transient failure notification; it will retry on its own eventually
without killing the job, and only give up when things don't work for
an extended period of time.

Karl


On Fri, Feb 10, 2012 at 5:08 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi,

 Not sure why, but now am able to index data from Windows Share repositories
 into Solr. I don't get the Access denied messages any more, although I
 haven't changed anything.

 Sorry for the inconvenience caused. Will get back again if I see any issue.

 Thanks and Regards,
 Swapna.


 On Fri, Feb 10, 2012 at 12:47 PM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:

 Hi,

 Till recently, I have been using ManifoldCF trunk code (before 0.4 was
 released) on Windows XP. I was able to index files from Windows Share
 repositories successfully into Solr. Now, I have started using ManifoldCF
 0.4 version on Windows 7.
 With the new setup, am able to index files from File system repository
 with no issue, but I have problems indexing data from Windows Share
 repository.

 The job starts and ends with Result Description : Authorization: Access
 is denied. in Simple History. The log file has the message JCIFS:
 Authorization exception reading document/directory smb://nhance29/TestMails/
 - skipping

 Can you please tell me what needs to be done to resolve this ?

 I tried enabling Debug from properties.xml and this is what I get in the
 log file.

 DEBUG 2012-02-10 12:34:37,869 (Startup thread) - Connecting to:
 smb://GLOBAL;stgserver:password@nhance29/
 DEBUG 2012-02-10 12:34:37,907 (Startup thread) - Seed =
 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:39,781 (Worker thread '1') - JCIFS: getVersions():
 documentIdentifiers[0] is: smb://nhance29/TestMails/
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: In checkInclude
 for 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Matching
 startpoint 'smb://nhance29/TestMails/' against actual
 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Startpoint
 found!
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Startpoint:
 always included
 DEBUG 2012-02-10 12:34:44,417 (Worker thread '1') - JCIFS: Leaving
 checkInclude for 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:44,421 (Worker thread '1') - JCIFS: Processing
 'smb://nhance29/TestMails/'
 DEBUG 2012-02-10 12:34:44,421 (Worker thread '1') - JCIFS:
 'smb://nhance29/TestMails/' is a directory
  WARN 2012-02-10 12:34:44,425 (Worker thread '1') - JCIFS: Possibly
 transient exception detected on attempt 1 while listing files: Access is
 denied.
 jcifs.smb.SmbAuthException: Access is denied.
     at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:546)
     at jcifs.smb.SmbTransport.send(SmbTransport.java:640)
     at jcifs.smb.SmbSession.send(SmbSession.java:238)
     at jcifs.smb.SmbTree.send(SmbTree.java:119)
     at jcifs.smb.SmbFile.send(SmbFile.java:775)
     at jcifs.smb.SmbFile.doFindFirstNext(SmbFile.java:1986)
     at jcifs.smb.SmbFile.doEnum(SmbFile.java:1738)
     at jcifs.smb.SmbFile.listFiles(SmbFile.java:1715)
     at jcifs.smb.SmbFile.listFiles(SmbFile.java:1704)
     at
 org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileListFiles(SharedDriveConnector.java:2224)
     at
 org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:701)
     at
 org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
     at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:561)

 Thanks and Regards,
 Swapna.

Re: Web Crawl using ManifoldCF

2012-02-08 Thread Karl Wright

On Wed, Feb 8, 2012 at 8:24 AM, Silvia, Daniel [USA]
silvia_dan...@bah.com wrote:
 Hi Carl



 I want to thank you for your help regarding the Sharepoint to Solr
 connections, everything seems to be working properly after getting the
 Viewers and Home Owners groups permission set properly by our SharePoint
 Admins.

That's great news!  Thanks for sticking with it. ;-)

 However, I have another question regarding pulling site content from
 the SharePoint instance and not the files stored on the SharePoint instance.



 When creating a Respository connection, would you use the Web connection
 type to pull site content? If that is the case, when creating the job, do
 you indicate just the site url you want to crawl to pull site content in the
 Seed tab? Are we using the correct connection repository? Is there a
 respository type we use to just crawl websites for the content and not
 files?



I think that's the right approach, if there's a document you can crawl
somewhere that has a reference to the other documents, or the
documents all refer to each other.  You need such a document or
documents at the root of a document web, otherwise a web crawler has
no way of locating the documents in question.  That would be how you
identify your seed document.  For typical (non SharePoint) sites,
that's usually the main URL of the site.  So, for example, if you
wanted to crawl cnn.com you'd probably use a seed of
http://www.cnn.com, because that's a good place to start to get to all
of cnn's content.

If no such document(s) exist, then web crawling is not going to do it.
 If this site is served by SharePoint, then some kind of enhancement
to the SharePoint connector would be a better approach.

Thanks,
Karl


 As you can see, I hope I have explained myself properly, we are trying to
 just crawl site content.



 Thanks



 Dan

Re: ManifoldCF's dist/shapoint-integration dir

2012-01-31 Thread Karl Wright

Ok, let's do one thing at a time.

First:

For the Path tab where there are Path Rules, are these the paths we
want ManifoldCF to follow? Each site, and each Library like Documents
and Shared Documents. And in the Metadata tab, this is the tab where
you indicate for each Site and Library you want to include
specific metadata or include all metadata?

For SharePoint, there are Path Rules and Metadata Rules.  The Path
Rules describe what documents you want to include or exclude.  The
Metadata Rules describe what metadata you want to include or exclude.
For right now I would ignore the Metadata Rules and just make sure you
have Path Rules that mean that you have included documents.

As I run the report, I see Documents, Active, and Processed
where the numbers change under the Active column as well as the
Document and Processed column (these just get larger, where Active
changes). 

This report we actually call the Job Status screen.  The fact that
the numbers get larger and the job doesn't just end indicates that you
are successfully crawling your SharePoint, and you have set up the job
to include at least some documents.  This is good news.  However, this
is NOT the Simple History report I was alluding to earlier.  To get
to that report, click on the Simple History link on the left-hand
navigation area.  This report will show the events of your choice
(default - ALL recorded events) over a given time window (default: the
last hour).  If you've done this right you should at least see a Job
start event.  The events you are most interested in are the fetch
(which describes all attempts to fetch documents from SharePoint) and
document ingest, which describe attempts to get documents into Solr.
 You can refresh the displayed events by clicking the Go button in
the middle of the screen whenever you wish.

I'd like you to delete your job, create it again, and start it.  Then,
while it is running, I'd like you to go to the Simple History
screen, and select the appropriate connection (your SharePoint
repository connection), and click the Go button.  So as not to skip
anything basic:

(1) What event types do you see?
(2) Are there fetch events?
(3) Are there document ingest events?

If you see no fetch events, that implies you have either not
specified any documents to include in your job, OR your Solr
connection is configured to reject too many document types so they are
all getting filtered out.

If you see document ingest events, but those have errors, it implies
that the configuration of your Solr connection is incorrect and does
not match the way your Solr is configured.  If you send me a specific
error code and/or text I can help you figure out what is happening.

If you see document ingest events with NO errors, but the Solr
instance is not getting documents, you are describing an impossible
situation.  While your Solr instance may not be configured to have the
Extracting Update Handler active, or it may be at a different URL than
what you pointed at, that would definitely yield errors or
notifications in the Simple History.

Please let me know what you actually see.
Karl



On Tue, Jan 31, 2012 at 7:53 AM, Silvia, Daniel [USA]
silvia_dan...@bah.com wrote:
 Hi Karl

 I am trying to figure out why I can't see anything being indexed into our 
 Solr index. I was looking at another post where you were working with 
 Martijn and that individual was not able to see info getting into Solr. In 
 the report  that I have set up, I have included all metadata associated to 
 each site, Share Documents, and Documents. In the Solr Field Mapping, I am 
 associating metadata fields that are indicated in the MetaData tab to fields 
 that exist in our solr index.

 For the Path tab where there are Path Rules, are these the paths we want 
 ManifoldCF to follow? Each site, and each Library like Documents and Shared 
 Documents. And in the Metadata tab, this is the tab where you indicate for 
 each Site and Library you want to include specific metadata or include 
 all metadata?

 As I run the report, I see Documents, Active, and Processed where the 
 numbers change under the Active column as well as the Document and 
 Processed column (these just get larger, where Active changes). While I was 
 researching why I may not be seeing something over on the Solr side, I saw 
 your communication with another individual indicating that I should see 
 something like literal.xxx=yyy in the Solr log. This is an older post so 
 there maybe something else I should see. But the only thing I see when I look 
 at the Solr log is [ ] webapp=/solr path=/update/extract 
 params={commit=true} status=0 QTime=0.

 Any ideas.

 Thanks





 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Monday, January 30, 2012 10:40 AM
 To: Silvia, Daniel [USA]
 Subject: Re: ManifoldCF's dist/shapoint-integration dir

 The default time range for the Simple History is the last hour.  I
 suspect you are unaware of that.  If you

Re: ManifoldCF's dist/shapoint-integration dir

2012-01-31 Thread Karl Wright

I should clarify that the reason for deleting and recreating the job
is because ManifoldCF crawls incrementally.  If you just run a job a
second time you may well not get any documents if none have changed
from the first time the job was run.

Thanks,
Karl

On Tue, Jan 31, 2012 at 9:00 AM, Karl Wright daddy...@gmail.com wrote:
 Ok, let's do one thing at a time.

 First:

 For the Path tab where there are Path Rules, are these the paths we
 want ManifoldCF to follow? Each site, and each Library like Documents
 and Shared Documents. And in the Metadata tab, this is the tab where
 you indicate for each Site and Library you want to include
 specific metadata or include all metadata?

 For SharePoint, there are Path Rules and Metadata Rules.  The Path
 Rules describe what documents you want to include or exclude.  The
 Metadata Rules describe what metadata you want to include or exclude.
 For right now I would ignore the Metadata Rules and just make sure you
 have Path Rules that mean that you have included documents.

 As I run the report, I see Documents, Active, and Processed
 where the numbers change under the Active column as well as the
 Document and Processed column (these just get larger, where Active
 changes). 

 This report we actually call the Job Status screen.  The fact that
 the numbers get larger and the job doesn't just end indicates that you
 are successfully crawling your SharePoint, and you have set up the job
 to include at least some documents.  This is good news.  However, this
 is NOT the Simple History report I was alluding to earlier.  To get
 to that report, click on the Simple History link on the left-hand
 navigation area.  This report will show the events of your choice
 (default - ALL recorded events) over a given time window (default: the
 last hour).  If you've done this right you should at least see a Job
 start event.  The events you are most interested in are the fetch
 (which describes all attempts to fetch documents from SharePoint) and
 document ingest, which describe attempts to get documents into Solr.
  You can refresh the displayed events by clicking the Go button in
 the middle of the screen whenever you wish.

 I'd like you to delete your job, create it again, and start it.  Then,
 while it is running, I'd like you to go to the Simple History
 screen, and select the appropriate connection (your SharePoint
 repository connection), and click the Go button.  So as not to skip
 anything basic:

 (1) What event types do you see?
 (2) Are there fetch events?
 (3) Are there document ingest events?

 If you see no fetch events, that implies you have either not
 specified any documents to include in your job, OR your Solr
 connection is configured to reject too many document types so they are
 all getting filtered out.

 If you see document ingest events, but those have errors, it implies
 that the configuration of your Solr connection is incorrect and does
 not match the way your Solr is configured.  If you send me a specific
 error code and/or text I can help you figure out what is happening.

 If you see document ingest events with NO errors, but the Solr
 instance is not getting documents, you are describing an impossible
 situation.  While your Solr instance may not be configured to have the
 Extracting Update Handler active, or it may be at a different URL than
 what you pointed at, that would definitely yield errors or
 notifications in the Simple History.

 Please let me know what you actually see.
 Karl



 On Tue, Jan 31, 2012 at 7:53 AM, Silvia, Daniel [USA]
 silvia_dan...@bah.com wrote:
 Hi Karl

 I am trying to figure out why I can't see anything being indexed into our 
 Solr index. I was looking at another post where you were working with 
 Martijn and that individual was not able to see info getting into Solr. In 
 the report  that I have set up, I have included all metadata associated to 
 each site, Share Documents, and Documents. In the Solr Field Mapping, I am 
 associating metadata fields that are indicated in the MetaData tab to fields 
 that exist in our solr index.

 For the Path tab where there are Path Rules, are these the paths we want 
 ManifoldCF to follow? Each site, and each Library like Documents and Shared 
 Documents. And in the Metadata tab, this is the tab where you indicate for 
 each Site and Library you want to include specific metadata or include 
 all metadata?

 As I run the report, I see Documents, Active, and Processed where the 
 numbers change under the Active column as well as the Document and 
 Processed column (these just get larger, where Active changes). While I 
 was researching why I may not be seeing something over on the Solr side, I 
 saw your communication with another individual indicating that I should see 
 something like literal.xxx=yyy in the Solr log. This is an older post so 
 there maybe something else I should see. But the only thing I see when I 
 look at the Solr log is [ ] webapp=/solr path=/update/extract

Re: ManifoldCF's dist/shapoint-integration dir

2012-01-31 Thread Karl Wright

When I select only the fetch activity, I don't see anything in the
events, when I select the Document Ingest activity, I don't see
anything in the events.

So either you've already run the job and the documents were accessed
the first time (and won't be accessed again until they change), or the
problem is likely that your SharePoint Path Rules are not including
any documents.  It would be very helpful at this point to include a
screen shot of the job you've created.  Since you are not on the net,
perhaps you can jot down your SharePoint path rules for me to have a
look at, as they are displayed when you view the job.

Thanks,
Karl

On Tue, Jan 31, 2012 at 9:44 AM, Silvia, Daniel [USA]
silvia_dan...@bah.com wrote:
 Hi Karl

 Ok, I have created a new job and ran the job and went to the Simple History 
 Report.

 I see the Events. If all the  Activities in the Simple History Report, 
 Document Deletion(SolrPipeline), Document Ingest(SolrPipeline), and Fetch are 
 selected I see a start job and end job for events . When I get to the Simple 
 History Report I can select the Connection, I don't have an option to 
 select the Activities I run the report first.
 When I select only the fetch activity, I don't see anything in the events, 
 when I select the Document Ingest activity, I don't see anything in the 
 events.

 My solr output connection has the following information:
 Protocol: http
 Server: the server name
 Port:8080 (we are running solr on Jboss port 8080)
 Web Application Name: solr
 Core Name: collection1
 Update Handler: update/extract
 Remove Handler: /update
 Status Handler: /admin/ping



 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Tuesday, January 31, 2012 9:00 AM
 To: Silvia, Daniel [USA]; connectors-user@incubator.apache.org
 Subject: Re: ManifoldCF's dist/shapoint-integration dir

 Ok, let's do one thing at a time.

 First:

 For the Path tab where there are Path Rules, are these the paths we
 want ManifoldCF to follow? Each site, and each Library like Documents
 and Shared Documents. And in the Metadata tab, this is the tab where
 you indicate for each Site and Library you want to include
 specific metadata or include all metadata?

 For SharePoint, there are Path Rules and Metadata Rules.  The Path
 Rules describe what documents you want to include or exclude.  The
 Metadata Rules describe what metadata you want to include or exclude.
 For right now I would ignore the Metadata Rules and just make sure you
 have Path Rules that mean that you have included documents.

 As I run the report, I see Documents, Active, and Processed
 where the numbers change under the Active column as well as the
 Document and Processed column (these just get larger, where Active
 changes). 

 This report we actually call the Job Status screen.  The fact that
 the numbers get larger and the job doesn't just end indicates that you
 are successfully crawling your SharePoint, and you have set up the job
 to include at least some documents.  This is good news.  However, this
 is NOT the Simple History report I was alluding to earlier.  To get
 to that report, click on the Simple History link on the left-hand
 navigation area.  This report will show the events of your choice
 (default - ALL recorded events) over a given time window (default: the
 last hour).  If you've done this right you should at least see a Job
 start event.  The events you are most interested in are the fetch
 (which describes all attempts to fetch documents from SharePoint) and
 document ingest, which describe attempts to get documents into Solr.
  You can refresh the displayed events by clicking the Go button in
 the middle of the screen whenever you wish.

 I'd like you to delete your job, create it again, and start it.  Then,
 while it is running, I'd like you to go to the Simple History
 screen, and select the appropriate connection (your SharePoint
 repository connection), and click the Go button.  So as not to skip
 anything basic:

 (1) What event types do you see?
 (2) Are there fetch events?
 (3) Are there document ingest events?

 If you see no fetch events, that implies you have either not
 specified any documents to include in your job, OR your Solr
 connection is configured to reject too many document types so they are
 all getting filtered out.

 If you see document ingest events, but those have errors, it implies
 that the configuration of your Solr connection is incorrect and does
 not match the way your Solr is configured.  If you send me a specific
 error code and/or text I can help you figure out what is happening.

 If you see document ingest events with NO errors, but the Solr
 instance is not getting documents, you are describing an impossible
 situation.  While your Solr instance may not be configured to have the
 Extracting Update Handler active, or it may be at a different URL than
 what you pointed at, that would definitely yield errors or
 notifications in the Simple History

Re: ManifoldCF's dist/shapoint-integration dir

2012-01-31 Thread Karl Wright

It's been a while since I've set up a SharePoint job but I think what
you are missing is a file rule (instead of just a library rule).
Here's what the end-user documentation says on the matter:

Each rule consists of a path, a rule type, and an action. The actions
are Include and Exclude. The rule type tells the connection what
kind of SharePoint entity it is allowed to exactly match. For example,
a File rule will only exactly match SharePoint paths that represent
files - it cannot exactly match sites or libraries. The path itself is
just a sequence of characters, where the * character has the special
meaning of being able to match any number of any kind of characters,
and the ? character matches exactly one character of any kind.

The rule matcher extends strict, exact matching by introducing a
concept of implicit inclusion rules. If your rule action is Include,
and you specify (say) a File rule, the matcher presumes implicit
inclusion rules for the corresponding site and library. So, if you
create an Include File rule that matches (for example)
/MySite/MyLibrary/MyFile, there is an implied Site Include rule
for /MySite, and an implied Library Include rule for
/MySite/MyLibrary. Similarly, if you create a Library Include
rule, there is an implied Site Include rule that corresponds to it.
Note that these shortcuts only applies to Include rules - there are
no corresponding implied Exclude rules.

What this means is that you should probably be declaring file rules
with * as the file name for each library, rather than a library
rule.  You might want to just try this.  If you still have trouble,
you can try setting the org.apache.manifoldcf.connectors property to
DEBUG in the properties.xml file and restarting ManifoldCF before
your next crawl.  The manifoldcf.log file will then have output
describing the decisions the SharePoint connector made about each
site, library, file, or folder it encountered.

Thanks,
Karl

On Tue, Jan 31, 2012 at 10:27 AM, Silvia, Daniel [USA]
silvia_dan...@bah.com wrote:
 Hi Karl

 The Path Rules are :

 Path Match: /Shared Documents
 Type: library
 Action: include

 Path Match: /IDD/Shared Documents
 Type: library
 Action: include

 Path Match: /IDD/Documents
 Type: library
 Action: include

 Path Match: /manifoldcf/Shared Documents
 Type: library
 Action: include

 I hope this helps.

 I really appreciate your help.



 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Tuesday, January 31, 2012 10:01 AM
 To: Silvia, Daniel [USA]
 Cc: connectors-user@incubator.apache.org
 Subject: Re: ManifoldCF's dist/shapoint-integration dir

 When I select only the fetch activity, I don't see anything in the
 events, when I select the Document Ingest activity, I don't see
 anything in the events.

 So either you've already run the job and the documents were accessed
 the first time (and won't be accessed again until they change), or the
 problem is likely that your SharePoint Path Rules are not including
 any documents.  It would be very helpful at this point to include a
 screen shot of the job you've created.  Since you are not on the net,
 perhaps you can jot down your SharePoint path rules for me to have a
 look at, as they are displayed when you view the job.

 Thanks,
 Karl

 On Tue, Jan 31, 2012 at 9:44 AM, Silvia, Daniel [USA]
 silvia_dan...@bah.com wrote:
 Hi Karl

 Ok, I have created a new job and ran the job and went to the Simple History 
 Report.

 I see the Events. If all the  Activities in the Simple History Report, 
 Document Deletion(SolrPipeline), Document Ingest(SolrPipeline), and Fetch 
 are selected I see a start job and end job for events . When I get to the 
 Simple History Report I can select the Connection, I don't have an option 
 to select the Activities I run the report first.
 When I select only the fetch activity, I don't see anything in the events, 
 when I select the Document Ingest activity, I don't see anything in the 
 events.

 My solr output connection has the following information:
 Protocol: http
 Server: the server name
 Port:8080 (we are running solr on Jboss port 8080)
 Web Application Name: solr
 Core Name: collection1
 Update Handler: update/extract
 Remove Handler: /update
 Status Handler: /admin/ping



 
 From: Karl Wright [daddy...@gmail.com]
 Sent: Tuesday, January 31, 2012 9:00 AM
 To: Silvia, Daniel [USA]; connectors-user@incubator.apache.org
 Subject: Re: ManifoldCF's dist/shapoint-integration dir

 Ok, let's do one thing at a time.

 First:

 For the Path tab where there are Path Rules, are these the paths we
 want ManifoldCF to follow? Each site, and each Library like Documents
 and Shared Documents. And in the Metadata tab, this is the tab where
 you indicate for each Site and Library you want to include
 specific metadata or include all metadata?

 For SharePoint, there are Path Rules and Metadata Rules.  The Path
 Rules describe what documents you want

Re: Cannot find OracleDriver

2012-01-19 Thread Karl Wright

MCF's Oracle support was written against earlier versions of the
Oracle driver.  It is possible that they have changed the driver
class.  If the driver winds up in the dist/connector-lib directory
(I'm assuming you are using trunk or 0.4-incubating), then it should
be accessible.

Could you please try the following:

jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver

... assuming you are using Linux?

If the driver class IS found, then the other possibility is that the
jar is compiled against a later version of Java than the one you are
using to run MCF.

Please let me know what you find.

Karl

On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 I downloaded MCF and started playing with the default setup under Jetty and
 Derby. It starts up without any issue.

 I would like to connect to our ORACLE database and import data into SOLR.

 I placed the ojdbc6.jar file in the connectors/jdbc/jdbc-drivers directory
 as stated in the README instruction file to use the ORACLE driver.

 I ran ant build from the main directory, and restarted the example in
 dist/example using Jetty.

 When I setup a connector, MCF throws an exception stating that it cannot
 find oracle.jdbc.OracleDriver class.

 Looking in the connector-lib directory, the oracle jar is there.

 I also tried placing the ojdbc6.jar in the dist/example/lib directory, but
 that didn't fix the problem either.

 Can anyone point me in the right direction?

 TIA

 --
 This e-mail and any files transmitted with it may be proprietary.  Please
 note that any views or opinions presented in this e-mail are solely those of
 the author and do not necessarily represent those of Apogee Integration.

Re: Cannot find OracleDriver

2012-01-19 Thread Karl Wright

I was able to reproduce the problem.  I'll get back to you when I
figure out what the issue is.
Karl

On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker
mpar...@apogeeintegration.com wrote:
 I've used the jar file in NetBeans to connect to the database without any
 issue.

 Seems more like a class loader issue.


 On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:

 I have the latest release from the Apache Manifold site (i.e.
 0.3-incubating).

 I checked the driver jar file with winzip, and the driver name is still
 the same (oracle.jdbc.OracleDriver).

 I'm running java 1.6.0_18-b7 on Windows XP SP 3.

 On Thu, Jan 19, 2012 at 2:27 PM, Karl Wright daddy...@gmail.com wrote:

 MCF's Oracle support was written against earlier versions of the
 Oracle driver.  It is possible that they have changed the driver
 class.  If the driver winds up in the dist/connector-lib directory
 (I'm assuming you are using trunk or 0.4-incubating), then it should
 be accessible.

 Could you please try the following:

 jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver

 ... assuming you are using Linux?

 If the driver class IS found, then the other possibility is that the
 jar is compiled against a later version of Java than the one you are
 using to run MCF.

 Please let me know what you find.

 Karl

 On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  I downloaded MCF and started playing with the default setup under Jetty
  and
  Derby. It starts up without any issue.
 
  I would like to connect to our ORACLE database and import data into
  SOLR.
 
  I placed the ojdbc6.jar file in the connectors/jdbc/jdbc-drivers
  directory
  as stated in the README instruction file to use the ORACLE driver.
 
  I ran ant build from the main directory, and restarted the example in
  dist/example using Jetty.
 
  When I setup a connector, MCF throws an exception stating that it
  cannot
  find oracle.jdbc.OracleDriver class.
 
  Looking in the connector-lib directory, the oracle jar is there.
 
  I also tried placing the ojdbc6.jar in the dist/example/lib directory,
  but
  that didn't fix the problem either.
 
  Can anyone point me in the right direction?
 
  TIA
 
  --
  This e-mail and any files transmitted with it may be proprietary.
   Please
  note that any views or opinions presented in this e-mail are solely
  those of
  the author and do not necessarily represent those of Apogee
  Integration.
 




 --
 Regards,

 Matt Parker (CTR)
 Senior Software Architect
 Apogee Integration, LLC
 5180 Parkstone Drive, Suite #160
 Chantilly, Virginia 20151
 703.272.4797 (site)
 703.474.1918 (cell)
 www.apogeeintegration.com




 --
 Regards,

 Matt Parker (CTR)
 Senior Software Architect
 Apogee Integration, LLC
 5180 Parkstone Drive, Suite #160
 Chantilly, Virginia 20151
 703.272.4797 (site)
 703.474.1918 (cell)
 www.apogeeintegration.com

 --
 This e-mail and any files transmitted with it may be proprietary.  Please
 note that any views or opinions presented in this e-mail are solely those of
 the author and do not necessarily represent those of Apogee Integration.

Re: Cannot find OracleDriver

2012-01-19 Thread Karl Wright

The problem is that the JDBC driver is using a pool driver that is in
common with the core of ManifoldCF.  So the connector-lib path, which
only the connectors know about, won't do.  That's a bug which I'll
create a ticket for.

A temporary fix, which is slightly involved, requires you to put the
ojdbc6.jar in the example/lib area, as you already tried, but in
addition you will need to explicitly include the jar in your
classpath.  Normally the start.jar's manifest describes all the jars
in the initial classpath.  I thought it was possible to also include
additional classpath info through the normal --classpath mechanism,
but that doesn't seem to work, so you may be stuck with modifying the
root build.xml file to add the jar to the manifest.

I'm going to experiment a bit and see if I can come up with something quickly.

Karl


On Thu, Jan 19, 2012 at 2:48 PM, Karl Wright daddy...@gmail.com wrote:
 I was able to reproduce the problem.  I'll get back to you when I
 figure out what the issue is.
 Karl

 On Thu, Jan 19, 2012 at 2:47 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
 I've used the jar file in NetBeans to connect to the database without any
 issue.

 Seems more like a class loader issue.


 On Thu, Jan 19, 2012 at 2:41 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:

 I have the latest release from the Apache Manifold site (i.e.
 0.3-incubating).

 I checked the driver jar file with winzip, and the driver name is still
 the same (oracle.jdbc.OracleDriver).

 I'm running java 1.6.0_18-b7 on Windows XP SP 3.

 On Thu, Jan 19, 2012 at 2:27 PM, Karl Wright daddy...@gmail.com wrote:

 MCF's Oracle support was written against earlier versions of the
 Oracle driver.  It is possible that they have changed the driver
 class.  If the driver winds up in the dist/connector-lib directory
 (I'm assuming you are using trunk or 0.4-incubating), then it should
 be accessible.

 Could you please try the following:

 jar -tf ojdbc6.jar | grep oracle/jdbc/OracleDriver

 ... assuming you are using Linux?

 If the driver class IS found, then the other possibility is that the
 jar is compiled against a later version of Java than the one you are
 using to run MCF.

 Please let me know what you find.

 Karl

 On Thu, Jan 19, 2012 at 1:43 PM, Matthew Parker
 mpar...@apogeeintegration.com wrote:
  I downloaded MCF and started playing with the default setup under Jetty
  and
  Derby. It starts up without any issue.
 
  I would like to connect to our ORACLE database and import data into
  SOLR.
 
  I placed the ojdbc6.jar file in the connectors/jdbc/jdbc-drivers
  directory
  as stated in the README instruction file to use the ORACLE driver.
 
  I ran ant build from the main directory, and restarted the example in
  dist/example using Jetty.
 
  When I setup a connector, MCF throws an exception stating that it
  cannot
  find oracle.jdbc.OracleDriver class.
 
  Looking in the connector-lib directory, the oracle jar is there.
 
  I also tried placing the ojdbc6.jar in the dist/example/lib directory,
  but
  that didn't fix the problem either.
 
  Can anyone point me in the right direction?
 
  TIA
 
  --
  This e-mail and any files transmitted with it may be proprietary.
   Please
  note that any views or opinions presented in this e-mail are solely
  those of
  the author and do not necessarily represent those of Apogee
  Integration.
 




 --
 Regards,

 Matt Parker (CTR)
 Senior Software Architect
 Apogee Integration, LLC
 5180 Parkstone Drive, Suite #160
 Chantilly, Virginia 20151
 703.272.4797 (site)
 703.474.1918 (cell)
 www.apogeeintegration.com




 --
 Regards,

 Matt Parker (CTR)
 Senior Software Architect
 Apogee Integration, LLC
 5180 Parkstone Drive, Suite #160
 Chantilly, Virginia 20151
 703.272.4797 (site)
 703.474.1918 (cell)
 www.apogeeintegration.com

 --
 This e-mail and any files transmitted with it may be proprietary.  Please
 note that any views or opinions presented in this e-mail are solely those of
 the author and do not necessarily represent those of Apogee Integration.

Re: Programmatic Interaction with ManifoldCF

2012-01-16 Thread Karl Wright

Hi Swapna,

Passwords are stored in obfuscated form.  There's a different method
call to set passwords accordingly, which performs the obfuscation.
See 
org.apache.manifoldcf.core.interfaces.ConfigParams.setObfuscatedParameter(String
key, String value).

Thanks
Karl



On Mon, Jan 16, 2012 at 6:23 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi Karl,

 I have looked at the examples you suggested and they have been very helpful.
 Am using the API to get/put different jobs/repository connections etc and
 everything is working fine.

 But I have an issue when creating a Windows Share repository connection. Am
 using some code something like below:

 Configuration connectionConfiguration = new Configuration();
     addParameterNode(connectionConfiguration,Server,serverName);
     addParameterNode(connectionConfiguration,Domain/Realm,GLOBAL);
     addParameterNode(connectionConfiguration,User Name,userName);
     addParameterNode(connectionConfiguration,Password,password);

 Then, am using this connectionConfiguration to create the repository
 connection. The connection is getting created without any issue, but when I
 check the status using crawler UI, it shows that the connection is not
 working. When I edit the connection (from crawler UI) to see the details,
 the password is shown as empty. Can you please tell me how to create a
 Windows Share repository connection using this API such that it works by
 using the credentials that are sent as arguments ?

 Thanks and Regards,
 Swapna.

 On Thu, Jan 5, 2012 at 11:43 AM, Swapna Vuppala swapna.kollip...@gmail.com
 wrote:

 Thanks Karl, I have started looking into the API Service. Will get back to
 you with more specific questions once into it.

 Thanks and Regards,
 Swapna.


 On Tue, Jan 3, 2012 at 7:47 PM, daddy...@gmail.com daddy...@gmail.com
 wrote:

 The preferred way to do this is via the API Service.  See Chapter 3 of
 ManifoldCF in Action.  There are examples at
 http://manifoldcfinaction.googlecode.com/svnroot/trunk from the book.

 Karl

 Sent from my Nokia phone
 -Original Message-
 From: Swapna Vuppala
 Sent:  03/01/2012, 6:55  AM
 To: connectors-user@incubator.apache.org
 Subject: Programmatic Interaction with ManifoldCF


 Hi,

 Am looking for the best way to interact with ManifoldCF programmatically
 for my purposes. My target is to develop a small tool (command-line)
 which
 can read a XML file to get the list of locations that have to be crawled,
 and create repository connections and jobs that use the created
 repository
 connections, with paths that have been read from the XML file.

 If I write a Java program for this, which API should I be using ?
 Earlier,
 I have looked at scripts that can be run to create repository
 connections,
 jobs etc . Should I run such scripts from my Java program or is there a
 better way to approach this ? Or is it possible to use the classes of
 ManifoldCF in my program to achieve this ? If so, how ?

 Can you please direct me to the ideal approach ?

 Thanks and Regards,
 Swapna.

Re: required attribute of solr-integration security fields

2012-01-10 Thread Karl Wright

required=true affects the update handler, though, and ManifoldCF
does not send __nosecurity__ as a value; it expects Solr to add it.

So without default value, the solr 3.x and solr 4.x components do not work.

ManifoldCF in Action has its own example, which doesn't use
__nosecurity__, but is slower.  The book is now out of date, though,
in this regard.  You should not mix schema.xml from the book with code
from the ManifoldCF tree.

Karl

On Tue, Jan 10, 2012 at 3:44 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 The tests for the components
 would not pass if required was true
 I run ant test for the components and passed when adding required=true.
 If someone will forget to set default value of __nosecurity__,  
 required=true  is effective.
 In fact I forgot to add default value, so I couldn't search anything.
 For, I used security fields described in ManifoldCF in Action which don't 
 have default attribute.
 So I think setting required=true  is helpful.

 Regards,
 Shinichiro Abe

 On 2012/01/10, at 15:47, Karl Wright wrote:

 The fields should be required=false but with a default value of
 __nosecurity__.  I believe that means that if there is no field
 value attached to the document when it is sent to Solr, Solr will make
 sure it has the value __nosecurity__.  The tests for the components
 would not pass if required was true, so I am a little puzzled as to
 why you feel there is a problem here?

 Here's what the tests use for schema:

  !-- MCF Security fields --
  field name=allow_token_document type=string indexed=true
 stored=false multiValued=true default=__nosecurity__/
  field name=deny_token_document type=string indexed=true
 stored=false multiValued=true default=__nosecurity__/
  field name=allow_token_share type=string indexed=true
 stored=false multiValued=true default=__nosecurity__/
  field name=deny_token_share type=string indexed=true
 stored=false multiValued=true default=__nosecurity__/

 Here's how the test documents are added:

    assertU(adoc(id, da12, allow_token_document, token1,
 allow_token_document, token2));
    assertU(adoc(id, da13-dd3, allow_token_document, token1,
 allow_token_document, token3, deny_token_document, token3));
    assertU(adoc(id, sa123-sd13, allow_token_share, token1,
 allow_token_share, token2, allow_token_share, token3,
 deny_token_share, token1, deny_token_share, token3));
    assertU(adoc(id, sa3-sd1-da23, allow_token_document,
 token2, allow_token_document, token3, allow_token_share,
 token3, deny_token_share, token1));
    assertU(adoc(id, notoken));

 Karl

 On Mon, Jan 9, 2012 at 11:12 PM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi.

 README[1] of solr-integration says that you will need to add security 
 fields,
 and specify required=false.
 I should specify required=true because MCF connectors always return any 
 tokens
 and we can't search anything if these fields have no tokens
 (that is, null and these fields don't even have __nosecurity__ that 
 stands for no security token.)
 when using MCF security plugin.
 May I open JIRA ticket for modifying README? Is there a reason that should 
 be required=false?

 [1]https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/solr/integration/README-3.x.txt

 Regards,
 Shinichiro Abe

Re: required attribute of solr-integration security fields

2012-01-10 Thread Karl Wright

I might as well clearly specify the necessity of field value by using
required=true.
What do you think?

If you do that, the Solr Connector will cease to work.  Try it if you
do not believe me.  The current contract is that the connector sends
in tokens of each type to Solr, and will send zero tokens if it has
none.  In that case, if you set required=true Solr will reject the
document with an error.

Karl


On Tue, Jan 10, 2012 at 7:43 PM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Hi,
 So without default value, the solr 3.x and solr 4.x components do not work.
 Okay, I know we must have default value.
 Then do we need to have required attribute itself?
 I don't mind having required attribute or not though(because we have default 
 value),
 I might as well clearly specify the necessity of field value by using 
 required=true.
 What do you think?

 Thank you,
 Shinichiro Abe

 On 2012/01/10, at 20:17, Karl Wright wrote:

 required=true affects the update handler, though, and ManifoldCF
 does not send __nosecurity__ as a value; it expects Solr to add it.

 So without default value, the solr 3.x and solr 4.x components do not work.

 ManifoldCF in Action has its own example, which doesn't use
 __nosecurity__, but is slower.  The book is now out of date, though,
 in this regard.  You should not mix schema.xml from the book with code
 from the ManifoldCF tree.

 Karl

 On Tue, Jan 10, 2012 at 3:44 AM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 The tests for the components
 would not pass if required was true
 I run ant test for the components and passed when adding required=true.
 If someone will forget to set default value of __nosecurity__,  
 required=true  is effective.
 In fact I forgot to add default value, so I couldn't search anything.
 For, I used security fields described in ManifoldCF in Action which don't 
 have default attribute.
 So I think setting required=true  is helpful.

 Regards,
 Shinichiro Abe

 On 2012/01/10, at 15:47, Karl Wright wrote:

 The fields should be required=false but with a default value of
 __nosecurity__.  I believe that means that if there is no field
 value attached to the document when it is sent to Solr, Solr will make
 sure it has the value __nosecurity__.  The tests for the components
 would not pass if required was true, so I am a little puzzled as to
 why you feel there is a problem here?

 Here's what the tests use for schema:

  !-- MCF Security fields --
  field name=allow_token_document type=string indexed=true
 stored=false multiValued=true default=__nosecurity__/
  field name=deny_token_document type=string indexed=true
 stored=false multiValued=true default=__nosecurity__/
  field name=allow_token_share type=string indexed=true
 stored=false multiValued=true default=__nosecurity__/
  field name=deny_token_share type=string indexed=true
 stored=false multiValued=true default=__nosecurity__/

 Here's how the test documents are added:

    assertU(adoc(id, da12, allow_token_document, token1,
 allow_token_document, token2));
    assertU(adoc(id, da13-dd3, allow_token_document, token1,
 allow_token_document, token3, deny_token_document, token3));
    assertU(adoc(id, sa123-sd13, allow_token_share, token1,
 allow_token_share, token2, allow_token_share, token3,
 deny_token_share, token1, deny_token_share, token3));
    assertU(adoc(id, sa3-sd1-da23, allow_token_document,
 token2, allow_token_document, token3, allow_token_share,
 token3, deny_token_share, token1));
    assertU(adoc(id, notoken));

 Karl

 On Mon, Jan 9, 2012 at 11:12 PM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hi.

 README[1] of solr-integration says that you will need to add security 
 fields,
 and specify required=false.
 I should specify required=true because MCF connectors always return any 
 tokens
 and we can't search anything if these fields have no tokens
 (that is, null and these fields don't even have __nosecurity__ that 
 stands for no security token.)
 when using MCF security plugin.
 May I open JIRA ticket for modifying README? Is there a reason that 
 should be required=false?

 [1]https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/solr/integration/README-3.x.txt

 Regards,
 Shinichiro Abe

Re: required attribute of solr-integration security fields

2012-01-09 Thread Karl Wright

The fields should be required=false but with a default value of
__nosecurity__.  I believe that means that if there is no field
value attached to the document when it is sent to Solr, Solr will make
sure it has the value __nosecurity__.  The tests for the components
would not pass if required was true, so I am a little puzzled as to
why you feel there is a problem here?

Here's what the tests use for schema:

  !-- MCF Security fields --
  field name=allow_token_document type=string indexed=true
stored=false multiValued=true default=__nosecurity__/
  field name=deny_token_document type=string indexed=true
stored=false multiValued=true default=__nosecurity__/
  field name=allow_token_share type=string indexed=true
stored=false multiValued=true default=__nosecurity__/
  field name=deny_token_share type=string indexed=true
stored=false multiValued=true default=__nosecurity__/

Here's how the test documents are added:

assertU(adoc(id, da12, allow_token_document, token1,
allow_token_document, token2));
assertU(adoc(id, da13-dd3, allow_token_document, token1,
allow_token_document, token3, deny_token_document, token3));
assertU(adoc(id, sa123-sd13, allow_token_share, token1,
allow_token_share, token2, allow_token_share, token3,
deny_token_share, token1, deny_token_share, token3));
assertU(adoc(id, sa3-sd1-da23, allow_token_document,
token2, allow_token_document, token3, allow_token_share,
token3, deny_token_share, token1));
assertU(adoc(id, notoken));

Karl

On Mon, Jan 9, 2012 at 11:12 PM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Hi.

 README[1] of solr-integration says that you will need to add security fields,
 and specify required=false.
 I should specify required=true because MCF connectors always return any 
 tokens
 and we can't search anything if these fields have no tokens
 (that is, null and these fields don't even have __nosecurity__ that stands 
 for no security token.)
 when using MCF security plugin.
 May I open JIRA ticket for modifying README? Is there a reason that should be 
 required=false?

 [1]https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/solr/integration/README-3.x.txt

 Regards,
 Shinichiro Abe

Re: Jetty configuration

2011-12-28 Thread Karl Wright

The single-process example was originally conceived as just a
quick-and-dirty way to get ManifoldCF running, and nobody thought it
would ever become a serious deployment model.  But having said that it
is pretty straightforward to add support for more Jetty configuration
options, I believe.  Please consider opening a ticket and specifying
what you'd like to see as far as Jetty configuration support.  I
suppose just supporting a jetty.xml would be sufficient?

Karl

On Wed, Dec 28, 2011 at 1:17 PM, M Kelleher mj.kelle...@gmail.com wrote:
 In spite of the class that starts jetty and MCF, is there still a way
 to configure jetty to use any of the supported OPTIONS, or specify a
 jetty.xml?

 I would like to enable JMX and configure BASIC authentication for the
 container, and also enable a plugin that will allow me to specify what
 IP addresses jetty will respond to.

 Before I write my own wrapper replacement for start.jar and the
 invocation of the manifoldcf jetty starter class, I was hoping that
 there was a built in way to specify these kinds of configurations.

 Thanks.

 Sent from my iPad

Re: Incubator status, Manning email, is any action needed by users?

2011-12-12 Thread Karl Wright

is the continued Incubator status a problem?

It's only a problem in that some potential users of the software may
avoid it due to this status, which is unfortunate.  It also limits
books sales.

Is there something more we, as a group, need to do to push this forward?

The decision to not pursue graduation at this time has to do with
nothing more than the percentage of commits that are done by all the
active committers, and their distribution.  ASF wants their projects
well-covered, and ManifoldCF has not yet achieved that.  In all other
respects, I believe ManifoldCF would have no problem in being able to
graduate.

For people who follow the project, this means basically that we need
your contributions and your continued involvement.  If you contribute
consistently and well, you may be asked to become a committer, and
that would certainly help the project towards graduation.

Karl

On Mon, Dec 12, 2011 at 11:40 PM, Mark Bennett mbenn...@ideaeng.com wrote:
 I got the email from Manning mentioning that the print book would be
 delayed, in favor of updates to the electronic copy, since the incubation
 period has been extended.  This seems quite reasonable.

 This is NOT a post about the book - if it were, I'd post to that board.

 But more importantly, is the continued Incubator status a problem?  Is it
 something we need to do something about?

 Is there something more we, as a group, need to do to push this forward?
 For example, something with doc or unit tests to comply with ASF?  Or
 starting to Blog campaign to lobby ASF?

 OR Is this even a problem to be worked on?  Maybe there's good reasons for
 it, and Karl is content?  I'm a newb to these lists, so wanted to ask before
 doing anything.  Only saw one recent post on the topic.

 Thanks,
 Mark

 --
 Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
 Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Re: URL modification

2011-12-07 Thread Karl Wright

The existing canonicalization logic is complex and cannot correctly be
represented as a simple regexp (or I would have simply done it that
way in the first place ;-) ).  But we could certainly entertain the
notion of adding arbitrary parameter removal/addition as being one of
the kinds of canonicalizations that could be done.  If you think you
need this enhancement please create a ticket for it and we'll mull it
over.

Thanks,
Karl


On Wed, Dec 7, 2011 at 8:59 AM, Michael Kelleher mj.kelle...@gmail.com wrote:
 Is it possible to modify the URL's at all at collection time or before fetch
 time?  There is a URL parameter I would like to remove before the URL is
 fetched.

 Canonicalization seems to do that, but the modification types are fixed,
  remove: JSP sessions, ASP sessions, PHP sessions, BV sessions.  It does not
 seem to allow a regex to transform the URL.

 thanks

Re: WEB: Illegal seed URL

2011-12-06 Thread Karl Wright

The URL as stated is fine and is pretty standard. I don't think
there's a problem there, unless you inadvertantly fixed something when
you changed the hostname.

Can you look at the log - there may well be a stack trace, especially
if you have property name=org.apache.manifoldcf.connectors
value=DEBUG/ set. I'd love to see what the trace is.

Karl

On Tue, Dec 6, 2011 at 1:52 PM, Michael Kelleher mj.kelle...@gmail.com wrote:
Here is my seed URL (minus the hostname):
https://hostname.com/vwebv/search?searchArg=dvdsearchCode=SALLsearchType=1recCount=100

I am using a Web Crawler connection that has been tested with the
NullOutputConnector - so I dont think the issue can be here

I am also using the Solr Output Connector - this had been throwing an
Exception till I fixed the core name - this is the first time I have used
this. So, maybe I dont have things configured correct here. However, there
are no exceptions in the log. Also, I am not using authentication at all on
Solr.

I looked at the class:
connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors\webcrawler\WebcrawlerConnector.java
and it was not Obvious what the issue is.

Also, in logging.ini - I changed the logging level to DEBUG and restarted
before I tested the crawl, which further obscures the logic to me in
WebcrawlerConnector.java

Is there somewhere else I can set logging levels. I am not sure my change
to logging.ini is having any effect. Also, is there some other test you
might suggest?

thanks.

--mike

Re: Problem crawling windows share

2011-12-05 Thread Karl Wright

About your capture - Michael Allen says the following:

Actually this has nothing to do with DFS. JCIFS does not get to the
point where it does DFS anything. The capture shows a vanilla
STATUS_LOGON_FAILURE when GLOBAL\swapna.vuppala tries to auth with
l-carx01.global.arup.com. So the possible causes for this are 1) the
account name is not valid 2) the supplied password is incorrect 3)
some security policy is deliberately blocking that user or particular
type of auth or 4) some server configuration is incompatible with
JCIFS. I only mention this last option because I noticed the target
server has security signatures disabled. That's strange. If they're
messing around with things like that, who knows what their clients are
expected to do.

Try a Windows client that uses NTLM instead of Kerberos. Meaning try a
machine that is not joined to the domain so that when you try to
access the target it asks you for credentials at which point you can
test with GLOBAL\swapna.vuppala. Then it will use NTLM and you can
actually compare captures. If the operator doesn't have a laptop or
something not joined to the domain, it might be sufficient to log into
a workstation using machine credentials and not domain credentials.

Also when testing JCIFS you should use a simple stand-alone program
like examples/ListFiles.java.

In other words:
(a) Since JCIFS does not use Kerberos for authentication, you need to
try to log into the recalcitrant server via Windows without using
Kerberos to be able to do a side-by-side comparison.  Michael has some
ways of doing that, above.
(b) You may find that it doesn't work, in which case JCIFS is not
going to work either.
(c) If it *does* work, then try to generate your side-by-side
comparisons using a simpler example rather than ManifoldCF en toto;
you can see how at jcifs.samba.org, or I can help you further.

He also mentions that there is some bizarreness on the response that
indicates that the server is configured in a way that he's never seen
before.  And believe me, Michael has seen a *lot* of strange
configurations...

Hope this helps.
Karl

On Mon, Nov 28, 2011 at 4:12 AM, Karl Wright daddy...@gmail.com wrote:
 That should read properties.xml, not properties.ini.  It looks
 like this page needs updating.

 The debug property in the XML form is:

 property name=org.apache.manifoldcf.connectors value=DEBUG/

 I don't think it will provide you with any additional information that
 is useful for debugging your authentication issue, however, if that is
 why you are looking at it.  There may be some jcifs.jar debugging
 switches that might be of more help, but in the end I suspect you will
 need a packet capture of both a successful connection (via Windows)
 and an unsuccessful one (via MCF).  The guy you will need to talk with
 after that is the jcifs author Michael Allen; I can give you his email
 address if you get that far.

 Karl


 On Mon, Nov 28, 2011 at 1:30 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
 Hi Karl,

 I was planning to debug jCIFS repository connection using WireShark and I
 came across this
 https://cwiki.apache.org/CONNECTORS/debugging-connections.html
 Here, I see something as add org.apache.manifoldcf.connectors=DEBUG to the
 properties.ini file. Is it the properties.xml file that is being referred
 here ? If not, where do I find properties.ini file ?

 Thanks and Regards,
 Swapna.

 On Thu, Nov 17, 2011 at 1:31 PM, Karl Wright daddy...@gmail.com wrote:

 See http://jcifs.samba.org/src/docs/api/overview-summary.html#scp.
 The properties jcifs.smb.lmCompatibility and
 jcifs.smb.client.useExtendedSecurity are the ones you may want to
 change.  These two properties go together so certain combinations make
 sense and others don't, so there's really only combinations you need
 but I'll need to look at what they are and get back to you later
 today.

 As far as setting the switches are concerned, if you are using the
 Quick Start you do this trivially by:

 java -Dxxx -Dyyy -jar start.jar

 If you are using the multi-process configuration, that is what the
 defines directory is for; you only need to create files in that
 directory with the names jcifs.smb.lmCompatibility and
 jcifs.smb.client.useExtendedSecurity containing the values you want
 to set.

 Karl


 On Thu, Nov 17, 2011 at 1:11 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Hi Karl,
 
  Am able to access the folders on the problem server through windows
  explorer, (\\server3\Folder1). I tried couple of things with the
  credentials
  form, changing username, domain etc.. but I keep getting the same error
  Couldn't connect to server: Logon failure: unknown user name or bad
  password
 
  Can you tell me more about the -D switch you were talking of ?
 
  Thanks and Regards,
  Swapna.
 
  On Tue, Nov 15, 2011 at 12:40 PM, Karl Wright daddy...@gmail.com
  wrote:
 
  Glad you chased it down this far.
 
  First thing to try is whether you can get into the problem server
  using Windows

Re: Export crawled URLs

2011-12-04 Thread Karl Wright

Well, the history comes from the repohistory table, yes - but you may
not be able to construct a query with entityid=jobs.id, first of all
because that is incorrect (what the entity field contains is dependent
 on the activity type), and secondly because that column is
potentially long and only some kinds of queries can be done against
it.  Specifically it cannot be built into an index on PostgreSQL.

Karl

On Sun, Dec 4, 2011 at 7:50 PM, Hitoshi Ozawa
ozawa_hito...@ogis-ri.co.jp wrote:
 Is history just entries in the repohistory table with entitityid =
 jobs.id?

 H.Ozawa

 (2011/12/03 1:43), Karl Wright wrote:

 The best place to get this from is the simple history.  A command-line
 utility to dump this information to a text file should be possible
 with the currently available interface primitives.  If that is how you
 want to go, you will need to run ManifoldCF in multiprocess mode.
 Alternatively you might want to request the info from the API, but
 that's problematic because nobody has implemented report support in
 the API as of now.

 A final alternative is to get this from the log.  There is an [INFO]
 level line from the web connector for every fetch, I seem to recall,
 and you might be able to use that.

 Thanks,
 Karl


 On Fri, Dec 2, 2011 at 11:18 AM, M Kellehermj.kelle...@gmail.com  wrote:


 Is it possible to export / download the list of URLs visited during a
 crawl job?

 Sent from my iPad

Re: Exception while processing document

2011-11-27 Thread Karl Wright

Hmm, this is a new one for me.  Can you include the entire trace?

Karl


On Sat, Nov 26, 2011 at 2:43 PM, Michael Kelleher mj.kelle...@gmail.com wrote:
 I get the following exception:

 java.lang.RuntimeException: Unexpected error:
 java.security.InvalidAlgorithmParameterException: the trustAnchors parameter
 must be non-empty

 Anyone know what this relates to and how to fix it?

 I currently have nutch 1.3 crawling the same site without exceptions.

Re: Question about deploy to Tomcat

2011-11-27 Thread Karl Wright

On Sat, Nov 26, 2011 at 8:02 PM, Michael Kelleher mj.kelle...@gmail.com wrote:
 I have been reading file:
  http://incubator.apache.org/connectors/how-to-build-and-deploy.html#Running+ManifoldCF

 If I am reading this correctly, it seems that the Database initialization is
 completely manual, and does not happen at MCF startup time.  Is this
 correct?


Yes - there is an example set of steps listed in the
how-to-build-and-deploy page.  This is necessary because of the
multiprocess nature of the model.

 The standalone instance does not seem to work for me, and apparently neither
 does deploying this to Tomcat.




The standalone instance seems fine here, and passes the tests.  Can
you be more explicit about what is happening for you?

Karl

Re: Question about deploy to Tomcat

2011-11-27 Thread Karl Wright

On Sun, Nov 27, 2011 at 5:23 PM, Adam LaPila adam.lap...@lmal.com.au wrote:
 Hi Michael,

 I too am having trouble getting MCF with tomcat to work successfully, if get 
 it completed il be sure to reply with details on how I got it to work.

 What happened for me, I followed the steps several times on the how-to page. 
 when in my browser I could open up the main MCF page, but if I clicked on any 
 of the links (output, repository, job,etc) it would just open a blank page. 
 It would be the mcf-crawler-ui layout but there would be nothing displayed 
 apart from the default template. Did you have something similar?


It sounds like something is wrong with database communication.  What
database are you using, and are there any exceptions in the Tomcat
logs pertaining to ManifoldCF?  I bet there are.

Karl

 Cheers,

 Adam.

 -Original Message-
 From: Michael Kelleher [mailto:mj.kelle...@gmail.com]
 Sent: Sunday, 27 November 2011 12:03 PM
 To: connectors-user@incubator.apache.org
 Subject: Question about deploy to Tomcat

 I have been reading file:
 http://incubator.apache.org/connectors/how-to-build-and-deploy.html#Running+ManifoldCF

 If I am reading this correctly, it seems that the Database initialization is 
 completely manual, and does not happen at MCF startup time.  Is this correct?

 The standalone instance does not seem to work for me, and apparently neither 
 does deploying this to Tomcat.



 This message is intended only for the use of the intended recipient(s) If you 
 are not an intended recipient, you are hereby notified that any use, 
 dissemination, disclosure or copying of this communication is strictly 
 prohibited. If you have received this communication in error please destroy 
 all copies of this message and its attachments and notify the sender 
 immediately

Re: Authority Connection works unpredictably

2011-11-23 Thread Karl Wright

Hi Swapna,

There should be manifoldcf log output that contains the actual stack
trace of the exception.  That would be very helpful; I need the line
numbers.

The code is quite simple, and indicates that the LDAP server is
refusing a connection:

  protected void getSession()
throws ManifoldCFException
  {
if (ctx == null)
{
  // Calculate the ldap url first
  String ldapURL = ldap://; + domainControllerName + :389;

  Hashtable env = new Hashtable();
  
env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory);
  env.put(Context.SECURITY_AUTHENTICATION,authentication);
  env.put(Context.SECURITY_PRINCIPAL,userName);
  env.put(Context.SECURITY_CREDENTIALS,password);

  //connect to my domain controller
  env.put(Context.PROVIDER_URL,ldapURL);

  //specify attributes to be returned in binary format
  env.put(java.naming.ldap.attributes.binary,tokenGroups objectSid);

  // Now, try the connection...
  try
  {
ctx = new InitialLdapContext(env,null);
  }
  catch (AuthenticationException e)
  {
// This means we couldn't authenticate!
throw new ManifoldCFException(Authentication problem
authenticating admin user '+userName+': +e.getMessage(),e);
  }
  catch (CommunicationException e)
  {
// This means we couldn't connect, most likely
throw new ManifoldCFException(Couldn't communicate with domain
controller '+domainControllerName+': +e.getMessage(),e);
  }
  catch (NamingException e)
  {
throw new ManifoldCFException(e.getMessage(),e);
  }
}
else
{
  // Attempt to reconnect.  I *hope* this is efficient and doesn't
do unnecessary work.
  try
  {
ctx.reconnect(null);
  }
  catch (AuthenticationException e)
  {
// This means we couldn't authenticate!
throw new ManifoldCFException(Authentication problem
authenticating admin user '+userName+': +e.getMessage(),e);
  }
  catch (CommunicationException e)
  {
// This means we couldn't connect, most likely
throw new ManifoldCFException(Couldn't communicate with domain
controller '+domainControllerName+': +e.getMessage(),e);
  }
  catch (NamingException e)
  {
throw new ManifoldCFException(e.getMessage(),e);
  }
}

expiration = System.currentTimeMillis() + expirationInterval;

try
{
  responseLifetime = Long.parseLong(this.cacheLifetime) * 60L * 1000L;
  LRUsize = Integer.parseInt(this.cacheLRUsize);
}
catch (NumberFormatException e)
{
  throw new ManifoldCFException(Cache lifetime or Cache LRU size
must be an integer: +e.getMessage(),e);
}

  }


Your problem description indicates that it is possible that the
ctx.reconnect() call is failing to reconnect, but a new connection
works OK on your setup.  A stack trace should tell me everything.

Thanks,
Karl



On Wed, Nov 23, 2011 at 12:58 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi Karl,

 Even after reducing the max connections to 3, the connection fails abruptly
 for me.

 Currently, the domain controller am using is mapped to only one IP address,
 and that responds on ping, and the max connections are 3. It was working
 yesterday and it fails suddenly throwing different exceptions like below:

 Threw exception: 'Couldn't communicate with domain controller 'globalad1':
 null'
 Threw exception: 'Couldn't communicate with domain controller
 'globalad1.global.arup.com': null'
 Threw exception: 'globalad1.global.arup.com:389; socket closed'

 Sometimes, it works when I change the cache lifetime parameter. What others
 factors do you think that can cause this to fail ?

 Thanks and Regards,
 Swapna.

 On Tue, Nov 22, 2011 at 11:56 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:

 OK.. Thanks for the information

 On Mon, Nov 21, 2011 at 6:31 PM, Karl Wright daddy...@gmail.com wrote:

 The sAMAccountName and UserPrincipalName LDAP fields were used by
 different versions of Windows at different points in time.  Some
 backwards compatibility was maintained, however Microsoft has
 apparently decided to deprecate one of them (can't remember which),
 and thus you need support for both.

 Karl

 On Mon, Nov 21, 2011 at 6:39 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Hi Karl,
 
  Yes, my Active Directory authority connection is configured to talk to
  only
  one IP address and that particular one is responding to ping always.
 
  Earlier, the max connections parameter was set to 10, now I reduced it
  to 3.
  Its working as of now and I'll keep checking if its going to throw an
  exception. Thanks a lot for the inputs.
 
  Also, I was wondering what the difference was between 2 options for
  Login
  name AD attribute, sAMAccountName and UserPrincipalName ?
 
  Thanks and Regards,
  Swapna.
 
  On Mon, Nov 21, 2011 at 4:57 PM, Karl

Re: Authority Connection works unpredictably

2011-11-23 Thread Karl Wright

To clarify, what I think may be happening is this.

(1) The Java LDAP context is keeping a socket connection to the AD controller.
(2) The AD controller must be configured to close connections forcibly
after a certain period of time.
(3) The LDAP context's reconnect() operation doesn't recover from a
socket that was closed by the server.
(4) The authority code won't release the LDAP context until 5 idle
minutes go by.

So basically, a connection winds up in a busted state and doesn't
recover, if the server closes the socket out from under the ldap
connection.

It's easy to fix, so I've opened a ticket (CONNECTORS-291), and will
commit code changes to trunk shortly.  What version of MCF are you
using?

Karl

On Wed, Nov 23, 2011 at 5:23 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Swapna,

 There should be manifoldcf log output that contains the actual stack
 trace of the exception.  That would be very helpful; I need the line
 numbers.

 The code is quite simple, and indicates that the LDAP server is
 refusing a connection:

  protected void getSession()
    throws ManifoldCFException
  {
    if (ctx == null)
    {
      // Calculate the ldap url first
      String ldapURL = ldap://; + domainControllerName + :389;

      Hashtable env = new Hashtable();
      
 env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory);
      env.put(Context.SECURITY_AUTHENTICATION,authentication);
      env.put(Context.SECURITY_PRINCIPAL,userName);
      env.put(Context.SECURITY_CREDENTIALS,password);

      //connect to my domain controller
      env.put(Context.PROVIDER_URL,ldapURL);

      //specify attributes to be returned in binary format
      env.put(java.naming.ldap.attributes.binary,tokenGroups objectSid);

      // Now, try the connection...
      try
      {
        ctx = new InitialLdapContext(env,null);
      }
      catch (AuthenticationException e)
      {
        // This means we couldn't authenticate!
        throw new ManifoldCFException(Authentication problem
 authenticating admin user '+userName+': +e.getMessage(),e);
      }
      catch (CommunicationException e)
      {
        // This means we couldn't connect, most likely
        throw new ManifoldCFException(Couldn't communicate with domain
 controller '+domainControllerName+': +e.getMessage(),e);
      }
      catch (NamingException e)
      {
        throw new ManifoldCFException(e.getMessage(),e);
      }
    }
    else
    {
      // Attempt to reconnect.  I *hope* this is efficient and doesn't
 do unnecessary work.
      try
      {
        ctx.reconnect(null);
      }
      catch (AuthenticationException e)
      {
        // This means we couldn't authenticate!
        throw new ManifoldCFException(Authentication problem
 authenticating admin user '+userName+': +e.getMessage(),e);
      }
      catch (CommunicationException e)
      {
        // This means we couldn't connect, most likely
        throw new ManifoldCFException(Couldn't communicate with domain
 controller '+domainControllerName+': +e.getMessage(),e);
      }
      catch (NamingException e)
      {
        throw new ManifoldCFException(e.getMessage(),e);
      }
    }

    expiration = System.currentTimeMillis() + expirationInterval;

    try
    {
      responseLifetime = Long.parseLong(this.cacheLifetime) * 60L * 1000L;
      LRUsize = Integer.parseInt(this.cacheLRUsize);
    }
    catch (NumberFormatException e)
    {
      throw new ManifoldCFException(Cache lifetime or Cache LRU size
 must be an integer: +e.getMessage(),e);
    }

  }


 Your problem description indicates that it is possible that the
 ctx.reconnect() call is failing to reconnect, but a new connection
 works OK on your setup.  A stack trace should tell me everything.

 Thanks,
 Karl



 On Wed, Nov 23, 2011 at 12:58 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
 Hi Karl,

 Even after reducing the max connections to 3, the connection fails abruptly
 for me.

 Currently, the domain controller am using is mapped to only one IP address,
 and that responds on ping, and the max connections are 3. It was working
 yesterday and it fails suddenly throwing different exceptions like below:

 Threw exception: 'Couldn't communicate with domain controller 'globalad1':
 null'
 Threw exception: 'Couldn't communicate with domain controller
 'globalad1.global.arup.com': null'
 Threw exception: 'globalad1.global.arup.com:389; socket closed'

 Sometimes, it works when I change the cache lifetime parameter. What others
 factors do you think that can cause this to fail ?

 Thanks and Regards,
 Swapna.

 On Tue, Nov 22, 2011 at 11:56 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:

 OK.. Thanks for the information

 On Mon, Nov 21, 2011 at 6:31 PM, Karl Wright daddy...@gmail.com wrote:

 The sAMAccountName and UserPrincipalName LDAP fields were used by
 different versions of Windows at different points in time.  Some
 backwards compatibility was maintained, however Microsoft has
 apparently

Re: Authority Connection works unpredictably

2011-11-23 Thread Karl Wright

I've attached a patch to the ticket so even MCF 0.3 users should be
able to apply it.

Karl

On Wed, Nov 23, 2011 at 5:40 AM, Karl Wright daddy...@gmail.com wrote:
 To clarify, what I think may be happening is this.

 (1) The Java LDAP context is keeping a socket connection to the AD controller.
 (2) The AD controller must be configured to close connections forcibly
 after a certain period of time.
 (3) The LDAP context's reconnect() operation doesn't recover from a
 socket that was closed by the server.
 (4) The authority code won't release the LDAP context until 5 idle
 minutes go by.

 So basically, a connection winds up in a busted state and doesn't
 recover, if the server closes the socket out from under the ldap
 connection.

 It's easy to fix, so I've opened a ticket (CONNECTORS-291), and will
 commit code changes to trunk shortly.  What version of MCF are you
 using?

 Karl

 On Wed, Nov 23, 2011 at 5:23 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Swapna,

 There should be manifoldcf log output that contains the actual stack
 trace of the exception.  That would be very helpful; I need the line
 numbers.

 The code is quite simple, and indicates that the LDAP server is
 refusing a connection:

  protected void getSession()
    throws ManifoldCFException
  {
    if (ctx == null)
    {
      // Calculate the ldap url first
      String ldapURL = ldap://; + domainControllerName + :389;

      Hashtable env = new Hashtable();
      
 env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory);
      env.put(Context.SECURITY_AUTHENTICATION,authentication);
      env.put(Context.SECURITY_PRINCIPAL,userName);
      env.put(Context.SECURITY_CREDENTIALS,password);

      //connect to my domain controller
      env.put(Context.PROVIDER_URL,ldapURL);

      //specify attributes to be returned in binary format
      env.put(java.naming.ldap.attributes.binary,tokenGroups objectSid);

      // Now, try the connection...
      try
      {
        ctx = new InitialLdapContext(env,null);
      }
      catch (AuthenticationException e)
      {
        // This means we couldn't authenticate!
        throw new ManifoldCFException(Authentication problem
 authenticating admin user '+userName+': +e.getMessage(),e);
      }
      catch (CommunicationException e)
      {
        // This means we couldn't connect, most likely
        throw new ManifoldCFException(Couldn't communicate with domain
 controller '+domainControllerName+': +e.getMessage(),e);
      }
      catch (NamingException e)
      {
        throw new ManifoldCFException(e.getMessage(),e);
      }
    }
    else
    {
      // Attempt to reconnect.  I *hope* this is efficient and doesn't
 do unnecessary work.
      try
      {
        ctx.reconnect(null);
      }
      catch (AuthenticationException e)
      {
        // This means we couldn't authenticate!
        throw new ManifoldCFException(Authentication problem
 authenticating admin user '+userName+': +e.getMessage(),e);
      }
      catch (CommunicationException e)
      {
        // This means we couldn't connect, most likely
        throw new ManifoldCFException(Couldn't communicate with domain
 controller '+domainControllerName+': +e.getMessage(),e);
      }
      catch (NamingException e)
      {
        throw new ManifoldCFException(e.getMessage(),e);
      }
    }

    expiration = System.currentTimeMillis() + expirationInterval;

    try
    {
      responseLifetime = Long.parseLong(this.cacheLifetime) * 60L * 1000L;
      LRUsize = Integer.parseInt(this.cacheLRUsize);
    }
    catch (NumberFormatException e)
    {
      throw new ManifoldCFException(Cache lifetime or Cache LRU size
 must be an integer: +e.getMessage(),e);
    }

  }


 Your problem description indicates that it is possible that the
 ctx.reconnect() call is failing to reconnect, but a new connection
 works OK on your setup.  A stack trace should tell me everything.

 Thanks,
 Karl



 On Wed, Nov 23, 2011 at 12:58 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
 Hi Karl,

 Even after reducing the max connections to 3, the connection fails abruptly
 for me.

 Currently, the domain controller am using is mapped to only one IP address,
 and that responds on ping, and the max connections are 3. It was working
 yesterday and it fails suddenly throwing different exceptions like below:

 Threw exception: 'Couldn't communicate with domain controller 'globalad1':
 null'
 Threw exception: 'Couldn't communicate with domain controller
 'globalad1.global.arup.com': null'
 Threw exception: 'globalad1.global.arup.com:389; socket closed'

 Sometimes, it works when I change the cache lifetime parameter. What others
 factors do you think that can cause this to fail ?

 Thanks and Regards,
 Swapna.

 On Tue, Nov 22, 2011 at 11:56 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:

 OK.. Thanks for the information

 On Mon, Nov 21, 2011 at 6:31 PM, Karl Wright daddy...@gmail.com wrote:

 The sAMAccountName

Re: MCF - Oracle to Solr

2011-11-21 Thread Karl Wright

Hi Adam,

Like I said before, the Simple History shows clearly that you have a
perfectly reasonable URL for your documents.  That is NOT the problem.
 The URL does not even need to be real, it's just an identifier of
sorts as far as ManifoldCF and Solr are concerned.  As I said before,
you will probably want to make it real eventually, because otherwise
there's no way to link back to display the content of your search
results, but that's not important for indexing.

Many people have indexed JDBC content successfully.  But Solr is, on
the other hand, very highly configurable, and depending on how you
have set up your solrconfig.xml and/or schema.xml file you can
certainly get back 500 errors or 400 errors from it when ManifoldCF
tries to index something.  When that happens all that usually needs to
be done is that either the configuration of the output connection
needs to be changed, or the solrconfig.xml and/or schema.xml needs to
be changed.

So let's start by exploring how you have set up your Solr.  Are you
running the solr example without modification?  Or have you (or
someone else) set Solr up specifically for your search problem?  Can
you find out where the Solr standard error and standard output is
going?  If so, you should see output for each document that ManifoldCF
tries to index.  Do you see this output, and what does it say?

I should also mention that several versions of Solr returned 400
errors for zero-length documents indexed through the extracting update
handler, which is what ManifoldCF uses.  This is not usually a problem
anyhow because, although it is noisy, there would not be any content
for the document anyway.  But is there any possibility that the
database field you are indexing as the content field has nothing in it
some or all of the time?

Karl

On Mon, Nov 21, 2011 at 1:01 AM, Adam LaPila adam.lap...@lmal.com.au wrote:
 Hi Karl,

 Still no luck. You wouldn't happen to have a link to any good resources of 
 how to index a DB to Solr with MCF...other than the end-user examples from 
 the website? Perhaps some of your own work with the use of a Database, can be 
 oracle, mysql, etc.
 Do you know or anyone who has tried this before and was successful? any 
 design documents...im been googling non stop, surely someone has done this 
 before

 With the  CONCAT('http://localhost:8080/solr?id=',AIRCRAFT_ID) AS 
 $(URLCOLUMN)..FROM...WHERE)

 Does the URL need to be the link to the solr example? As the end user 
 documentation says URLCOLUMN, The name of an expected resultset column 
 containing a URL

 This is what I have been getting in the Simple History since my last email.
 11-21-2011 16:53:28.378 document ingest (Solr) localhost:8080/solr?id=AC004
  400 48 1 Bad Request

 11-21-2011 16:53:28.347 document ingest (Solr) localhost:8080/solr?id=AC003
  400 43 1 Bad Request

 11-21-2011 16:53:28.331 document ingest (Solr) localhost:8080/solr?id=AC002
  400 34 1 Bad Request

 11-21-2011 16:53:28.300 document ingest (Solr) localhost:8080/solr?id=AC001
  400 27 1 Bad Request

 Sorry for any troubles, just a little confused with it all.

 Regards,

 Adam.




 -Original Message-
 From: Karl Wright [mailto:daddy...@gmail.com]
 Sent: Monday, 21 November 2011 11:43 AM
 To: connectors-user@incubator.apache.org
 Subject: Re: MCF - Oracle to Solr

 Hi Adam,

 The 500 error is coming from Solr, so the place to look is in the Solr logs 
 and output.  If you are running the Solr example, you should be seeing stack 
 traces which may shed light on what is happening.

 FWIW, I doubt very much that this has anything to do with your URL 
 construction, which looks good based on what the Simple History indicates.

 Thanks,
 Karl

 On Sun, Nov 20, 2011 at 7:02 PM, Adam LaPila adam.lap...@lmal.com.au wrote:
 Hello,



 Im trying to get MCF to index my oracle repository to my solr output
 repository. I have been following the end-user documentation and im
 still having trouble getting things to work. I have also installed and
 running solr off a tomcat server on port 8080



 I have set up my output, repository connectors. These seem to be fine,
 as it has the Connection Working status.

 I am sure the problem is how I'm setting up my job to extract the
 database table data, to my solr index.



 I received an email from Karl a couple of days ago in regards to the
 queries provided.



 SELECT CONCAT('http://myserver.com?id=',Aircraft_ID) AS $(URLCOLUMN), ...
 FROM ... WHERE ...



 I have changed my query to be more like this.



 This is what I have as my Data Query:

 SELECT AIRCRAFT_ID AS $(IDCOLUMN), AIRCRAFT_INFO as $(DATACOLUMN),
 CONCAT('http://localhost:8080/solr?id=',AIRCRAFT_ID) AS $(URLCOLUMN)
 FROM AIRCRAFT WHERE AIRCRAFT_ID IN $(IDLIST)



 When I run the job, I find that in the simple history I get something
 like this.



 document ingest (Solr)            http://localhost:8080/solr?id=AC001
             500      27        16        Internal Server Error

 AC001 is of the ID's

Re: Authority Connection works unpredictably

2011-11-21 Thread Karl Wright

So let me get this straight - your Active Directory authority
connection is configured to talk to only one IP address?  and that IP
address responds to ping even when you are receiving an error back
from the authority connection?

Another possibility is that the DC can only accept a limited number of
connections at a time. What is the max connections parameter for your
authority connection?  Try reducing it to no more than 3-4 and see if
that helps.

Karl


On Mon, Nov 21, 2011 at 5:34 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi Karl,

 I think I see many domain controllers for the domain am using. But I see
 only one IP address mapped to the domain controller name that am using in
 the credentials form.

 As I told you, its working sometimes and throwing exception sometimes. But
 ping works always fine on the  domain controller name that am using, from
 which I assume that it is not unreachable.

 Can you tell me what else I should be checking or what other factors could
 be causing this to fail ?

 Thanks and Regards,
 Swapna.

 On Thu, Nov 17, 2011 at 1:18 PM, Karl Wright daddy...@gmail.com wrote:

 Try doing nslookup on the domain controller.  In some larger companies
 there are many domain controllers all with the same name but different
 IP's.  These *should* all be in synch but it may be the case that they
 are not - or some of them are unreachable or offline.  This can also
 be the cause of intermittent authorization failures during crawling.

 If that is the case you have the option of setting the local machine's
 /etc/hosts file to point to a couple of domain controller instances
 that are local and in good working order, rather than rely on DNS to
 find one.

 Karl

 On Thu, Nov 17, 2011 at 1:32 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Hi,
 
  I seem to have some problem with Authority Connection. When I define an
  Authority Connection specifying all the parameters like Domain
  Controller,
  username, password etc, the connection status shows Connection Working
  and
  everything works fine, crawling and sending docs to solr, using
  mcf-authority-service to get only those docs that a user has got
  permission
  to see etc.
 
  But suddenly, the connection status for the Authority Connection throws
  an
  exception, and when I play around the credentials form toggling Login
  name
  AD attribute, or changing domain controller name, or authentication , or
  sometimes even with the same settings that threw an exception earlier,
  the
  status shows Connection working again. I cannot define when it fails
  and
  when it works and for what settings it works.
 
  Can someone help me in understanding why this is happening and what
  needs to
  be done to make it work always ?
 
  Thanks and Regards,
  Swapna.

Re: Problem crawling windows share

2011-11-17 Thread Karl Wright

See http://jcifs.samba.org/src/docs/api/overview-summary.html#scp.
The properties jcifs.smb.lmCompatibility and
jcifs.smb.client.useExtendedSecurity are the ones you may want to
change.  These two properties go together so certain combinations make
sense and others don't, so there's really only combinations you need
but I'll need to look at what they are and get back to you later
today.

As far as setting the switches are concerned, if you are using the
Quick Start you do this trivially by:

java -Dxxx -Dyyy -jar start.jar

If you are using the multi-process configuration, that is what the
defines directory is for; you only need to create files in that
directory with the names jcifs.smb.lmCompatibility and
jcifs.smb.client.useExtendedSecurity containing the values you want
to set.

Karl


On Thu, Nov 17, 2011 at 1:11 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi Karl,

 Am able to access the folders on the problem server through windows
 explorer, (\\server3\Folder1). I tried couple of things with the credentials
 form, changing username, domain etc.. but I keep getting the same error
 Couldn't connect to server: Logon failure: unknown user name or bad
 password

 Can you tell me more about the -D switch you were talking of ?

 Thanks and Regards,
 Swapna.

 On Tue, Nov 15, 2011 at 12:40 PM, Karl Wright daddy...@gmail.com wrote:

 Glad you chased it down this far.

 First thing to try is whether you can get into the problem server
 using Windows Explorer.  Obviously ManifoldCF is not going to be able
 to do it if Windows can't.  If you *can* get in, then just playing
 with the form of the credentials in the MCF connection might do the
 trick.  Some Windows or net appliance servers are picky about this.
 Try various things like leaving the domain blank and specifying the
 user as a...@domain.com, for instance. There's also a different NTLM
 mode you can operation jcifs in that some servers may be configured to
 require; this would need you to set a -D switch on the command line to
 enable.

 Karl

 On Tue, Nov 15, 2011 at 12:10 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Hi Karl,
 
  Thanks for the input. It looks like my problem is related to the second
  one
  that you specified. One of the directories in the path am trying to
  index is
  actually redirecting to a different server. And when I specify this new
  server in defining the repository connection, with my credentials, the
  connection fails with the message:  Couldn't connect to server: Logon
  failure: unknown user name or bad password
 
  I'll look into why am not able to connect to this server.
 
  Thanks and Regards,
  Swapna.
 
  On Mon, Nov 14, 2011 at 4:56 PM, Karl Wright daddy...@gmail.com wrote:
 
  There's two kinds of problem you might be having.  The first is
  intermittent, and the second is not intermittent but would have
  something to do with specific directories.
 
  Intermittent problems might include a domain controller that is not
  always accessible.  In such cases, the crawl will proceed but will
  tend to fail unpredictably.  On the other hand, if you have a
  directory that is handled by a DFS redirection, it is possible that
  the redirection is indicating a new server (lets call it server3)
  which may not like the precise form of your login credentials.  Can
  you determine which scenario you are seeing?
 
  Karl
 
  On Mon, Nov 14, 2011 at 3:11 AM, Swapna Vuppala
  swapna.kollip...@gmail.com wrote:
   Hi,
  
   I have been using windows share repository connection to crawl and
   get
   data
   from a particular server (server 1). Its working perfectly fine.
   However, am
   having trouble when I try with data from another server (server 2).
  
   When I define a repository connection of type windows share and
   specify
   the
   server name (server 2) with my credentials, the connection status
   shows
   Connection working. But when I run a job to use this repository
   connection
   and index data from a location on this server 2, I keep getting the
   exception below:
  
   JCIFS: Possibly transient exception detected on attempt 3 while
   checking
   if
   file exists: Logon failure: unknown user name or bad password.
   jcifs.smb.SmbAuthException: Logon failure: unknown user name or bad
   password.
       at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:544)
       at jcifs.smb.SmbTransport.send(SmbTransport.java:661)
       at jcifs.smb.SmbSession.sessionSetup(SmbSession.java:390)
       at jcifs.smb.SmbSession.send(SmbSession.java:218)
       at jcifs.smb.SmbTree.treeConnect(SmbTree.java:176)
       at jcifs.smb.SmbFile.doConnect(SmbFile.java:911)
       at jcifs.smb.SmbFile.connect(SmbFile.java:954)
       at jcifs.smb.SmbFile.connect0(SmbFile.java:880)
       at jcifs.smb.SmbFile.queryPath(SmbFile.java:1335)
       at jcifs.smb.SmbFile.exists(SmbFile.java:1417)
       at
  
  
   org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists

Re: Problem crawling windows share

2011-11-14 Thread Karl Wright

There's two kinds of problem you might be having.  The first is
intermittent, and the second is not intermittent but would have
something to do with specific directories.

Intermittent problems might include a domain controller that is not
always accessible.  In such cases, the crawl will proceed but will
tend to fail unpredictably.  On the other hand, if you have a
directory that is handled by a DFS redirection, it is possible that
the redirection is indicating a new server (lets call it server3)
which may not like the precise form of your login credentials.  Can
you determine which scenario you are seeing?

Karl

On Mon, Nov 14, 2011 at 3:11 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi,

 I have been using windows share repository connection to crawl and get data
 from a particular server (server 1). Its working perfectly fine. However, am
 having trouble when I try with data from another server (server 2).

 When I define a repository connection of type windows share and specify the
 server name (server 2) with my credentials, the connection status shows
 Connection working. But when I run a job to use this repository connection
 and index data from a location on this server 2, I keep getting the
 exception below:

 JCIFS: Possibly transient exception detected on attempt 3 while checking if
 file exists: Logon failure: unknown user name or bad password.
 jcifs.smb.SmbAuthException: Logon failure: unknown user name or bad
 password.
     at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:544)
     at jcifs.smb.SmbTransport.send(SmbTransport.java:661)
     at jcifs.smb.SmbSession.sessionSetup(SmbSession.java:390)
     at jcifs.smb.SmbSession.send(SmbSession.java:218)
     at jcifs.smb.SmbTree.treeConnect(SmbTree.java:176)
     at jcifs.smb.SmbFile.doConnect(SmbFile.java:911)
     at jcifs.smb.SmbFile.connect(SmbFile.java:954)
     at jcifs.smb.SmbFile.connect0(SmbFile.java:880)
     at jcifs.smb.SmbFile.queryPath(SmbFile.java:1335)
     at jcifs.smb.SmbFile.exists(SmbFile.java:1417)
     at
 org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists(SharedDriveConnector.java:2064)
     at
 org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:521)
     at
 org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)

 Am able to access this location from windows explorer. What else should I be
 checking or what could be the reasons/factors causing this to fail ?

 Thanks and Regards,
 Swapna.

Re: Problem crawling windows share

2011-11-14 Thread Karl Wright

Glad you chased it down this far.

First thing to try is whether you can get into the problem server
using Windows Explorer. Obviously ManifoldCF is not going to be able
to do it if Windows can't. If you *can* get in, then just playing
with the form of the credentials in the MCF connection might do the
trick. Some Windows or net appliance servers are picky about this.
Try various things like leaving the domain blank and specifying the
user as a...@domain.com, for instance. There's also a different NTLM
mode you can operation jcifs in that some servers may be configured to
require; this would need you to set a -D switch on the command line to
enable.

Karl

On Tue, Nov 15, 2011 at 12:10 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
Hi Karl,

Thanks for the input. It looks like my problem is related to the second one
that you specified. One of the directories in the path am trying to index is
actually redirecting to a different server. And when I specify this new
server in defining the repository connection, with my credentials, the
connection fails with the message: Couldn't connect to server: Logon
failure: unknown user name or bad password

I'll look into why am not able to connect to this server.

Thanks and Regards,
Swapna.

On Mon, Nov 14, 2011 at 4:56 PM, Karl Wright daddy...@gmail.com wrote:

There's two kinds of problem you might be having. The first is
intermittent, and the second is not intermittent but would have
something to do with specific directories.

Intermittent problems might include a domain controller that is not
always accessible. In such cases, the crawl will proceed but will
tend to fail unpredictably. On the other hand, if you have a
directory that is handled by a DFS redirection, it is possible that
the redirection is indicating a new server (lets call it server3)
which may not like the precise form of your login credentials. Can
you determine which scenario you are seeing?

Karl

On Mon, Nov 14, 2011 at 3:11 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
Hi,

I have been using windows share repository connection to crawl and get
data
from a particular server (server 1). Its working perfectly fine.
However, am
having trouble when I try with data from another server (server 2).

When I define a repository connection of type windows share and specify
the
server name (server 2) with my credentials, the connection status shows
Connection working. But when I run a job to use this repository
connection
and index data from a location on this server 2, I keep getting the
exception below:

JCIFS: Possibly transient exception detected on attempt 3 while checking
if
file exists: Logon failure: unknown user name or bad password.
jcifs.smb.SmbAuthException: Logon failure: unknown user name or bad
password.
at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:544)
at jcifs.smb.SmbTransport.send(SmbTransport.java:661)
at jcifs.smb.SmbSession.sessionSetup(SmbSession.java:390)
at jcifs.smb.SmbSession.send(SmbSession.java:218)
at jcifs.smb.SmbTree.treeConnect(SmbTree.java:176)
at jcifs.smb.SmbFile.doConnect(SmbFile.java:911)
at jcifs.smb.SmbFile.connect(SmbFile.java:954)
at jcifs.smb.SmbFile.connect0(SmbFile.java:880)
at jcifs.smb.SmbFile.queryPath(SmbFile.java:1335)
at jcifs.smb.SmbFile.exists(SmbFile.java:1417)
at

org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.fileExists(SharedDriveConnector.java:2064)
at

org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getDocumentVersions(SharedDriveConnector.java:521)
at

org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)

Am able to access this location from windows explorer. What else should
I be
checking or what could be the reasons/factors causing this to fail ?

Thanks and Regards,
Swapna.

Re: Authorization for Ubuntu server and Windows WS not in a domain

2011-11-12 Thread Karl Wright

Hi,

File acls in ManifoldCF would normally be handled by the repository
connector, during the process of indexing documents.  The
corresponding Linux authority information would include what Unix
groups a user was part of.  There is currently no authority connector
I am aware of that does that.  You could perhaps try to write your own
- I doubt it would be very hard to write.

Karl

On Sat, Nov 12, 2011 at 9:39 AM,  mi...@grf.bg.ac.rs wrote:
 Hello,
 So I could use MCF as a servise provider for authorization. That is nice
 but only if the formats agree (I'll check that).
 But still there is one question left: What authorization component to use
 if users and indexed files are not on the Windows servers, but on the
 Ubuntu server with Unix style file rights of type rwxr--r-- ?
 Also, could I use AD component for Windows not in a domain?


 The format of a ManifoldCF access token is a collaboration between an
 authority connector and repository connectors designed to work with
 that authority connector.  If you've already indexed documents using
 another mechanism, you can still use ManifoldCF's authority service to
 obtain access tokens for authenticated users.  This service is a web
 application accessible by http.  You can see what it returns (after
 defining an authority connection or two in the ManifoldCF UI) by
 simply using curl:

 curl
 http://localhost:8345/mcf-authority-service/UserACLs?username=myn...@mydomain.com

 ... and noting what is returned.  The access tokens indexed in Solr by
 your crawler will have to match the access token format returned by
 the authority service, or the Solr query modification components we
 supply will not work.

 Hope this helps,
 Karl

 On Fri, Nov 11, 2011 at 2:12 PM,  mi...@grf.bg.ac.rs wrote:
 Hello,

 I would like to test ManifoldCF (MCF) in order to achieve doc level
 security in my SOLR search app. Actually I already developed my own
 document crawler apart from the MCF framework. My test SOLR app is
 located
 on Ubunutu server (the indexed docs are located on that server). When I
 tried to use MCF Quick Start App I didn't know what authorization
 connector to use for this case (of course I would use MCF crawlers in
 this
 case to retrieve documents).

 I need MCF only for authorization. My crawler already uses Java 7
 capabilities to retrieve file ACL(Posix attributes) What classes do I
 need
 to use from MCF libraries to perform authorization based on Ubuntu
 server
 usernames?

 Finally, is it possible to perform authorization against Windows
 accounts
 on workstations not in a domain (local users)?

 Thank you,
 Milos

Re: Solr - ManifoldCFSecurityFilter

2011-10-21 Thread Karl Wright

It doesn't have to be in the URL, but is has to be in the solr request
object somehow.

If you want another source for the parameter, please describe what you
are trying to do and maybe we can come up with something different.

Karl

On Fri, Oct 21, 2011 at 4:34 AM, Wunderlich, Tobias
tobias.wunderl...@igd-r.fraunhofer.de wrote:
 Hey guys,



 I’ve got a question concerning the security search component for Solr. To
 authorize a user I have to send the parameter “AuthenticatedUserName=…”.
 Does this always have to happen directly in the URL or is there another way
 to send this parameter?

Re: Using Active Directory

2011-10-20 Thread Karl Wright

)
  at
 
  org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:254)
  at
 
  org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:372)
  at
 
  org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:98)
    at
 
  org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4584)
  at
 
  org.apache.catalina.core.StandardContext$2.call(StandardContext.java:5262)
  at
 
  org.apache.catalina.core.StandardContext$2.call(StandardContext.java:5257)
  at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at
  java.util.concurrent.FutureTask.run(Unknown Source) at
  java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at
  java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at
  java.lang.Thread.run(Unknown Source) Caused by:
  java.lang.ClassNotFoundException:
  org.apache.solr.mcf.ManifoldCFSearchComponent at
   
Have I missed any steps ?? What else should I be doing for Solr
  Integration ??
   
Thanks and Regards,
Swapna.
   
On Fri, Oct 14, 2011 at 2:53 PM, Swapna Vuppala 
  swapna.kollip...@gmail.com wrote:
Thanks a lot for the info Shinichiro Abe, I'll look into it.
   
Thanks and Regards,
Swapna.
   
   
On Fri, Oct 14, 2011 at 2:21 PM, Shinichiro Abe 
  shinichiro.ab...@gmail.com wrote:
Hi.
   
If you can use ManifoldCF 0.4 trunk,
you can use solr integration components.
Recently the plugin is added.
   
Please see:
   
 
  http://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/solr/integration/README-3.x.txt
   
You can get the results depending on user access tokens on Solr
side.
curl 
 
  http://localhost:8983/solr/select?q=*:*AuthenticatedUserName=username@domain
  
   
Regards,
Shinichiro Abe
   
On 2011/10/14, at 16:39, Swapna Vuppala wrote:
   
 Hi Karl,

 Thanks for the reply. I built jCIFS connector, registered it,
 created
  a repository connection of type Windows Share, and created a job using
  Solr
  connection and Windows share connection.

 I modified the Solr schema to include fields

 field name=allow_token_document type=string indexed=true
  stored=true multiValued=true/
 field name=deny_token_document type=string indexed=true
  stored=true multiValued=true/
 field name=allow_token_share type=string indexed=true
  stored=true multiValued=true/
 field name=deny_token_share type=string indexed=true
  stored=true multiValued=true/

 I set the stored attribute to true just for testing purposes.

 Now when I run the job, I see these tokens in the indexed data as
  expected.

 My next job would be to make the search from Solr secure. Do I
 have
  to make any changes on Solr side to make use of these tokens and present
  only those docs to the user that he's entitled to see ?
 Can you please direct me as to how to filter the search results
  depending upon the user's credentials ?

 Thanks and Regards,
 Swapna.


 On Thu, Oct 13, 2011 at 1:22 PM, Karl Wright daddy...@gmail.com
  wrote:
 Hi,

 First, it is DOCUMENT access tokens that are sent to Solr, not
 user
 access tokens.  You must therefore be crawling a repository that
 has
 some notion of security.  The File System connector does not do
 that;
 you probably want to use the CIFS connector instead.

 Thanks,
 Karl

 On Thu, Oct 13, 2011 at 3:19 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Hi,
 
  Am trying to use Active Directory authority connection to
  address
  Solr
  security.
 
  I created an Authority Connection of type Active Directory (the
  connection
  status shows Connection Working) and used it in creating a File
  System
  repository connection.
  Then, I created a job with Solr as output connection and the
  above
  created
  repository connection.
 
  As per my understanding ( I might be totally wrong, please
  correct
  me if
  so), ManifoldCF now sends user's access tokens along with the
  documents to
  be indexed to Solr. I should be able to see the access tokens in
  Solr's
  indexed data either by extending the schema with fields
 
  field name=allow_token_document type=string indexed=true
  stored=true multiValued=true/
  field name=deny_token_document type=string indexed=true
  stored=true
  multiValued=true/
 
  or they come as some automatic fields that Solr creates , with
  the
  attr_
  prefix as specified at
 
 
  http://www.mail-archive.com/connectors-user@incubator.apache.org/msg00462.html
 
  But am not able to see any access tokens with/without modifying
  Solr schema.
  Have I missed configuring anything else or how I do I check if
  my
  Active
  Directory connection is working properly ??
  Am using ManifoldCF 0.3 version

Re: Trouble accessing mcf-api-service

2011-10-10 Thread Karl Wright

Glad you worked it out!

Thanks,
Karl

On Mon, Oct 10, 2011 at 5:36 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 I could resolve the issue by following commands at
 http://incubator.apache.org/connectors/programmatic-operation.html#Control+by+Servlet+API

 Thanks and Regards,
 Swapna.

 On Mon, Oct 10, 2011 at 12:18 PM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:

 Hi,

 Am able to access and use ManifoldCF crawler (configured on my windows
 machine on Tomcat). But am not able to access
 http://localhost:8080/mcf-api-service/json. (was trying to refer last
 section at http://www.searchworkings.org/blog/-/blogs/344989).
 It says {error:Unrecognized resource.}

 I copied all the war files (mcf-api-service.war,
 mcf-authority-service.war, mcf-crawler-ui.war) to webapps folder of my
 Tomcat installation folder (C:\Program Files\Apache Software
 Foundation\Tomcat 7.0\webapps), configured properties.xml to use PostgreSql
 database



 Can you please help me resolve this issue ?

 Thanks and Regards,
 Swapna.

Re: MCF 0.3 - WebCrawlerConnector - Ingestion Problems

2011-10-06 Thread Karl Wright

Hi Tobias,

Sorry for the delay.
There are a number of reasons a document can be rejected for indexing.
 They are:

(1) URL criteria, as specified in the Web job's specification information
(2) Maximum document length, as controlled by the output connection
(you never told us what that was)
(3) Mime type criteria, as controlled by the output connection

So I bet this is a mime type issue.  What content-type does the page
have?  What output connector are you using?

Karl

On Thu, Oct 6, 2011 at 7:18 AM, Wunderlich, Tobias
tobias.wunderl...@igd-r.fraunhofer.de wrote:
 Hey guys,



 I try to crawl a website generated with a Mediawiki-extension and always get
 the message:



 “[WebcrawlerConnector.java:1312] - WEB: Decided not to ingest
 'http://wiki.host/index.php?title=Spezial%3AAlle+Seitenfrom=pto=snamespace=0'
 because it did not match ingestability criteria”



 Seed-url:
 'http://wiki.host/index.php?title=Spezial%3AAlle+Seitenfrom=pto=snamespace=0

 Inclusions (crawl and index): .*

 Exclusions: none



 Other sites are crawled without problems, so I’m wondering what those
 ingestability criteria exactly are.



 Best regards,

 Tobias

Re: config files for ManifoldCF

2011-10-05 Thread Karl Wright

Did you remember to start the agents process?

Karl

On Wed, Oct 5, 2011 at 5:47 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi Karl,

 I installed postgresql database and changed properties.xml accordingly, used
 executecommand.bat to initialize database, install schema, register solr,
 filesystem and active directory connectors and ran agents process.

 Am able to access the crawler UI at http://localhost:8080/mcf-crawler-ui and
 define SOLR output connection, file system repository connection and also a
 job. But my problem is that when I run the job, the status shows Starting
 up and does not change after that.

 Connection status for Solr connection shows Connection working

 I see nothing in manifoldcf.log. Can you please direct me as to where to
 look for any errors or how to resolve this ?

 Thanks and Regards,
 Swapna.

 On Tue, Oct 4, 2011 at 3:38 PM, Karl Wright daddy...@gmail.com wrote:

 How you add the -D switch for tomcat depends on what platform you are
 running tomcat.  On Windows, there is an application that allows you
 to add commands to the java invocation.  On linux, the
 /etc/init.d/tomcat script allows you to set options - depending on
 version, you can even put these in a directory that the script scrapes
 to put them together.

 As for what else you need:

 - a properties.xml file that specifies a synch directory
 - you will need to initialize the database, register the crawler
 agent, and register the connectors using commands, as described in
 how-to-build-and-deploy
 - You'll need to run the agents process, and any of the sidecar
 processes needed by the connectors you have registered.  There are
 scripts for all of these, which require you to set MCF_HOME and
 JAVA_HOME environment variables first.

 Karl



 On Tue, Oct 4, 2011 at 4:15 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Thanks Karl and Piergiorgio,
 
  I need one more clarification, but that's regarding deploying ManifoldCF
  on
  Tomcat.
 
  I have built ManifoldCF 0.3 and have been running it so far on Jetty and
  everything works fine. But now I want to use Tomcat instead of Jetty. I
  tried instructions at
  http://incubator.apache.org/connectors/how-to-build-and-deploy.html.
 
  I already have Tomcat installed on my machine. So I copied the war files
  (mfc-api-service,mfc-authority-service,crawler-ui) into Tomcat's webapps
  directory, copied all contents of dist directory of manifoldcf into a
  separate directory (which I set as MFC_HOME environment variable). Now
  am
  trying to access the crawler UI at http://localhost:8080/mcf-crawler-ui/
  But I get the exception
 
  org.apache.jasper.JasperException: javax.servlet.ServletException:
  org.apache.manifoldcf.core.interfaces.ManifoldCFException:
  Initialization
  failed: Could not read configuration file 'C:\lcf\properties.xml'
 
 
  I understand that the property org.apache.manifoldcf.configfile is not
  set. How do I set this and what else do I have to do for proper and
  complete
  deployment on tomcat ?
 
 
  Thanks a lot in advance,
  Swapna.
 
 
 
 
  On Mon, Oct 3, 2011 at 5:13 AM, Karl Wright daddy...@gmail.com wrote:
 
  Hi Swapna,
 
  To clarify Piergiorgio's answer a little, ManifoldCF uses a
  properties.xml file for its basic configuration information.  However,
  everything else is kept in the database.  That includes connection
  definitions and job definitions.  I recommend that you start by using
  the Quick-Start example uses an embedded Apache Derby database
  instance, by default.  You can change this later, of course.  For real
  work we recommend PostgreSQL.
 
  You can find more information at
  http://incubator.apache.org/connectors/how-to-build-and-deploy.html.
  Have a look at the quick-start instructions.
 
  Karl
 
  On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi
  piergior...@apache.org wrote:
   Hi Swapna,
  
   2011/10/2 Swapna Vuppala swapna.kollip...@gmail.com
  
   Hi,
   Am new to using ManifoldCF and I have got couple of doubts about
   using
   it.
   Am interested in knowing about what are the config files that are
   used
   in
   ManifoldCF, where they are located and how they are used. Also, I
   was
   wondering where all the information about output connection
   definitions,
   repository definitions and job definitions, defined by a user using
   the
   crawler UI, are stored.
  
   The unique config file is the properties.xml that you need to add a
   new
   JVM
   parameter:
   -Dorg.apache.manifoldcf.configfile=configuration file path
   This only if you are deploying ManifoldCF in an application server.
   Otherwise you can leave properties.xml in your user home/lcf
   folder.
   You can find an example of the properties.xml file in the
   dist/example
   folder of the distribution bundle.
   All the information managed by the UI Crawler are stored in a
   database,
   HSQL
   by default but you can configure a Postgresql DBMS changing the
   properties.xml file, for more

Re: config files for ManifoldCF

2011-10-05 Thread Karl Wright

I worked with Swapna directly to resolve this.  Turned out he'd been
using ^C to kill the agents process, so a LockClean procedure fixed
it.
Karl

On Wed, Oct 5, 2011 at 6:37 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 I used this

 property name=org.apache.manifoldcf.synchdirectory
 value=c:/mysynchdir/

 Thanks and Regards,
 Swapna.

 On Wed, Oct 5, 2011 at 4:05 PM, Karl Wright daddy...@gmail.com wrote:

 What do you have set for your synch directory?

 Karl

 On Wed, Oct 5, 2011 at 6:09 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Yes, I used executecommand.bat org.apache.manifoldcf.agents.AgentRun
 
  Thanks and Regards,
  Swapna.
 
  On Wed, Oct 5, 2011 at 3:37 PM, Karl Wright daddy...@gmail.com wrote:
 
  Did you remember to start the agents process?
 
  Karl
 
  On Wed, Oct 5, 2011 at 5:47 AM, Swapna Vuppala
  swapna.kollip...@gmail.com wrote:
   Hi Karl,
  
   I installed postgresql database and changed properties.xml
   accordingly,
   used
   executecommand.bat to initialize database, install schema, register
   solr,
   filesystem and active directory connectors and ran agents process.
  
   Am able to access the crawler UI at
   http://localhost:8080/mcf-crawler-ui
   and
   define SOLR output connection, file system repository connection and
   also a
   job. But my problem is that when I run the job, the status shows
   Starting
   up and does not change after that.
  
   Connection status for Solr connection shows Connection working
  
   I see nothing in manifoldcf.log. Can you please direct me as to where
   to
   look for any errors or how to resolve this ?
  
   Thanks and Regards,
   Swapna.
  
   On Tue, Oct 4, 2011 at 3:38 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   How you add the -D switch for tomcat depends on what platform you
   are
   running tomcat.  On Windows, there is an application that allows you
   to add commands to the java invocation.  On linux, the
   /etc/init.d/tomcat script allows you to set options - depending on
   version, you can even put these in a directory that the script
   scrapes
   to put them together.
  
   As for what else you need:
  
   - a properties.xml file that specifies a synch directory
   - you will need to initialize the database, register the crawler
   agent, and register the connectors using commands, as described in
   how-to-build-and-deploy
   - You'll need to run the agents process, and any of the sidecar
   processes needed by the connectors you have registered.  There are
   scripts for all of these, which require you to set MCF_HOME and
   JAVA_HOME environment variables first.
  
   Karl
  
  
  
   On Tue, Oct 4, 2011 at 4:15 AM, Swapna Vuppala
   swapna.kollip...@gmail.com wrote:
Thanks Karl and Piergiorgio,
   
I need one more clarification, but that's regarding deploying
ManifoldCF
on
Tomcat.
   
I have built ManifoldCF 0.3 and have been running it so far on
Jetty
and
everything works fine. But now I want to use Tomcat instead of
Jetty.
I
tried instructions at
   
http://incubator.apache.org/connectors/how-to-build-and-deploy.html.
   
I already have Tomcat installed on my machine. So I copied the war
files
(mfc-api-service,mfc-authority-service,crawler-ui) into Tomcat's
webapps
directory, copied all contents of dist directory of manifoldcf
into a
separate directory (which I set as MFC_HOME environment variable).
Now
am
trying to access the crawler UI at
http://localhost:8080/mcf-crawler-ui/
But I get the exception
   
org.apache.jasper.JasperException: javax.servlet.ServletException:
org.apache.manifoldcf.core.interfaces.ManifoldCFException:
Initialization
failed: Could not read configuration file 'C:\lcf\properties.xml'
   
   
I understand that the property org.apache.manifoldcf.configfile
is
not
set. How do I set this and what else do I have to do for proper
and
complete
deployment on tomcat ?
   
   
Thanks a lot in advance,
Swapna.
   
   
   
   
On Mon, Oct 3, 2011 at 5:13 AM, Karl Wright daddy...@gmail.com
wrote:
   
Hi Swapna,
   
To clarify Piergiorgio's answer a little, ManifoldCF uses a
properties.xml file for its basic configuration information.
 However,
everything else is kept in the database.  That includes
connection
definitions and job definitions.  I recommend that you start by
using
the Quick-Start example uses an embedded Apache Derby database
instance, by default.  You can change this later, of course.  For
real
work we recommend PostgreSQL.
   
You can find more information at
   
http://incubator.apache.org/connectors/how-to-build-and-deploy.html.
Have a look at the quick-start instructions.
   
Karl
   
On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi
piergior...@apache.org wrote:
 Hi Swapna,

 2011/10/2 Swapna Vuppala

Re: config files for ManifoldCF

2011-10-04 Thread Karl Wright

How you add the -D switch for tomcat depends on what platform you are
running tomcat. On Windows, there is an application that allows you
to add commands to the java invocation. On linux, the
/etc/init.d/tomcat script allows you to set options - depending on
version, you can even put these in a directory that the script scrapes
to put them together.

As for what else you need:

- a properties.xml file that specifies a synch directory
- you will need to initialize the database, register the crawler
agent, and register the connectors using commands, as described in
how-to-build-and-deploy
- You'll need to run the agents process, and any of the sidecar
processes needed by the connectors you have registered. There are
scripts for all of these, which require you to set MCF_HOME and
JAVA_HOME environment variables first.

Karl

On Tue, Oct 4, 2011 at 4:15 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
Thanks Karl and Piergiorgio,

I need one more clarification, but that's regarding deploying ManifoldCF on
Tomcat.

I have built ManifoldCF 0.3 and have been running it so far on Jetty and
everything works fine. But now I want to use Tomcat instead of Jetty. I
tried instructions at
http://incubator.apache.org/connectors/how-to-build-and-deploy.html.

I already have Tomcat installed on my machine. So I copied the war files
(mfc-api-service,mfc-authority-service,crawler-ui) into Tomcat's webapps
directory, copied all contents of dist directory of manifoldcf into a
separate directory (which I set as MFC_HOME environment variable). Now am
trying to access the crawler UI at http://localhost:8080/mcf-crawler-ui/
But I get the exception

org.apache.jasper.JasperException: javax.servlet.ServletException:
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Initialization
failed: Could not read configuration file 'C:\lcf\properties.xml'

I understand that the property org.apache.manifoldcf.configfile is not
set. How do I set this and what else do I have to do for proper and complete
deployment on tomcat ?

Thanks a lot in advance,
Swapna.

On Mon, Oct 3, 2011 at 5:13 AM, Karl Wright daddy...@gmail.com wrote:

Hi Swapna,

To clarify Piergiorgio's answer a little, ManifoldCF uses a
properties.xml file for its basic configuration information. However,
everything else is kept in the database. That includes connection
definitions and job definitions. I recommend that you start by using
the Quick-Start example uses an embedded Apache Derby database
instance, by default. You can change this later, of course. For real
work we recommend PostgreSQL.

You can find more information at
http://incubator.apache.org/connectors/how-to-build-and-deploy.html.
Have a look at the quick-start instructions.

Karl

On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi
piergior...@apache.org wrote:
Hi Swapna,

2011/10/2 Swapna Vuppala swapna.kollip...@gmail.com

Hi,
Am new to using ManifoldCF and I have got couple of doubts about using
it.
Am interested in knowing about what are the config files that are used
in
ManifoldCF, where they are located and how they are used. Also, I was
wondering where all the information about output connection
definitions,
repository definitions and job definitions, defined by a user using the
crawler UI, are stored.

The unique config file is the properties.xml that you need to add a new
JVM
parameter:
-Dorg.apache.manifoldcf.configfile=configuration file path
This only if you are deploying ManifoldCF in an application server.
Otherwise you can leave properties.xml in your user home/lcf folder.
You can find an example of the properties.xml file in the dist/example
folder of the distribution bundle.
All the information managed by the UI Crawler are stored in a database,
HSQL
by default but you can configure a Postgresql DBMS changing the
properties.xml file, for more information about all the parameters you
can
visit the following page:

http://incubator.apache.org/connectors/how-to-build-and-deploy.html#The+ManifoldCF+configuration+file
Hope this helps.
Piergiorgio

Can you please help me in clarifying these doubts ?
Thanks and Regards,
Swapna.

--
Piergiorgio Lucidi
http://about.me/piergiorgiolucidi

Re: config files for ManifoldCF

2011-10-04 Thread Karl Wright

In the Apache Tomcat group of programs, there's a Configure Tomcat
application.  Click on that, and you will find within the ability to
add switches to your tomcat instance.

Since you've never used Java before, I'm afraid this group is not
likely to be your best resource.  You might try googling for some web
resources that will likely help you more than I can in that regard.

Karl

On Tue, Oct 4, 2011 at 6:34 AM, Swapna Vuppala
swapna.kollip...@gmail.com wrote:
 Hi Karl,

 Thanks for the quick response. Am running Tomcat on Windows, sorry that I
 didn't mention it earlier. How do I do this on Windows ??

 Also, this is the first time am working on Java environment and my questions
 may look too trivial. Please bear with me.

 Thanks and Regards,
 Swapna.

 On Tue, Oct 4, 2011 at 3:38 PM, Karl Wright daddy...@gmail.com wrote:

 How you add the -D switch for tomcat depends on what platform you are
 running tomcat.  On Windows, there is an application that allows you
 to add commands to the java invocation.  On linux, the
 /etc/init.d/tomcat script allows you to set options - depending on
 version, you can even put these in a directory that the script scrapes
 to put them together.

 As for what else you need:

 - a properties.xml file that specifies a synch directory
 - you will need to initialize the database, register the crawler
 agent, and register the connectors using commands, as described in
 how-to-build-and-deploy
 - You'll need to run the agents process, and any of the sidecar
 processes needed by the connectors you have registered.  There are
 scripts for all of these, which require you to set MCF_HOME and
 JAVA_HOME environment variables first.

 Karl



 On Tue, Oct 4, 2011 at 4:15 AM, Swapna Vuppala
 swapna.kollip...@gmail.com wrote:
  Thanks Karl and Piergiorgio,
 
  I need one more clarification, but that's regarding deploying ManifoldCF
  on
  Tomcat.
 
  I have built ManifoldCF 0.3 and have been running it so far on Jetty and
  everything works fine. But now I want to use Tomcat instead of Jetty. I
  tried instructions at
  http://incubator.apache.org/connectors/how-to-build-and-deploy.html.
 
  I already have Tomcat installed on my machine. So I copied the war files
  (mfc-api-service,mfc-authority-service,crawler-ui) into Tomcat's webapps
  directory, copied all contents of dist directory of manifoldcf into a
  separate directory (which I set as MFC_HOME environment variable). Now
  am
  trying to access the crawler UI at http://localhost:8080/mcf-crawler-ui/
  But I get the exception
 
  org.apache.jasper.JasperException: javax.servlet.ServletException:
  org.apache.manifoldcf.core.interfaces.ManifoldCFException:
  Initialization
  failed: Could not read configuration file 'C:\lcf\properties.xml'
 
 
  I understand that the property org.apache.manifoldcf.configfile is not
  set. How do I set this and what else do I have to do for proper and
  complete
  deployment on tomcat ?
 
 
  Thanks a lot in advance,
  Swapna.
 
 
 
 
  On Mon, Oct 3, 2011 at 5:13 AM, Karl Wright daddy...@gmail.com wrote:
 
  Hi Swapna,
 
  To clarify Piergiorgio's answer a little, ManifoldCF uses a
  properties.xml file for its basic configuration information.  However,
  everything else is kept in the database.  That includes connection
  definitions and job definitions.  I recommend that you start by using
  the Quick-Start example uses an embedded Apache Derby database
  instance, by default.  You can change this later, of course.  For real
  work we recommend PostgreSQL.
 
  You can find more information at
  http://incubator.apache.org/connectors/how-to-build-and-deploy.html.
  Have a look at the quick-start instructions.
 
  Karl
 
  On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi
  piergior...@apache.org wrote:
   Hi Swapna,
  
   2011/10/2 Swapna Vuppala swapna.kollip...@gmail.com
  
   Hi,
   Am new to using ManifoldCF and I have got couple of doubts about
   using
   it.
   Am interested in knowing about what are the config files that are
   used
   in
   ManifoldCF, where they are located and how they are used. Also, I
   was
   wondering where all the information about output connection
   definitions,
   repository definitions and job definitions, defined by a user using
   the
   crawler UI, are stored.
  
   The unique config file is the properties.xml that you need to add a
   new
   JVM
   parameter:
   -Dorg.apache.manifoldcf.configfile=configuration file path
   This only if you are deploying ManifoldCF in an application server.
   Otherwise you can leave properties.xml in your user home/lcf
   folder.
   You can find an example of the properties.xml file in the
   dist/example
   folder of the distribution bundle.
   All the information managed by the UI Crawler are stored in a
   database,
   HSQL
   by default but you can configure a Postgresql DBMS changing the
   properties.xml file, for more information about all the parameters
   you
   can
   visit the following page:
  
  
   http

Re: Not able to create sharepoint connection

2011-10-04 Thread Karl Wright

You might be interested to know that we now prebuild the SharePoint
MCPermissions web service, and will be including this as part of the
release for ManifoldCF 0.4-incubating.  You can pick it up on trunk
now; it's delivered (along with installation instructions and .bat
scripts) under dist/sharepoint-integration when you build.

Karl

On Wed, Jul 20, 2011 at 6:12 AM, Karl Wright daddy...@gmail.com wrote:
 Hi Pravin,

 The .NET piece of the SharePoint connector is necessary only if you
 select the sharepoint 3.0 radio button when you set up your SharePoint
 repository connection.  If you don't want to build and deploy the
 MCPermissions plugin on the server-side SharePoint, you can certainly
 avoid that by simply selecting the sharepoint 2.0 radio button
 instead.  However, if you choose to do it this way, then SharePoint's
 file and folder permissions will not be accessible to the connector.
 (These were added in 3.0 but Microsoft overlooked the web service
 methods that would allow external access to them.)

 It sounds like, with this change, you were probably successful in
 connecting to the second system you tried.  For the first system, can
 you tell me a bit more about it?  For example, what version of
 SharePoint was it?  And, did you follow the instructions and use your
 browser to determine the proper connection URL?

 I'm actually very interested to talk with someone who has access to a
 functioning SharePoint setup, since I lost access to my SharePoint
 testbed a while ago and would dearly love to bring that connector up
 to snuff for SharePoint 2010.  I'm hoping Microsoft actually corrected
 the missing feature that required the MCPermissions plugin, for
 instance.  Please let me know if you are willing to experiment a bit
 to help us with this connector.

 Thanks!
 Karl


 On Wed, Jul 20, 2011 at 5:20 AM, Pravin Agrawal
 pravin_agra...@persistent.co.in wrote:
 Hi All,



 I was trying out the sharepoint repository connector provided by manifold
 CF, following are the steps that I carried out to get it up and running.



 For building and deploying the CF, I followed the procedure given at
 http://incubator.apache.org/connectors/how-to-build-and-deploy.html

 I have build the connector on my linux machine and deployed the application
 using quick start process.



 The question that came to me during the build process was – is .NET and MS
 Visual studio absolutely necessary for building the sharepoint connector or
 is it sufficient to provide only those 5 wsdl file mentioned in the guide
 and does it works on windows only as I am able to see manifoldCF UI on to my
 browser by simply deploying the quickstart version along with sharepoint
 repository connection.



 I started to make the sharepoint repository connection to crawl one of the
 sharepoint site and following are the problems I encountered while creating
 it

 1.   After creating the sharepoint repository connection, the status
 shows me a message as follows

 Connection status: The site at
 http://portal.mydomain.co.in/sites/Documents/Forms did not exist

 2.   With a different sharepoint site, connection status shows me
 following message

 Connection status: ManifoldCF's MCPermissions web service may not be
 installed on the target SharePoint server. MCPermissions service is needed
 for SharePoint repositories version 3.0 or higher, to allow access to
 security information for files and folders. Consult your system
 administrator.



 Can any when tell me whether I am missing some steps



 Thanks in advance.



 -Regards,

 Pravin



 DISCLAIMER == This e-mail may contain privileged and confidential
 information which is the property of Persistent Systems Ltd. It is intended
 only for the use of the individual or entity to which it is addressed. If
 you are not the intended recipient, you are not authorized to read, retain,
 copy, print, distribute or use this message. If you have received this
 communication in error, please notify the sender and delete all copies of
 this message. Persistent Systems Ltd. does not accept any liability for
 virus infected mails.

Re: config files for ManifoldCF

2011-10-02 Thread Karl Wright

Hi Swapna,

You can find more information at
http://incubator.apache.org/connectors/how-to-build-and-deploy.html.
Have a look at the quick-start instructions.

Karl

On Sun, Oct 2, 2011 at 1:43 PM, Piergiorgio Lucidi
piergior...@apache.org wrote:
Hi Swapna,

2011/10/2 Swapna Vuppala swapna.kollip...@gmail.com

Hi,
Am new to using ManifoldCF and I have got couple of doubts about using it.
Am interested in knowing about what are the config files that are used in
ManifoldCF, where they are located and how they are used. Also, I was
wondering where all the information about output connection definitions,
repository definitions and job definitions, defined by a user using the
crawler UI, are stored.

The unique config file is the properties.xml that you need to add a new JVM
parameter:
-Dorg.apache.manifoldcf.configfile=configuration file path
This only if you are deploying ManifoldCF in an application server.
Otherwise you can leave properties.xml in your user home/lcf folder.
You can find an example of the properties.xml file in the dist/example
folder of the distribution bundle.
All the information managed by the UI Crawler are stored in a database, HSQL
by default but you can configure a Postgresql DBMS changing the
properties.xml file, for more information about all the parameters you can
visit the following page:
http://incubator.apache.org/connectors/how-to-build-and-deploy.html#The+ManifoldCF+configuration+file
Hope this helps.
Piergiorgio

Can you please help me in clarifying these doubts ?
Thanks and Regards,
Swapna.

--
Piergiorgio Lucidi
http://about.me/piergiorgiolucidi

Re: Indexing Wikipedia/MediaWiki

2011-09-16 Thread Karl Wright

This looked easy enough that I just went ahead and implemented it.

If you check out trunk, and add site map document URLs to the Feed
URLs tab for an RSS job, it should locate the documents the sitemap
points at.  Furthermore it should not chase links within those
documents unless the documents are also site map documents or rss
feeds in their own right.

Karl

On Fri, Sep 16, 2011 at 5:31 AM, Karl Wright daddy...@gmail.com wrote:
 It might be worth exploring sitemaps.

 http://en.wikipedia.org/wiki/Site_map

 It may be possible to create a connector, much like the RSS connector,
 that you can point at a site map and it would just pick up the pages.
 In fact, I think it would be straightforward to modify the RSS
 connector to understand sitemap format.

 If you can do a little research to figure out if this might work for
 you, I'd be willing to do some work and try to implement it.

 Karl

 On Fri, Sep 16, 2011 at 3:53 AM, Wunderlich, Tobias
 tobias.wunderl...@igd-r.fraunhofer.de wrote:
 Hey folks,



 I am currently working on a project to create a basic search platform using
 Solr and ManifoldCF. One of the content-repositories I need to index is a
 wiki (MediaWiki) and that’s where I ran into a wall. I tried using the
 web-connector, but simply crawling the sites resulted in a lot of content I
 don’t need (navigation-links, …) and not every information I wanted was
 gathered (author, last modified, …). The only metadata I got was the one
 included in head/meta, which wasn’t relevant.



 Is there another way to get the wiki’s data and more important is there a
 way to get the right data into the right field? I know that there is a way
 to export the wiki-sites in xml with wiki-syntax, but I don’t know how that
 would help me. I could simply use solr’s DataImportHandler to index a
 complete wiki-dump, but it would be nice to use the same framework for every
 repository, especially since manifold manages all the recrawling.



 Does anybody have some experience in this direction, or any idea for a
 solution?



 Thanks in advance,

 Tobias

Re: Last-modified from Web crawler

2011-08-24 Thread Karl Wright

If I recall, the Solr output connector has a tab that will let you map
incoming metadata to whatever solr fieldname you want.  It's called
the Solr Field Mapping tab, and you set it on each job that indexes to
a solr output connection.  Give it a try and see if it works for you.

Karl


On Wed, Aug 24, 2011 at 4:38 AM, Jan Høydahl jan@cominvent.com wrote:
 Wow, that was quick :)

 So, how can we now configure so that Last-Modified is sent to the solr 
 output connector as e.g. literal.last_modified ?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 22. aug. 2011, at 17.09, Jan Høydahl wrote:

 CONNECTORS-243

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 22. aug. 2011, at 16.38, Karl Wright wrote:

 It would have to be sent as a metadata field.  This should not be
 difficult to implement.  Can you create a JIRA ticket for it please?

 Thanks,
 Karl

 On Mon, Aug 22, 2011 at 10:35 AM, Jan Høydahl jan@cominvent.com wrote:
 Hi,

 How can we have the Web connector send the last-modified value from a 
 page's HTTP header to the output connector?

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

Re: Setting heapsize of agent

2011-06-24 Thread Karl Wright

Good point.  We should probably have an environment variable or script
parameter for this.  Would you like to create a ticket?

Karl

On Fri, Jun 24, 2011 at 1:51 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Hello.
 When using ./executecommand.sh org.apache.manifoldcf.agents.AgentRun,
 where do we set JVM heapsize of agent (-Xms1024m -Xmx1024m)?
 We cannot use files in processes/define folder which adds -D switch.

 Regards,
 Shinichiro Abe

Travel assistance, ApacheCon NA 2011

2011-06-06 Thread Karl Wright

The Apache Software Foundation (ASF)'s Travel Assistance Committee (TAC) is
now accepting applications for ApacheCon North America 2011, 7-11 November
in Vancouver BC, Canada.

The TAC is seeking individuals from the Apache community at-large --users,
developers, educators, students, Committers, and Members-- who would like to
attend ApacheCon, but need some financial support in order to be able to get
there. There are limited places available, and all applicants will be scored
on their individual merit.

Financial assistance is available to cover flights/trains, accommodation and
entrance fees either in part or in full, depending on circumstances.
However, the support available for those attending only the BarCamp (7-8
November) is less than that for those attending the entire event (Conference
+ BarCamp 7-11 November). The Travel Assistance Committee aims to support
all official ASF events, including cross-project activities; as such, it may
be prudent for those in Asia and Europe to wait for an event geographically
closer to them.

More information can be found at http://www.apache.org/travel/index.html
including a link to the online application and detailed instructions for
submitting.

Applications will close on 8 July 2011 at 22:00 BST (UTC/GMT +1).

We wish good luck to all those who will apply, and thank you in advance for
tweeting, blogging, and otherwise spreading the word.

Regards,
The Travel Assistance Committee

ManifoldCF now officially requires Java 1.5

2011-06-02 Thread Karl Wright

Hi everyone,

I've checked in changes that move ManifoldCF from mostly the Java 1.4
world into the Java 1.5 world.  This should introduce no compilation
errors in user connector code, but most people will need to do a clean
recompile to get a working system again.  Please let me know ASAP if
anyone finds any problems.

Thanks!
Karl

My ManifoldCF talk has been accepted for ApacheCon North America 2011 in Vancouver

2011-05-31 Thread Karl Wright

I'll be giving a 45-minute introductory talk in Vancouver at ApacheCon
North America, some time between November 9 and November 11, 2011.  If
anyone has any particular detail or issue they would like to see in
the talk, I'd be happy to entertain your suggestion.  Please let me
know.

Karl

Re: Re-sending docs to output connector

2011-05-26 Thread Karl Wright

More thoughts:

Including this functionality as a general feature of ManifoldCF would
allow one to use ManifoldCF as a repository of content in its own
right.  In this model, probably the data would be keyed by the output
connection name, and if integrated at this level in theory this would
work with any output connection.  The UI modifications would be modest
and would consist of additional buttons on the output connection view
page to re-feed documents to the connection rather than recrawl.

Advantages: Would leverage multiple output connectors transparently,
and would support the refeed everything to Solr model.  Guaranteed
commit on the part of a target search engine would no longer be a
requirement.

Downside: First, lots of storage would be required that probably can't
live in PostgreSQL, complicating the deployment model.  Second,
depending on the details of implementation, there may not be feedback
available at crawl time from the output connection about the
acceptability of a document for indexing.  Third, for many repository
connectors the benefit of reading from the file system might well be
zero.  Fourth, the entire process of keeping the target repository
managed properly is a manual one, and thus prone to errors.

Karl

On Wed, May 25, 2011 at 10:13 AM, Karl Wright daddy...@gmail.com wrote:
 On a refeed from cache request, send all objects to Solr - this
 should probably be per Job, not per output connector

 This is where your proposal gets in trouble, I think.  There is no
 infrastructure mechanism in ManifoldCF to do either of these things at
 this time.  Connections are not aware of what jobs are using them, and
 there is no way to send a signal to a connector to tell it to refeed,
 nor is there a button in the crawler UI for it.  You're basically
 proposing significant infrastructure changes in ManifoldCF to support
 a missing feature in Solr, seems to me.

 Also, I'm pretty sure we want to try to solve the guaranteed
 delivery problem using the same mechanism, whatever it turns out to
 be.  The problems are almost identical and the overhead of having two
 independent solutions for the same issue is very high.  So let us try
 to make this work for both cases.

 Karl

 On Wed, May 25, 2011 at 9:55 AM, Jan Høydahl jan@cominvent.com wrote:
 Hi,

 Definitely, Solr also needs some sort of guaranteed delivery mechanism, but 
 it's probably not the same thing as this cache, I imagine more like a 
 message queue or callback mechanism. But that's a separate discussion all 
 together :)

 So if we don't shoot for a 100% solution, but try to solve the need to 
 re-feed a bunch of documents from MCF really quickly after some schema 
 change or other processing change on the output (may be any output really), 
 then we'd have a simpler case:

 Not a standalone server but a lightweight library (jar) which knows how to 
 talk to a persistent object store (CouchDB), supporting simple put(), get(), 
 delete() operations as well as querying for objects within time stamps etc. 
 An outputConnector that wish to support caching could then inject calls to 
 this library in all places it talks with Solr:
 * On add: put() object to cache along with a timestamp for sequence, then 
 send doc directly to Solr
 * On delete: Delete document from cache, then add a delete meta object 
 with timestamp as a transaction log feature, then delete from Solr
 * On a refeed from cache request, send all objects to Solr - this should 
 probably be per Job, not per output connector
 * A refeed from cache since timestamp X request would be useful after Solr 
 downtime. The command would use the cache as a transaction log.

 The cache will always be a mirror of what the output (Solr) SHOULD look 
 like, thus it would also be possible to support a consistency check 
 feature, in which we compare all IDs from cache with all IDs in Solr, and if 
 not equal, get back in sync.

 Doing this as a lightweight library would then provide a tool for 
 programmers of other clients.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 25. mai 2011, at 13.49, Karl Wright wrote:

 I've been thinking about this further.

 First, it seems clear to me that both Solr AND ManifoldCF would need
 access to the document cache.  If the cache lives under ManifoldCF, I
 cannot see a good way towards a Solr integration that works the way
 I'd hope it would.  Furthermore, the cache is not needed by many (or
 even most) ManifoldCF targets, so adding this as a general feature of
 ManifoldCF doesn't make sense to me.

 On the other hand, while Solr can certainly use this facility, I can
 well imagine other situations where it would be very useful as well.
 So I am now leaning towards having a wholly separate service which
 functions as both a cache and a transaction log.  A ManifoldCF output
 connector would communicate with the service, and Solr also would -
 or, rather, some automatic Solr-specific push process would query

Re: Re-sending docs to output connector

2011-05-25 Thread Karl Wright

I've been thinking about this further.

First, it seems clear to me that both Solr AND ManifoldCF would need
access to the document cache.  If the cache lives under ManifoldCF, I
cannot see a good way towards a Solr integration that works the way
I'd hope it would.  Furthermore, the cache is not needed by many (or
even most) ManifoldCF targets, so adding this as a general feature of
ManifoldCF doesn't make sense to me.

On the other hand, while Solr can certainly use this facility, I can
well imagine other situations where it would be very useful as well.
So I am now leaning towards having a wholly separate service which
functions as both a cache and a transaction log.  A ManifoldCF output
connector would communicate with the service, and Solr also would -
or, rather, some automatic Solr-specific push process would query
for changes between a specified time range and push those into Solr.
Other such processes would be possible too.  The list of moving parts
would therefore be:

- a configuration file containing details on how to communicate with Solr
- a stand-alone web application which accepts documents and metadata
via HTTP, and can also respond to HTTP transaction log queries and
commands
- a number of command classes (processes) which provide a means of
push the transaction log contents into Solr, using the HTTP API
mentioned above.

I'd be interested in working on the development of such a widget, but
I probably wouldn't have the serious time necessary to do much until
July 1 given current schedule.  Anybody else interested in
collaborating?  Other thoughts?

Karl

On Tue, May 24, 2011 at 7:28 PM, Karl Wright daddy...@gmail.com wrote:
 The only requirement you may have overlooked is the requirement that
 Solr be able to take advantage of the item cache automatically if it
 happens to be restarted in the middle of an indexing pass.  If you
 think about it, you will realize that this cannot be done externally
 to Solr, unless Solr learns how to pull documents from the item
 cache, and keep track somehow of the last item/operation it
 successfully committed.  That's why I proposed putting the whole cache
 under Solr auspices.  Deletions also would need to be enumerated in
 the cache, so it would not really be a cache but more like a
 transaction log.  But I agree that the right place for such a
 transaction log is effectively between MCF and Solr.

 Obviously the cache would also need to be disk based, or once again
 guaranteed delivery would not be possible.  Compression might be
 useful, as would be checkpoints in case the data got large.  This is
 very database-like, so CouchDB might be a reasonable way to do it,
 especially if this code is considered to be part of Solr.  If part of
 ManifoldCF, we should try to see if PostgreSQL would suffice, since it
 will likely be already installed and ready to go.

 Karl

 On Tue, May 24, 2011 at 5:01 PM, Jan Høydahl jan@cominvent.com wrote:
 The Refetch all ingested documents works, but with Web crawling the 
 problem is that it will take almost as long as a new crawl to re-feed.

 The solutions could be
 A) Add a stand-alone cache in front of Solr
 B) Add a caching proxy in front of MCF - will allow speedy re-crawl (but 
 clunky to administer)
 C) Extend MCF with an optional item cache. This could allow a refeed from 
 cache button somewhere...

 The cache in C could be realized externally to MCF, e.g. as a CouchDB 
 cluster. To enable, you'd add the CouchDB access into to properties.xml.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 24. mai 2011, at 15.11, Karl Wright wrote:

 ManifoldCF is designed to deal with the problem of repeated or
 continuous crawling, doing only what is needed on subsequent crawls.
 It is thus a true incremental crawler.  But in order for this to work
 for you, you need to let ManifoldCF do its job of keeping track of
 what documents (and what document versions) have been handed to the
 output connection.  For the situation where you change something in
 Solr, the ManifoldCF solution to that is the refetch all ingested
 documents button in the Crawler UI.  This is on the view page for the
 output connection.  Clicking that button will cause ManifoldCF to
 re-index all documents - but will also require ManifoldCF to recrawl
 them, because ManifoldCF does not keep copies of the documents it
 crawls anywhere.

 If you need to avoid recrawling at all costs when you change Solr
 configurations, you may well need to put some sort of software of your
 own devising between ManifoldCF and Solr.  You basically want to
 develop a content repository which ManifoldCF outputs to which can be
 scanned to send to your Solr instance.  I actually proposed this
 design for a Solr guaranteed delivery mechanism, because until Solr
 commits a document it can still be lost if the Solr instance is shut
 down.  Clearly something like this is needed and would also likely
 solve your problem too.  The main issue

Re: Re-sending docs to output connector

2011-05-25 Thread Karl Wright

On a refeed from cache request, send all objects to Solr - this
should probably be per Job, not per output connector

This is where your proposal gets in trouble, I think.  There is no
infrastructure mechanism in ManifoldCF to do either of these things at
this time.  Connections are not aware of what jobs are using them, and
there is no way to send a signal to a connector to tell it to refeed,
nor is there a button in the crawler UI for it.  You're basically
proposing significant infrastructure changes in ManifoldCF to support
a missing feature in Solr, seems to me.

Also, I'm pretty sure we want to try to solve the guaranteed
delivery problem using the same mechanism, whatever it turns out to
be.  The problems are almost identical and the overhead of having two
independent solutions for the same issue is very high.  So let us try
to make this work for both cases.

Karl

On Wed, May 25, 2011 at 9:55 AM, Jan Høydahl jan@cominvent.com wrote:
 Hi,

 Definitely, Solr also needs some sort of guaranteed delivery mechanism, but 
 it's probably not the same thing as this cache, I imagine more like a message 
 queue or callback mechanism. But that's a separate discussion all together :)

 So if we don't shoot for a 100% solution, but try to solve the need to 
 re-feed a bunch of documents from MCF really quickly after some schema change 
 or other processing change on the output (may be any output really), then 
 we'd have a simpler case:

 Not a standalone server but a lightweight library (jar) which knows how to 
 talk to a persistent object store (CouchDB), supporting simple put(), get(), 
 delete() operations as well as querying for objects within time stamps etc. 
 An outputConnector that wish to support caching could then inject calls to 
 this library in all places it talks with Solr:
 * On add: put() object to cache along with a timestamp for sequence, then 
 send doc directly to Solr
 * On delete: Delete document from cache, then add a delete meta object with 
 timestamp as a transaction log feature, then delete from Solr
 * On a refeed from cache request, send all objects to Solr - this should 
 probably be per Job, not per output connector
 * A refeed from cache since timestamp X request would be useful after Solr 
 downtime. The command would use the cache as a transaction log.

 The cache will always be a mirror of what the output (Solr) SHOULD look like, 
 thus it would also be possible to support a consistency check feature, in 
 which we compare all IDs from cache with all IDs in Solr, and if not equal, 
 get back in sync.

 Doing this as a lightweight library would then provide a tool for programmers 
 of other clients.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

 On 25. mai 2011, at 13.49, Karl Wright wrote:

 I've been thinking about this further.

 First, it seems clear to me that both Solr AND ManifoldCF would need
 access to the document cache.  If the cache lives under ManifoldCF, I
 cannot see a good way towards a Solr integration that works the way
 I'd hope it would.  Furthermore, the cache is not needed by many (or
 even most) ManifoldCF targets, so adding this as a general feature of
 ManifoldCF doesn't make sense to me.

 On the other hand, while Solr can certainly use this facility, I can
 well imagine other situations where it would be very useful as well.
 So I am now leaning towards having a wholly separate service which
 functions as both a cache and a transaction log.  A ManifoldCF output
 connector would communicate with the service, and Solr also would -
 or, rather, some automatic Solr-specific push process would query
 for changes between a specified time range and push those into Solr.
 Other such processes would be possible too.  The list of moving parts
 would therefore be:

 - a configuration file containing details on how to communicate with Solr
 - a stand-alone web application which accepts documents and metadata
 via HTTP, and can also respond to HTTP transaction log queries and
 commands
 - a number of command classes (processes) which provide a means of
 push the transaction log contents into Solr, using the HTTP API
 mentioned above.

 I'd be interested in working on the development of such a widget, but
 I probably wouldn't have the serious time necessary to do much until
 July 1 given current schedule.  Anybody else interested in
 collaborating?  Other thoughts?

 Karl

 On Tue, May 24, 2011 at 7:28 PM, Karl Wright daddy...@gmail.com wrote:
 The only requirement you may have overlooked is the requirement that
 Solr be able to take advantage of the item cache automatically if it
 happens to be restarted in the middle of an indexing pass.  If you
 think about it, you will realize that this cannot be done externally
 to Solr, unless Solr learns how to pull documents from the item
 cache, and keep track somehow of the last item/operation it
 successfully committed.  That's why I proposed putting the whole cache
 under Solr

Re: Re-sending docs to output connector

2011-05-24 Thread Karl Wright

ManifoldCF is designed to deal with the problem of repeated or
continuous crawling, doing only what is needed on subsequent crawls.
It is thus a true incremental crawler.  But in order for this to work
for you, you need to let ManifoldCF do its job of keeping track of
what documents (and what document versions) have been handed to the
output connection.  For the situation where you change something in
Solr, the ManifoldCF solution to that is the refetch all ingested
documents button in the Crawler UI.  This is on the view page for the
output connection.  Clicking that button will cause ManifoldCF to
re-index all documents - but will also require ManifoldCF to recrawl
them, because ManifoldCF does not keep copies of the documents it
crawls anywhere.

If you need to avoid recrawling at all costs when you change Solr
configurations, you may well need to put some sort of software of your
own devising between ManifoldCF and Solr.  You basically want to
develop a content repository which ManifoldCF outputs to which can be
scanned to send to your Solr instance.  I actually proposed this
design for a Solr guaranteed delivery mechanism, because until Solr
commits a document it can still be lost if the Solr instance is shut
down.  Clearly something like this is needed and would also likely
solve your problem too.  The main issue, though, is that it would need
to be integrated with Solr itself, because you'd really want it to
pick up where it left off if Solr is cycled etc.  In my opinion this
functionality really can't function as part of ManifoldCF for that
reason.

Karl

On Tue, May 24, 2011 at 8:57 AM, Jan Høydahl jan@cominvent.com wrote:
 Hi,

 Is there an easy way to separate fetching from ingestion?
 I'd like to first run a crawl for several days, and then feed it to my Solr 
 output as fast as possible.
 Also, after schema changes in Solr, there is a need to re-feed all docs.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com

Re: Treatment of protected files

2011-05-19 Thread Karl Wright

This should be enough.

I'll open a ticket. The changes to the solr connector are trivial; I
can do them and check them in, if someone is willing to try it out for
real.

Karl

On Thu, May 19, 2011 at 6:11 AM, Erlend Garåsen e.f.gara...@usit.uio.no wrote:

Here's what I found in my simple history logs:
org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
getting content for thmx and xps file types

So, yes, Tika exceptions are stored in the MCF logs, so I guess it should be
possible to find a workaround for this.

Erlend

On 19.05.11 12.00, Karl Wright wrote:

There was a Solr ticket created I believe by Shinichiro.

The question is whether the Solr 500 response has anything in its body
that could help ManifoldCF recognize a Tika exception. If not there
is little the Solr connector can do to detect this case. The problem
is that you need to look in the Simple History to see what the
response actually is, and I don't think Shinichiro did that.

Karl

On Thu, May 19, 2011 at 4:42 AM, Erlend Garåsene.f.gara...@usit.uio.no
wrote:

Do we have an MCF ticket for this issue yet? Or is rather a Solr issue?

I agree with Karl. We should look for a TikaException and then tell MCF
to
skip affecting documents. But maybe this should just be a temporary fix
until it has been fixed in Solr Cell.

Exactly the same happens if Tika cannot parse a document which it does
not
support. Solr/Solr Cell returns a 500 server error, causing MCF to retry
over and over again:
[2011-05-18 17:39:34.104] [] webapp=/solr path=/update/extract

params={literal.id=http://foreninger.uio.no/akademikerne/Tillitsvalgte_i_akademikerforeninger_files/themedata.thmx}
status=500 QTime=5
[2011-05-18 17:39:39.102] {} 0 4
[2011-05-18 17:39:39.103] org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: TIKA-418: RuntimeException while
getting content for thmx and xps file types

And finally, the job just aborts:
Exception tossed: Repeated service interruptions - failure processing
document: Ingestion HTTP error code 500
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated
service
interruptions - failure processing document: Ingestion HTTP error code
500
at

org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:630)
Caused by: org.apache.manifoldcf.core.interfaces.ManifoldCFException:
Ingestion HTTP error code 500
at

org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:1362)

I guess I can find a workaround since I have created my own
ExtractingRequestHandler in order to support language detection etc., but
I
think MCF should act differently when the underlying cause is a
TikaException.

Erlend

On 27.04.11 12.25, Karl Wright wrote:

If I recall, it treats the 400 response as meaning this document
should be skipped, and it treats the 500 response as meaning this
document should be retried because I have absolutely no idea what
happened. However, we could modify the code for the 500 response to
look at the content of the response as well, and look for a string in
it that would give us a clue, such as TikaException. If we see a
TikaException, we could have it conclude this document should be
skipped. That was what I was thinking.

Karl

On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:

Hi.Thank you for your reply.

It seems that Solr.ExtractingRequestHandler responds the same HTTP
response(SERVER_ERROR( 500 )) at any time error occurs.
I'll try to open a ticket for solr.

Is it correct that MCF re-try crawling was processed when it receives
500
level response, not 400 level response?

Thank you.
Shinichiro Abe

On 2011/04/27, at 14:45, Karl Wright wrote:

So the 500 error is occurring because Solr is throwing an exception at
indexing time, is that correct?

If this is correct, then here's my take. (1) A 500 error is a nasty
error that Solr should not be returning under normal conditions. (2)
A password-protected PDF is not what I would consider exceptional, so
Tika should not be throwing an exception when it sees it, merely (at
worst) logging an error and continuing. However, having said that,
output connectors in ManifoldCF can make the decision to never retry
the document, by returning a certain status, provided the connector
can figure out that the error warrants this treatment.

My suggestion is therefore the following. First, we should open a
ticket for Solr about this. Second, if you can see the error output
from the Simple History for a TikaException being thrown in Solr, we
can look for that text in the response from Solr and perhaps modify
the Solr Connector to detect the case. If you could open a ManifoldCF
ticket and include that text I'd be very grateful.

Thanks!
Karl

On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:

Hello.

There are pdf

Re: Which version of Solr have implements the Document Level Access Control

2011-05-05 Thread Karl Wright

ok, if you try what I sent and it works, I will check it in.
Karl

On Thu, May 5, 2011 at 6:29 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
 I'm assuming that since this is a Domain logon name, we don't need to add
 any escaping sequence, otherwise OS would reject it during authentication.

 Yes, you are right, userSID is needed, if user is not any part of group but
 still have access to document.

 On Thu, May 5, 2011 at 6:23 PM, Karl Wright daddy...@gmail.com wrote:

 Thanks - we do need the user sid, so I will put that back.

 Also, I'd like to ask what you know about escaping the user name in
 this expression:

 String searchFilter = ((objectClass=user)(sAMAccountName= + userName +
 ));

 It seems to me that there is probably some escaping needed, but I
 don't know what style.  Do you think it is the same (C-style, with \
 escape) as for the other case?

 Karl

 On Thu, May 5, 2011 at 6:20 PM, Kadri Atalay atalay.ka...@gmail.com
 wrote:
  Hi Karl,
 
      String returnedAtts[]={tokenGroups} is ONLY returning the
  memberGroups,
 
  C:\OPTcurl
 
  http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
  AUTHORIZED:TEQA-DC
  TOKEN:TEQA-DC:S-1-5-32-545
  TOKEN:TEQA-DC:S-1-5-32-544
  TOKEN:TEQA-DC:S-1-5-32-555
  TOKEN:TEQA-DC:S-1-5-21-
  1212545812-2858578934-3563067286-1124
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
  TOKEN:TEQA-DC:S-1-1-0
 
  but,
 
  -    String returnedAtts[] = {tokenGroups,objectSid}; is returning
  memberGroups AND SID for that user.
 
  C:\OPTcurl
 
  http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
  AUTHORIZED:TEQA-DC
  TOKEN:TEQA-DC:S-1-5-32-545
  TOKEN:TEQA-DC:S-1-5-32-544
  TOKEN:TEQA-DC:S-1-5-32-555
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480
  TOKEN:TEQA-DC:S-1-1-0
 
  Since we are only interested in the member groups, tokenGroups is
  sufficient, but if you also need user SID then you might keep the
  objectSID
  as well.
 
  Thanks
 
  Kadri
 
 
  On Thu, May 5, 2011 at 6:01 PM, Karl Wright daddy...@gmail.com wrote:
 
  I am curious about the following change, which does not seem correct:
 
 
      //Specify the attributes to return
  -    String returnedAtts[] = {tokenGroups,objectSid};
  +    String returnedAtts[]={tokenGroups};
      searchCtls.setReturningAttributes(returnedAtts);
 
  Karl
 
 
  On Thu, May 5, 2011 at 5:36 PM, Kadri Atalay atalay.ka...@gmail.com
  wrote:
   Karl,
  
   The ActiveDirectoryAuthority.java is attached.
  
   I'm not sure about clicking Grant ASF License, or how to do that
   from
   Tortoise.
   But, you got my consent for granting the ASF license.
  
   Thanks
  
   Kadri
  
  
   On Thu, May 5, 2011 at 5:28 PM, Karl Wright daddy...@gmail.com
   wrote:
  
   You may attach the whole ActiveDirectoryAuthority.java file to the
   ticket if you prefer.  But you must click the Grant ASF License
   button.
  
   Karl
  
   On Thu, May 5, 2011 at 5:24 PM, Kadri Atalay
   atalay.ka...@gmail.com
   wrote:
Karl,
   
I'm using the Tortoise SVN, and new to SVN..
Do you know how to do this with Tortoise ?
Otherwise, I can just send the source code directly to you.
BTW, there are some changes in the ParseUser method also, you can
see
all
when you run the diff.
   
Thanks
   
Kadri

Re: Which version of Solr have implements the Document Level Access Control

2011-05-05 Thread Karl Wright

It must mean we're somehow throwing an exception in the case where the
user is missing.  I bet I know why - the CN lookup is failing instead.
 I'll see if I can change it.

Karl

On Thu, May 5, 2011 at 6:43 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
 It works, only difference I see with previous one is: if a domain is
 reachable, message usernotfound makes a better indicator, somehow we lost
 that.


 C:\OPTtestauthority

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 Previous one
 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com;
 USERNOTFOUND:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY


 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
 AUTHORIZED:TEQA-DC
 TOKEN:TEQA-DC:S-1-5-32-545
 TOKEN:TEQA-DC:S-1-5-32-544
 TOKEN:TEQA-DC:S-1-5-32-555
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480
 TOKEN:TEQA-DC:S-1-1-0

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=kata...@teqa.filetek.com;
 AUTHORIZED:TEQA-DC
 TOKEN:TEQA-DC:S-1-5-32-545
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1473
 TOKEN:TEQA-DC:S-1-1-0

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay@fakedomain;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY


 On Thu, May 5, 2011 at 6:29 PM, Karl Wright daddy...@gmail.com wrote:

 I've cleaned things up slightly to restore the objectSid and also to
 fix an infinite loop ifyou have more than one comma in the escape
 expression.  I've attached the file, can you see if it works?

 Thanks,
 Karl


 On Thu, May 5, 2011 at 6:23 PM, Karl Wright daddy...@gmail.com wrote:
  Thanks - we do need the user sid, so I will put that back.
 
  Also, I'd like to ask what you know about escaping the user name in
  this expression:
 
  String searchFilter = ((objectClass=user)(sAMAccountName= + userName
  + ));
 
  It seems to me that there is probably some escaping needed, but I
  don't know what style.  Do you think it is the same (C-style, with \
  escape) as for the other case?
 
  Karl
 
  On Thu, May 5, 2011 at 6:20 PM, Kadri Atalay atalay.ka...@gmail.com
  wrote:
  Hi Karl,
 
      String returnedAtts[]={tokenGroups} is ONLY returning the
  memberGroups,
 
  C:\OPTcurl
 
  http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
  AUTHORIZED:TEQA-DC
  TOKEN:TEQA-DC:S-1-5-32-545
  TOKEN:TEQA-DC:S-1-5-32-544
  TOKEN:TEQA-DC:S-1-5-32-555
  TOKEN:TEQA-DC:S-1-5-21-
  1212545812-2858578934-3563067286-1124
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
  TOKEN:TEQA-DC:S-1-1-0
 
  but,
 
  -    String returnedAtts[] = {tokenGroups,objectSid}; is returning
  memberGroups AND SID for that user.
 
  C:\OPTcurl
 
  http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
  AUTHORIZED:TEQA-DC
  TOKEN:TEQA-DC:S-1-5-32-545
  TOKEN:TEQA-DC:S-1-5-32-544
  TOKEN:TEQA-DC:S-1-5-32-555
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480
  TOKEN:TEQA-DC:S-1-1-0
 
  Since we are only interested in the member groups, tokenGroups is
  sufficient, but if you also need user SID then you might keep the
  objectSID
  as well.
 
  Thanks
 
  Kadri
 
 
  On Thu, May 5, 2011 at 6:01 PM, Karl Wright daddy...@gmail.com wrote:
 
  I am curious about the following change, which does not seem correct:
 
 
      //Specify the attributes to return
  -    String returnedAtts[] = {tokenGroups,objectSid};
  +    String returnedAtts[]={tokenGroups};
      searchCtls.setReturningAttributes(returnedAtts);
 
  Karl
 
 
  On Thu, May 5, 2011 at 5:36 PM, Kadri Atalay atalay.ka...@gmail.com
  wrote:
   Karl,
  
   The ActiveDirectoryAuthority.java is attached.
  
   I'm not sure about clicking Grant ASF License, or how to do that
   from
   Tortoise.
   But, you got my consent for granting the ASF license

Re: Which version of Solr have implements the Document Level Access Control

2011-05-05 Thread Karl Wright

Try this.
Karl


On Thu, May 5, 2011 at 7:12 PM, Karl Wright daddy...@gmail.com wrote:
 It must mean we're somehow throwing an exception in the case where the
 user is missing.  I bet I know why - the CN lookup is failing instead.
  I'll see if I can change it.

 Karl

 On Thu, May 5, 2011 at 6:43 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
 It works, only difference I see with previous one is: if a domain is
 reachable, message usernotfound makes a better indicator, somehow we lost
 that.


 C:\OPTtestauthority

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 Previous one
 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com;
 USERNOTFOUND:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY


 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
 AUTHORIZED:TEQA-DC
 TOKEN:TEQA-DC:S-1-5-32-545
 TOKEN:TEQA-DC:S-1-5-32-544
 TOKEN:TEQA-DC:S-1-5-32-555
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480
 TOKEN:TEQA-DC:S-1-1-0

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=kata...@teqa.filetek.com;
 AUTHORIZED:TEQA-DC
 TOKEN:TEQA-DC:S-1-5-32-545
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1473
 TOKEN:TEQA-DC:S-1-1-0

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay@fakedomain;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY


 On Thu, May 5, 2011 at 6:29 PM, Karl Wright daddy...@gmail.com wrote:

 I've cleaned things up slightly to restore the objectSid and also to
 fix an infinite loop ifyou have more than one comma in the escape
 expression.  I've attached the file, can you see if it works?

 Thanks,
 Karl


 On Thu, May 5, 2011 at 6:23 PM, Karl Wright daddy...@gmail.com wrote:
  Thanks - we do need the user sid, so I will put that back.
 
  Also, I'd like to ask what you know about escaping the user name in
  this expression:
 
  String searchFilter = ((objectClass=user)(sAMAccountName= + userName
  + ));
 
  It seems to me that there is probably some escaping needed, but I
  don't know what style.  Do you think it is the same (C-style, with \
  escape) as for the other case?
 
  Karl
 
  On Thu, May 5, 2011 at 6:20 PM, Kadri Atalay atalay.ka...@gmail.com
  wrote:
  Hi Karl,
 
      String returnedAtts[]={tokenGroups} is ONLY returning the
  memberGroups,
 
  C:\OPTcurl
 
  http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
  AUTHORIZED:TEQA-DC
  TOKEN:TEQA-DC:S-1-5-32-545
  TOKEN:TEQA-DC:S-1-5-32-544
  TOKEN:TEQA-DC:S-1-5-32-555
  TOKEN:TEQA-DC:S-1-5-21-
  1212545812-2858578934-3563067286-1124
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
  TOKEN:TEQA-DC:S-1-1-0
 
  but,
 
  -    String returnedAtts[] = {tokenGroups,objectSid}; is returning
  memberGroups AND SID for that user.
 
  C:\OPTcurl
 
  http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
  AUTHORIZED:TEQA-DC
  TOKEN:TEQA-DC:S-1-5-32-545
  TOKEN:TEQA-DC:S-1-5-32-544
  TOKEN:TEQA-DC:S-1-5-32-555
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
  TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480
  TOKEN:TEQA-DC:S-1-1-0
 
  Since we are only interested in the member groups, tokenGroups is
  sufficient, but if you also need user SID then you might keep the
  objectSID
  as well.
 
  Thanks
 
  Kadri
 
 
  On Thu, May 5, 2011 at 6:01 PM, Karl Wright daddy...@gmail.com wrote:
 
  I am curious about the following change, which does not seem correct:
 
 
      //Specify the attributes to return
  -    String returnedAtts[] = {tokenGroups,objectSid};
  +    String returnedAtts[]={tokenGroups};
      searchCtls.setReturningAttributes(returnedAtts);
 
  Karl
 
 
  On Thu, May 5, 2011 at 5:36 PM, Kadri Atalay atalay.ka...@gmail.com
  wrote:
   Karl,
  
   The ActiveDirectoryAuthority.java is attached.
  
   I'm not sure about clicking Grant ASF License, or how to do

Re: Which version of Solr have implements the Document Level Access Control

2011-05-05 Thread Karl Wright

I think yours was working because it was returning cn=null,
cn=users, which was a result of the fact that cn was null and the
expression was assembled using the + operator.  When I separated the
ldap escape out, it caused a null pointer exception to be thrown
instead.  It should be fixed now.

Karl


On Thu, May 5, 2011 at 7:19 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
 Fyi. The file I sent you was returning usernotfound.


 Sent from my iPhone

 On May 5, 2011, at 7:12 PM, Karl Wright daddy...@gmail.com wrote:

 It must mean we're somehow throwing an exception in the case where the
 user is missing.  I bet I know why - the CN lookup is failing instead.
 I'll see if I can change it.

 Karl

 On Thu, May 5, 2011 at 6:43 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
 It works, only difference I see with previous one is: if a domain is
 reachable, message usernotfound makes a better indicator, somehow we lost
 that.


 C:\OPTtestauthority

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 Previous one
 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com;
 USERNOTFOUND:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY


 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
 AUTHORIZED:TEQA-DC
 TOKEN:TEQA-DC:S-1-5-32-545
 TOKEN:TEQA-DC:S-1-5-32-544
 TOKEN:TEQA-DC:S-1-5-32-555
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480
 TOKEN:TEQA-DC:S-1-1-0

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=kata...@teqa.filetek.com;
 AUTHORIZED:TEQA-DC
 TOKEN:TEQA-DC:S-1-5-32-545
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1473
 TOKEN:TEQA-DC:S-1-1-0

 C:\OPTcurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay@fakedomain;
 UNREACHABLEAUTHORITY:TEQA-DC
 TOKEN:TEQA-DC:DEAD_AUTHORITY


 On Thu, May 5, 2011 at 6:29 PM, Karl Wright daddy...@gmail.com wrote:

 I've cleaned things up slightly to restore the objectSid and also to
 fix an infinite loop ifyou have more than one comma in the escape
 expression.  I've attached the file, can you see if it works?

 Thanks,
 Karl


 On Thu, May 5, 2011 at 6:23 PM, Karl Wright daddy...@gmail.com wrote:
 Thanks - we do need the user sid, so I will put that back.

 Also, I'd like to ask what you know about escaping the user name in
 this expression:

 String searchFilter = ((objectClass=user)(sAMAccountName= + userName
 + ));

 It seems to me that there is probably some escaping needed, but I
 don't know what style.  Do you think it is the same (C-style, with \
 escape) as for the other case?

 Karl

 On Thu, May 5, 2011 at 6:20 PM, Kadri Atalay atalay.ka...@gmail.com
 wrote:
 Hi Karl,

     String returnedAtts[]={tokenGroups} is ONLY returning the
 memberGroups,

 C:\OPTcurl

 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
 AUTHORIZED:TEQA-DC
 TOKEN:TEQA-DC:S-1-5-32-545
 TOKEN:TEQA-DC:S-1-5-32-544
 TOKEN:TEQA-DC:S-1-5-32-555
 TOKEN:TEQA-DC:S-1-5-21-
 1212545812-2858578934-3563067286-1124
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
 TOKEN:TEQA-DC:S-1-1-0

 but,

 -    String returnedAtts[] = {tokenGroups,objectSid}; is returning
 memberGroups AND SID for that user.

 C:\OPTcurl

 http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
 AUTHORIZED:TEQA-DC
 TOKEN:TEQA-DC:S-1-5-32-545
 TOKEN:TEQA-DC:S-1-5-32-544
 TOKEN:TEQA-DC:S-1-5-32-555
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1124
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-512
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-513
 TOKEN:TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480
 TOKEN:TEQA-DC:S-1-1-0

 Since we are only interested in the member groups, tokenGroups is
 sufficient, but if you also need user SID then you might keep the
 objectSID
 as well.

 Thanks

 Kadri


 On Thu, May 5, 2011 at 6:01 PM, Karl Wright daddy...@gmail.com wrote:

 I am curious about the following change, which does not seem correct:


     //Specify the attributes to return

Re: Which version of Solr have implements the Document Level Access Control

2011-05-03 Thread Karl Wright

I thought you were using the Quick Start, whcih does not have a sync directory.

Karl

On Tue, May 3, 2011 at 6:16 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
Note:
Did that, still didn't helped, but deleting the contents of mysyncdir
worked.

On Tue, May 3, 2011 at 5:48 PM, Karl Wright daddy...@gmail.com wrote:

Never seen that before. Do you have more than one instance running?
Only one instance can run at a time or the database is unhappy.

If that still doesn't seem to be the problem, try ant clean and then
ant build again. It will clean out the existing database instance.

Karl

On Tue, May 3, 2011 at 5:34 PM, Kadri Atalay atalay.ka...@gmail.com
wrote:
Hi Karl,

You are right, somehow I still had the OLD 195 code..
Just got the latest, compiled, but this one doesn't start after the
message
Configuration file successfully read

Any ideas ?

Thanks

Kadri

On Tue, May 3, 2011 at 3:12 PM, Karl Wright daddy...@gmail.com wrote:

The latest CONNECTORS-195 branch code doesn't use sAMAccountName. It
uses ObjectSid. Your schema has ObjectSid. The version of
ActiveDirectoryAuthority in trunk looks up ObjectSid too. Indeed, the
only change is the addition of the following:

if (theGroups.size() == 0)
return userNotFoundResponse;

This CANNOT occur for an existing user, because all existing users
must have at least one SID. And, if existing users returned the
proper SIDs before, this should not change anything. So I cannot see
how you could be getting the result you claim.

Are you SURE you synched up the CONNECTORS-195 branch and built that?
I have not checked this code into trunk yet.

Karl

On Tue, May 3, 2011 at 2:46 PM, Kadri Atalay atalay.ka...@gmail.com
wrote:
Hi Carl,

Got the latest one, built and tried but same result..
At the mean time took a look my user account with AD browser, and as
you
can
see (attached) it does have a sAMAccountName attribute.
BTW, do we have to use objectClass = user for the search filter ?
May
need
to check into this..

Thanks

Kadri

On Tue, May 3, 2011 at 1:16 PM, Karl Wright daddy...@gmail.com
wrote:

I tried locating details of DSID-031006E0 on MSDN, to no avail.
Microsoft apparently doesn't document this error.
But I asked around, and there are two potential avenues forward.

Avenue 1: There is a Windows tool called LDP, which should allow you
to browse AD's LDAP. What you would need to do is confirm that each
user has a sAMAccountName attribute. If they *don't*, it is
possible
that the domain was not set up in compatibility mode, which means
we'll need to find a different attribute to query against.

Avenue 2: Just change the string sAMAccountName in the
ActiveDirectoryAuthority.java class to uid, and try again. The
uid attribute should exist on all AD installations after Windows
2000.

Thanks,
Karl

On Tue, May 3, 2011 at 12:52 PM, Karl Wright daddy...@gmail.com
wrote:
I removed the object scope from the user lookup - it's worth
another
try. Care to synch up an run again?

Karl

On Tue, May 3, 2011 at 12:36 PM, Karl Wright daddy...@gmail.com
wrote:
As I feared, the new user-exists-check code is not correct in
some
way. Apparently we can't retrieve the attribute I'm looking for
by
this kind of query.

The following website seems to have some suggestions as to how to
do
better, with downloadable samples, but I'm not going to be able
to
look at it in any detail until this evening.

http://www.techtalkz.com/windows-server-2003/424352-get-samaccountnames-all-users-active-directory-group.html

Karl

On Tue, May 3, 2011 at 12:12 PM, Kadri Atalay
atalay.ka...@gmail.com
wrote:
Karl,

Here is the first round of tests with CONNECTORS-195t: Now we
are
getting
all responses as TEQA-DC:DEAD_AUTHORITY.. even with valid users.

Please take a look at the 2 bitmap files I have attached. (they
have
the
screen shots from debug screens)

invalid user and invalid domain
C:\OPTcurl

http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain;
USERNOTFOUND:TEQA-DC
TOKEN:TEQA-DC:DEAD_AUTHORITY

invalid user and valid (full domain name)
C:\OPTcurl

http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com;
USERNOTFOUND:TEQA-DC
TOKEN:TEQA-DC:DEAD_AUTHORITY

valid user and valid domain (please see bitmap file
katalay_ad...@teqa.bmp)
This name gets the similar error as the first fakeuser
eventhough
it's
a
valid user.
C:\OPTcurl

http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa;
USERNOTFOUND:TEQA-DC
TOKEN:TEQA-DC:DEAD_AUTHORITY

Re: Which version of Solr have implements the Document Level Access Control

2011-05-03 Thread Karl Wright

I went back over these emails. It appears that at no time have you
actually received SIDs, either user or group, back from any Authority
Connector inquiry:

response to actual domain account call:
C:\OPT\security_examplecurl
http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa;
AUTHORIZED:TEQA-DC
TOKEN:TEQA-DC:S-1-1-0

I could have sworn that you had seen SIDs other than S-1-1-0 back for
existing users on your setup, but I can find no evidence that was ever
the case. Given that, it seems perfectly reasonable that the change
in CONNECTORS-195 would convert ALL of these responses to USERNOTFOUND
ones.

Other recent users of the AD controller had no difficulty getting SIDs
back, most notably Mr. Abe, who worked closely with me on getting the
AD connector working with caching. The conclusion I have is that
either your domain controller configuration, or your connection
credentials/credential permissions, are incorrect. (I'd look
carefully at the permissions of the account you are giving to the
connection, because on the face of it that sounds most likely). But
the fix for non-existent users seems to be right nevertheless, so I
will go ahead and commit to trunk.

Thanks,
Karl

On Tue, May 3, 2011 at 7:38 PM, Karl Wright daddy...@gmail.com wrote:
Ok, can you try the trunk code? If that works, I'll be shocked. I
think something must have changed in your environment since you began
this experiment.

Karl

On Tue, May 3, 2011 at 6:19 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
Karl,

This is result from lates 195 branch..
I'll run it in the debugger to see actual error messages later on.

Is there anyone else can verify this code against their active directory ?

Thanks

Kadri

C:\OPTcurl
http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain;
USERNOTFOUND:TEQA-DC
TOKEN:TEQA-DC:DEAD_AUTHORITY

C:\OPTcurl
http://localhost:8345/mcf-authority-service/UserACLs?username=fakeu...@teqa.filetek.com;
USERNOTFOUND:TEQA-DC
TOKEN:TEQA-DC:DEAD_AUTHORITY

C:\OPTcurl
http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa;
USERNOTFOUND:TEQA-DC
TOKEN:TEQA-DC:DEAD_AUTHORITY

C:\OPTcurl
http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_ad...@teqa.filetek.com;
USERNOTFOUND:TEQA-DC
TOKEN:TEQA-DC:DEAD_AUTHORITY

C:\OPTcurl
http://localhost:8345/mcf-authority-service/UserACLs?username=kata...@teqa.filetek.com;
USERNOTFOUND:TEQA-DC
TOKEN:TEQA-DC:DEAD_AUTHORITY

On Tue, May 3, 2011 at 5:48 PM, Karl Wright daddy...@gmail.com wrote:

Never seen that before. Do you have more than one instance running?
Only one instance can run at a time or the database is unhappy.

If that still doesn't seem to be the problem, try ant clean and then
ant build again. It will clean out the existing database instance.

Karl

On Tue, May 3, 2011 at 5:34 PM, Kadri Atalay atalay.ka...@gmail.com
wrote:
Hi Karl,

You are right, somehow I still had the OLD 195 code..
Just got the latest, compiled, but this one doesn't start after the
message
Configuration file successfully read

Any ideas ?

Thanks

Kadri

On Tue, May 3, 2011 at 3:12 PM, Karl Wright daddy...@gmail.com wrote:

if (theGroups.size() == 0)
return userNotFoundResponse;

Are you SURE you synched up the CONNECTORS-195 branch and built that?
I have not checked this code into trunk yet.

Karl

On Tue, May 3, 2011 at 2:46 PM, Kadri Atalay atalay.ka...@gmail.com
wrote:
Hi Carl,

Thanks

Kadri

On Tue, May 3, 2011 at 1:16 PM, Karl Wright daddy...@gmail.com
wrote:

I tried locating details of DSID-031006E0 on MSDN, to no avail.
Microsoft apparently doesn't document this error.
But I asked around, and there are two potential avenues forward.

Re: Which version of Solr have implements the Document Level Access Control

2011-05-02 Thread Karl Wright

NameNotFound exception is never being reached because process is
throwing internal exception, and this is never checked.

I see the logging trace; it looks like the ldap code is eating the
exception and returning a blank list.  This is explicitly NOT what is
supposed to happen, nor did it happen on JDK 1.5, I am certain.  You
might find that this behavior has changed between Java releases.

Also, what is the reason for adding everyone group for each response ?

I added this in because the standard treatment of Active Directory
2000 and 2003 was to exclude the public ACL.  Since all users have it,
if the user exists (which was the case if NameNotFound exception was
not being thrown), it was always safe to add it in.


If JDK xxx, which is eating the internal exception, gives back SOME
signal that the user does not exist, we can certainly check for that.
What signal do you recommend looking for, based on the trace?  Is
there any way to get at errExPartialResultException  (id=7962)  
from  NamingEnumeration answer?

Karl



On Mon, May 2, 2011 at 3:31 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
 Hi Karl,

 I noticed in the code that   NameNotFound exception is never being reached
 because process is throwing internal exception, and this is never checked.
 (see below)
 Also, what is the reason for adding everyone group for each response ?
   theGroups.add(S-1-1-0);

 When there is no groups or SID's returned, following return code is still
 used..
   return new
 AuthorizationResponse(tokens,AuthorizationResponse.RESPONSE_OK);

 Should I assume this code was tested against an Active Directory, and
 working, and or should I start checking from the beginning every parameter
 is entered. (see below)
 For example, in the following code, DIGEST-MD5 GSSAPI is used for security
 authentication, but user name and password is passed as a clear text.. and
 not in the format they suggest in their documentation.

 Thanks

 Kadri

 http://download.oracle.com/javase/jndi/tutorial/ldap/security/gssapi.html


     if (ctx == null)
     {
   // Calculate the ldap url first
   String ldapURL = ldap://; + domainControllerName + :389;

   Hashtable env = new Hashtable();

 env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory);
   env.put(Context.SECURITY_AUTHENTICATION,DIGEST-MD5 GSSAPI);
   env.put(Context.SECURITY_PRINCIPAL,userName);
   env.put(Context.SECURITY_CREDENTIALS,password);

   //connect to my domain controller
   env.put(Context.PROVIDER_URL,ldapURL);

   //specify attributes to be returned in binary format
   env.put(java.naming.ldap.attributes.binary,tokenGroups objectSid);



 fakeuser@teqa

     //Search for objects using the filter
   NamingEnumeration answer = ctx.search(searchBase, searchFilter,
 searchCtls);

 answer    LdapSearchEnumeration  (id=6635)
     cleaned    false
     cont    Continuation  (id=6674)
     entries    VectorE  (id=6675)
     enumClnt    LdapClient  (id=6676)
         authenticateCalled    true
         conn    Connection  (id=6906)
         isLdapv3    true
         pcb    null
         pooled    false
         referenceCount    1
         unsolicited    VectorE  (id=6907)
     errEx    PartialResultException  (id=6677)
         cause    PartialResultException  (id=6677)
         detailMessage    [LDAP: error code 10 - 202B: RefErr:
 DSID-031006E0, data 0, 1 access points\n\tref 1: 'teqa'\n


   ArrayList theGroups = new ArrayList();
   // All users get certain well-known groups
   theGroups.add(S-1-1-0);


 answer    LdapSearchEnumeration  (id=7940)
     cleaned    false
     cont    Continuation  (id=7959)
     entries    VectorE  (id=7960)
     enumClnt    LdapClient  (id=7961)
     errEx    PartialResultException  (id=7962)
         cause    PartialResultException  (id=7962)
         detailMessage    [LDAP: error code 10 - 202B: RefErr:
 DSID-031006E0, data 0, 1 access points\n\tref 1: 'teqa'\n

   return new
 AuthorizationResponse(tokens,AuthorizationResponse.RESPONSE_OK);


 On Tue, Apr 26, 2011 at 12:54 PM, Karl Wright daddy...@gmail.com wrote:

 If a completely unknown user still comes back as existing, then it's
 time to look at how your domain controller is configured.
 Specifically, what do you have it configured to trust?  What version
 of Windows is this?

 The way LDAP tells you a user does not exist in Java is by an
 exception.  So this statement:

      NamingEnumeration answer = ctx.search(searchBase, searchFilter,
 searchCtls);

 will throw the NameNotFoundException if the name doesn't exist, which
 the Active Directory connector then catches:

    catch (NameNotFoundException e)
    {
      // This means that the user doesn't exist
      return userNotFoundResponse;
    }

 Clearly this is not working at all for your setup.  Maybe you can look
 at the DC's event logs, and see what kinds of decisions it is making
 here?  It's not making much sense to me

Re: Which version of Solr have implements the Document Level Access Control

2011-05-02 Thread Karl Wright

I opened a ticket, CONNECTORS-195, and added what I think is an
explicit check for existence of the user as a patch.  Can you apply
the patch and let me know if it seems to fix the problem?

Thanks,
Karl


On Mon, May 2, 2011 at 3:51 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
 I see, thanks for the response.
 I'll look into it little deeper, before making a suggestion how to check for
 this internal exception.. If JDK 1.6 behavior is different than JDK 1.5 for
 LDAP, this may not be the only problem we may encounter..
 Maybe any exception generated by JDK during this request should be
 evaluated.. We'll see.
 Thanks.
 Kadri

 On Mon, May 2, 2011 at 3:44 PM, Karl Wright daddy...@gmail.com wrote:

 NameNotFound exception is never being reached because process is
 throwing internal exception, and this is never checked.

 I see the logging trace; it looks like the ldap code is eating the
 exception and returning a blank list.  This is explicitly NOT what is
 supposed to happen, nor did it happen on JDK 1.5, I am certain.  You
 might find that this behavior has changed between Java releases.

 Also, what is the reason for adding everyone group for each response ?

 I added this in because the standard treatment of Active Directory
 2000 and 2003 was to exclude the public ACL.  Since all users have it,
 if the user exists (which was the case if NameNotFound exception was
 not being thrown), it was always safe to add it in.


 If JDK xxx, which is eating the internal exception, gives back SOME
 signal that the user does not exist, we can certainly check for that.
 What signal do you recommend looking for, based on the trace?  Is
 there any way to get at errEx    PartialResultException  (id=7962)  
 from  NamingEnumeration answer?

 Karl



 On Mon, May 2, 2011 at 3:31 PM, Kadri Atalay atalay.ka...@gmail.com
 wrote:
  Hi Karl,
 
  I noticed in the code that   NameNotFound exception is never being
  reached
  because process is throwing internal exception, and this is never
  checked.
  (see below)
  Also, what is the reason for adding everyone group for each response ?
    theGroups.add(S-1-1-0);
 
  When there is no groups or SID's returned, following return code is
  still
  used..
    return new
  AuthorizationResponse(tokens,AuthorizationResponse.RESPONSE_OK);
 
  Should I assume this code was tested against an Active Directory, and
  working, and or should I start checking from the beginning every
  parameter
  is entered. (see below)
  For example, in the following code, DIGEST-MD5 GSSAPI is used for
  security
  authentication, but user name and password is passed as a clear text..
  and
  not in the format they suggest in their documentation.
 
  Thanks
 
  Kadri
 
 
  http://download.oracle.com/javase/jndi/tutorial/ldap/security/gssapi.html
 
 
      if (ctx == null)
      {
    // Calculate the ldap url first
    String ldapURL = ldap://; + domainControllerName + :389;
 
    Hashtable env = new Hashtable();
 
 
  env.put(Context.INITIAL_CONTEXT_FACTORY,com.sun.jndi.ldap.LdapCtxFactory);
    env.put(Context.SECURITY_AUTHENTICATION,DIGEST-MD5 GSSAPI);
    env.put(Context.SECURITY_PRINCIPAL,userName);
    env.put(Context.SECURITY_CREDENTIALS,password);
 
    //connect to my domain controller
    env.put(Context.PROVIDER_URL,ldapURL);
 
    //specify attributes to be returned in binary format
    env.put(java.naming.ldap.attributes.binary,tokenGroups
  objectSid);
 
 
 
  fakeuser@teqa
 
      //Search for objects using the filter
    NamingEnumeration answer = ctx.search(searchBase, searchFilter,
  searchCtls);
 
  answer    LdapSearchEnumeration  (id=6635)
      cleaned    false
      cont    Continuation  (id=6674)
      entries    VectorE  (id=6675)
      enumClnt    LdapClient  (id=6676)
          authenticateCalled    true
          conn    Connection  (id=6906)
          isLdapv3    true
          pcb    null
          pooled    false
          referenceCount    1
          unsolicited    VectorE  (id=6907)
      errEx    PartialResultException  (id=6677)
          cause    PartialResultException  (id=6677)
          detailMessage    [LDAP: error code 10 - 202B: RefErr:
  DSID-031006E0, data 0, 1 access points\n\tref 1: 'teqa'\n
 
 
    ArrayList theGroups = new ArrayList();
    // All users get certain well-known groups
    theGroups.add(S-1-1-0);
 
 
  answer    LdapSearchEnumeration  (id=7940)
      cleaned    false
      cont    Continuation  (id=7959)
      entries    VectorE  (id=7960)
      enumClnt    LdapClient  (id=7961)
      errEx    PartialResultException  (id=7962)
          cause    PartialResultException  (id=7962)
          detailMessage    [LDAP: error code 10 - 202B: RefErr:
  DSID-031006E0, data 0, 1 access points\n\tref 1: 'teqa'\n
 
    return new
  AuthorizationResponse(tokens,AuthorizationResponse.RESPONSE_OK);
 
 
  On Tue, Apr 26, 2011 at 12:54 PM, Karl Wright daddy...@gmail.com
  wrote

Re: Treatment of protected files

2011-04-27 Thread Karl Wright

If I recall, it treats the 400 response as meaning this document
should be skipped, and it treats the 500 response as meaning this
document should be retried because I have absolutely no idea what
happened.  However, we could modify the code for the 500 response to
look at the content of the response as well, and look for a string in
it that would give us a clue, such as TikaException.  If we see a
TikaException, we could have it conclude this document should be
skipped.  That was what I was thinking.

Karl

On Wed, Apr 27, 2011 at 6:00 AM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Hi.Thank you for your reply.

 It seems that Solr.ExtractingRequestHandler responds the same HTTP 
 response(SERVER_ERROR( 500 )) at any time error occurs.
 I'll try to open a ticket for solr.

 Is it correct that MCF re-try crawling was processed when it receives 500 
 level response, not 400 level response?

 Thank you.
 Shinichiro Abe


 On 2011/04/27, at 14:45, Karl Wright wrote:

 So the 500 error is occurring because Solr is throwing an exception at
 indexing time, is that correct?

 If this is correct, then here's my take.  (1) A 500 error is a nasty
 error that Solr should not be returning under normal conditions.  (2)
 A password-protected PDF is not what I would consider exceptional, so
 Tika should not be throwing an exception when it sees it, merely (at
 worst) logging an error and continuing.  However, having said that,
 output connectors in ManifoldCF can make the decision to never retry
 the document, by returning a certain status, provided the connector
 can figure out that the error warrants this treatment.

 My suggestion is therefore the following.  First, we should open a
 ticket for Solr about this.  Second, if you can see the error output
 from the Simple History for a TikaException being thrown in Solr, we
 can look for that text in the response from Solr and perhaps modify
 the Solr Connector to detect the case.  If you could open a ManifoldCF
 ticket and include that text I'd be very grateful.

 Thanks!
 Karl

 On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
 shinichiro.ab...@gmail.com wrote:
 Hello.

 There are pdf and office files that are protected by reading password.
 We do not have to read those files if we do not know the password of files.

 Now, MCF job starts to crawl the filesystem repository and post to Solr.
 Document ingestion of non-protected files is done successfully,
 but one of protected file is not done successfully as far as the job is 
 processed beyond Retry Limit.
 During that time, it is logging 500 result code in simple history.
 (Solr throws TikaException caused by PDFBox or apache poi as the reason 
 that it does not read protected documents.)

 When I ran that test by continuous clawing, not by simple once crawling,
 the job was done halfway and logged the following:
 Error: Repeated service interruptions - failure processing document: 
 Ingestion HTTP error code 500
 the job tried to crawl that files many times.

 It seems that a job takes a lot of time and costs for treating protected 
 files.
 So I want to find a way to skip quickly reading those files.

 In my survey:
 Hopfillers is not relevant.(right?)
 Then Tika, PDFBox, and POI have the mechanism to decrypt protected files,
 but throw each another exception in the case that given invalid password.
 It occurs to me that Solr throws another result code when protected files 
 are posted,
 as one idea apart from possibility or not.

 Do you have any ideas?

 Regards,
 Shinichiro Abe

Re: Which version of Solr have implements the Document Level Access Control

2011-04-26 Thread Karl Wright

So you are trying to extend the example in the book, correct, to run
against active directory and the JCIFS connector?  And this is with
Solr 3.1?

The book was written for Solr 1.4.1, so it's entirely possible that
something in Solr changed in relation to the way search components are
used.  So I think we're going to need to do some debugging.

(1) First, to confirm sanity, try using curl against the mcf authority
service.  Try some combination of users to see how that works, e.g.:

curl http://localhost:8345/mcf-authority-service/UserACLs?username=joe;

...and

curl 
http://localhost:8345/mcf-authority-service/UserACLs?username=joe@fakedomain;

...and also the real domain name, whatever that is.  See if the access
tokens that come back look correct.  If they don't then we know where
there's an issue.

If they *are* correct, let me know and we'll go to the next stage,
which would be to make sure the authority service is actually getting
called and the proper query is being built and run under Solr 3.1.

Thanks,
Karl

On Tue, Apr 26, 2011 at 11:59 AM, Kadri Atalay atalay.ka...@gmail.com wrote:
 Hi Karl,

 I followed the instructions, and for testing purposes set stored=true to
 be able to see the ACL values stored in Solr.

 But, when I run the search in following format I get peculiar results..
 :http://10.1.200.155:8080/solr/select/?q=*%3A*AuthenticatedUserName=username

 Any user name without a domain name  ie AuthenticatedUserName=joe does not
 return any results (which is correct)
 But any user name with ANY domain name returns all the indexes  ie
 AuthenticatedUserName=joe@fakedomain   (which is not correct)

 Any thoughts ?

 Thanks

 Kadri

 On Sun, Apr 24, 2011 at 7:08 PM, Karl Wright daddy...@gmail.com wrote:

 Solr 3.1 is being clever here; it's seeing arguments coming in that do
 not correspond to known schema fields, and presuming they are
 automatic fields.  So when the schema is unmodified, you see these
 fields that Solr creates for you, with the attr_ prefix.  They are
 created as being stored, which is not good for access tokens since
 then you will see them in the response.  I don't know if they are
 indexed or not, but I imagine not, which is also not good.

 So following the instructions is still the right thing to do, I would say.

 Karl

 On Fri, Apr 22, 2011 at 3:24 PM, Kadri Atalay atalay.ka...@gmail.com
 wrote:
  Hi Karl,
 
  There is one thing I noticed while following the example in chapter 4.:
  Prior to making any changes into the schema.xml, I was able to see the
  following security information in query responses:
  ie:
 
  doc
  -
  arr name=attr_allow_token_document
  strTEQA-DC:S-1-3-0/str
  strTEQA-DC:S-1-5-13/str
  strTEQA-DC:S-1-5-18/str
  strTEQA-DC:S-1-5-32-544/str
  strTEQA-DC:S-1-5-32-545/str
  strTEQA-DC:S-1-5-32-547/str
  /arr
  -
  arr name=attr_allow_token_share
  strTEQA-DC:S-1-1-0/str
  strTEQA-DC:S-1-5-2/str
  -
  str
  TEQA-DC:S-1-5-21-1212545812-2858578934-3563067286-1480
  /str
  /arr
  -
  arr name=attr_content
  -
  str
                               Autonomy ODBC Fetch Technical Brief 0506
  Technical Brief
 
 
  But, after I modified the schema/xml, and added the following fields,
      !-- Security fields --
      field name=allow_token_document type=string indexed=true
  stored=false multiValued=true/
      field name=deny_token_document type=string indexed=true
  stored=false multiValued=true/
      field name=allow_token_share type=string indexed=true
  stored=false multiValued=true/
      field name=deny_token_share type=string indexed=true
  stored=false multiValued=true/
 
  I longer see neither the attr_allow_token_document   or the
  allow_token_document fields..
 
  Since same fields exist with attr_  prefix, should we need to add these
  new
  field names into the schema file, or can we simply change
  ManifoldSecurity
  to use attr_ fields ?
 
  Also, when Solr is running under Tomcat, I have to re-start the Solr
  App, or
  re-start Tomcat to see the newly added indexes..
 
  Any thoughts ?
 
  Thanks
 
  Kadri
 
  On Fri, Apr 22, 2011 at 12:53 PM, Karl Wright daddy...@gmail.com
  wrote:
 
  I don't believe Solr has yet officially released document access
  control, so you will need to use the patch for ticket 1895.
  Alternatively, the ManifoldCF in Action chapter 4 example has an
  implementation based on this ticket.  You can get the code for it at
 
 
  https://manifoldcfinaction.googlecode.com/svn/trunk/edition_1/security_example.
 
  Thanks,
  Karl
 
 
  On Fri, Apr 22, 2011 at 11:45 AM, Kadri Atalay atalay.ka...@gmail.com
  wrote:
   Hello,
  
   Does anyone know which version of Solr have implements the Document
   Level
   Access Control, or has it implemented (partially or fully) ?
   Particularly issue #s 1834, 1872, 1895
  
   Thanks
  
   Kadri

Re: Which version of Solr have implements the Document Level Access Control

2011-04-26 Thread Karl Wright

If a completely unknown user still comes back as existing, then it's
time to look at how your domain controller is configured.
Specifically, what do you have it configured to trust?  What version
of Windows is this?

The way LDAP tells you a user does not exist in Java is by an
exception.  So this statement:

  NamingEnumeration answer = ctx.search(searchBase, searchFilter,
searchCtls);

will throw the NameNotFoundException if the name doesn't exist, which
the Active Directory connector then catches:

catch (NameNotFoundException e)
{
  // This means that the user doesn't exist
  return userNotFoundResponse;
}

Clearly this is not working at all for your setup.  Maybe you can look
at the DC's event logs, and see what kinds of decisions it is making
here?  It's not making much sense to me at this point.

Karl

On Tue, Apr 26, 2011 at 12:45 PM, Kadri Atalay atalay.ka...@gmail.com wrote:
 Get the same result with user doesn't exist
 C:\OPT\security_examplecurl
 http://localhost:8345/mcf-authority-service/UserACLs?username=fakeuser@fakedomain;
 AUTHORIZED:TEQA-DC
 TOKEN:TEQA-DC:S-1-1-0

 BTW, is there a command to get all users available in Active Directory, from
 mcf-authority service, or other test commands to see if it's working
 correctly ?

 Also, I set the logging level to finest from Solr Admin for
 ManifoldCFSecurityFilter,but I don't see any logs created.. Is there any
 other settings need to be tweaked ?

 Thanks

 Kadri

 On Tue, Apr 26, 2011 at 12:38 PM, Karl Wright daddy...@gmail.com wrote:

 One other quick note.  You might want to try a user that doesn't exist
 and see what you get.  It should be a USERNOTFOUND response.

 If that's indeed what you get back, then this is a relatively minor
 issue with Active Directory.  Basically the S-1-1-0 SID is added by
 the active directory authority, so the DC is actually returning an
 empty list of SIDs for the user with an unknown domain.  It *should*
 tell us the user doesn't exist, I agree, but that's clearly a problem
 only Active Directory can solve; we can't make that decision in the
 active directory connector because the DC may be just one node in a
 hierarchy.  Perhaps there's a Microsoft knowledge-base article that
 would clarify things further.

 Please let me know what you find.
 Karl

 On Tue, Apr 26, 2011 at 12:27 PM, Karl Wright daddy...@gmail.com wrote:
  The method code from the Active Directory authority that handles the
  LDAP query construction is below.  It looks perfectly reasonable to
  me:
 
   /** Parse a user name into an ldap search base. */
   protected static String parseUser(String userName)
     throws ManifoldCFException
   {
     //String searchBase =
  CN=Administrator,CN=Users,DC=qa-ad-76,DC=metacarta,DC=com;
     int index = userName.indexOf(@);
     if (index == -1)
       throw new ManifoldCFException(Username is in unexpected form
  (no @): '+userName+');
     String userPart = userName.substring(0,index);
     String domainPart = userName.substring(index+1);
     // Start the search base assembly
     StringBuffer sb = new StringBuffer();
     sb.append(CN=).append(userPart).append(,CN=Users);
     int j = 0;
     while (true)
     {
       int k = domainPart.indexOf(.,j);
       if (k == -1)
       {
         sb.append(,DC=).append(domainPart.substring(j));
         break;
       }
       sb.append(,DC=).append(domainPart.substring(j,k));
       j = k+1;
     }
     return sb.toString();
   }
 
  So I have to conclude that your Active Directory domain controller is
  simply not caring what the DC= fields are, for some reason.  No idea
  why.
 
  If you want to confirm this picture, you might want to create a patch
  to add some Logging.authorityConnectors.debug statements at
  appropriate places so we can see the actual query it's sending to
  LDAP.  I'm happy to commit this debug output patch eventually if you
  also want to create a ticket.
 
  Thanks,
  Karl
 
  On Tue, Apr 26, 2011 at 12:17 PM, Kadri Atalay atalay.ka...@gmail.com
  wrote:
  Yes, ManifoldCF is running with JCIFS connector, and using Solr 3.1
 
  response to first call:
  C:\OPT\security_examplecurl
  http://localhost:8345/mcf-authority-service/UserACLs?username=joe;
  UNREACHABLEAUTHORITY:TEQA-DC
  TOKEN:TEQA-DC:DEAD_AUTHORITY
 
  response to fake domain call:
  C:\OPT\security_examplecurl
 
  http://localhost:8345/mcf-authority-service/UserACLs?username=joe@fakedomain;
  AUTHORIZED:TEQA-DC
  TOKEN:TEQA-DC:S-1-1-0
 
  response to actual domain account call:
  C:\OPT\security_examplecurl
 
  http://localhost:8345/mcf-authority-service/UserACLs?username=katalay_admin@teqa;
  AUTHORIZED:TEQA-DC
  TOKEN:TEQA-DC:S-1-1-0
 
  Looks like as long as there is a domain suffix, return is positive..
 
  Thanks
 
  Kadri
 
 
  On Tue, Apr 26, 2011 at 12:10 PM, Karl Wright daddy...@gmail.com
  wrote:
 
  So you are trying to extend the example in the book, correct, to run
  against active directory and the JCIFS connector

Re: Treatment of protected files

2011-04-26 Thread Karl Wright

So the 500 error is occurring because Solr is throwing an exception at
indexing time, is that correct?

If this is correct, then here's my take.  (1) A 500 error is a nasty
error that Solr should not be returning under normal conditions.  (2)
A password-protected PDF is not what I would consider exceptional, so
Tika should not be throwing an exception when it sees it, merely (at
worst) logging an error and continuing.  However, having said that,
output connectors in ManifoldCF can make the decision to never retry
the document, by returning a certain status, provided the connector
can figure out that the error warrants this treatment.

My suggestion is therefore the following.  First, we should open a
ticket for Solr about this.  Second, if you can see the error output
from the Simple History for a TikaException being thrown in Solr, we
can look for that text in the response from Solr and perhaps modify
the Solr Connector to detect the case.  If you could open a ManifoldCF
ticket and include that text I'd be very grateful.

Thanks!
Karl

On Tue, Apr 26, 2011 at 10:53 PM, Shinichiro Abe
shinichiro.ab...@gmail.com wrote:
 Hello.

 There are pdf and office files that are protected by reading password.
 We do not have to read those files if we do not know the password of files.

 Now, MCF job starts to crawl the filesystem repository and post to Solr.
 Document ingestion of non-protected files is done successfully,
 but one of protected file is not done successfully as far as the job is 
 processed beyond Retry Limit.
 During that time, it is logging 500 result code in simple history.
 (Solr throws TikaException caused by PDFBox or apache poi as the reason that 
 it does not read protected documents.)

 When I ran that test by continuous clawing, not by simple once crawling,
 the job was done halfway and logged the following:
 Error: Repeated service interruptions - failure processing document: 
 Ingestion HTTP error code 500
 the job tried to crawl that files many times.

 It seems that a job takes a lot of time and costs for treating protected 
 files.
 So I want to find a way to skip quickly reading those files.

 In my survey:
 Hopfillers is not relevant.(right?)
 Then Tika, PDFBox, and POI have the mechanism to decrypt protected files,
 but throw each another exception in the case that given invalid password.
 It occurs to me that Solr throws another result code when protected files are 
 posted,
 as one idea apart from possibility or not.

 Do you have any ideas?

 Regards,
 Shinichiro Abe

1 2 >

1 - 100 of 168 matches

Mail list logo