from:"Karl Wright"

Re: Job stuck without message

2018-11-28 Thread Karl Wright

The database row indicates there is no reason that the document should not
be queued and processed.
As for getting a thread dump, there's a "force" option (-F).

The only other reason stuff may not run is if the query plan for
identifying documents to process has gone horribly wrong.  We should see
that in the thread dump however.

I will unfortunately need to be offline for the next 24 hours due to an
emergency situation, but if it turns out that your agents process is busy
executing a long-running query, then I suggest analyzing the jobqueue table
to get a better plan.  This happens automatically but there are conditions
under which it doesn't happen frequently enough.  If the job is waiting for
locks, then the stack trace will tell me where.

Thanks,
Karl
Karl


On Wed, Nov 28, 2018 at 11:07 AM Bisonti Mario 
wrote:

> I attatched  a row that correspond to a row of one of these documents in
> this mail
>
>
>
>
>
>
>
> I obtain the pid of:
> "/bin/bash -e
> /opt/manifoldcf/multiprocess-zk-example-proprietary/start-agents.sh"
>
> The pid is 1233
>
>
>
> I tried to use
>
> sudo jstack -l 1233 > /tmp/jstack_start_agent.log
>
>
>
> but I obtain:
>
> 1233: Unable to open socket file /proc/1233/cwd/.attach_pid1233: target
> process 1233 doesn't respond within 10500ms or HotSpot VM not loaded
>
>
>
> Perhaps isn’t it the right way to obtain a thread dump?
>
> Excuse me but I am not a Linux expert..
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* mercoledì 28 novembre 2018 16:36
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> Another thing you could do is get a thread dump of the agents process.
>
>
>
> Karl
>
>
>
>
>
> On Wed, Nov 28, 2018 at 10:35 AM Karl Wright  wrote:
>
> Can you look into the database jobqueue table and provide a row that
> corresponds to one of these documents?
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Wed, Nov 28, 2018 at 10:26 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
> Repository has Max connection=10
>
>
>
> In the Document Status report” I see many item with :
>
> State=“Not yet processed”
>
> Status=”Ready for processing”
>
> Scheduled=01-01-1970 01:00:00.000”
>
> Scheduled Action=”Process”
>
>
>
>
>
>
>
>
>
> But the job no more walk..
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* mercoledì 28 novembre 2018 16:03
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> "Pipe instances are busy" occurs because you are overloading the SMB
> access to your servers.  How many connections do you have allocated for
> your repository connection?  You probably want to limit this to 2-3 if you
> see this error a lot, and it appears you do.
>
> " Tika Server: Tika Server rejects: Tika Server rejected document with
> the following reason: Unprocessable Entity" means the document is not
> properly formed XML.  The rejection will mean the document isn't indexed,
> but this will not stop the job.
>
> If nothing is happening and you don't know why, I'd suggest looking at the
> Document Status report to figure out what documents are not being processed
> and why.  It is quite possible they are all in the process of being retried
> because of the "Pipe instances" issue above.
>
>
>
> Karl
>
>
>
> On Wed, Nov 28, 2018 at 9:46 AM Bisonti Mario 
> wrote:
>
> Hallo Karl.
>
> I take this ticket because,now, after I use zookeeper, my job works for 7
> hours and now it is in hang status.
>
> I see running but it seems hanging, no log from 1 hour
>
>
>
> This is the last manifoldcf.log lines:
>
>
>
>
>
> at jcifs.smb.SmbFile.open(SmbFile.java:1010)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbFileOutputStream.(SmbFileOutputStream.java:142)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.TransactNamedPipeOutputStream.(TransactNamedPipeOutputStream.java:32)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140)
> ~[jcifs-1.3.18.3.jar:?]
>
>

Re: Job stuck without message

2018-11-28 Thread Karl Wright

Another thing you could do is get a thread dump of the agents process.

Karl


On Wed, Nov 28, 2018 at 10:35 AM Karl Wright  wrote:

> Can you look into the database jobqueue table and provide a row that
> corresponds to one of these documents?
>
> Thanks,
> Karl
>
>
> On Wed, Nov 28, 2018 at 10:26 AM Bisonti Mario 
> wrote:
>
>> Hallo.
>>
>> Repository has Max connection=10
>>
>>
>>
>> In the Document Status report” I see many item with :
>>
>> State=“Not yet processed”
>>
>> Status=”Ready for processing”
>>
>> Scheduled=01-01-1970 01:00:00.000”
>>
>> Scheduled Action=”Process”
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> But the job no more walk..
>>
>>
>>
>>
>>
>> *Da:* Karl Wright 
>> *Inviato:* mercoledì 28 novembre 2018 16:03
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* Re: Job stuck without message
>>
>>
>>
>> "Pipe instances are busy" occurs because you are overloading the SMB
>> access to your servers.  How many connections do you have allocated for
>> your repository connection?  You probably want to limit this to 2-3 if you
>> see this error a lot, and it appears you do.
>>
>> " Tika Server: Tika Server rejects: Tika Server rejected document with
>> the following reason: Unprocessable Entity" means the document is not
>> properly formed XML.  The rejection will mean the document isn't indexed,
>> but this will not stop the job.
>>
>> If nothing is happening and you don't know why, I'd suggest looking at
>> the Document Status report to figure out what documents are not being
>> processed and why.  It is quite possible they are all in the process of
>> being retried because of the "Pipe instances" issue above.
>>
>>
>>
>> Karl
>>
>>
>>
>> On Wed, Nov 28, 2018 at 9:46 AM Bisonti Mario 
>> wrote:
>>
>> Hallo Karl.
>>
>> I take this ticket because,now, after I use zookeeper, my job works for 7
>> hours and now it is in hang status.
>>
>> I see running but it seems hanging, no log from 1 hour
>>
>>
>>
>> This is the last manifoldcf.log lines:
>>
>>
>>
>>
>>
>> at jcifs.smb.SmbFile.open(SmbFile.java:1010)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at
>> jcifs.smb.SmbFileOutputStream.(SmbFileOutputStream.java:142)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at
>> jcifs.smb.TransactNamedPipeOutputStream.(TransactNamedPipeOutputStream.java:32)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at
>> jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at
>> jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2951)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2446)
>> [mcf-jcifs-connector.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1222)
>> [mcf-jcifs-connector.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:627)
>> [mcf-jcifs-connector.jar:?]
>>
>> at
>> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
>> [mcf-pull-agent.jar:?]
>>
>> WARN 2018-11-28T14:46:21,524 (Worker thread '59') - JCIFS: Possibly
>> transient exception detected on attempt 1 while getting share security: All
>> pipe instances are busy.
>>
>> jcifs.smb.SmbException: All pipe instances are busy.
>>
>> at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:569)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at jcifs.smb.SmbTransport.send(SmbTransport.java:669)
>> ~[jcifs-1.3.18.3.jar:?]
>>
>> at jcifs.smb.SmbSession.send(SmbSession.java:238)
>> ~[jcif

[jira] [Resolved] (CONNECTORS-1559) Logging Is Not working as expected

2018-11-28 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1559.
-
Resolution: Not A Problem
  Assignee: Karl Wright

Logging is described in the "how to build and deploy" page, here:

https://manifoldcf.apache.org/release/release-2.11/en_US/how-to-build-and-deploy.html#The+ManifoldCF+configuration+files

There are two places where logging may be configured: system-wide loggers 
controlled by properties.xml, and local loggers by the logging.xml file.

> Logging Is Not working as expected
> --
>
> Key: CONNECTORS-1559
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1559
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna
>Assignee: Karl Wright
>Priority: Major
>
> We are using the Manifold multi procress file type installation and normal 
> log4j property is not working as expected the Manifold trying to log into OS 
> log which we have not configure,
>  
> If you can share sample Logging.xml and explain how logging works in Apache 
> that will be helpful.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1558) Action Button is Missing in Status Job

2018-11-27 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700885#comment-16700885
 ] 

Karl Wright commented on CONNECTORS-1558:
-

I'm afraid this report is completely unintelligible, and it doesn't describe a 
bug either.  So I'm closing it.  Please communicate via 
us...@manifoldcf.apache.org for questions like this.


> Action Button is Missing in Status Job
> --
>
> Key: CONNECTORS-1558
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1558
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna
>Priority: Major
>
> We configure the Elastic Connector with Manifold server, We are using 
> Manifold 2.10 version and Elastic 5.6 . Even though no job is running still 
> Agent process is running from 2days and all its printing in the Simple 
> History Job end message.
>  
> Could it be possible to release this job and we can stop the process from 
> running?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (CONNECTORS-1558) Action Button is Missing in Status Job

2018-11-27 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1558.
-
Resolution: Incomplete

> Action Button is Missing in Status Job
> --
>
> Key: CONNECTORS-1558
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1558
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna
>Priority: Major
>
> We configure the Elastic Connector with Manifold server, We are using 
> Manifold 2.10 version and Elastic 5.6 . Even though no job is running still 
> Agent process is running from 2days and all its printing in the Simple 
> History Job end message.
>  
> Could it be possible to release this job and we can stop the process from 
> running?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Error Job stop after repeatidly interruption

2018-11-26 Thread Karl Wright

9 (Worker thread '87') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Tika down, retrying:
> Connect to sengvivv01.local.domain:9998 [sengvivv01.local.domain/
> 172.16.1.135] failed: Connection refused (Connection refused)
>
> WARN 2018-11-26T13:18:26,668 (Worker thread '92') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Tika down, retrying:
> Connect to sengvivv01.local.domain:9998 [sengvivv01.local.domain/
> 172.16.1.135] failed: Connection refused (Connection refused)
>
> WARN 2018-11-26T13:18:26,722 (Worker thread '99') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Tika down, retrying:
> Connect to sengvivv01.local.domain:9998 [sengvivv01.local.domain/
> 172.16.1.135] failed: Connection refused (Connection refused)
>
> WARN 2018-11-26T13:18:26,862 (Worker thread '75') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Tika down, retrying:
> Connect to sengvivv01.local.domain:9998 [sengvivv01.local.domain/
> 172.16.1.135] failed: Connection refused (Connection refused)
>
> WARN 2018-11-26T13:18:26,862 (Worker thread '12') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Tika down, retrying:
> Connect to sengvivv01.local.domain:9998 [sengvivv01.local.domain/
> 172.16.1.135] failed: Connection refused (Connection refused)
>
>
>
>
>
> So, I don’t understand if the worker tried to reconnect after 10 seconds
> or not
>
>
>
> How could I check it?
>
>
>
> Thanks a lot
>
>
>
> Mario
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 15 novembre 2018 13:00
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Error Job stop after repeatidly interruption
>
>
>
> The easiest way to do it is just to check out current trunk:
>
>
>
> svn co https://svn.apache.org/repos/asf/manifoldcf/trunk
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsvn.apache.org%2Frepos%2Fasf%2Fmanifoldcf%2Ftrunk=01%7C01%7CMario.Bisonti%40vimar.com%7Ce35d2a3ecf744ed254da08d64af1ecf0%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1=H%2BeYb5xRnrIp%2FIlcR%2F%2FR2FaX8KjYR3Ec6Dvyeii8OnU%3D=0>
>
>
>
> No need to apply a patch then.  Just build:
>
>
>
> ant make-core-deps
>
> ant make-deps
>
> ant build
>
>
>
> Karl
>
>
>
>
>
> On Thu, Nov 15, 2018 at 4:30 AM Bisonti Mario 
> wrote:
>
> Thanks a lot Karl.
>
>
>
> To overwrite the connector with your patch, have I to download the trunk
> and recompile, isn’t it?
>
>
>
> Excuse me for my questions but I am not expert on programming, compiling,
> etc.
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 15 novembre 2018 09:48
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Error Job stop after repeatidly interruption
>
>
>
> (1) I increased the retries to go at least 10 minutes.
>
> (2) I handled the 503 response explicitly, with the same logic.
>
> See: https://issues.apache.org/jira/browse/CONNECTORS-1556
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCONNECTORS-1556=01%7C01%7CMario.Bisonti%40vimar.com%7Ce35d2a3ecf744ed254da08d64af1ecf0%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1=LWrGl7ELqT4IZRvO0%2Fq43m1W6D40jY0RvU5JA%2F2fjLk%3D=0>
>
>
>
> Karl
>
>
>
>
>
> On Thu, Nov 15, 2018 at 3:35 AM Bisonti Mario 
> wrote:
>
> Yes, Karl.
>
>
>
> Is it possible to apply the same your concept , wait 10 sec and retry
> three times , to the 503 error , too?
>
>
>
> So, I would like to try, if, with the modification, I obtain that job end
> correctly instead of failure.
>
>
>
>
>
> Thanks a lot
>
> Mario
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 15 novembre 2018 09:17
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Error Job stop after repeatidly interruption
>
>
>
> Hi Mario,
>
>
>
> Here's the code:
>
>
>
> >>>>>>
>
> try {
>
>   //System.out.println("About to do a content PUT");
>
>   response = this.httpClient.execute(tikaHost, httpPut);
>
>   //System.out.println("... content PUT succeeded");
>
> } catch (IOException e) {
>
>   // Retry 3 times, 1 ms between retries, and abort if
> doesn't work
>
>   final long currentTime = System.currentTimeMillis();
>
>   throw new ServiceInterruption("Tika down, retrying:
> "+e.getMessage(),e,currentTim

Re: ManifoldCF Docker MySQL Connection Error

2018-11-24 Thread Karl Wright

Hi Furkan,

Each database has a superuser name and password, for creating the workarea
and the dbuser that will be active during operation.  So all of these are
needed.

As for why mysql.hostname, mysql.server, and mysql.client are needed --
each database driver has specific needs.  There is no property
"mysql.hostname"; these are the ones used:

>>>>>>
  /** MySQL server property */
  public static final String mysqlServerProperty =
"org.apache.manifoldcf.mysql.server";
  /** Source system name or IP */
  public static final String mysqlClientProperty =
"org.apache.manifoldcf.mysql.client";
  /** MySQL ssl property */
  public static final String mysqlSslProperty =
"org.apache.manifoldcf.mysql.ssl";

<<<<<<

Karl


On Sat, Nov 24, 2018 at 9:58 AM Furkan KAMACI 
wrote:

> Hi Karl,
>
> I've found why it didn't work. I thought that these are enough:
>
>value="custom_hostname"/>
>   
>value="mypass"/>
>
> However, these key/value pairs are read by ManifoldCF too:
>
>value="mypass"/>
>   
>value="custom_hostname"/>
>value="custom_hostname"/>
>
> So, I've added that properties to make it work. Shouldn't hostname,
> dbsuperusername and dbsuperuserpassword be enough?
>
> Kind Regards,
> Furkan KAMACI
>
>
> On Sat, Nov 24, 2018 at 5:40 PM Karl Wright  wrote:
>
>> Hi Furkan,
>>
>> Why do you conclude that MCF is not using the parameters in
>> parameters.xml?
>>
>> It's possible that it cannot *find* parameters.xml.  Have you verified
>> that?
>>
>> Karl
>>
>>
>> On Sat, Nov 24, 2018 at 8:58 AM Furkan KAMACI 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> When I try to connect a dockerized MySQL from host machine I use ip
>>> address. Symbolic names can be used inter-docker connections with linked
>>> containers.
>>>
>>> I've tried all combinations and none of them worked. I couldn't figure
>>> out the reason why ManifoldCF does not use parameters defined at
>>> properties.xml and why it tries to connect via default username
>>> (manifoldcf) and password (local_pg_passwd).
>>>
>>> Kind Regards,
>>> Furkan KAMACI
>>>
>>> On Sat, Nov 24, 2018 at 3:56 PM Karl Wright  wrote:
>>>
>>>> Hi Furkan,
>>>>
>>>> This gives me pause:
>>>>
>>>> >>> value="172.17.0.4"/>
>>>>
>>>> A virtualized environment may require use of symbolic names rather than
>>>> hard IP addresses.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Sat, Nov 24, 2018 at 7:53 AM Furkan KAMACI 
>>>> wrote:
>>>>
>>>>> Hi Karl,
>>>>>
>>>>> Same config with same MySQL version in non-docker environment works. I
>>>>> can successfully connect to mysql docker via:
>>>>>
>>>>> mysql -h 172.17.0.2 -u root -p
>>>>>
>>>>> Here is my config for MySQL:
>>>>>
>>>>>   >>>> value="amarok"/>
>>>>>   
>>>>>   >>>> value="org.apache.manifoldcf.core.database.DBInterfaceMySQL"/>
>>>>>   >>>> value="172.17.0.4"/>
>>>>>   
>>>>>   
>>>>>   >>>> value="mypass"/>
>>>>>
>>>>> I've put some logging into ConnectionFactory and this is what I get:
>>>>>
>>>>> --
>>>>> Database: mysql
>>>>> jdbcDriver: com.mysql.jdbc.Driver
>>>>> jdbcUrl:
>>>>> jdbc:mysql://localhost/mysql?useUnicode=true=utf8
>>>>> userName: root
>>>>> password: mypass
>>>>> --
>>>>> --
>>>>> Database: amarok
>>>>> jdbcDriver: com.mysql.jdbc.Driver
>>>>> jdbcUrl:
>>>>> jdbc:mysql://localhost/amarok?useUnicode=true=utf8
>>>>> userName: manifoldcf
>>>>> password: local_pg_passwd
>>>>> --
>>>>>
>>>>> So, it doesn't try to connect a host rather than localhost without
>>>>> respecting properties.xml?
>>>>>
>>>>> Kind Regards,
>>>>> Furkan KAMACI
>>>>>
>>>>> On Sat, Nov 24, 2018 at 3:42 PM Karl Wright 
>>>>> wrote:
>>>>>
>

Re: ManifoldCF Docker MySQL Connection Error

2018-11-24 Thread Karl Wright

Hi Furkan,

This gives me pause:



A virtualized environment may require use of symbolic names rather than
hard IP addresses.

Karl


On Sat, Nov 24, 2018 at 7:53 AM Furkan KAMACI 
wrote:

> Hi Karl,
>
> Same config with same MySQL version in non-docker environment works. I can
> successfully connect to mysql docker via:
>
> mysql -h 172.17.0.2 -u root -p
>
> Here is my config for MySQL:
>
>   
>   
>value="org.apache.manifoldcf.core.database.DBInterfaceMySQL"/>
>value="172.17.0.4"/>
>   
>   
>value="mypass"/>
>
> I've put some logging into ConnectionFactory and this is what I get:
>
> --
> Database: mysql
> jdbcDriver: com.mysql.jdbc.Driver
> jdbcUrl:
> jdbc:mysql://localhost/mysql?useUnicode=true=utf8
> userName: root
> password: mypass
> --
> --
> Database: amarok
> jdbcDriver: com.mysql.jdbc.Driver
> jdbcUrl:
> jdbc:mysql://localhost/amarok?useUnicode=true=utf8
> userName: manifoldcf
> password: local_pg_passwd
> --
>
> So, it doesn't try to connect a host rather than localhost without
> respecting properties.xml?
>
> Kind Regards,
> Furkan KAMACI
>
> On Sat, Nov 24, 2018 at 3:42 PM Karl Wright  wrote:
>
>> Hi Furkan,
>>
>> The only thing that comes to mind is that maybe your MySQL is running on
>> a different port than you expect, or that the MySQL driver you are using is
>> not compatible with your setup.  Basically it is failing to create a
>> connection between the driver and the database.
>>
>> Karl
>>
>>
>> On Sat, Nov 24, 2018 at 7:28 AM Furkan KAMACI 
>> wrote:
>>
>>> Hi All,
>>>
>>> I try to test ManifoldCF via docker. I've run mysql as follows:
>>>
>>> docker run --name custom-mysql -v
>>> /home/ubuntu/mysql-conf:/etc/mysql/conf.d -e MYSQL_ROOT_PASSWORD=mypass -d
>>> mysql:5.7.16
>>>
>>> I've run my docker container of ManifoldCF as follows:
>>>
>>> docker run --name manifoldcf --link custom-mysql:mysql -p 8345:8345 -it
>>> manifoldcf:2.7.1
>>>
>>> However, I get:
>>>
>>> *org.apache.manifoldcf.core.interfaces.ManifoldCFException: Error
>>> getting connection: Communications link failure*
>>>
>>> *The last packet sent successfully to the server was 0 milliseconds ago.
>>> The driver has not received any packets from the server.*
>>> * at
>>> org.apache.manifoldcf.core.database.ConnectionFactory.getConnection(ConnectionFactory.java:83)*
>>> * at
>>> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:797)*
>>> * at
>>> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1457)*
>>> * at
>>> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)*
>>> * at
>>> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:204)*
>>> * at
>>> org.apache.manifoldcf.core.database.DBInterfaceMySQL.performQuery(DBInterfaceMySQL.java:907)*
>>> * at
>>> org.apache.manifoldcf.core.database.DBInterfaceMySQL.getTableSchema(DBInterfaceMySQL.java:753)*
>>> * at
>>> org.apache.manifoldcf.core.database.BaseTable.getTableSchema(BaseTable.java:185)*
>>> * at
>>> org.apache.manifoldcf.agents.agentmanager.AgentManager.install(AgentManager.java:67)*
>>> * at
>>> org.apache.manifoldcf.agents.system.ManifoldCF.installTables(ManifoldCF.java:112)*
>>> * at
>>> org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:235)*
>>> *Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>> Communications link failure*
>>>
>>>
>>> I can connect MySQL via command line, I can access it via another docker
>>> container and I can access it if I create a project which just includes
>>> ConnectionFactory.java of ManifoldCF.
>>>
>>> What may be the reason for this?
>>>
>>> Kind Regards,
>>> Furkan KAMACI
>>>
>>

Re: Manifold fails with alfresco | pipeline exception

2018-11-24 Thread Karl Wright

Hi Nikhilesh,

Where are you seeing these errors?  They sound like ElasticSearch errors to
me; it is complaining that an null or empty-string pipeline name is being
specified somehow.  Can you tell me what version of ElasticSearch you are
using?

We have outstanding tickets for updating the ElasticSearch connector,
because the public API for ElasticSearch changes very rapidly and therefore
it may be out of date.  Please check the JIRA queue to find the status of
issues like this.

Thanks,
Karl



On Sat, Nov 24, 2018 at 7:45 AM Sivakoti, Nikhilesh <
nikhilesh.sivak...@capgemini.com> wrote:

> Hi Team,
>
>
>
> We have been trying to use ManifoldCF with our Alfresco. We have
> customized the manifold alfresco connector and manifold elastic search
> connector as per our needs.
>
> We have added the authentication mechanism in elastic search connector to
> connect to the QA servers. But when we try to execute a job we are getting
> the below errors.
>
>
>
> HTTP code = 400, Response =
> {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"pipeline
> with id [null] does not
> exist"}],"type":"illegal_argument_exception","reason":"pipeline with id
> [null] does not exist"},"status":400}
>
>
>
> Could you please provide any support on thiss?
>
>
>
>
>
> Thanks,
>
> Nikhilesh
>
> This message contains information that may be privileged or confidential
> and is the property of the Capgemini Group. It is intended only for the
> person to whom it is addressed. If you are not the intended recipient, you
> are not authorized to read, print, retain, copy, disseminate, distribute,
> or use this message or any part thereof. If you receive this message in
> error, please notify the sender immediately and delete all copies of this
> message.
>

[jira] [Commented] (CONNECTORS-1557) HTML Tag extractor

2018-11-21 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694406#comment-16694406
 ] 

Karl Wright commented on CONNECTORS-1557:
-

The best way to deliver the code is as a patch attachment to a ticket like this.

I hope that the transformer you wrote is consistent with the other transformers 
that ManifoldCF provides, e.g. the HTML Extractor and the Metadata Adjuster.  
Generally we are not fond of transformers that take on more than the most basic 
part of what might be structured as a multi-part transformation.  From your 
description it sounds like you've basically extended the HTML extractor and 
added functionality to it similar to what the Metadata Adjuster does.   If 
that's true, it might be good to only provide the extraction functionality 
extension from CSS to the HTML extractor, and let the Metadata Adjuster handle 
the field mappings.

Please let me know how you want to proceed.


> HTML Tag extractor
> --
>
> Key: CONNECTORS-1557
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1557
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Donald Van den Driessche
>Priority: Major
>
> I wrote a HTML Tag extractor, based on the HTML Extractor.
> I needed to extract specific HTML tags and transfer them to their own field 
> in my output repository.
> Input
>  * Englobing tag (CSS selector)
>  * Blacklist (CSS selector)
>  * Fieldmapping (CSS selector)
>  * Strip HTML
> Process
>  * Retrieve Englobing tag
>  * Remove blacklist
>  * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + 
> strip HTML (if requested)
>  * Englobing tag minus blacklist: strip HTML (if requested) and return as 
> output (content)
> How can I best deliver the source code?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Language Detection for the data

2018-11-21 Thread Karl Wright

Hi Nikita,

Can you be more specific when you say "OpenNLP is not working"?  All that
this connector does is integrate OpenNLP as a ManifoldCF transformer.  It
uses a specific directory to deliver the models that OpenNLP uses to match
and extract content from documents.  Thus, you can provide any models you
want that are compatible with the OpenNLP version we're including.

Can you describe the steps you are taking and what you are seeing?

On Wed, Nov 21, 2018 at 12:44 AM Nikita Ahuja  wrote:

> Hi,
>
> I have query related to detect the language of the records/data which is
> going to be ingest in the Output Connector.
>
> OpenNLP connector is not working for the detection as per the user
> documentation, but this is not working appropriately. Please suggest is NLP
> has to be used if yes, then how it should be used or is there any other
> solution for this?
>
> --
> Thanks and Regards,
> Nikita
> Email: nik...@smartshore.nl
> United Sources Service Pvt. Ltd.
> a "Smartshore" Company
> Mobile: +91 99 888 57720
> http://www.smartshore.nl
>

[jira] [Assigned] (CONNECTORS-1557) HTML Tag extractor

2018-11-21 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1557:
---

Assignee: Karl Wright

> HTML Tag extractor
> --
>
> Key: CONNECTORS-1557
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1557
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Donald Van den Driessche
>    Assignee: Karl Wright
>Priority: Major
>
> I wrote a HTML Tag extractor, based on the HTML Extractor.
> I needed to extract specific HTML tags and transfer them to their own field 
> in my output repository.
> Input
>  * Englobing tag (CSS selector)
>  * Blacklist (CSS selector)
>  * Fieldmapping (CSS selector)
>  * Strip HTML
> Process
>  * Retrieve Englobing tag
>  * Remove blacklist
>  * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + 
> strip HTML (if requested)
>  * Englobing tag minus blacklist: strip HTML (if requested) and return as 
> output (content)
> How can I best deliver the source code?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: web connector : links extraction issues

2018-11-15 Thread Karl Wright

Hi Olivier,

The HTML parser built into MCF is quite resilient against badly formed
HTML, but there are limits.  Characters like "<" and ">" are used to denote
tags and thus they confuse the parser when they are present in unescaped
form.  It may be possible, with a fair bit of work, to handle some such
cases, but generally it is not possible to do this readily without a great
deal of work (and also knowledge that we're parsing HTML specifically, not
general XML).

So, in general, I think you should not expect ManifoldCF to be able to
handle whatever badly formed HTML you throw at it.  It's never going to be
as resilient as (say) Firefox in this regard.  It is much better to format
HTML properly in the first place.  You can verify this by using one of the
many available online XML validator tools available.

Thanks,
Karl


On Thu, Nov 15, 2018 at 7:22 AM Olivier Tavard <
olivier.tav...@francelabs.com> wrote:

> Hi Karl,
>
> Thanks for your answer.
> Could you detail your answer please ? Just to better understand : you mean
> that there is no chance that special characters could be escaped in the MCF
> code in this case ie the website needs to escape itself the special
> characters otherwise the extraction will not work in MCF, am I right ?
>
> Best regards,
>
> Olivier
>
>
>
> Le 15 nov. 2018 à 12:57, Karl Wright  a écrit :
>
> Hi Olivier,
>
> You can create a ticket but I don't have a good solution for you in any
> case.
>
> Karl
>
>
> On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard <
> olivier.tav...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> Do you think that I need to create a Jira issue relative to this bug ie
>> that the links extraction does not work if inside Javascript tags some code
>> contain special characters like '>', '< '?
>>
>> Thanks,
>> Best regards,
>>
>> Olivier
>>
>>
>>
>> Le 30 oct. 2018 à 12:05, Olivier Tavard 
>> a écrit :
>>
>> Hi Karl,
>>
>> Thanks for your answer.
>> I kept looking into this and I found what was the problem. The Javascript
>> code into the tags </scripts>  contained the character '<'. If so
>> the links extraction does not work with the web connector.
>>
>> To reproduce it, I created this page hosted in local Apache then I
>> indexed it with MCF 2.11 out of the box.
>>
>> in the first example the page was :
>> <!DOCTYPE html>
>>
>> <head>
>> <title>test</title>
>> <meta charset="utf-8" />
>> *<script type="text/javascript">*
>>
>> 
>> 
>>
>> https://manifoldcf.apache.org/en_US/index.html;>manifoldcf
>> 
>>
>> The links extraction was correct, in the debug log :
>> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an
>> HttpClient object
>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For
>> http://localhost:/testjs/test.html, setting virtual host to localhost
>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an
>> HttpClient object after 1 ms.
>> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for
>> '/testjs/test.html'
>>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL|
>> http://localhost:/testjs/test.html|1540896372585+75|200|223|
>> <http://localhost:/testjs/test.html%7C1540896372585+75%7C200%7C223%7C>
>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 
>> 'http://localhost:/testjs/test.html'
>> is text, with encoding 'UTF-8'; link extraction starting
>> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html
>> document 'http://localhost:/testjs/test.html', found link to
>> 'https://manifoldcf.apache.org/en_US/index.html'
>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content
>> exclusion rule supplied... returning
>> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to
>> ingest 'http://localhost:/testjs/test.html'
>> —
>> In the second example, the code was pretty quite the same except that I
>> included the character '<' in the content of the script tags :
>> 
>>
>> 
>> test
>> 
>> *a<b*
>>
>> 
>> 
>>
>> https://manifoldcf.apache.org/en_US/index.html
>> ">manifoldcf
>>
>> 
>>
>> The links extraction was not successful, the debug log indicates :
>> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an
>> HttpClient object
>> DEBUG 2018-10-30T11:48:13,475 (

Re: web connector : links extraction issues

2018-11-15 Thread Karl Wright

Hi Olivier,

You can create a ticket but I don't have a good solution for you in any
case.

Karl


On Thu, Nov 15, 2018 at 6:53 AM Olivier Tavard <
olivier.tav...@francelabs.com> wrote:

> Hi Karl,
>
> Do you think that I need to create a Jira issue relative to this bug ie
> that the links extraction does not work if inside Javascript tags some code
> contain special characters like '>', '< '?
>
> Thanks,
> Best regards,
>
> Olivier
>
>
>
> Le 30 oct. 2018 à 12:05, Olivier Tavard  a
> écrit :
>
> Hi Karl,
>
> Thanks for your answer.
> I kept looking into this and I found what was the problem. The Javascript
> code into the tags </scripts>  contained the character '<'. If so
> the links extraction does not work with the web connector.
>
> To reproduce it, I created this page hosted in local Apache then I indexed
> it with MCF 2.11 out of the box.
>
> in the first example the page was :
> <!DOCTYPE html>
>
> <head>
> <title>test</title>
> <meta charset="utf-8" />
> *<script type="text/javascript">*
>
> 
> 
>
> https://manifoldcf.apache.org/en_US/index.html;>manifoldcf
> 
>
> The links extraction was correct, in the debug log :
> DEBUG 2018-10-30T11:46:12,584 (Worker thread '33') - WEB: Waiting for an
> HttpClient object
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: For
> http://localhost:/testjs/test.html, setting virtual host to localhost
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Got an
> HttpClient object after 1 ms.
> DEBUG 2018-10-30T11:46:12,585 (Worker thread '33') - WEB: Get method for
> '/testjs/test.html'
>  INFO 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: FETCH URL|
> http://localhost:/testjs/test.html|1540896372585+75|200|223|
> <http://localhost:/testjs/test.html%7C1540896372585+75%7C200%7C223%7C>
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: Document 
> 'http://localhost:/testjs/test.html'
> is text, with encoding 'UTF-8'; link extraction starting
> DEBUG 2018-10-30T11:46:12,661 (Worker thread '33') - WEB: In html document
> 'http://localhost:/testjs/test.html', found link to
> 'https://manifoldcf.apache.org/en_US/index.html'
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: no content
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:46:12,662 (Worker thread '33') - WEB: Decided to
> ingest 'http://localhost:/testjs/test.html'
> —
> In the second example, the code was pretty quite the same except that I
> included the character '<' in the content of the script tags :
> 
>
> 
> test
> 
> *a<b*
>
> 
> 
>
> https://manifoldcf.apache.org/en_US/index.html
> ">manifoldcf
>
> 
>
> The links extraction was not successful, the debug log indicates :
> DEBUG 2018-10-30T11:48:13,474 (Worker thread '36') - WEB: Waiting for an
> HttpClient object
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: For
> http://localhost:/testjs/test.html, setting virtual host to localhost
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Got an
> HttpClient object after 1 ms.
> DEBUG 2018-10-30T11:48:13,475 (Worker thread '36') - WEB: Get method for
> '/testjs/test.html'
>  INFO 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: FETCH URL|
> http://localhost:/testjs/test.html|1540896493475+76|200|226|
> <http://localhost:/testjs/test.html%7C1540896493475+76%7C200%7C226%7C>
> DEBUG 2018-10-30T11:48:13,552 (Worker thread '36') - WEB: Document 
> 'http://localhost:/testjs/test.html'
> is text, with encoding 'UTF-8'; link extraction starting
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: no content
> exclusion rule supplied... returning
> DEBUG 2018-10-30T11:48:13,553 (Worker thread '36') - WEB: Decided to
> ingest 'http://localhost:/testjs/test.html'
> —
> So special characters like the less than sign should be escaped in the
> code of the web connector to preserve the links extraction.
>
> Thanks,
> Best regards,
>
>
> Olivier
>
> Le 29 oct. 2018 à 19:39, Karl Wright  a écrit :
>
> Hi Olivier,
>
> Javascript inclusion in the Web Connector is not evaluated.  In fact, no
> Javascript is executed at all.  Therefore it should not matter what is
> included via javascript.
>
> Thanks,
> Karl
>
>
> On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <
> olivier.tav...@francelabs.com> wrote:
>
>> Hi,
>>
>> Regarding the web connector, I noticed that for specific websites, some
>> Javascript code can prevent the web connector to fetch correctly all the
>> links present on

Re: Error Job stop after repeatidly interruption

2018-11-15 Thread Karl Wright

(1) I increased the retries to go at least 10 minutes.
(2) I handled the 503 response explicitly, with the same logic.

See: https://issues.apache.org/jira/browse/CONNECTORS-1556

Karl


On Thu, Nov 15, 2018 at 3:35 AM Bisonti Mario 
wrote:

> Yes, Karl.
>
>
>
> Is it possible to apply the same your concept , wait 10 sec and retry
> three times , to the 503 error , too?
>
>
>
> So, I would like to try, if, with the modification, I obtain that job end
> correctly instead of failure.
>
>
>
>
>
> Thanks a lot
>
> Mario
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 15 novembre 2018 09:17
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Error Job stop after repeatidly interruption
>
>
>
> Hi Mario,
>
>
>
> Here's the code:
>
>
>
> >>>>>>
>
> try {
>
>   //System.out.println("About to do a content PUT");
>
>   response = this.httpClient.execute(tikaHost, httpPut);
>
>   //System.out.println("... content PUT succeeded");
>
> } catch (IOException e) {
>
>   // Retry 3 times, 1 ms between retries, and abort if
> doesn't work
>
>   final long currentTime = System.currentTimeMillis();
>
>   throw new ServiceInterruption("Tika down, retrying:
> "+e.getMessage(),e,currentTime + 1L,
>
> -1L,3,true);
>
> }
>
>
>
> responseCode = response.getStatusLine().getStatusCode();
>
> if (response.getStatusLine().getStatusCode() == 200 ||
> response.getStatusLine().getStatusCode() == 204) {
>
>   tikaServerIs = response.getEntity().getContent();
>
>   try {
>
> responseDs = new FileDestinationStorage();
>
> final OutputStream os2 = responseDs.getOutputStream();
>
> try {
>
>   IOUtils.copyLarge(tikaServerIs, os2, 0L, sp.writeLimit);
>
> } finally {
>
>   os2.close();
>
> }
>
> length = new Long(responseDs.getBinaryLength());
>
>   } finally {
>
> tikaServerIs.close();
>
>   }
>
> } else {
>
>   activities.noDocument();
>
>   if (responseCode == 422) {
>
> resultCode = "TIKASERVERREJECTS";
>
> description = "Tika Server rejected document with the
> following reason: "
>
> + response.getStatusLine().getReasonPhrase();
>
> return handleTikaServerRejects(description);
>
>   } else {
>
> resultCode = "TIKASERVERERROR";
>
> description = "Tika Server failed to parse document with
> the following error: "
>
> + response.getStatusLine().getReasonPhrase();
>
> return handleTikaServerError(description);
>
>   }
>
> }
>
>
>
>   } catch (IOException | ParseException e) {
>
> resultCode = "TIKASERVERRESPONSEISSUE";
>
> description = e.getMessage();
>
> int rval;
>
> if (e instanceof IOException) {
>
>   rval = handleTikaServerException((IOException) e);
>
> } else {
>
>   rval = handleTikaServerException((ParseException) e);
>
> }
>
> if (rval == DOCUMENTSTATUS_REJECTED) {
>
>   activities.noDocument();
>
> }
>
> return rval;
>
>   }
>
> <<<<<<
>
> and
>
> >>>>>>
>
>   protected static int handleTikaServerError(String description)
>
>   throws IOException, ManifoldCFException, ServiceInterruption {
>
> // MHL - what does Tika throw if it gets an IOException reading the
> stream??
>
> Logging.ingest.warn("Tika Server: Tika Server error: " + description);
>
> return DOCUMENTSTATUS_REJECTED;
>
>   }
>
> <<<<<<
>
>
>
> The summary:
>
> (1) If ManifoldCF cannot connect at all, or gets an IO error, it will wait
> at least 10 seconds and then retry -- up to three times.
>
> (2) When Manifold sees a 503 error it immediately just rejects the
> document.
>
> So you are requesting different handling for 503 errors?
>
>
>
> Karl
>
>
>
>
>
> On Thu, Nov 15, 2018 at 2:42 AM Bisonti Mario 
> wrot

[jira] [Updated] (CONNECTORS-1556) Integrate changes in retry handling to address TIKA-2776

2018-11-15 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1556:

Attachment: CONNECTORS-1556.patch

> Integrate changes in retry handling to address TIKA-2776
> 
>
> Key: CONNECTORS-1556
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1556
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika service connector
>    Reporter: Karl Wright
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1556.patch
>
>
> The Tika service extractor currently retries on some conditions but does not 
> handle the case where the external Tika service is restarting itself.  This 
> generates a 503 error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (CONNECTORS-1556) Integrate changes in retry handling to address TIKA-2776

2018-11-15 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1556.
-
Resolution: Fixed

r1846627


> Integrate changes in retry handling to address TIKA-2776
> 
>
> Key: CONNECTORS-1556
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1556
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Tika service connector
>    Reporter: Karl Wright
>    Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1556.patch
>
>
> The Tika service extractor currently retries on some conditions but does not 
> handle the case where the external Tika service is restarting itself.  This 
> generates a 503 error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (CONNECTORS-1556) Integrate changes in retry handling to address TIKA-2776

2018-11-15 Thread Karl Wright (JIRA)

Karl Wright created CONNECTORS-1556:
---

 Summary: Integrate changes in retry handling to address TIKA-2776
 Key: CONNECTORS-1556
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1556
 Project: ManifoldCF
  Issue Type: Bug
  Components: Tika service connector
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 2.12


The Tika service extractor currently retries on some conditions but does not 
handle the case where the external Tika service is restarting itself.  This 
generates a 503 error.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Error Job stop after repeatidly interruption

2018-11-15 Thread Karl Wright

Hi Mario,

Here's the code:

>>>>>>
try {
  //System.out.println("About to do a content PUT");
  response = this.httpClient.execute(tikaHost, httpPut);
  //System.out.println("... content PUT succeeded");
} catch (IOException e) {
  // Retry 3 times, 1 ms between retries, and abort if
doesn't work
  final long currentTime = System.currentTimeMillis();
  throw new ServiceInterruption("Tika down, retrying:
"+e.getMessage(),e,currentTime + 1L,
-1L,3,true);
}

responseCode = response.getStatusLine().getStatusCode();
if (response.getStatusLine().getStatusCode() == 200 ||
response.getStatusLine().getStatusCode() == 204) {
  tikaServerIs = response.getEntity().getContent();
  try {
responseDs = new FileDestinationStorage();
final OutputStream os2 = responseDs.getOutputStream();
try {
  IOUtils.copyLarge(tikaServerIs, os2, 0L, sp.writeLimit);
} finally {
  os2.close();
}
length = new Long(responseDs.getBinaryLength());
  } finally {
tikaServerIs.close();
  }
} else {
  activities.noDocument();
  if (responseCode == 422) {
resultCode = "TIKASERVERREJECTS";
description = "Tika Server rejected document with the
following reason: "
+ response.getStatusLine().getReasonPhrase();
return handleTikaServerRejects(description);
  } else {
resultCode = "TIKASERVERERROR";
description = "Tika Server failed to parse document with
the following error: "
+ response.getStatusLine().getReasonPhrase();
return handleTikaServerError(description);
  }
}

  } catch (IOException | ParseException e) {
resultCode = "TIKASERVERRESPONSEISSUE";
description = e.getMessage();
int rval;
if (e instanceof IOException) {
  rval = handleTikaServerException((IOException) e);
} else {
  rval = handleTikaServerException((ParseException) e);
}
if (rval == DOCUMENTSTATUS_REJECTED) {
  activities.noDocument();
}
return rval;
  }
<<<<<<
and
>>>>>>
  protected static int handleTikaServerError(String description)
  throws IOException, ManifoldCFException, ServiceInterruption {
// MHL - what does Tika throw if it gets an IOException reading the
stream??
Logging.ingest.warn("Tika Server: Tika Server error: " + description);
return DOCUMENTSTATUS_REJECTED;
  }
<<<<<<

The summary:

(1) If ManifoldCF cannot connect at all, or gets an IO error, it will wait
at least 10 seconds and then retry -- up to three times.
(2) When Manifold sees a 503 error it immediately just rejects the document.

So you are requesting different handling for 503 errors?

Karl


On Thu, Nov 15, 2018 at 2:42 AM Bisonti Mario 
wrote:

> Hallo Karl.
>
> I opened an issue on Tika here:
>
> https://issues.apache.org/jira/browse/TIKA-2776
>
>
>
> The person that develops tika, suggests me to put a waiting on the client
> (in my case manifoldcf)
>
>
> https://issues.apache.org/jira/browse/TIKA-2776?focusedCommentId=16686620=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16686620
>
>
>
> I am not able to do this…
>
> Is it possible to implement on the MCF source?
>
>
>
>
> Thanks a lot
>
>
>
> Mario
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 8 novembre 2018 21:03
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Error Job stop after repeatidly interruption
>
>
>
> Hi Mario,
>
>
>
> The Tika external connector retries for a while before it gives up and
> aborts the job.  If you can get the Tika server back up within a reasonable
> period of time all should be well.  But if one specific document *always*
> brings down the Tika server, it will be hard to recover from that.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Nov 8, 2018 at 2:56 PM Bisonti Mario 
> wrote:
>
> Hallo.
>
>
>
> I am trying to index more than 500 documents in a Windows Share.
>
>
>
> It happens that job is interrupted due to repeatidly interruption.
>
> This is the manifold.log:
>
> .
> .
> WARN 2018-11-07T21:53:25,296 (Worker thread '59') - Service interruption
> reported for job 1533797717712 connec

Re: Job stuck - WorkerThread functions return null

2018-11-14 Thread Karl Wright

Hi Cheng,

Unless you are using carrydown information (that is, information that is
recorded for a parent document that the child document needs access to),
this is the method you want to use:

activities.addDocumentReference(documentIdentifier);

If you DO need to pull data recorded for a parent from the child, the best
connector to look at for an example is the SharePoint connector.

As far as the stack trace is concerned -- these always get written to the
log.  The reason the framework "hangs" is because the exception is a fatal
one and is basically causes the thread to restart itself, and thus nothing
progresses under those conditions.  Very probably the cause of this
exception is that you are including a 'parent identifier' which is not
actually a document identifier that was itself added.

Karl


On Wed, Nov 14, 2018 at 2:16 AM Cheng Zeng  wrote:

>
> Hi Karl,
>
> Thanks a lot for your replay. I didn't change any code in the framework
> except my own repository connector.
>
> I found that there five methods which are available to inject document
> identifiers. Could you please tell me how I should choose the right way to
> inject the document identifiers.
>  activities.addDocumentReference(documentIdentifier);
>  activities.addDocumentReference(documentIdentifier, parentIdentifier,
> relationshipType);
>  activities.addDocumentReference(documentIdentifier, parentIdentifier,
> relationshipType, dataNames, dataValues);
>  activities.addDocumentReference(documentIdentifier, parentIdentifier,
> relationshipType, dataNames, dataValues, originationTime);
>  activities.addDocumentReference(documentIdentifier, parentIdentifier,
> relationshipType, dataNames, dataValues, originationTime, prereqEventNames);
>
> The way I injected document identifiers is as follows.
>
>
> activities.addDocumentReference(docUri,documentIdentifier,RELATIONSHIP_CHILD);
> docUri is the doc url which is supposed to be fetched, e.g.
> http://domino_server:80/path/dep1/database_name.nsf/api/data/documents
> documentIdentifier is the parent url, e.g.
> http://domino_server:80/path/dep1/database_name.nsf/api/data/documents/unid/B0F9484E94DEA3204825813E001034E1
>
> I am afraid that there is no full stack trace thrown. I have only got the
>
> new IllegalArgumentException("Unrecognized document identifier:
> '"+documentIdentifier+"'");
>
> with the following code in the 
> WorkerThread.java(org.apache.manifoldcf.crawler.system).
> I've found the document identifier in the table of "jobqueue" and the
> dochash in the table of "jobqueue" is matched against the hashcode
> generated by the hash method.
>
> For some of the document identifiers,
> previousDocuments.get(documentIdentifierHash) can return the queued
> document, but for several document identifier,
> previousDocuments.get(documentIdentifierHash) return null.
>
> Could you please give me some indication?
>
> protected IPipelineSpecificationWithVersions
> computePipelineSpecificationWithVersions(String documentIdentifierHash,
>   String componentIdentifierHash,
>   String documentIdentifier)
> {
>   QueuedDocument qd = previousDocuments.get(documentIdentifierHash);
>  // return null. The problem is here.
>   if (qd == null)
> throw new IllegalArgumentException("Unrecognized document
> identifier: '"+documentIdentifier+"'");
>   return new
> PipelineSpecificationWithVersions(pipelineSpecification,qd,componentIdentifierHash);
> }
>
> Best wishes,
>
> Cheng
>
>
>
>
> --
> *From:* Karl Wright 
> *Sent:* 12 November 2018 18:46
> *To:* user@manifoldcf.apache.org
> *Subject:* Re: Job stuck - WorkerThread functions return null
>
> Hi,
> Have you been modifying the framework code?  If so, I really cannot help
> you.
>
> If you haven't -- it looks like you've got code that is injecting document
> identifiers that are incorrect.  But I will need to see a full stack trace
> to be sure of that.
>
> Thanks,
> Karl
>
>
> On Mon, Nov 12, 2018 at 4:06 AM Cheng Zeng  wrote:
>
> Hi Karl,
>
> I am developing my own repository where I borrowed some code from the file
> repository connector. I use my repository connector to crawling documents
> from IBM domino system. I managed to retrieve all the files in the domino,
> however, when I restart my job to recrawl the database in the domino, I've
> got problems with the following code where 
> previousDocuments.get(documentIdentifierHash)
> in the WorkerThread.java(org.apache.manifoldcf.crawler.system) return null
> for some of the document ids. As a result, the job got stuck with the
> specific document id.
>
> Coul

Re: Valid usecase of CredSSP auth scheme

2018-11-12 Thread Karl Wright

Hi Michael,

I did not contribute this work; I merely obliquely helped integrating it.
If I recall correctly, there was a reasonable case made for it, but I don't
remember what it wasy.

Karl


On Mon, Nov 12, 2018 at 5:50 PM Michael Osipov  wrote:

> Guys,
>
> I just have discovered that CredSSP has been added with (NTLM, yuck)
> some time ago. Can someone point me to a valid use case for this over
> HTTP? Karl? As far as I understand CredSSP [1] it is simply not
> compatible with/designed for HTTP and duplicates the transport
> encryption. The main purpose is to securely transport the Kerberos UPN
> and password of the user to the target server, e.g., for RDP to obtain a
> TGT on the remote machine as if someone is physically in front of the
> remote machine.
>
> This makes sense if you work on raw sockets, but on HTTP?
> The CredSspScheme also says that it should work with GSS, but I believe
> that this is impossible because as soon as yo have the GSSCredential,
> you don't have access to the UPN and password, you have the TGT only.
> Neither with JGSS, Heimdal, nor MIT Kerberos unless you acquire them
> again, like the RDP login dialog does.
>
> So again, what does it better than HTTPS + SPNEGO with credential
> delegation or contraint delegation also given that this works on the
> Windows backend only?!
>
> Michael
>
> [1] https://msdn.microsoft.com/en-us/library/cc226794.aspx
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@hc.apache.org
> For additional commands, e-mail: dev-h...@hc.apache.org
>
>

Re: Job stuck - WorkerThread functions return null

2018-11-12 Thread Karl Wright

Hi,
Have you been modifying the framework code?  If so, I really cannot help
you.

If you haven't -- it looks like you've got code that is injecting document
identifiers that are incorrect.  But I will need to see a full stack trace
to be sure of that.

Thanks,
Karl


On Mon, Nov 12, 2018 at 4:06 AM Cheng Zeng  wrote:

> Hi Karl,
>
> I am developing my own repository where I borrowed some code from the file
> repository connector. I use my repository connector to crawling documents
> from IBM domino system. I managed to retrieve all the files in the domino,
> however, when I restart my job to recrawl the database in the domino, I've
> got problems with the following code where 
> previousDocuments.get(documentIdentifierHash)
> in the WorkerThread.java(org.apache.manifoldcf.crawler.system) return null
> for some of the document ids. As a result, the job got stuck with the
> specific document id.
>
> Could you please tell me how I could fix the problem?
>
>  protected IPipelineSpecificationWithVersions
> computePipelineSpecificationWithVersions(String documentIdentifierHash,
>   String componentIdentifierHash,
>   String documentIdentifier)
> {
>   QueuedDocument qd = previousDocuments.get(documentIdentifierHash);
>  // return null. The problem is here.
>   if (qd == null)
> throw new IllegalArgumentException("Unrecognized document
> identifier: '"+documentIdentifier+"'");
>   return new
> PipelineSpecificationWithVersions(pipelineSpecification,qd,componentIdentifierHash);
> }
>
>
> Thanks a lot.
>
> Cheng
>

Re: Error Job stop after repeatidly interruption

2018-11-08 Thread Karl Wright

Hi Mario,

The Tika external connector retries for a while before it gives up and
aborts the job.  If you can get the Tika server back up within a reasonable
period of time all should be well.  But if one specific document *always*
brings down the Tika server, it will be hard to recover from that.

Karl


On Thu, Nov 8, 2018 at 2:56 PM Bisonti Mario 
wrote:

> Hallo.
>
>
>
> I am trying to index more than 500 documents in a Windows Share.
>
>
>
> It happens that job is interrupted due to repeatidly interruption.
>
> This is the manifold.log:
>
> .
> .
> WARN 2018-11-07T21:53:25,296 (Worker thread '59') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Tika down, retrying:
> Connect to localhost:9998 [localhost/127.0.0.1,
> localhost/0:0:0:0:0:0:0:1] failed: Connection refused (Connection refused)
>
> WARN 2018-11-07T21:53:25,476 (Worker thread '89') - Service interruption
> reported for job 1533797717712 connection 'WinShare': Tika down, retrying:
> Connect to localhost:9998 [localhost/127.0.0.1,
> localhost/0:0:0:0:0:0:0:1] failed: Connection refused (Connection refused)
>
> WARN 2018-11-07T21:53:33,814 (Worker thread '15') - JCIFS: Possibly
> transient exception detected on attempt 1 while getting share security: All
> pipe instances are busy.
>
> jcifs.smb.SmbException: All pipe instances are busy.
>
> at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:569)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbTransport.send(SmbTransport.java:669)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbSession.send(SmbSession.java:238)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbTree.send(SmbTree.java:119) ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.send(SmbFile.java:776) ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.open0(SmbFile.java:993)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.open(SmbFile.java:1010)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbFileOutputStream.(SmbFileOutputStream.java:142)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.TransactNamedPipeOutputStream.(TransactNamedPipeOutputStream.java:32)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2951)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2438)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1221)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:627)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2018-11-07T21:53:57,861 (Worker thread '12') - JCIFS: Possibly
> transient exception detected on attempt 1 while getting share security: All
> pipe instances are busy.
>
> jcifs.smb.SmbException: All pipe instances are busy.
>
> at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:569)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbTransport.send(SmbTransport.java:669)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbSession.send(SmbSession.java:238)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbTree.send(SmbTree.java:119) ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.send(SmbFile.java:776) ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.open0(SmbFile.java:993)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.open(SmbFile.java:1010)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbFileOutputStream.(SmbFileOutputStream.java:142)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.TransactNamedPipeOutputStream.(TransactNamedPipeOutputStream.java:32)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126)
> ~[jcifs-1.3.18.3.jar:?]
>
> at

[jira] [Resolved] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-07 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1554.
-
Resolution: Cannot Reproduce

> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-07 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678101#comment-16678101
 ] 

Karl Wright commented on CONNECTORS-1554:
-

[~bisontim], there are several approved models under which you can run 
ManifoldCF.  They are each represented by an example directory in the 
distribution.  But the way you propose running everything under Tomcat is not 
one of these.

If you indeed want to run ManifoldCF as a single process (with the pitfalls 
that may have, including issues regarding starvation of UI resources during 
heavy crawling), you can simply deploy the combined ManifoldCF war file.  
Instructions are on the "how to build and deploy" page.


> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>    Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677450#comment-16677450
 ] 

Karl Wright commented on CONNECTORS-1554:
-

Note that if you perform the lock-clean procedure *as described*, all the 
documents should be reprioritized in any case, so all crawling should resume.  
After that, if you wind up with stuck documents it should be possible to look 
at the simple history for one of the stuck ones to see what happened to it.

The document retry logic has not changed for years, and was last changed in a 
minor way to address this very problem back in 2015.  Documents that get 
retried wind up being given to a thread that recomputes their priority.  The 
need to do this is signaled by the "needspriority" field being set to "Y", and 
then the reprioritization threads kick in and set the priority eventually.

So if you have jobqueue entries with the docpriority value of 1E9+1, a status 
of "P" or "G", and a needspriority field NOT set to 'Y', then those documents 
are stuck and I don't know how they got there.  So I need to know what happened 
to them that caused this.  



> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676928#comment-16676928
 ] 

Karl Wright commented on CONNECTORS-1554:
-

Hi [~bisontim], you are using file synchronization, as I feared.

This is deprecated.  You really want to be using Zookeeper synchronization.

Furthermore, your process of cleaning the locks is wrong.  The Tomcat web apps 
you are using do not include the agents process, and therefore you are cleaning 
the locks out from under a running agents process!  That's never going to work. 
 The proper process is:

(1) shutdown tomcat
(2) shutdown agents process
(3) clean locks
(4) start agents process
(5) start tomcat

You do not need to shut down solr or postgresql for this; in fact, that's 
counterproductive.


> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Job stuck without message

2018-11-06 Thread Karl Wright

I added a couple of questions to the ticket.  Please reply.

Thanks,
Karl


On Tue, Nov 6, 2018 at 8:56 AM Bisonti Mario 
wrote:

> Thanks a lot, Karl.
>
> I created a ticket.
>
> https://issues.apache.org/jira/browse/CONNECTORS-1554
>
>
>
>
>
> Thanks
>
>
>
> Mario
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 6 novembre 2018 14:28
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> ok, can you create a ticket?  Also, I'd appreciate it if you can look at
> the simple history for one of these documents; I need to see what happened
> to it last.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Tue, Nov 6, 2018 at 7:32 AM Bisonti Mario 
> wrote:
>
> My version is 2.11
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 6 novembre 2018 13:07
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> Thanks.
>
> What version of ManifoldCF are you using?  We fixed a problem a while back
> having to do with documents that (because of error processing) get put into
> a "ready for processing" state which don't have any document priority set.
> But this should have been addressed, certainly, by the most recent release
> and probably by 2.10 as well.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Nov 6, 2018 at 5:43 AM Bisonti Mario 
> wrote:
>
> Hallo Karl.
>
> When it hangs I see in the Queue status:
>
>
>
> And in the Document Status:
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 30 ottobre 2018 19:32
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> What I am interested in now is the Document Status report for any one of
> the documents that is 'stuck'.  The next crawl time value is the critical
> field.  Can you include an example?
>
>
>
> Karl
>
>
>
> On Tue, Oct 30, 2018, 12:36 PM Bisonti Mario 
> wrote:
>
> Thanks a lot, Karl.
>
>
>
> It happens that the job starts, it works and index for an hour and after
> it frezzes, I haven’t error or waiting status in Document Queue o Simple
> History, I have only “OK” status so, I haven’t failures.
>
>
>
> I am not able to see other log errors other from the manifoldcf.log
>
>
>
> Solr server is ok
>
> Tika server is ok
>
> Agent is ok
>
> Tomcat with ManifoldCF is ok
>
>
>
> I could search if I could to put in info log mode for example Tika servrer
> or Solr.
>
>
>
> Thanks..
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 30 ottobre 2018 16:38
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> Hi Mario,
>
> Please look at the Queue Status report to determine what is waiting and
> why it is waiting.
> You can also look at the Simple History to see what has been happening.
> If you are getting 100% failures in fetching documents then you may need to
> address this because your infrastructure is unhappy.  If the failure is
> something that indicates that the document is never going to be readable,
> that's a different problem and we might need to address that in the
> connector.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Oct 30, 2018 at 10:33 AM Bisonti Mario 
> wrote:
>
>
>
> Thanks a lot Karl
>
>
>
> Yes, I see many docs in the docs queue but they are inactive.
>
>
>
> Infact i see that no more docs are indexed in Solr and I see that job is
> with the same number of docs Active (35012)
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 30 ottobre 2018 13:59
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> The reason the job is "stuck" is because:
>
> ' JCIFS: Possibly transient exception detected on attempt 1 while getting
> share security: All pipe instances are busy.'
>
> This means that ManifoldCF will retry this document for a while before it
> gives up on it.  It appears to be stuck but it is not.  You can verify that
> by looking at the Document Queue report to see what is queued and what
> times the various documents will be retried.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Oct 30, 2018 at 5:07 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
>
>
> I started a job that works for some minutes, and after it stucks.
>
>
>
> In the manifoldcf.log I see:
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:627)
>

[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676820#comment-16676820
 ] 

Karl Wright commented on CONNECTORS-1554:
-

Hi [~bisontim], I note the following in your log:

{code}
ERROR 2018-11-06T14:31:47,730 (Agents thread) - Exception tossed: Service 'A' 
of type 'AGENT_org.apache.manifoldcf.crawler.system.CrawlerAgent is not active
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Service 'A' of type 
'AGENT_org.apache.manifoldcf.crawler.system.CrawlerAgent is not active
at 
org.apache.manifoldcf.core.lockmanager.BaseLockManager.endServiceActivity(BaseLockManager.java:462)
 ~[mcf-core.jar:?]
at 
org.apache.manifoldcf.core.lockmanager.LockManager.endServiceActivity(LockManager.java:172)
 ~[mcf-core.jar:?]
at 
org.apache.manifoldcf.agents.system.AgentsDaemon.checkAgents(AgentsDaemon.java:289)
 ~[mcf-agents.jar:?]
at 
org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsThread.run(AgentsDaemon.java:209)
 [mcf-agents.jar:?]
{code}

This makes me concerned that you might not be shutting down the agents process 
cleanly.  If you are using file-based synchronization, this could lead to stuck 
locks, which would explain the behavior you are seeing quite well.  Can you 
confirm that you are using zookeeper?  Thanks in advance.

> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676817#comment-16676817
 ] 

Karl Wright commented on CONNECTORS-1554:
-

Hi [~bisontim], I need the Simple History of one of the documents that is 
"stuck".  You will need to have it go back far enough to find out what happened 
to that one document last.  Thanks in advance!!


> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>    Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (CONNECTORS-1554) Job stuck during crawl documents on folder

2018-11-06 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1554:
---

Assignee: Karl Wright

> Job stuck during crawl documents on folder
> --
>
> Key: CONNECTORS-1554
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1554
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Active Directory authority, File system connector, Tika 
> extractor
>Affects Versions: ManifoldCF 2.11
> Environment: Ubuntu Server 18.04
> ManifoldCF 2.11
> Solr 7.5.0
> Tika Server 1.19.1
>Reporter: Mario Bisonti
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.11
>
> Attachments: SimpleHistory.png, manifoldcf.log
>
>
> Hallo.
> When I start a job that index a Windows Share, it stucks after a 15 minutes 
> near.
>  
> I see error in ManifoldCF.log as you can see in the attachment
>  
> I attached "Simple History" with the last documents crawled.
> Thanks a lot.
> Mario
> [^manifoldcf.log]!SimpleHistory.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Job stuck without message

2018-11-06 Thread Karl Wright

ok, can you create a ticket?  Also, I'd appreciate it if you can look at
the simple history for one of these documents; I need to see what happened
to it last.

Thanks,
Karl


On Tue, Nov 6, 2018 at 7:32 AM Bisonti Mario 
wrote:

> My version is 2.11
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 6 novembre 2018 13:07
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> Thanks.
>
> What version of ManifoldCF are you using?  We fixed a problem a while back
> having to do with documents that (because of error processing) get put into
> a "ready for processing" state which don't have any document priority set.
> But this should have been addressed, certainly, by the most recent release
> and probably by 2.10 as well.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Nov 6, 2018 at 5:43 AM Bisonti Mario 
> wrote:
>
> Hallo Karl.
>
> When it hangs I see in the Queue status:
>
>
>
> And in the Document Status:
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 30 ottobre 2018 19:32
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> What I am interested in now is the Document Status report for any one of
> the documents that is 'stuck'.  The next crawl time value is the critical
> field.  Can you include an example?
>
>
>
> Karl
>
>
>
> On Tue, Oct 30, 2018, 12:36 PM Bisonti Mario 
> wrote:
>
> Thanks a lot, Karl.
>
>
>
> It happens that the job starts, it works and index for an hour and after
> it frezzes, I haven’t error or waiting status in Document Queue o Simple
> History, I have only “OK” status so, I haven’t failures.
>
>
>
> I am not able to see other log errors other from the manifoldcf.log
>
>
>
> Solr server is ok
>
> Tika server is ok
>
> Agent is ok
>
> Tomcat with ManifoldCF is ok
>
>
>
> I could search if I could to put in info log mode for example Tika servrer
> or Solr.
>
>
>
> Thanks..
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 30 ottobre 2018 16:38
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> Hi Mario,
>
> Please look at the Queue Status report to determine what is waiting and
> why it is waiting.
> You can also look at the Simple History to see what has been happening.
> If you are getting 100% failures in fetching documents then you may need to
> address this because your infrastructure is unhappy.  If the failure is
> something that indicates that the document is never going to be readable,
> that's a different problem and we might need to address that in the
> connector.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Oct 30, 2018 at 10:33 AM Bisonti Mario 
> wrote:
>
>
>
> Thanks a lot Karl
>
>
>
> Yes, I see many docs in the docs queue but they are inactive.
>
>
>
> Infact i see that no more docs are indexed in Solr and I see that job is
> with the same number of docs Active (35012)
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 30 ottobre 2018 13:59
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> The reason the job is "stuck" is because:
>
> ' JCIFS: Possibly transient exception detected on attempt 1 while getting
> share security: All pipe instances are busy.'
>
> This means that ManifoldCF will retry this document for a while before it
> gives up on it.  It appears to be stuck but it is not.  You can verify that
> by looking at the Document Queue report to see what is queued and what
> times the various documents will be retried.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Oct 30, 2018 at 5:07 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
>
>
> I started a job that works for some minutes, and after it stucks.
>
>
>
> In the manifoldcf.log I see:
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:627)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2018-10-30T09:21:31,440 (Worker thread '2') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:33,502 (Worker thread '14') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:37,725 (Worker thread '30') - Tika Se

[jira] [Resolved] (CONNECTORS-1553) Upgrade to SolrJ 6.6.5

2018-11-06 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1553.
-
Resolution: Won't Fix

> Upgrade to SolrJ 6.6.5
> --
>
> Key: CONNECTORS-1553
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1553
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1553) Upgrade to SolrJ 6.6.5

2018-11-06 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676329#comment-16676329
 ] 

Karl Wright commented on CONNECTORS-1553:
-

[~kamaci], we updated to SolrJ 7.4.x for release 2.11.  We should not go back.

> Upgrade to SolrJ 6.6.5
> --
>
> Key: CONNECTORS-1553
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1553
> Project: ManifoldCF
>  Issue Type: Task
>  Components: Lucene/SOLR connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Furkan KAMACI
>Assignee: Furkan KAMACI
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Welcome Tim Allison as a Lucene/Solr committer

2018-11-05 Thread Karl Wright

Welcome!
Karl

On Mon, Nov 5, 2018 at 1:39 PM Christine Poerschke (BLOOMBERG/ LONDON) <
cpoersc...@bloomberg.net> wrote:

> Welcome Tim!
>
> From: dev@lucene.apache.org At: 11/02/18 16:20:52
> To: dev@lucene.apache.org
> Subject: Welcome Tim Allison as a Lucene/Solr committer
>
> Hi all,
>
>
> Please join me in welcoming Tim Allison as the latest Lucene/Solr committer!
>
> Congratulations and Welcome, Tim!
>
> It's traditional for you to introduce yourself with a brief bio.
>
> Erick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
>

Re: Welcome Gus Heck as Lucene/Solr committer

2018-11-02 Thread Karl Wright

Welcome!!
Karl

On Thu, Nov 1, 2018 at 9:53 PM Koji Sekiguchi 
wrote:

> Welcome Gus!
>
> Koji
>
> On 2018/11/01 21:22, David Smiley wrote:
> > Hi all,
> >
> > Please join me in welcoming Gus Heck as the latest Lucene/Solr committer!
> >
> > Congratulations and Welcome, Gus!
> >
> > Gus, it's traditional for you to introduce yourself with a brief bio.
> >
> > ~ David
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-11-02 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672605#comment-16672605
 ] 

Karl Wright commented on CONNECTORS-1546:
-

I didn't see a commit go by.  Were you able to commit?


> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-11-01 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672435#comment-16672435
 ] 

Karl Wright commented on CONNECTORS-1552:
-

Looks good, but I'd suggest making sure the text capitalization style is 
consistent with everything else in the connector.


> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: screenshot-1.png
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1529) Add "url" output element to ES Output Connector (required when used with the Web Repository Connector)

2018-11-01 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672425#comment-16672425
 ] 

Karl Wright commented on CONNECTORS-1529:
-

As long as it's a new field, seems that backwards compatibility is preserved, 
so I'm OK with it.


> Add "url" output element to ES Output Connector (required when used with the 
> Web Repository Connector)
> --
>
> Key: CONNECTORS-1529
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1529
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: elasticsearch.patch, image-2018-09-06-10-28-45-008.png
>
>
> Add "url" (copy of the _id field) to ES Output.
> ES no longer supports copying from _id (copy-to) in the schema.
> As per 
> !image-2018-09-06-10-28-45-008.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (LUCENE-8540) Geo3d quantization test failure for MAX/MIN encoding values

2018-10-31 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670641#comment-16670641
 ] 

Karl Wright commented on LUCENE-8540:
-

[~ivera] Looks reasonable as far as I can tell.  The question is whether the 
decode scaling factor is 'correct' but I think changing that will cause people 
to need to reindex, so this is a better fix.

> Geo3d quantization test failure for MAX/MIN encoding values
> ---
>
> Key: LUCENE-8540
> URL: https://issues.apache.org/jira/browse/LUCENE-8540
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Attachments: LUCENE-8540.patch
>
>
> Here is a reproducible error:
> {code:java}
> 08:45:21[junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
> 08:45:21[junit4] IGNOR/A 0.01s J1 | TestGeo3DPoint.testRandomBig
> 08:45:21[junit4]> Assumption #1: 'nightly' test group is disabled 
> (@Nightly())
> 08:45:21[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestGeo3DPoint -Dtests.method=testQuantization 
> -Dtests.seed=4CB20CF248F6211 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ga-IE -Dtests.timezone=America/Bogota -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
> 08:45:21[junit4] ERROR   0.20s J1 | TestGeo3DPoint.testQuantization <<<
> 08:45:21[junit4]> Throwable #1: java.lang.IllegalArgumentException: 
> value=-1.0011188543037526 is out-of-bounds (less than than WGS84's 
> -planetMax=-1.0011188539924791)
> 08:45:21[junit4]> at 
> __randomizedtesting.SeedInfo.seed([4CB20CF248F6211:32220FD9326E7F33]:0)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.Geo3DUtil.encodeValue(Geo3DUtil.java:56)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testQuantization(TestGeo3DPoint.java:1228)
> 08:45:21[junit4]> at java.lang.Thread.run(Thread.java:748)
> 08:45:21[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
> {id=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, 
> docValues:{id=DocValuesFormat(name=Asserting), 
> point=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=659, 
> maxMBSortInHeap=6.225981846119071, sim=RandomSimilarity(queryNorm=false): {}, 
> locale=ga-IE, timezone=America/Bogota
> 08:45:21[junit4]   2> NOTE: Linux 2.6.32-754.6.3.el6.x86_64 amd64/Oracle 
> Corporation 1.8.0_181 
> (64-bit)/cpus=16,threads=1,free=466116320,total=536346624
> 08:45:21[junit4]   2> NOTE: All tests run in this JVM: [GeoPointTest, 
> RandomGeoPolygonTest, TestGeo3DPoint]
> 08:45:21[junit4] Completed [18/18 (1!)] on J1 in 19.83s, 14 tests, 1 
> error, 1 skipped <<< FAILURES!{code}
>  
> It seems this test will fail if encoding = Geo3DUtil.MIN_ENCODED_VALUE or 
> encoding = Geo3DUtil.MAX_ENCODED_VALUE.
> It is related with https://issues.apache.org/jira/browse/LUCENE-7327
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Job stuck without message

2018-10-30 Thread Karl Wright

What I am interested in now is the Document Status report for any one of
the documents that is 'stuck'.  The next crawl time value is the critical
field.  Can you include an example?

Karl

On Tue, Oct 30, 2018, 12:36 PM Bisonti Mario 
wrote:

> Thanks a lot, Karl.
>
>
>
> It happens that the job starts, it works and index for an hour and after
> it frezzes, I haven’t error or waiting status in Document Queue o Simple
> History, I have only “OK” status so, I haven’t failures.
>
>
>
> I am not able to see other log errors other from the manifoldcf.log
>
>
>
> Solr server is ok
>
> Tika server is ok
>
> Agent is ok
>
> Tomcat with ManifoldCF is ok
>
>
>
> I could search if I could to put in info log mode for example Tika servrer
> or Solr.
>
>
>
> Thanks..
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 30 ottobre 2018 16:38
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> Hi Mario,
>
> Please look at the Queue Status report to determine what is waiting and
> why it is waiting.
> You can also look at the Simple History to see what has been happening.
> If you are getting 100% failures in fetching documents then you may need to
> address this because your infrastructure is unhappy.  If the failure is
> something that indicates that the document is never going to be readable,
> that's a different problem and we might need to address that in the
> connector.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Oct 30, 2018 at 10:33 AM Bisonti Mario 
> wrote:
>
>
>
> Thanks a lot Karl
>
>
>
> Yes, I see many docs in the docs queue but they are inactive.
>
>
>
> Infact i see that no more docs are indexed in Solr and I see that job is
> with the same number of docs Active (35012)
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 30 ottobre 2018 13:59
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> The reason the job is "stuck" is because:
>
> ' JCIFS: Possibly transient exception detected on attempt 1 while getting
> share security: All pipe instances are busy.'
>
> This means that ManifoldCF will retry this document for a while before it
> gives up on it.  It appears to be stuck but it is not.  You can verify that
> by looking at the Document Queue report to see what is queued and what
> times the various documents will be retried.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Oct 30, 2018 at 5:07 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
>
>
> I started a job that works for some minutes, and after it stucks.
>
>
>
> In the manifoldcf.log I see:
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:627)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2018-10-30T09:21:31,440 (Worker thread '2') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:33,502 (Worker thread '14') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:37,725 (Worker thread '30') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:44,406 (Worker thread '49') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:47,310 (Worker thread '15') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:52,000 (Worker thread '27') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:53,526 (Worker thread '15') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:22:04,511 (Worker thread '3') - JCIFS: Possibly
> transient exception detected on attempt 1 while getting share security: All
> pipe instances are busy.
>
> jcifs.smb.SmbException: All pipe instances are busy.
>
> at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:569)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbTransport.send(SmbTransport.java:669)
> ~[jcifs-1.3.18.3.jar:

Re: Job stuck without message

2018-10-30 Thread Karl Wright

Hi Mario,

Please look at the Queue Status report to determine what is waiting and why
it is waiting.
You can also look at the Simple History to see what has been happening.  If
you are getting 100% failures in fetching documents then you may need to
address this because your infrastructure is unhappy.  If the failure is
something that indicates that the document is never going to be readable,
that's a different problem and we might need to address that in the
connector.

Karl


On Tue, Oct 30, 2018 at 10:33 AM Bisonti Mario 
wrote:

>
>
> Thanks a lot Karl
>
>
>
> Yes, I see many docs in the docs queue but they are inactive.
>
>
>
> Infact i see that no more docs are indexed in Solr and I see that job is
> with the same number of docs Active (35012)
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* martedì 30 ottobre 2018 13:59
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: Job stuck without message
>
>
>
> The reason the job is "stuck" is because:
>
> ' JCIFS: Possibly transient exception detected on attempt 1 while getting
> share security: All pipe instances are busy.'
>
> This means that ManifoldCF will retry this document for a while before it
> gives up on it.  It appears to be stuck but it is not.  You can verify that
> by looking at the Document Queue report to see what is queued and what
> times the various documents will be retried.
>
>
>
> Karl
>
>
>
>
>
> On Tue, Oct 30, 2018 at 5:07 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
>
>
> I started a job that works for some minutes, and after it stucks.
>
>
>
> In the manifoldcf.log I see:
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:627)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2018-10-30T09:21:31,440 (Worker thread '2') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:33,502 (Worker thread '14') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:37,725 (Worker thread '30') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:44,406 (Worker thread '49') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:47,310 (Worker thread '15') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:52,000 (Worker thread '27') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:53,526 (Worker thread '15') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:22:04,511 (Worker thread '3') - JCIFS: Possibly
> transient exception detected on attempt 1 while getting share security: All
> pipe instances are busy.
>
> jcifs.smb.SmbException: All pipe instances are busy.
>
> at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:569)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbTransport.send(SmbTransport.java:669)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbSession.send(SmbSession.java:238)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbTree.send(SmbTree.java:119) ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.send(SmbFile.java:776) ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.open0(SmbFile.java:993)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.open(SmbFile.java:1010)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbFileOutputStream.(SmbFileOutputStream.java:142)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.TransactNamedPipeOutputStream.(TransactNamedPipeOutputStream.java:32)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.Dce

Re: Job stuck without message

2018-10-30 Thread Karl Wright

The reason the job is "stuck" is because:

' JCIFS: Possibly transient exception detected on attempt 1 while getting
share security: All pipe instances are busy.'

This means that ManifoldCF will retry this document for a while before it
gives up on it.  It appears to be stuck but it is not.  You can verify that
by looking at the Document Queue report to see what is queued and what
times the various documents will be retried.

Karl


On Tue, Oct 30, 2018 at 5:07 AM Bisonti Mario 
wrote:

> Hallo.
>
>
>
> I started a job that works for some minutes, and after it stucks.
>
>
>
> In the manifoldcf.log I see:
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:627)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2018-10-30T09:21:31,440 (Worker thread '2') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:33,502 (Worker thread '14') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:37,725 (Worker thread '30') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:44,406 (Worker thread '49') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:47,310 (Worker thread '15') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:52,000 (Worker thread '27') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:21:53,526 (Worker thread '15') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:22:04,511 (Worker thread '3') - JCIFS: Possibly
> transient exception detected on attempt 1 while getting share security: All
> pipe instances are busy.
>
> jcifs.smb.SmbException: All pipe instances are busy.
>
> at jcifs.smb.SmbTransport.checkStatus(SmbTransport.java:569)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbTransport.send(SmbTransport.java:669)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbSession.send(SmbSession.java:238)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbTree.send(SmbTree.java:119) ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.send(SmbFile.java:776) ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.open0(SmbFile.java:993)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.open(SmbFile.java:1010)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbFileOutputStream.(SmbFileOutputStream.java:142)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.TransactNamedPipeOutputStream.(TransactNamedPipeOutputStream.java:32)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.smb.SmbNamedPipe.getNamedPipeOutputStream(SmbNamedPipe.java:187)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> jcifs.dcerpc.DcerpcPipeHandle.doSendFragment(DcerpcPipeHandle.java:68)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:190)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.bind(DcerpcHandle.java:126)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.dcerpc.DcerpcHandle.sendrecv(DcerpcHandle.java:140)
> ~[jcifs-1.3.18.3.jar:?]
>
> at jcifs.smb.SmbFile.getShareSecurity(SmbFile.java:2951)
> ~[jcifs-1.3.18.3.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecurity(SharedDriveConnector.java:2438)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.getFileShareSecuritySet(SharedDriveConnector.java:1221)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.connectors.sharedrive.SharedDriveConnector.processDocuments(SharedDriveConnector.java:627)
> [mcf-jcifs-connector.jar:?]
>
> at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399)
> [mcf-pull-agent.jar:?]
>
> WARN 2018-10-30T09:22:10,359 (Worker thread '27') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:22:13,932 (Worker thread '12') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:22:14,274 (Worker thread '23') - Tika Server: Tika
> Server rejects: Tika Server rejected document with the following reason:
> Unprocessable Entity
>
> WARN 2018-10-30T09:22:19,933 (Worker thread '8') - Tika

[jira] [Assigned] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-10-29 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1552:
---

Assignee: Steph van Schalkwyk  (was: Karl Wright)

> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Steph van Schalkwyk
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [jira] [Commented] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-10-29 Thread Karl Wright

If you have this ready, I can assign to you -- or take it yourself.

Karl


On Mon, Oct 29, 2018 at 3:33 PM Steph van Schalkwyk 
wrote:

> I'm working on that one as well. Bit of a fix with a client right now. Will
> issue patch.
>
>
>
> *Steph van Schalkwyk*
> Principal, Remcam Search Engines
> +1.314.452. <+1+314+452+2896>2896st...@remcam.net   http://remcam.net
> <http://www.remcam.net/> Skype: svanschalkwyk
> <https://mail.google.com/mail/u/0/#>
> <http://linkedin.com/in/vanschalkwyk>
>
>
> On Mon, Oct 29, 2018 at 1:45 PM Karl Wright (JIRA) 
> wrote:
>
> >
> > [
> >
> https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667589#comment-16667589
> > ]
> >
> > Karl Wright commented on CONNECTORS-1552:
> > -
> >
> > The ES connector does not currently support any ES authentication
> > requirements whatsoever.  This is therefore an enhancement to the current
> > connector, not a bug.  Enhancement requests are looked at based on time
> and
> > availability of the volunteers working on the ManifoldCF project.
> >
> > I would suggest that if you have time-critical need for a new feature,
> you
> > consider adding it yourself.  The earliest I could look at this would be
> > next weekend and that is not guaranteed.
> >
> >
> > > Apache ManifoldCF Elastic Connector for Basic Authorisation
> > > ---
> > >
> > > Key: CONNECTORS-1552
> > > URL:
> > https://issues.apache.org/jira/browse/CONNECTORS-1552
> > > Project: ManifoldCF
> > >  Issue Type: Improvement
> > >  Components: Elastic Search connector
> > >Affects Versions: ManifoldCF 2.10
> > >Reporter: Krishna Agrawal
> > >Assignee: Karl Wright
> > >Priority: Major
> > > Fix For: ManifoldCF 2.12
> > >
> > >
> > > We are using the Apache Manifold CF to connect the elastic search as
> our
> > Elastic server is protected url there is no way we are able to connect
> from
> > the Admin console.
> > > If we remove the authentication connector works well but we want to
> > access by passing username and password.
> > > Please guide us so that we can complete our set up.
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v7.6.3#76005)
> >
>

[jira] [Commented] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-10-29 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667589#comment-16667589
 ] 

Karl Wright commented on CONNECTORS-1552:
-

The ES connector does not currently support any ES authentication requirements 
whatsoever.  This is therefore an enhancement to the current connector, not a 
bug.  Enhancement requests are looked at based on time and availability of the 
volunteers working on the ManifoldCF project.

I would suggest that if you have time-critical need for a new feature, you 
consider adding it yourself.  The earliest I could look at this would be next 
weekend and that is not guaranteed.


> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation

2018-10-29 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1552:
---

 Assignee: Karl Wright
 Priority: Major  (was: Blocker)
Fix Version/s: ManifoldCF 2.12
  Component/s: Elastic Search connector
   Issue Type: Improvement  (was: Bug)

> Apache ManifoldCF Elastic Connector for Basic Authorisation
> ---
>
> Key: CONNECTORS-1552
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1552
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Krishna Agrawal
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> We are using the Apache Manifold CF to connect the elastic search as our 
> Elastic server is protected url there is no way we are able to connect from 
> the Admin console.
> If we remove the authentication connector works well but we want to access by 
> passing username and password.
> Please guide us so that we can complete our set up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: web connector : links extraction issues

2018-10-29 Thread Karl Wright

Hi Olivier,

Javascript inclusion in the Web Connector is not evaluated.  In fact, no
Javascript is executed at all.  Therefore it should not matter what is
included via javascript.

Thanks,
Karl


On Mon, Oct 29, 2018 at 1:39 PM Olivier Tavard <
olivier.tav...@francelabs.com> wrote:

> Hi,
>
> Regarding the web connector, I noticed that for specific websites, some
> Javascript code can prevent the web connector to fetch correctly all the
> links present on the page. Specifically, for websites that contain a
> deprecated version of New relic web agent as
> js-agent.newrelic.com/nr-1071.min.js.
> After downloading the page locally and removing the reference to the new
> relic agent browser, the links were correctly fetched in the page by the
> web connector. So it seems that the Javascript injection here caused by
> the new relic agent was the cause of the links not fetched in the page.
> This case is rare and concerns only old versions of New Relic agent. But
> in a more generic way, would it be possible to block the javascript
> injection at the connector level during the indexation ?
>
> Thanks,
> Best regards,
> Olivier
>
>
>

Re: ManifoldCF database model

2018-10-29 Thread Karl Wright

You can enable repository connector debug logging by adding this to your
properties.xml:



Having said that, the cleanup phase for all connectors is executed by the
framework.  We know the framework works because we have numerous
integration tests that exercise it.  But it's up to the ES connector to
delete documents and log the fact that it is deleting documents.  So I
suspect that it is the ES connector's delete functionality that is not
working properly.

If you told me that *no* documents show up in the Simple History as being
deleted during the cleanup phase, then there would obviously be a simple ES
connector bug involved.  But if there are multiple documents that *do* get
deleted, it's more complex than that.  Do you ever see *any* documents
deleted during the cleanup phase in the Simple History with the ES
connector?

Another easy check is to set up exactly the same job but with the output
going to the Null Output Connector.  This connector definitely logs
everything it sees.  Compare and contrast vs the ES output connector.  If
you see a difference, it's likely a bug in the ES connector that we'll have
to figure out.

Thanks,
Karl

Karl

On Mon, Oct 29, 2018 at 12:39 PM Gustavo Beneitez <
gustavo.benei...@gmail.com> wrote:

> Hi,
>
> we made a new test, job created several documents that never where removed
> from Elastic Search after job deletion, and the Simple History never showed
> them as deleted.
>
> I also looked for an error on logs without luck.
>
> I think it could be 2) case, can I increase log detail for web repository?
> This, and the Elastic, are both default connectors, no code changes here.
>
> Thanks.
>
> El lun., 29 oct. 2018 a las 16:12, Karl Wright ()
> escribió:
>
> > It is only possible if:
> >
> > (1) You run a job in a "minimal" configuration, or
> > (2) There is a bug in either the repository connector that doesn't
> properly
> > signal the status of a deleted document to the pipeline, or
> > (3) There is a bug in the output connector so that deletion of a document
> > silently fails but is nevertheless reported as having succeeded.
> >
> > The way to figure this out is to look at the Simple History for one of
> the
> > documents you expect to have been deleted to see how it was handled.
> >
> > Thanks,
> > Karl
> >
> >
> > On Mon, Oct 29, 2018 at 11:06 AM Gustavo Beneitez <
> > gustavo.benei...@gmail.com> wrote:
> >
> > > Hi Karl,
> > >
> > > after several tests I did manage to create, run and delete a job with
> > > Elastic output connector, and all its documents where also deleted from
> > > database while they were not deleted from repository.
> > >
> > > Under which cases is this possible? Maybe if they share repo?
> > >
> > > Thanks in advance!
> > >
> > >
> > > El mié., 17 oct. 2018 a las 14:40, Gustavo Beneitez (<
> > > gustavo.benei...@gmail.com>) escribió:
> > >
> > > > Ok thanks!
> > > >
> > > > El mié., 17 oct. 2018 a las 14:27, Karl Wright ( >)
> > > > escribió:
> > > >
> > > >> Ok, the schema is described in ManifoldCF In Action.
> > > >>
> > > >> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs
> > > >>
> > > >> Karl
> > > >>
> > > >>
> > > >> On Wed, Oct 17, 2018 at 7:41 AM Gustavo Beneitez <
> > > >> gustavo.benei...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hi Karl,
> > > >> >
> > > >> > as far as I was able to gather information from history records, I
> > > could
> > > >> > see MCF is behaving as expected. The "problem" shows when
> > > ElasticSearch
> > > >> is
> > > >> > down or performing bad, MCF says it was requested to be deleted,
> but
> > > >> while
> > > >> > it has been erased from database, it is alive on ElasticSearch
> side,
> > > so
> > > >> I
> > > >> > need to find whether or not there are those kind of
> inconsistencies
> > > >> exist.
> > > >> >
> > > >> > Please allow us to check those documents and make new tests in
> order
> > > to
> > > >> see
> > > >> > what really happens,we don't modify any database record by hand.
> > > >> >
> > > >> > Thanks!
> > > >> >
> > > >> >

Re: ManifoldCF database model

2018-10-29 Thread Karl Wright

It is only possible if:

(1) You run a job in a "minimal" configuration, or
(2) There is a bug in either the repository connector that doesn't properly
signal the status of a deleted document to the pipeline, or
(3) There is a bug in the output connector so that deletion of a document
silently fails but is nevertheless reported as having succeeded.

The way to figure this out is to look at the Simple History for one of the
documents you expect to have been deleted to see how it was handled.

Thanks,
Karl


On Mon, Oct 29, 2018 at 11:06 AM Gustavo Beneitez <
gustavo.benei...@gmail.com> wrote:

> Hi Karl,
>
> after several tests I did manage to create, run and delete a job with
> Elastic output connector, and all its documents where also deleted from
> database while they were not deleted from repository.
>
> Under which cases is this possible? Maybe if they share repo?
>
> Thanks in advance!
>
>
> El mié., 17 oct. 2018 a las 14:40, Gustavo Beneitez (<
> gustavo.benei...@gmail.com>) escribió:
>
> > Ok thanks!
> >
> > El mié., 17 oct. 2018 a las 14:27, Karl Wright ()
> > escribió:
> >
> >> Ok, the schema is described in ManifoldCF In Action.
> >>
> >> https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs
> >>
> >> Karl
> >>
> >>
> >> On Wed, Oct 17, 2018 at 7:41 AM Gustavo Beneitez <
> >> gustavo.benei...@gmail.com>
> >> wrote:
> >>
> >> > Hi Karl,
> >> >
> >> > as far as I was able to gather information from history records, I
> could
> >> > see MCF is behaving as expected. The "problem" shows when
> ElasticSearch
> >> is
> >> > down or performing bad, MCF says it was requested to be deleted, but
> >> while
> >> > it has been erased from database, it is alive on ElasticSearch side,
> so
> >> I
> >> > need to find whether or not there are those kind of inconsistencies
> >> exist.
> >> >
> >> > Please allow us to check those documents and make new tests in order
> to
> >> see
> >> > what really happens,we don't modify any database record by hand.
> >> >
> >> > Thanks!
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > El mar., 16 oct. 2018 a las 19:27, Karl Wright ()
> >> > escribió:
> >> >
> >> > > Hi, you can look at ManifoldCF In Action.  There's a link to it on
> the
> >> > > manifoldcf page.
> >> > >
> >> > > However, you should be aware that we consider it a severe bug if
> >> > ManifoldCF
> >> > > doesn't clean up after itself.  The only time that is not expected
> is
> >> > when
> >> > > people write buggy connectors or mess with database tables
> >> themselves.  I
> >> > > would urge you to examine the Simple History report and try to come
> up
> >> > with
> >> > > a reproducible test case rather than trying to reverse engineer MCF.
> >> > > Should you go directly to the database, we will be unable to give
> you
> >> any
> >> > > support.
> >> > >
> >> > > Thanks,
> >> > > Karl
> >> > >
> >> > >
> >> > > On Tue, Oct 16, 2018 at 11:51 AM Gustavo Beneitez <
> >> > > gustavo.benei...@gmail.com> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > how do you do? I was wandering if there is any technical document
> >> about
> >> > > > what is the meaning of each table in database, the relationship
> >> between
> >> > > > documents, repositories, jobs and any other output connector (some
> >> kind
> >> > > of
> >> > > > a database model).
> >> > > >
> >> > > > We are facing some "garbage issues", jobs are created, duplicated,
> >> > > related
> >> > > > to transformations, linked to outputs (Elastic Search), played and
> >> > > finally
> >> > > > deleted, but in the end documents that should be also deleted
> >> against
> >> > the
> >> > > > output connector,  sometimes they still are there, don't know if
> >> they
> >> > are
> >> > > > visible because they point to an existing job, an unexpected job
> >> end or
> >> > > any
> >> > > > other failure.
> >> > > >
> >> > > > We need to understand the database model in order to check when
> >> > documents
> >> > > > stored in Elastic can be safely removed since they no longer are
> >> > referred
> >> > > > by any process. A process that should be executed periodically
> every
> >> > > week,
> >> > > > for example.
> >> > > >
> >> > > > Thanks in advance!
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Contribution help for the Confluence connector patch

2018-10-24 Thread Karl Wright

Never mind, I was able to get it fixed.

Karl


On Wed, Oct 24, 2018 at 10:19 AM Karl Wright  wrote:

> I've created CONNECTORS-1551, and attached the patch.
>
> Unfortunately there seems to be some encoding issues with
> common_ja_JP.properties; can you send that one file via email as an
> attachment?  Thanks!
>
> Karl
>
>
> On Tue, Oct 23, 2018 at 8:54 PM 白井 隆/ Shirai Takashi <
> shi...@nintendo.co.jp> wrote:
>
>> Hi, there.
>>
>> I've just made the patch to extend mcf-confluence-connector.
>> The official site says that I can create a JIRA ticket for improvements.
>> But I cannot access the JIRA via the firewall in our office.
>> Can someone create a ticket instead of me?
>>
>> The patch is attached to this mail.
>> [Extension]
>> o Support the page type 'blogpost' as well as 'page'. (*1)
>> o Include the Japanese message catalog.
>> [Bug Fix]
>> o Ugly message when the 'Port' value is invalid.
>> o Ugly message of 'Process Attachments' in 'View a Job'.
>> o Some null pointer exceptions.
>>
>> (*1)
>> Confluence has 2 different types of page.
>> The current connector can only find 'page' typed pages.
>> This extension can find both of them selectively.
>>
>> Thanks.
>>
>> 
>> Nintendo, Co., Ltd.
>> Product Technology Dept.
>> Takashi SHIRAI
>> PHONE: +81-75-662-9600
>> mailto:shi...@nintendo.co.jp
>
>

[jira] [Resolved] (CONNECTORS-1551) Various confluence connector issues

2018-10-24 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1551.
-
Resolution: Fixed

r1844778


> Various confluence connector issues
> ---
>
> Key: CONNECTORS-1551
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1551
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Confluence connector
>Affects Versions: ManifoldCF 2.11
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1551.patch
>
>
> I've just made the patch to extend mcf-confluence-connector.
> The official site says that I can create a JIRA ticket for improvements.
> But I cannot access the JIRA via the firewall in our office.
> Can someone create a ticket instead of me?
> The patch is attached to this mail.
> [Extension]
> o Support the page type 'blogpost' as well as 'page'. (*1)
> o Include the Japanese message catalog.
> [Bug Fix]
> o Ugly message when the 'Port' value is invalid.
> o Ugly message of 'Process Attachments' in 'View a Job'.
> o Some null pointer exceptions.
> (*1)
> Confluence has 2 different types of page.
> The current connector can only find 'page' typed pages.
> This extension can find both of them selectively.
> Thanks.
> Takashi SHIRAI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Contribution help for the Confluence connector patch

2018-10-24 Thread Karl Wright

I've created CONNECTORS-1551, and attached the patch.

Unfortunately there seems to be some encoding issues with
common_ja_JP.properties; can you send that one file via email as an
attachment?  Thanks!

Karl


On Tue, Oct 23, 2018 at 8:54 PM 白井 隆/ Shirai Takashi 
wrote:

> Hi, there.
>
> I've just made the patch to extend mcf-confluence-connector.
> The official site says that I can create a JIRA ticket for improvements.
> But I cannot access the JIRA via the firewall in our office.
> Can someone create a ticket instead of me?
>
> The patch is attached to this mail.
> [Extension]
> o Support the page type 'blogpost' as well as 'page'. (*1)
> o Include the Japanese message catalog.
> [Bug Fix]
> o Ugly message when the 'Port' value is invalid.
> o Ugly message of 'Process Attachments' in 'View a Job'.
> o Some null pointer exceptions.
>
> (*1)
> Confluence has 2 different types of page.
> The current connector can only find 'page' typed pages.
> This extension can find both of them selectively.
>
> Thanks.
>
> 
> Nintendo, Co., Ltd.
> Product Technology Dept.
> Takashi SHIRAI
> PHONE: +81-75-662-9600
> mailto:shi...@nintendo.co.jp

[jira] [Updated] (CONNECTORS-1551) Various confluence connector issues

2018-10-24 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1551:

Attachment: CONNECTORS-1551.patch

> Various confluence connector issues
> ---
>
> Key: CONNECTORS-1551
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1551
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Confluence connector
>Affects Versions: ManifoldCF 2.11
>    Reporter: Karl Wright
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1551.patch
>
>
> I've just made the patch to extend mcf-confluence-connector.
> The official site says that I can create a JIRA ticket for improvements.
> But I cannot access the JIRA via the firewall in our office.
> Can someone create a ticket instead of me?
> The patch is attached to this mail.
> [Extension]
> o Support the page type 'blogpost' as well as 'page'. (*1)
> o Include the Japanese message catalog.
> [Bug Fix]
> o Ugly message when the 'Port' value is invalid.
> o Ugly message of 'Process Attachments' in 'View a Job'.
> o Some null pointer exceptions.
> (*1)
> Confluence has 2 different types of page.
> The current connector can only find 'page' typed pages.
> This extension can find both of them selectively.
> Thanks.
> Takashi SHIRAI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (CONNECTORS-1551) Various confluence connector issues

2018-10-24 Thread Karl Wright (JIRA)

Karl Wright created CONNECTORS-1551:
---

 Summary: Various confluence connector issues
 Key: CONNECTORS-1551
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1551
 Project: ManifoldCF
  Issue Type: Bug
  Components: Confluence connector
Affects Versions: ManifoldCF 2.11
Reporter: Karl Wright
Assignee: Karl Wright
 Fix For: ManifoldCF 2.12


I've just made the patch to extend mcf-confluence-connector.
The official site says that I can create a JIRA ticket for improvements.
But I cannot access the JIRA via the firewall in our office.
Can someone create a ticket instead of me?

The patch is attached to this mail.
[Extension]
o Support the page type 'blogpost' as well as 'page'. (*1)
o Include the Japanese message catalog.
[Bug Fix]
o Ugly message when the 'Port' value is invalid.
o Ugly message of 'Process Attachments' in 'View a Job'.
o Some null pointer exceptions.

(*1)
Confluence has 2 different types of page.
The current connector can only find 'page' typed pages.
This extension can find both of them selectively.

Thanks.
Takashi SHIRAI



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: How documents are deleted

2018-10-24 Thread Karl Wright

Hi Julien,

This is a complex question and the framework behaves differently depending
on the connector model.  Please read:

https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs

Karl


On Wed, Oct 24, 2018 at 5:26 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Hi Karl,
>
> I am trying to understand the behavior of ManifoldCF during a re-crawl
> and specially how missing documents are deleted and by which process ?
>
> I am focusing on two repository connectors, the JCIFS one and the JDBC
> one. Here is what I understand so far :
>
> In the JCIFS connector, the addSeedDocuments method list all the files
> found for each configured path. So it seems clear that any previously
> crawled files that have not been listed during a re-crawl by this method
> should be deleted.
>
> In the JDBC connector, the addSeedDocuments method only list the new or
> modified documents during a re-crawl (if, of course, the id query is
> correctly using the starttime and endtime variables). So here, there is
> a difference between the two connectors. It means that to delete missing
> documents, the previously crawled ones need to be 'checked' with the
> version query to detect the documents that must be removed.
>
> I am currently unable to tell what is really performed by ManifoldCF to
> deal with documents to delete and if any of the assumptions I exposed
> above are correct and/or used. Also, I am really interested to know
> which part of the code is performing the delete process.
>
> Thanks for your help.
>
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> Retrouvez-nous à l’Enterprise Search & Discovery Summit à Washington DC
> www.francelabs.com
>
>

Re: error when running jobs

2018-10-24 Thread Karl Wright

Hi Gustavo,

There's a great deal of noise in this log that ManifoldCF has nothing to do
with.  Did you turn on logging for the JDBC driver?  If so, can you turn it
off?

I *do* see signs that the forensics ran:

October 23rd 2018, 18:22:41.662 message:DEBUG 2018-10-23T18:22:57,492
(Worker thread '5') -  Forensics for record 1540306481078, current
status: 12  @version:1 @timestamp:October 23rd 2018, 18:22:41.662
ALCH_TENANT_ID:9c4ba694-9cf4-440c-8b63-d4f46ee61e5b type:syslog
timestamp:October 23rd 2018, 18:22:57.493
application_id_str:74f6767d-fcdd-43e2-8d85-521844f1fef0
message_type_str:OUT source_id_str:APP
space_id_str:9c4ba694-9cf4-440c-8b63-d4f46ee61e5b
org_id_str:666eed00-59f5-4800-8775-b816fa85b915 app_name_str:apwmcf
space_name_str:apps_pre instance_id_str:0
org_name_str:PRE_INTRANET_CD1 origin_str:rep tags_array:log,
unknown_msg_format _id:AWahvHx_nkO5H-t2xNdM _type:syslog
_index:logstash-2018.10.23 _score:


but what you will need to do is when you see the "forensics for..."
message, grep the log for the displayed record ID and include the last
chunk of those in your log dump.  What we see when the database isn't
working right is a clear indication that the record state was changed
and committed and then later the record state is a different value,
which cannot be.  It's as if the database forgets how to do ACID
properly.


Karl



On Wed, Oct 24, 2018 at 5:35 AM Gustavo Beneitez 
wrote:

> Hi Karl,
>
> we reproduced the issue yesterday and managed to get several logs.
> The exact error was:
>
> Error: Unexpected jobqueue status - record id 1540306481078, expecting
> active status, saw 12,
> So we performed a new search on database and what we got was:
> [image: mail1.png]
>
> status G, that means  STATUS_PENDINGPURGATORY
> [image: mail2.png]
> Please find enclosed a log extraction filtered by this record, on line 53
> there is an ERROR message "Unexpected jobqueue status - record id
> 1540306481078, expecting active status, saw 12". I don't know if that's
> what you need or maybe we have also to increase general log level.
>
> Thanks in advance.
>
>
> El mar., 23 oct. 2018 a las 14:28, Gustavo Beneitez (<
> gustavo.benei...@gmail.com>) escribió:
>
>> Thanks Karl, we are going to make new crawls with that property enable
>> and will get back to you.
>>
>> El mar., 23 oct. 2018 a las 10:09, Karl Wright ()
>> escribió:
>>
>>> Add this to your properties.xml:
>>>
>>> 
>>>
>>> This keeps stuff in memory and dumps a lot to the log as well.
>>>
>>> I'm afraid that groveling through the logs after a failure to confirm
>>> it's the same kind of thing we've seen before takes many hours.  I can only
>>> promise to do this when I have the time.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Oct 23, 2018 at 2:34 AM Gustavo Beneitez <
>>> gustavo.benei...@gmail.com> wrote:
>>>
>>>> I Karl,
>>>>
>>>> MySQL. As per config variables:
>>>> version  5.7.23-log
>>>> version comment MySQL Community Server (GPL)
>>>>
>>>> which file should I enable logging/debugging?
>>>>
>>>> Thanks!
>>>>
>>>> El lun., 22 oct. 2018 a las 21:36, Karl Wright ()
>>>> escribió:
>>>>
>>>>> Hi Gustavo,
>>>>>
>>>>> I have seen this error before; it is apparently due to the database
>>>>> failing to properly gate transactions and behave according to the
>>>>> concurrency model selected for a transaction.  We have a debugging setting
>>>>> you can configure which logs the needed information so that forensics get
>>>>> dumped, and when they do, it's apparent what is happening.
>>>>>
>>>>> Note well that I have never been able to make this problem appear
>>>>> here, so I suspect that the issue is related to network latency or some
>>>>> other external factor I cannot easily reproduce.
>>>>>
>>>>> Just so I know -- what database is this?  The place where we've seen
>>>>> this is postgresql; later versions of MySql do not seem to have an issue.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Mon, Oct 22, 2018 at 1:44 PM Gustavo Beneitez <
>>>>> gustavo.benei...@gmail.com> wrote:
>>>>>
>>>>>> Hi Karl,
>>>>>>
>>>>>> lately we are facing job status problems. After a few minutes the job
>>>>>> ends suddenly, always the s

[jira] [Commented] (LUCENE-8540) Geo3d quantization test failure for MAX/MIN encoding values

2018-10-23 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660515#comment-16660515
 ] 

Karl Wright commented on LUCENE-8540:
-

Hi [~ivera], can you have a look at this?  I'm quite busy today unfortunately.


> Geo3d quantization test failure for MAX/MIN encoding values
> ---
>
> Key: LUCENE-8540
> URL: https://issues.apache.org/jira/browse/LUCENE-8540
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Priority: Major
>
> Here is a reproducible error:
> {code:java}
> 08:45:21[junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
> 08:45:21[junit4] IGNOR/A 0.01s J1 | TestGeo3DPoint.testRandomBig
> 08:45:21[junit4]> Assumption #1: 'nightly' test group is disabled 
> (@Nightly())
> 08:45:21[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestGeo3DPoint -Dtests.method=testQuantization 
> -Dtests.seed=4CB20CF248F6211 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ga-IE -Dtests.timezone=America/Bogota -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
> 08:45:21[junit4] ERROR   0.20s J1 | TestGeo3DPoint.testQuantization <<<
> 08:45:21[junit4]> Throwable #1: java.lang.IllegalArgumentException: 
> value=-1.0011188543037526 is out-of-bounds (less than than WGS84's 
> -planetMax=-1.0011188539924791)
> 08:45:21[junit4]> at 
> __randomizedtesting.SeedInfo.seed([4CB20CF248F6211:32220FD9326E7F33]:0)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.Geo3DUtil.encodeValue(Geo3DUtil.java:56)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testQuantization(TestGeo3DPoint.java:1228)
> 08:45:21[junit4]> at java.lang.Thread.run(Thread.java:748)
> 08:45:21[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
> {id=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, 
> docValues:{id=DocValuesFormat(name=Asserting), 
> point=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=659, 
> maxMBSortInHeap=6.225981846119071, sim=RandomSimilarity(queryNorm=false): {}, 
> locale=ga-IE, timezone=America/Bogota
> 08:45:21[junit4]   2> NOTE: Linux 2.6.32-754.6.3.el6.x86_64 amd64/Oracle 
> Corporation 1.8.0_181 
> (64-bit)/cpus=16,threads=1,free=466116320,total=536346624
> 08:45:21[junit4]   2> NOTE: All tests run in this JVM: [GeoPointTest, 
> RandomGeoPolygonTest, TestGeo3DPoint]
> 08:45:21[junit4] Completed [18/18 (1!)] on J1 in 19.83s, 14 tests, 1 
> error, 1 skipped <<< FAILURES!{code}
>  
> It seems this test will fail if encoding = Geo3DUtil.MIN_ENCODED_VALUE or 
> encoding = Geo3DUtil.MAX_ENCODED_VALUE.
> It is related with https://issues.apache.org/jira/browse/LUCENE-7327
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-8540) Geo3d quantization test failure for MAX/MIN encoding values

2018-10-23 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned LUCENE-8540:
---

Assignee: Ignacio Vera

> Geo3d quantization test failure for MAX/MIN encoding values
> ---
>
> Key: LUCENE-8540
> URL: https://issues.apache.org/jira/browse/LUCENE-8540
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
>
> Here is a reproducible error:
> {code:java}
> 08:45:21[junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint
> 08:45:21[junit4] IGNOR/A 0.01s J1 | TestGeo3DPoint.testRandomBig
> 08:45:21[junit4]> Assumption #1: 'nightly' test group is disabled 
> (@Nightly())
> 08:45:21[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestGeo3DPoint -Dtests.method=testQuantization 
> -Dtests.seed=4CB20CF248F6211 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=ga-IE -Dtests.timezone=America/Bogota -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
> 08:45:21[junit4] ERROR   0.20s J1 | TestGeo3DPoint.testQuantization <<<
> 08:45:21[junit4]> Throwable #1: java.lang.IllegalArgumentException: 
> value=-1.0011188543037526 is out-of-bounds (less than than WGS84's 
> -planetMax=-1.0011188539924791)
> 08:45:21[junit4]> at 
> __randomizedtesting.SeedInfo.seed([4CB20CF248F6211:32220FD9326E7F33]:0)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.Geo3DUtil.encodeValue(Geo3DUtil.java:56)
> 08:45:21[junit4]> at 
> org.apache.lucene.spatial3d.TestGeo3DPoint.testQuantization(TestGeo3DPoint.java:1228)
> 08:45:21[junit4]> at java.lang.Thread.run(Thread.java:748)
> 08:45:21[junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
> {id=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, 
> docValues:{id=DocValuesFormat(name=Asserting), 
> point=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=659, 
> maxMBSortInHeap=6.225981846119071, sim=RandomSimilarity(queryNorm=false): {}, 
> locale=ga-IE, timezone=America/Bogota
> 08:45:21[junit4]   2> NOTE: Linux 2.6.32-754.6.3.el6.x86_64 amd64/Oracle 
> Corporation 1.8.0_181 
> (64-bit)/cpus=16,threads=1,free=466116320,total=536346624
> 08:45:21[junit4]   2> NOTE: All tests run in this JVM: [GeoPointTest, 
> RandomGeoPolygonTest, TestGeo3DPoint]
> 08:45:21[junit4] Completed [18/18 (1!)] on J1 in 19.83s, 14 tests, 1 
> error, 1 skipped <<< FAILURES!{code}
>  
> It seems this test will fail if encoding = Geo3DUtil.MIN_ENCODED_VALUE or 
> encoding = Geo3DUtil.MAX_ENCODED_VALUE.
> It is related with https://issues.apache.org/jira/browse/LUCENE-7327
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: error when running jobs

2018-10-23 Thread Karl Wright

Add this to your properties.xml:



This keeps stuff in memory and dumps a lot to the log as well.

I'm afraid that groveling through the logs after a failure to confirm it's
the same kind of thing we've seen before takes many hours.  I can only
promise to do this when I have the time.

Karl


On Tue, Oct 23, 2018 at 2:34 AM Gustavo Beneitez 
wrote:

> I Karl,
>
> MySQL. As per config variables:
> version  5.7.23-log
> version comment MySQL Community Server (GPL)
>
> which file should I enable logging/debugging?
>
> Thanks!
>
> El lun., 22 oct. 2018 a las 21:36, Karl Wright ()
> escribió:
>
>> Hi Gustavo,
>>
>> I have seen this error before; it is apparently due to the database
>> failing to properly gate transactions and behave according to the
>> concurrency model selected for a transaction.  We have a debugging setting
>> you can configure which logs the needed information so that forensics get
>> dumped, and when they do, it's apparent what is happening.
>>
>> Note well that I have never been able to make this problem appear here,
>> so I suspect that the issue is related to network latency or some other
>> external factor I cannot easily reproduce.
>>
>> Just so I know -- what database is this?  The place where we've seen this
>> is postgresql; later versions of MySql do not seem to have an issue.
>>
>> Thanks,
>> Karl
>>
>>
>> On Mon, Oct 22, 2018 at 1:44 PM Gustavo Beneitez <
>> gustavo.benei...@gmail.com> wrote:
>>
>>> Hi Karl,
>>>
>>> lately we are facing job status problems. After a few minutes the job
>>> ends suddenly, always the same way:
>>>
>>> Error: Unexpected jobqueue status - record id 1539339908660, expecting
>>> active status, saw 2
>>> Error: Unexpected jobqueue status - record id 1539291541171, expecting
>>> active status, saw 2
>>> Error: Unexpected jobqueue status - record id 1539294182173, expecting
>>> active status, saw 2
>>> Error: Unexpected jobqueue status - record id 1539338949797, expecting
>>> active status, saw 2
>>>
>>> I did some investigations and a select to the database after the error
>>> appeared and BEFORE rerunning the job:
>>>
>>> SELECT * FROM `jobqueue` WHERE id = 1539336459053 and jobid =
>>> 1539269973731
>>>
>>>
>>> it returned status = 'G'
>>>
>>>
>>> After the run was repeated, it finished OK  and same query returned
>>> status = 'C'.
>>>
>>> I don't understand much of the "active" workers but it seems the item is
>>> processed twice. Do you have an idea about what we should investigate?
>>>
>>>
>>> Thanks in advance!
>>>
>>

Re: error when running jobs

2018-10-22 Thread Karl Wright

Hi Gustavo,

I have seen this error before; it is apparently due to the database failing
to properly gate transactions and behave according to the concurrency model
selected for a transaction.  We have a debugging setting you can configure
which logs the needed information so that forensics get dumped, and when
they do, it's apparent what is happening.

Note well that I have never been able to make this problem appear here, so
I suspect that the issue is related to network latency or some other
external factor I cannot easily reproduce.

Just so I know -- what database is this?  The place where we've seen this
is postgresql; later versions of MySql do not seem to have an issue.

Thanks,
Karl

On Mon, Oct 22, 2018 at 1:44 PM Gustavo Beneitez 
wrote:

> Hi Karl,
>
> lately we are facing job status problems. After a few minutes the job ends
> suddenly, always the same way:
>
> Error: Unexpected jobqueue status - record id 1539339908660, expecting
> active status, saw 2
> Error: Unexpected jobqueue status - record id 1539291541171, expecting
> active status, saw 2
> Error: Unexpected jobqueue status - record id 1539294182173, expecting
> active status, saw 2
> Error: Unexpected jobqueue status - record id 1539338949797, expecting
> active status, saw 2
>
> I did some investigations and a select to the database after the error
> appeared and BEFORE rerunning the job:
>
> SELECT * FROM `jobqueue` WHERE id = 1539336459053 and jobid = 1539269973731
>
>
> it returned status = 'G'
>
>
> After the run was repeated, it finished OK  and same query returned status
> = 'C'.
>
> I don't understand much of the "active" workers but it seems the item is
> processed twice. Do you have an idea about what we should investigate?
>
>
> Thanks in advance!
>

[jira] [Resolved] (CONNECTORS-1550) HTML Tag mapping

2018-10-19 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1550.
-
Resolution: Not A Problem

Hi [~DonaldVdD], please post questions like this to the 
us...@manifoldcf.apache.org mailing list.  Jira is meant for bugs and 
enhancement requests.  Thank you!


> HTML Tag mapping
> 
>
> Key: CONNECTORS-1550
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1550
> Project: ManifoldCF
>  Issue Type: Wish
>  Components: Elastic Search connector, Tika extractor, Web connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Donald Van den Driessche
>Priority: Major
>
> I’ll be crawling a website with the standard Web connecter. I want to extract 
> just certain html tags like ,  and . 
> I’ve set up an HTML extractor transformation connector and the internal Tika 
> transformation connector. But I can’t find any place to do a mapping to the 
> output for this.
>  
> Do I have to write my own transformation connector to extract the content of 
> these tags? Or is there a built in solution?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1549:

Attachment: CONNECTORS-1549.patch

> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1549.patch, 
> image-2018-10-18-18-28-14-547.png, image-2018-10-18-18-33-01-577.png, 
> image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1549.
-
Resolution: Fixed

r1844293

> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.12
>
> Attachments: CONNECTORS-1549.patch, 
> image-2018-10-18-18-28-14-547.png, image-2018-10-18-18-33-01-577.png, 
> image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1549:

Fix Version/s: ManifoldCF 2.12

> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Fix For: ManifoldCF 2.12
>
> Attachments: image-2018-10-18-18-28-14-547.png, 
> image-2018-10-18-18-33-01-577.png, image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656073#comment-16656073
 ] 

Karl Wright commented on CONNECTORS-1549:
-

I found the issue and have attached a patch.  Thanks!


> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Attachments: image-2018-10-18-18-28-14-547.png, 
> image-2018-10-18-18-33-01-577.png, image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655986#comment-16655986
 ] 

Karl Wright commented on CONNECTORS-1549:
-

Hi [~julienFL]

Sorry for the delay.

First note that you can always use the order-preserving form even if MCF 
outputs the JSON in the other "sugary" form.  So this should unblock you.

Second, I'm looking at the code that generates the output in Configuration.java:

{code}
// The new JSON parser uses hash order for object keys.  So it isn't good 
enough to just detect that there's an
// intermingling.  Instead we need to the existence of more that one key; 
that implies that we need to do order preservation.
String lastChildType = null;
boolean needAlternate = false;
int i = 0;
while (i < getChildCount())
{
  ConfigurationNode child = findChild(i++);
  String key = child.getType();
  List list = childMap.get(key);
  if (list == null)
  {
// We found no existing list, so create one
list = new ArrayList();
childMap.put(key,list);
childList.add(key);
  }
  // Key order comes into play when we have elements of different types 
within the same child. 
  if (lastChildType != null && !lastChildType.equals(key))
  {
needAlternate = true;
break;
  }
  list.add(child);
  lastChildType = key;
}

if (needAlternate)
{
  // Can't use the array representation.  We'll need to start do a 
_children_ object, and enumerate
  // each child.  So, the JSON will look like:
  // :{_attribute_:xxx,_children_:[{_type_:, 
...},{_type_:, ...}, ...]}
...
{code}

The (needAlternate) clause is the one that writes the specification in the 
verbose form.  The logic seems like it would detect any time there's a subtree 
with a different key under a given level and set "needAlternate".  I'll stare 
at it some more but right now I'm having trouble seeing how this fails.


> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>    Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
> Attachments: image-2018-10-18-18-28-14-547.png, 
> image-2018-10-18-18-33-01-577.png, image-2018-10-18-18-34-01-542.png
>
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655223#comment-16655223
 ] 

Karl Wright commented on CONNECTORS-1549:
-

Hi [~julienFL], there was a similar ticket a while back for the file system 
connector.  Let me explain what the solution was and see if you still think 
there is a problem.

(1) The actual internal representation of a Document Specification is XML.
(2) For the API, we convert the XML to JSON and back.
(3) Because a complete and unambiguous conversion between these formats is 
quite ugly, we have multiple ways of doing the conversion, so that we allow 
"syntactic sugar" in the JSON for specific cases where the conversion can be 
done simply.
(4) A while back, there was a bug in the code that determined whether it was 
possible to use syntactic sugar of the specific kind that would lead to two 
independent lists for the File System Connector's document specification, so 
for a while what was *output* when you exported the Job was incorrect, and 
order would be lost if you re-imported it.

The solution was to (a) fix the bug, and (b) get the person using the API to 
use the correct, unambigious JSON format instead of the "sugary" format.  This 
preserves order.

The way to see if this is what you are up against is to create a JCIFS job with 
a complex rule set that has both inclusions and exclusions.  If it looks 
different than what you are expecting, then try replicating that format when 
you import via the API.


> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>    Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (CONNECTORS-1549) Include and exclude rules order lost

2018-10-18 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1549:
---

Assignee: Karl Wright

> Include and exclude rules order lost
> 
>
> Key: CONNECTORS-1549
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1549
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API, JCIFS connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Critical
>
> The include and exclude rules that can be defined in the job configuration 
> for the JCIFS connector can be combined and the defined order is really 
> important.
> The problem is that when one retrieve the job configuration as a json object 
> through the API, the include and exclude rules are splitted in two diffrent 
> arrays instead of one (one for each type of rule). So, the order is 
> completely lost when one try to recreate the job thanks to the API and the 
> JSON object. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1548) CMIS output connector test fails with versioning state error

2018-10-17 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1548:

Description: 
While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector 
test failures.  Specifically, here's the trace:

{code}
[junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The 
versioning state flag is imcompatible to the type definition.
[junit] at 
org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994)
{code}

Nested exception is:

{code}
[junit] Caused by: 
org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The 
versioning state flag is imcompatible to the type definition.
[junit] at 
org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.convertStatusCode(AbstractAtomPubService.java:514)
[junit] at 
org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.post(AbstractAtomPubService.java:717)
[junit] at 
org.apache.chemistry.opencmis.client.bindings.spi.atompub.ObjectServiceImpl.createDocument(ObjectServiceImpl.java:122)
[junit] at 
org.apache.chemistry.opencmis.client.runtime.SessionImpl.createDocument(SessionImpl.java:1158)
{code}

This may (or may not) be related to the Tika code now using a different 
implementation of jaxb.  I've moved all of jaxb and its dependent classes into 
connector-common-lib accordingly, and have no specific inclusions of jaxb in 
any connector class that would need it to be in connector-lib.

It has been committed to trunk; r1844137.  Please verify (or disprove) that the 
problem is the new jaxb implementation.  If it is we'll need to figure out why 
CMIS cares which implementation is used.


  was:
While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector 
test failures.  Specifically, here's the trace:

{code}
[junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The 
versioning state flag is imcompatible to the type definition.
[junit] at 
org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994)
{code}

This may (or may not) be related to the Tika code now using a different 
implementation of jaxb.  I've moved all of jaxb and its dependent classes into 
connector-common-lib accordingly, and have no specific inclusions of jaxb in 
any connector class that would need it to be in connector-lib.

It has been committed to trunk; r1844137.  Please verify (or disprove) that the 
problem is the new jaxb implementation.  If it is we'll need to figure out why 
CMIS cares which implementation is used.



> CMIS output connector test fails with versioning state error
> 
>
> Key: CONNECTORS-1548
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1548
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: CMIS Output Connector
>    Reporter: Karl Wright
>Assignee: Piergiorgio Lucidi
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector 
> test failures.  Specifically, here's the trace:
> {code}
> [junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The 
> versioning state flag is imcompatible to the type definition.
> [junit] at 
> org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994)
> {code}
> Nested exception is:
> {code}
> [junit] Caused by: 
> org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The 
> versioning state flag is imcompatible to the type definition.
> [junit] at 
> org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.convertStatusCode(AbstractAtomPubService.java:514)
> [junit] at 
> org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.post(AbstractAtomPubService.java:717)
> [junit] at 
> org.apache.chemistry.opencmis.client.bindings.spi.atompub.ObjectServiceImpl.createDocument(ObjectServiceImpl.java:122)
> [junit] at 
> org.apache.chemistry.opencmis.client.runtime.SessionImpl.createDocument(SessionImpl.java:1158)
> {code}
> This may (or may not) be related to the Tika code now using a different 
> implementation of jaxb.  I've moved all of jaxb and its dependent classes 
> into connector-common-lib accordingly, and have no specific inclusions of 
> jaxb in any connector class that would need it to be in connector-lib.
>

[jira] [Created] (CONNECTORS-1548) CMIS output connector test fails with versioning state error

2018-10-17 Thread Karl Wright (JIRA)

Karl Wright created CONNECTORS-1548:
---

 Summary: CMIS output connector test fails with versioning state 
error
 Key: CONNECTORS-1548
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1548
 Project: ManifoldCF
  Issue Type: Bug
  Components: CMIS Output Connector
Reporter: Karl Wright
Assignee: Piergiorgio Lucidi
 Fix For: ManifoldCF 2.12


While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector 
test failures.  Specifically, here's the trace:

{code}
[junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The 
versioning state flag is imcompatible to the type definition.
[junit] at 
org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994)
{code}

This may (or may not) be related to the Tika code now using a different 
implementation of jaxb.  I've moved all of jaxb and its dependent classes into 
connector-common-lib accordingly, and have no specific inclusions of jaxb in 
any connector class that would need it to be in connector-lib.

It has been committed to trunk; r1844137.  Please verify (or disprove) that the 
problem is the new jaxb implementation.  If it is we'll need to figure out why 
CMIS cares which implementation is used.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector

2018-10-17 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1547.
-
Resolution: Fixed

r1844120


> No activity record for for excluded documents in WebCrawlerConnector
> 
>
> Key: CONNECTORS-1547
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Olivier Tavard
>    Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf_local_files.log, manifoldcf_web.log, 
> simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by 
> the Document Filter transformation connector  in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector 
> Web repository connector 
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) 
> documents
> The simple history does not mention the documents excluded (excepted for html 
> documents). They have fetch activity and that's all (see 
> simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity 
> activity on connectors) :
> {code:java}
> Removing url 
> 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
>  because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type 
> ('"+contentType+"')";
>  fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
>  activityResultCode = null;{code}
> The activityResultCode is null.
>  
>  
> If we configure the same job but for a Local File system connector with the 
> same Document Filter transformation connector, the simple history mentions 
> all the documents excluded in the simple history (see 
> simple_history_files.jpeg)  and the code mentions a specific error code with 
> an activity record logged (class FileConnector l. 415) : 
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
>  {
>  errorCode = activities.EXCLUDED_MIMETYPE;
>  errorDesc = "Excluded because mime type ('"+mimeType+"')";
>  Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because 
> mime type ('"+mimeType+"') was excluded by output connector.");
>  activities.noDocument(documentIdentifier,versionString);
>  continue;
>  }{code}
>  
> So the Web Crawler connector should have the same behaviour than for 
> FileConnector and explicitly mention all the documents excluded by the user I 
> think.
>  
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector

2018-10-17 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1547:

Fix Version/s: ManifoldCF 2.12

> No activity record for for excluded documents in WebCrawlerConnector
> 
>
> Key: CONNECTORS-1547
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Olivier Tavard
>    Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf_local_files.log, manifoldcf_web.log, 
> simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by 
> the Document Filter transformation connector  in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector 
> Web repository connector 
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) 
> documents
> The simple history does not mention the documents excluded (excepted for html 
> documents). They have fetch activity and that's all (see 
> simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity 
> activity on connectors) :
> {code:java}
> Removing url 
> 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
>  because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type 
> ('"+contentType+"')";
>  fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
>  activityResultCode = null;{code}
> The activityResultCode is null.
>  
>  
> If we configure the same job but for a Local File system connector with the 
> same Document Filter transformation connector, the simple history mentions 
> all the documents excluded in the simple history (see 
> simple_history_files.jpeg)  and the code mentions a specific error code with 
> an activity record logged (class FileConnector l. 415) : 
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
>  {
>  errorCode = activities.EXCLUDED_MIMETYPE;
>  errorDesc = "Excluded because mime type ('"+mimeType+"')";
>  Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because 
> mime type ('"+mimeType+"') was excluded by output connector.");
>  activities.noDocument(documentIdentifier,versionString);
>  continue;
>  }{code}
>  
> So the Web Crawler connector should have the same behaviour than for 
> FileConnector and explicitly mention all the documents excluded by the user I 
> think.
>  
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector

2018-10-17 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1547:
---

Assignee: Karl Wright

> No activity record for for excluded documents in WebCrawlerConnector
> 
>
> Key: CONNECTORS-1547
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Olivier Tavard
>    Assignee: Karl Wright
>Priority: Minor
> Attachments: manifoldcf_local_files.log, manifoldcf_web.log, 
> simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by 
> the Document Filter transformation connector  in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector 
> Web repository connector 
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) 
> documents
> The simple history does not mention the documents excluded (excepted for html 
> documents). They have fetch activity and that's all (see 
> simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity 
> activity on connectors) :
> {code:java}
> Removing url 
> 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
>  because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type 
> ('"+contentType+"')";
>  fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
>  activityResultCode = null;{code}
> The activityResultCode is null.
>  
>  
> If we configure the same job but for a Local File system connector with the 
> same Document Filter transformation connector, the simple history mentions 
> all the documents excluded in the simple history (see 
> simple_history_files.jpeg)  and the code mentions a specific error code with 
> an activity record logged (class FileConnector l. 415) : 
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
>  {
>  errorCode = activities.EXCLUDED_MIMETYPE;
>  errorDesc = "Excluded because mime type ('"+mimeType+"')";
>  Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because 
> mime type ('"+mimeType+"') was excluded by output connector.");
>  activities.noDocument(documentIdentifier,versionString);
>  continue;
>  }{code}
>  
> So the Web Crawler connector should have the same behaviour than for 
> FileConnector and explicitly mention all the documents excluded by the user I 
> think.
>  
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: ManifoldCF database model

2018-10-17 Thread Karl Wright

Ok, the schema is described in ManifoldCF In Action.

https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs

Karl


On Wed, Oct 17, 2018 at 7:41 AM Gustavo Beneitez 
wrote:

> Hi Karl,
>
> as far as I was able to gather information from history records, I could
> see MCF is behaving as expected. The "problem" shows when ElasticSearch is
> down or performing bad, MCF says it was requested to be deleted, but while
> it has been erased from database, it is alive on ElasticSearch side, so I
> need to find whether or not there are those kind of inconsistencies exist.
>
> Please allow us to check those documents and make new tests in order to see
> what really happens,we don't modify any database record by hand.
>
> Thanks!
>
>
>
>
>
>
>
> El mar., 16 oct. 2018 a las 19:27, Karl Wright ()
> escribió:
>
> > Hi, you can look at ManifoldCF In Action.  There's a link to it on the
> > manifoldcf page.
> >
> > However, you should be aware that we consider it a severe bug if
> ManifoldCF
> > doesn't clean up after itself.  The only time that is not expected is
> when
> > people write buggy connectors or mess with database tables themselves.  I
> > would urge you to examine the Simple History report and try to come up
> with
> > a reproducible test case rather than trying to reverse engineer MCF.
> > Should you go directly to the database, we will be unable to give you any
> > support.
> >
> > Thanks,
> > Karl
> >
> >
> > On Tue, Oct 16, 2018 at 11:51 AM Gustavo Beneitez <
> > gustavo.benei...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > how do you do? I was wandering if there is any technical document about
> > > what is the meaning of each table in database, the relationship between
> > > documents, repositories, jobs and any other output connector (some kind
> > of
> > > a database model).
> > >
> > > We are facing some "garbage issues", jobs are created, duplicated,
> > related
> > > to transformations, linked to outputs (Elastic Search), played and
> > finally
> > > deleted, but in the end documents that should be also deleted against
> the
> > > output connector,  sometimes they still are there, don't know if they
> are
> > > visible because they point to an existing job, an unexpected job end or
> > any
> > > other failure.
> > >
> > > We need to understand the database model in order to check when
> documents
> > > stored in Elastic can be safely removed since they no longer are
> referred
> > > by any process. A process that should be executed periodically every
> > week,
> > > for example.
> > >
> > > Thanks in advance!
> > >
> >
>

Re: ManifoldCF database model

2018-10-16 Thread Karl Wright

Hi, you can look at ManifoldCF In Action.  There's a link to it on the
manifoldcf page.

However, you should be aware that we consider it a severe bug if ManifoldCF
doesn't clean up after itself.  The only time that is not expected is when
people write buggy connectors or mess with database tables themselves.  I
would urge you to examine the Simple History report and try to come up with
a reproducible test case rather than trying to reverse engineer MCF.
Should you go directly to the database, we will be unable to give you any
support.

Thanks,
Karl

On Tue, Oct 16, 2018 at 11:51 AM Gustavo Beneitez <
gustavo.benei...@gmail.com> wrote:

> Hi all,
>
> how do you do? I was wandering if there is any technical document about
> what is the meaning of each table in database, the relationship between
> documents, repositories, jobs and any other output connector (some kind of
> a database model).
>
> We are facing some "garbage issues", jobs are created, duplicated, related
> to transformations, linked to outputs (Elastic Search), played and finally
> deleted, but in the end documents that should be also deleted against the
> output connector,  sometimes they still are there, don't know if they are
> visible because they point to an existing job, an unexpected job end or any
> other failure.
>
> We need to understand the database model in order to check when documents
> stored in Elastic can be safely removed since they no longer are referred
> by any process. A process that should be executed periodically every week,
> for example.
>
> Thanks in advance!
>

Re: Create documents from transformation connector

2018-10-16 Thread Karl Wright

Hi Julien,

That is one thing you cannot do with the MCF pipeline. All documents must
originate in a RepositoryConnector.  The repository connector can create
multiple subdocuments itself, if need be, but the rest of the pipeline does
not allow further splitting.

One way around this: If the second document is intended for a second
output, you can write a transformer that just converts the original
document into the "new" one, and then create your pipeline so that your
transformer is in the path to the second output but not the first.

Your description of the problem argues, however, for adding archive
disassembly to the file system connector, frankly.

Karl

On Tue, Oct 16, 2018 at 12:09 PM Julien 
wrote:

> Hi Karl,
>
> I was wondering if there is a simple way to generate multiple documents
> from a transformation connector.
>
> My use case is the following :
> I have some files that are archives files and I would like to create a
> transformation connector that will be able to extract the files within the
> archives and create new MCF document for each extracted one. So they will
> be processed by the next connectors of my job pipeline.
>
> What would be the best approach in your opinion ?
>
> Regards,
> Julien
>
>
>
> ---
> L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> https://www.avast.com/antivirus
>

[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651950#comment-16651950
 ] 

Karl Wright commented on CONNECTORS-1546:
-

I agree with your decision.


> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651761#comment-16651761
 ] 

Karl Wright commented on CONNECTORS-1546:
-

Hi [~st...@remcam.net], can you comment on this?

> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'

2018-10-16 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1546:
---

Assignee: Steph van Schalkwyk

> Optimize Elasticsearch performance by removing 'forcemerge'
> ---
>
> Key: CONNECTORS-1546
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1546
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Elastic Search connector
>Reporter: Hans Van Goethem
>Assignee: Steph van Schalkwyk
>Priority: Major
>
> After crawling with ManifoldCF, forcemerge is applied to optimize the 
> Elasticsearch index. This optimization makes the Elastic faster for 
> read-operations but not for write-opeartions. On the contrary, performance on 
> the write operations becomes worse after every forcemerge. 
> Can you remove this forcemerge in ManifoldCF to optimize perfomance for 
> recurrent crawling to Elasticsearch?
> If somene needs this forcemerge, it can be applied mannually against 
> Elasticsearch directly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (CONNECTORS-1545) Parentheses in editConfiguration tab labels not supported

2018-10-12 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1545.
-
   Resolution: Fixed
Fix Version/s: ManifoldCF 2.12

> Parentheses in editConfiguration tab labels not supported
> -
>
> Key: CONNECTORS-1545
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1545
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Kishore Kumar
>Priority: Minor
> Fix For: ManifoldCF 2.12
>
>
> In the editConfiguration view of the web connector, the tab 
> 'AccessCredentials' is not "clickable" in the french version of the UI.
> It seems to be due to the presence of parentheses in the string "Informations 
> d'accès (Access Credentials)" as it is used to trigger the javascript 
> function of the tab and so is misinterpreted
> Need to check if similar cases are present for other connectors 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Karl Wright

I cannot reproduce your problem.  Perhaps you can download a new instance
and configure it from scratch using the embedded tika?  If that works it
should be possible to figure out what the difference is.

Karl

On Thu, Oct 11, 2018, 12:23 PM Bisonti Mario 
wrote:

> I tried to update Solr, Tika server and ManifoldCF to the last versions.
>
>
>
> I tried to add another Transformation before the TikaTransformation ti
> filter the alloweddocuments as you suggested in another discussion but
> nothing..
>
> I always have the same Result Code: EXCLUDEDMIMETYPE
>
>
>
>
>
> I read other discussion (
> https://lists.apache.org/thread.html/66a3f9780bbcc98e404e25f5a0e56a8a6c007448642c3bc15a366ed2@%3Cuser.manifoldcf.apache.org%3E)
>  but I don’t understand if they solved the issue
>
>
>
> ☹
>
>
>
> Thanks a lot.
>
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 14:57
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> When you don't check the "use extracting update handler" field is
> UNCHECKED, the mime types you list are IGNORED.  Only "text" mime types are
> accepted by the Solr connection in that case.  But that is exactly what the
> Tika extractor sends along, and many other people do this, and I can make
> it work fine here, so I don't know what you are doing wrong.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:37 AM Bisonti Mario 
> wrote:
>
> This is my solr output connection:
>
>
>
> I tried to put content_type as “Mime type field name:” but the result is
> always the same
>
>
>
> Could be that, unchecking the flag, ManifoldCF doesn’t use the mime types
> specified?
>
>
>
> I am using a snapshot version of ManifoldCF of three monts  ago.
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 14:20
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> I confirmed that both the Tika Service transformer and the Tika
> transformer check the same exact mime type:
>
> >>>>>>
>
>   @Override
>
>   public boolean checkMimeTypeIndexable(VersionContext
> pipelineDescription, String mimeType, IOutputCheckActivity checkActivity)
>
> throws ManifoldCFException, ServiceInterruption
>
>   {
>
> // We should see what Tika will transform
>
> // MHL
>
> // Do a downstream check
>
> return
> checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
>
>   }
>
> <<<<<<
>
>
>
> So: please verify that your Solr connection is set up correctly and the
> "use extracting update handler" box is UNCHECKED.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:16 AM Karl Wright  wrote:
>
> When you uncheck the "use extracting update handler" checkbox, the Solr
> connection only accepts text/plain, and no binary formats.  The Tika
> extractor, though, should set the mime type always to "text/plain".  Since
> the Simple History says otherwise, I wonder if there's a problem with the
> external Tika extractor.  Perhaps you can try the internal one to get your
> pipeline working first?  If the external one does not send the right mime
> type, then we need to correct that so you should open a ticket.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario 
> wrote:
>
> Now the document isn’t ingested by solr because I obtain:
>
>
>
> Solr connector rejected document due to mime type restrictions:
> (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
>
>
>
>
>
> But the mime type is on the tab
>
>
>
>
>
> And the settings worked well when I used Tika inside solr.
>
>
>
> Could you help me?
>
> Thanks
>
>
>
> *Da:* Bisonti Mario 
> *Inviato:* giovedì 11 ottobre 2018 14:03
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
>
>
> My mistake…
>
> As you wrote me I had to uncheck “use extracting update handler”
>
>
>
> Now I have to understand the field mentioned in schema etc.
>
>
>
> *Da:* Bisonti Mario 
> *Inviato:* giovedì 11 ottobre 2018 13:45
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
> I see the job processed but without the document inside.
>
> 10-11-2018 13:32:25.649
&g

Re: Logging and Document filter transformation connector

2018-10-11 Thread Karl Wright

Hi Olivier,

The Repository connector has no knowledge of what the pipeline looks like.
It simply asks the framework whether the mime type, length, etc. is
acceptable to the downstream pipeline.  It's the connector's responsibility
to note the reason for the rejection in the simple history, but it does not
have any knowledge whatsoever of which connector rejected the document, and
therefore cannot say which transformer or output rejected the document.

Transformation and output connectors which respond to checks for document
mime type or length checks likewise do not have any knowledge of the
upstream connector that is doing the checking.

Karl



On Thu, Oct 11, 2018 at 9:31 AM Olivier Tavard <
olivier.tav...@francelabs.com> wrote:

> Hello,
>
> I have a question regarding the Document filter transformation connector
> and the log about it.
> I would like to have a look of all the documents excluded by the rules
> configured in the Document filter transformation connector by looking at
> the Simple history or by the MCF log but it is not easy so far.
>
> Let’s say that I want to crawl a website and I want to index html pages
> only. So I configure a web repository connector with a Document filter
> transformation connector and I create the rule with only one allowed mime
> type content and one file extension. So far so good, the job works well but
> if I want to visualize on the MCF log or by the simple history all the
> files that were excluded by the transformation connector it is quickly
> complicated : I have to search manually all the files that were fetched but
> not processed by Tika transformation connector or ingested by the output
> connector.
>
> Of my understanding of the code, the document filter transformation
> connector can communicate directly with the repo transformation connector
> to indicate the rules of exclusion of the documents and so the document
> that need to be excluded are not processed in the Document filter
> transformation connector but directly excluded by the web repo connector.
> So in the simple history, I can see that a document that will be excluded
> is in "activity fetch" and that’s it, there is no additional information
> about it.
> Could it be possible to add a log entry with an explicit result code as
> excluded by "document filter connector" or something like when the document
> is excluded by the repository connector?
>
> Thank you,
> Best regards,
> Olivier
>
>

Re: Debug logging properties location

2018-10-11 Thread Karl Wright

Hi Olivier, it sounds like you are using Zookeeper.  Certain properties are
global and are imported into Zookeeper.  Other properties are local and
found in each local properties.xml file.  The debug properties for logging
is, I believe, global.

Karl


On Thu, Oct 11, 2018 at 8:39 AM Olivier Tavard <
olivier.tav...@francelabs.com> wrote:

> Hello,
>
> I have a question regarding the debug logging properties and their
> location in the multi process model.
> If I put the properties in the properties.xml file (as
> org.apache.manifoldcf.connectors for example), it seems that the properties
> are not taken into account. In the other hand, if I put them in
> the global-properties.xml file it is OK.
> Is it the normal behaviour ? I thought that global-properties file was
> only used for shared configuration. For example the property
> org.apache.manifoldcf.logconfigfile is located in properties.xml and not in
> global-properties.xml.
>
> Thanks,
> Best regards,
>
> Olivier
>
>
>

Re: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Karl Wright

When you don't check the "use extracting update handler" field is
UNCHECKED, the mime types you list are IGNORED.  Only "text" mime types are
accepted by the Solr connection in that case.  But that is exactly what the
Tika extractor sends along, and many other people do this, and I can make
it work fine here, so I don't know what you are doing wrong.

Karl


On Thu, Oct 11, 2018 at 8:37 AM Bisonti Mario 
wrote:

> This is my solr output connection:
>
>
>
> I tried to put content_type as “Mime type field name:” but the result is
> always the same
>
>
>
> Could be that, unchecking the flag, ManifoldCF doesn’t use the mime types
> specified?
>
>
>
> I am using a snapshot version of ManifoldCF of three monts  ago.
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 14:20
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> I confirmed that both the Tika Service transformer and the Tika
> transformer check the same exact mime type:
>
> >>>>>>
>
>   @Override
>
>   public boolean checkMimeTypeIndexable(VersionContext
> pipelineDescription, String mimeType, IOutputCheckActivity checkActivity)
>
> throws ManifoldCFException, ServiceInterruption
>
>   {
>
> // We should see what Tika will transform
>
> // MHL
>
> // Do a downstream check
>
> return
> checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
>
>   }
>
> <<<<<<
>
>
>
> So: please verify that your Solr connection is set up correctly and the
> "use extracting update handler" box is UNCHECKED.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:16 AM Karl Wright  wrote:
>
> When you uncheck the "use extracting update handler" checkbox, the Solr
> connection only accepts text/plain, and no binary formats.  The Tika
> extractor, though, should set the mime type always to "text/plain".  Since
> the Simple History says otherwise, I wonder if there's a problem with the
> external Tika extractor.  Perhaps you can try the internal one to get your
> pipeline working first?  If the external one does not send the right mime
> type, then we need to correct that so you should open a ticket.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario 
> wrote:
>
> Now the document isn’t ingested by solr because I obtain:
>
>
>
> Solr connector rejected document due to mime type restrictions:
> (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
>
>
>
>
>
> But the mime type is on the tab
>
>
>
>
>
> And the settings worked well when I used Tika inside solr.
>
>
>
> Could you help me?
>
> Thanks
>
>
>
> *Da:* Bisonti Mario 
> *Inviato:* giovedì 11 ottobre 2018 14:03
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
>
>
> My mistake…
>
> As you wrote me I had to uncheck “use extracting update handler”
>
>
>
> Now I have to understand the field mentioned in schema etc.
>
>
>
> *Da:* Bisonti Mario 
> *Inviato:* giovedì 11 ottobre 2018 13:45
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
> I see the job processed but without the document inside.
>
> 10-11-2018 13:32:25.649
>
> job end
>
> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>
> 0
>
> 1
>
> 10-11-2018 13:32:14.211
>
> job start
>
> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>
> 0
>
> 1
>
>
>
>
>
>
>
>
>
> Have I to uncheck, on my Solr output connection the “Use the Extract
> Update Handler”?
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 13:36
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> Please have a look at your "Simple History" report to see why the
> documents aren't getting indexed.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
> wrote:
>
> Thanks Karl.
>
> I tried, but it doesn’t index documents.
>
> It seemes that it doesn’t see them?
>
>
>
> Perhaps is the “Ignore Tika exception that I don’t know where to set in
> ManifoldCF  the problem?
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 12

Re: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Karl Wright

I confirmed that both the Tika Service transformer and the Tika transformer
check the same exact mime type:

>>>>>>
  @Override
  public boolean checkMimeTypeIndexable(VersionContext pipelineDescription,
String mimeType, IOutputCheckActivity checkActivity)
throws ManifoldCFException, ServiceInterruption
  {
// We should see what Tika will transform
// MHL
// Do a downstream check
return checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
  }
<<<<<<

So: please verify that your Solr connection is set up correctly and the
"use extracting update handler" box is UNCHECKED.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:16 AM Karl Wright  wrote:

> When you uncheck the "use extracting update handler" checkbox, the Solr
> connection only accepts text/plain, and no binary formats.  The Tika
> extractor, though, should set the mime type always to "text/plain".  Since
> the Simple History says otherwise, I wonder if there's a problem with the
> external Tika extractor.  Perhaps you can try the internal one to get your
> pipeline working first?  If the external one does not send the right mime
> type, then we need to correct that so you should open a ticket.
>
> Thanks,
> Karl
>
>
> On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario 
> wrote:
>
>> Now the document isn’t ingested by solr because I obtain:
>>
>>
>>
>> Solr connector rejected document due to mime type restrictions:
>> (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
>>
>>
>>
>>
>>
>> But the mime type is on the tab
>>
>>
>>
>>
>>
>> And the settings worked well when I used Tika inside solr.
>>
>>
>>
>> Could you help me?
>>
>> Thanks
>>
>>
>>
>> *Da:* Bisonti Mario 
>> *Inviato:* giovedì 11 ottobre 2018 14:03
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>>
>>
>> My mistake…
>>
>> As you wrote me I had to uncheck “use extracting update handler”
>>
>>
>>
>> Now I have to understand the field mentioned in schema etc.
>>
>>
>>
>> *Da:* Bisonti Mario 
>> *Inviato:* giovedì 11 ottobre 2018 13:45
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>> I see the job processed but without the document inside.
>>
>> 10-11-2018 13:32:25.649
>>
>> job end
>>
>> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>>
>> 0
>>
>> 1
>>
>> 10-11-2018 13:32:14.211
>>
>> job start
>>
>> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>>
>> 0
>>
>> 1
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Have I to uncheck, on my Solr output connection the “Use the Extract
>> Update Handler”?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Da:* Karl Wright 
>> *Inviato:* giovedì 11 ottobre 2018 13:36
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>> Please have a look at your "Simple History" report to see why the
>> documents aren't getting indexed.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
>> wrote:
>>
>> Thanks Karl.
>>
>> I tried, but it doesn’t index documents.
>>
>> It seemes that it doesn’t see them?
>>
>>
>>
>> Perhaps is the “Ignore Tika exception that I don’t know where to set in
>> ManifoldCF  the problem?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Da:* Karl Wright 
>> *Inviato:* giovedì 11 ottobre 2018 12:24
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>> Hi Mario,
>>
>>
>>
>> (1) When you use the Tika server externally, you do not get the
>> boilerpipe HTML extractor available for configuration and use.  That is
>> because it's external now.
>>
>> (2) In your Solr connection, you want to uncheck the box that says "use
>> extracting update handler", and you want to change the output handler from
>> "/update/extract" to just "/update".
>>
>>
>>
>> Karl
>>
>&

Re: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Karl Wright

Please have a look at your "Simple History" report to see why the documents
aren't getting indexed.

Thanks,
Karl


On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
wrote:

> Thanks Karl.
>
> I tried, but it doesn’t index documents.
>
> It seemes that it doesn’t see them?
>
>
>
> Perhaps is the “Ignore Tika exception that I don’t know where to set in
> ManifoldCF  the problem?
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 12:24
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> Hi Mario,
>
>
>
> (1) When you use the Tika server externally, you do not get the boilerpipe
> HTML extractor available for configuration and use.  That is because it's
> external now.
>
> (2) In your Solr connection, you want to uncheck the box that says "use
> extracting update handler", and you want to change the output handler from
> "/update/extract" to just "/update".
>
>
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
> I would like to use Tika server started from command line into ManifoldCF
> so, ManifoldCF as Trasformation connector, process with Tika and index to
> the output connecto Solr.
>
>
>
> I started Tika server:
> java -jar /opt/tika/tika-server-1.19.1.jar
>
>
>
> After, I created a transformation connection with TikaServer: localhost
> and Tika port 998 and connection works.
>
>
>
> After, I created a job and in the Tab Connection I inserted the
> Transformation yet created Before the Output Solr.
>
>
>
>
>
> Note that I don’t see the tab “Excepition” and “Boilerplate”
>
> Why this?
>
>
>
> Furthermore, if I start the job, I see that Solr hangs with exception:
>
> 2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share]
> o.e.j.s.HttpChannel /solr/core_share/update/extract
>
> java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
>
> at java.lang.Class.forName0(Native Method) ~[?:?]
>
> at java.lang.Class.forName(Class.java:374) ~[?:?]
>
>
>
> infact, I renamed the tika .jar:
> in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t
> use Tika because I would like that Manifoldcfuses Tika buti t doesn’t work.
>
>
>
> Have I to configure solr to don’t use Tika I suppose.
>
>
>
> How to do this?
>
>
>
> I see
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdatafari.atlassian.net%2Fwiki%2Fspaces%2FDATAFARI%2Fpages%2F107708451%2FData%2BExtraction%2BTika%2BEmbedded%2Bin%2BSolr%2BDeactivation%2BConfiguration=01%7C01%7CMario.Bisonti%40vimar.com%7Cb423213e15654257911308d62f63b2f3%7Ca1f008bcd59b4c668f8760fd9af15c7f%7C1=rvkicOO6EdBJaVavJb2dmOMvnd%2Bv3C2oFQsjGSN%2Fy3g%3D=0>
> but I haven’t Datafari, so, in a Solr standard configuration, how could I
> deactivated the tika ?
>
>
>
> Thanks a lot
>
>
>
> Mario
>
>
>
>

[jira] [Commented] (CONNECTORS-1545) Parentheses in editConfiguration tab labels not supported

2018-10-09 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644282#comment-16644282
 ] 

Karl Wright commented on CONNECTORS-1545:
-

Hi [~julienFL], all strings are escaped, and tabs are accessed by a sequence 
number, and not by the translated string displayed on the tab, so I find your 
explanation for the problem unconvincing.  Usually, problems if this kind are 
due to there being HTML mismatched tags or some such. 

> Parentheses in editConfiguration tab labels not supported
> -
>
> Key: CONNECTORS-1545
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1545
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Minor
>
> In the editConfiguration view of the web connector, the tab 
> 'AccessCredentials' is not "clickable" in the french version of the UI.
> It seems to be due to the presence of parentheses in the string "Informations 
> d'accès (Access Credentials)" as it is used to trigger the javascript 
> function of the tab and so is misinterpreted
> Need to check if similar cases are present for other connectors 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (CONNECTORS-1545) Parentheses in editConfiguration tab labels not supported

2018-10-09 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1545:
---

Assignee: Karl Wright

> Parentheses in editConfiguration tab labels not supported
> -
>
> Key: CONNECTORS-1545
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1545
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11
>Reporter: Julien Massiera
>Assignee: Karl Wright
>Priority: Minor
>
> In the editConfiguration view of the web connector, the tab 
> 'AccessCredentials' is not "clickable" in the french version of the UI.
> It seems to be due to the presence of parentheses in the string "Informations 
> d'accès (Access Credentials)" as it is used to trigger the javascript 
> function of the tab and so is misinterpreted
> Need to check if similar cases are present for other connectors 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Option to skip documents

2018-10-09 Thread Karl Wright

r1843343 adds this condition to the list of caught conditions.

In the future it would be better to create a ticket.

Karl


On Tue, Oct 9, 2018 at 3:06 PM Karl Wright  wrote:

> I can make it retry then skip if it doesn't succeed in a while.
>
> Karl
>
>
> On Tue, Oct 9, 2018 at 11:38 AM Romaric Pighetti <
> romaric.pighe...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> You're right it might be better to reschedule the file for later in this
>> case.
>>
>> In my case, I was able to crawl the files the first time I tried.
>> When launching another crawl a few days later, the same files were locked.
>> I tried to crawl them several times during the day but never could reach
>> them with always the same error.
>>
>> Currently MCF retries to access the file several times in a row, gives up
>> after several tries and stops the jobs with a message reporting the smb
>> Exception encountered.
>>
>> Thanks for your answer,
>> Romaric
>>
>> So it is indeed a temporary lock, but we can't tell how long it will last.
>>
>> Le 09/10/2018 à 17:04, Karl Wright a écrit :
>>
>> Hi Romaric,
>> If the error is transient, then the right thing to do is *not* to skip
>> the file, but to retry later.  What currently happens?
>>
>> Karl
>>
>>
>> On Tue, Oct 9, 2018 at 10:05 AM Romaric Pighetti <
>> romaric.pighe...@francelabs.com> wrote:
>>
>>> Hi Karl,
>>> Along the lines of this ticket
>>> https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1455?filter=allissues
>>> submitted by Julien, I recently stumbled across another smb exception
>>> thrown when dealing with some kind of locked files. The error was
>>> SmbException tossed processing smb://path/to/some/file.pst
>>> jcifs.smb.SmbException: 0xC054
>>> MSDN documentation about this error can be found on this page:
>>> https://msdn.microsoft.com/en-us/library/ee441884.aspx?f=255=-2147217396
>>>
>>> This happens with large pst files (outlook archives) that are in use for
>>> example.
>>> It is a case that would require the file to be skipped rather than
>>> stopping the job in my opinion.
>>> What do you think about it ?
>>>
>>> Thanks,
>>> Romaric
>>>
>>> --
>>> Romaric Pighetti
>>> France Labs – Les experts du Search
>>> Retrouvez-nous à l’Enterprise Search & Discovery
>>> <http://www.enterprisesearchanddiscovery.com/2018/default.aspx> Summit
>>> à Washington DC
>>>
>>> [image: cid:image001.png@01D42F35.80534520]
>>> <http://www.enterprisesearchanddiscovery.com/2018/default.aspx>
>>> www.francelabs.com
>>>
>>
>> --
>> Romaric Pighetti
>> France Labs – Les experts du Search
>> Retrouvez-nous à l’Enterprise Search & Discovery
>> <http://www.enterprisesearchanddiscovery.com/2018/default.aspx> Summit à
>> Washington DC
>>
>> [image: cid:image001.png@01D42F35.80534520]
>> <http://www.enterprisesearchanddiscovery.com/2018/default.aspx>
>> www.francelabs.com
>>
>

Re: Option to skip documents

2018-10-09 Thread Karl Wright

I can make it retry then skip if it doesn't succeed in a while.

Karl


On Tue, Oct 9, 2018 at 11:38 AM Romaric Pighetti <
romaric.pighe...@francelabs.com> wrote:

> Hi Karl,
>
> You're right it might be better to reschedule the file for later in this
> case.
>
> In my case, I was able to crawl the files the first time I tried.
> When launching another crawl a few days later, the same files were locked.
> I tried to crawl them several times during the day but never could reach
> them with always the same error.
>
> Currently MCF retries to access the file several times in a row, gives up
> after several tries and stops the jobs with a message reporting the smb
> Exception encountered.
>
> Thanks for your answer,
> Romaric
>
> So it is indeed a temporary lock, but we can't tell how long it will last.
>
> Le 09/10/2018 à 17:04, Karl Wright a écrit :
>
> Hi Romaric,
> If the error is transient, then the right thing to do is *not* to skip the
> file, but to retry later.  What currently happens?
>
> Karl
>
>
> On Tue, Oct 9, 2018 at 10:05 AM Romaric Pighetti <
> romaric.pighe...@francelabs.com> wrote:
>
>> Hi Karl,
>> Along the lines of this ticket
>> https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1455?filter=allissues
>> submitted by Julien, I recently stumbled across another smb exception
>> thrown when dealing with some kind of locked files. The error was
>> SmbException tossed processing smb://path/to/some/file.pst
>> jcifs.smb.SmbException: 0xC054
>> MSDN documentation about this error can be found on this page:
>> https://msdn.microsoft.com/en-us/library/ee441884.aspx?f=255=-2147217396
>>
>> This happens with large pst files (outlook archives) that are in use for
>> example.
>> It is a case that would require the file to be skipped rather than
>> stopping the job in my opinion.
>> What do you think about it ?
>>
>> Thanks,
>> Romaric
>>
>> --
>> Romaric Pighetti
>> France Labs – Les experts du Search
>> Retrouvez-nous à l’Enterprise Search & Discovery
>> <http://www.enterprisesearchanddiscovery.com/2018/default.aspx> Summit à
>> Washington DC
>>
>> [image: cid:image001.png@01D42F35.80534520]
>> <http://www.enterprisesearchanddiscovery.com/2018/default.aspx>
>> www.francelabs.com
>>
>
> --
> Romaric Pighetti
> France Labs – Les experts du Search
> Retrouvez-nous à l’Enterprise Search & Discovery
> <http://www.enterprisesearchanddiscovery.com/2018/default.aspx> Summit à
> Washington DC
>
> [image: cid:image001.png@01D42F35.80534520]
> <http://www.enterprisesearchanddiscovery.com/2018/default.aspx>
> www.francelabs.com
>

Re: Option to skip documents

2018-10-09 Thread Karl Wright

Hi Romaric,
If the error is transient, then the right thing to do is *not* to skip the
file, but to retry later.  What currently happens?

Karl


On Tue, Oct 9, 2018 at 10:05 AM Romaric Pighetti <
romaric.pighe...@francelabs.com> wrote:

> Hi Karl,
> Along the lines of this ticket
> https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1455?filter=allissues
> submitted by Julien, I recently stumbled across another smb exception
> thrown when dealing with some kind of locked files. The error was
> SmbException tossed processing smb://path/to/some/file.pst
> jcifs.smb.SmbException: 0xC054
> MSDN documentation about this error can be found on this page:
> https://msdn.microsoft.com/en-us/library/ee441884.aspx?f=255=-2147217396
>
> This happens with large pst files (outlook archives) that are in use for
> example.
> It is a case that would require the file to be skipped rather than
> stopping the job in my opinion.
> What do you think about it ?
>
> Thanks,
> Romaric
>
> --
> Romaric Pighetti
> France Labs – Les experts du Search
> Retrouvez-nous à l’Enterprise Search & Discovery
>  Summit à
> Washington DC
>
> [image: cid:image001.png@01D42F35.80534520]
> 
> www.francelabs.com
>

Re: Sharepoint connector help : site didn't exist or external

2018-10-08 Thread Karl Wright

Excellent news!
Thanks for the update.

Karl


On Mon, Oct 8, 2018 at 1:54 PM Susheel Kumar  wrote:

> Thank you so much Karl. I was able to crawl the site and index them.
>
> On Wed, Oct 3, 2018 at 3:31 PM Karl Wright  wrote:
>
>> Please read the user documentation for the sharepoint connector very
>> carefully.  You will need a site rule AND a path rule.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Oct 3, 2018 at 3:29 PM Susheel Kumar 
>> wrote:
>>
>>> Hi Karl,
>>>
>>> Please ignore my previous message.  I was just able to crawl it but for
>>> the files which i wanted to get extracted it is showing me below message. I
>>> already had Path configured to have /content included file and library type
>>> included but it is still not including them. How to correctly define path
>>> in order to get them included.
>>>
>>> DEBUG 2018-10-03T15:22:58,724 (Worker thread '40') - SharePoint:
>>> Checking whether to include document
>>> '/Content/Review_Communication_template.docx'
>>> DEBUG 2018-10-03T15:22:58,724 (Worker thread '40') - SharePoint: File
>>> path '/Content/Review_Communication_template.docx' does not match any rules
>>> - excluding
>>> DEBUG 2018-10-03T15:22:58,724 (Worker thread '40') - SharePoint:
>>> Checking whether to include document '/Content/Review_Template.pptx'
>>>
>>> On Wed, Oct 3, 2018 at 2:48 PM Susheel Kumar 
>>> wrote:
>>>
>>>> Thank you so much, Karl and taking me up to here. I am able to see
>>>> connector loggings now.
>>>>
>>>> The next I am struggling with, when I run a job to use sharepoint
>>>> repository and output to local file system, I see sometime 401 Unauthorized
>>>> for usergroup.asmx OR 404 for lists.asmx or 404 for Permission.asmx when I
>>>> am running same job again and again.  I am able to access these web service
>>>> thru POSTMAN and it works.
>>>>
>>>> Any hint what may be missing? I already had manifold Sharepoint plugin
>>>> installed.
>>>>
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - Enter:
>>>> SOAPPart::saveChanges
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "POST /sites/mysite/_vti_bin/usergroup.asmx HTTP/1.1[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "Content-Type: text/xml; charset=utf-8[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "Accept: */*[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "SOAPAction: "
>>>> http://schemas.microsoft.com/sharepoint/soap/directory/GetUserCollectionFromGroup
>>>> "[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "User-Agent: Axis/1.4[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "Content-Length: 427[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "Host: dit.apps.com[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "Connection: Keep-Alive[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "Accept-Encoding: gzip,deflate[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "Authorization: NTLM
>>>> TlRMTVNTUAADGAAYAEgAAADOAM4AYAQABAAuAQAADgAOADIBAAAeAB4AQAEAAABeAQAABYKIogUBKAoPMhvCQLxXxh1vp0BTXWvSa7dE6WMG25EYSBDigKSFQ4YXcg/4Gs4bOgEBYHUY7j1b1AFjszvpmEMnhAACAAQARQBTAAEAFgBDAEQATABEAEUAVgBFAFMAUgAwADIABAAaAEUAUwAuAEEARAAuAEEARABQAC4AYwBvAG0AAwAyAEMARABMAEQARQBWAEUAUwBSADAAMgAuAEUAUwAuAEEARAAuAEEARABQAC4AYwBvAG0ABQAUAEEARAAuAEEARABQAC4AYwBvAG0ABwAIAMWOHe89W9QBAABFAFMAawB1AG0AYQByAHMANQBSAE8AUwBFAEwAQwBEAFYAMAAwADAAMQBMAEoAQwA=[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "[\r][\n]"
>>>> DEBUG 2018-10-03T13:24:24,631 (Thread-49328) - http-outgoing-90 >>
>>>> "http://schemas.xmlsoap.org/soap/envelope/; xmlns:xsd="
>>>> http://www.w3.org/2001/XMLSchema; xmlns:xsi="
>>>> http://www.w3.org/2001/XMLSchema-instance;>>>

[jira] [Commented] (LUCENE-8522) Spatial: Polygon touching the negative boundaries of WGS84 fails on Solr

2018-10-08 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641762#comment-16641762
 ] 

Karl Wright commented on LUCENE-8522:
-

[~ivera], looks good to me.


> Spatial: Polygon touching the negative boundaries of WGS84 fails on Solr
> 
>
> Key: LUCENE-8522
> URL: https://issues.apache.org/jira/browse/LUCENE-8522
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Affects Versions: 7.4, 7.5, master (8.0)
>Reporter: Ema Panz
>Assignee: Ignacio Vera
>Priority: Critical
> Attachments: LUCENE-8522.patch
>
>
> When using the WGS84 coordinates system and querying with a polygon touching 
> one of the "negative" borders, Solr throws a "NullPointerException" error.
> The query is performed with the "intersect" function over a GeoJson polygon 
> specified with the coordinates:
> { "coordinates":[[[-180,90],[-180,-90],[180,-90],[180,90],[-180,90]]] }
>  
> The queried field has been defined as:
> {code:java}
> class="solr.SpatialRecursivePrefixTreeFieldType"
>spatialContextFactory="Geo3D"
>geo="true"
>planetModel="WGS84"
>format="GeoJSON"
> />{code}
>  
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.lucene.spatial.spatial4j.Geo3dShape.getBoundingBox(Geo3dShape.java:114)
> at 
> org.apache.lucene.spatial.query.SpatialArgs.calcDistanceFromErrPct(SpatialArgs.java:63)
> at 
> org.apache.lucene.spatial.query.SpatialArgs.resolveDistErr(SpatialArgs.java:84)
> at 
> org.apache.lucene.spatial.prefix.RecursivePrefixTreeStrategy.makeQuery(RecursivePrefixTreeStrategy.java:182)
> at 
> org.apache.solr.schema.AbstractSpatialFieldType.getQueryFromSpatialArgs(AbstractSpatialFieldType.java:368)
> at 
> org.apache.solr.schema.AbstractSpatialFieldType.getFieldQuery(AbstractSpatialFieldType.java:340)
> at 
> org.apache.solr.search.FieldQParserPlugin$1.parse(FieldQParserPlugin.java:45)
> at org.apache.solr.search.QParser.getQuery(QParser.java:169)
> at 
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:207)
> at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:272)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2539)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
> at 
> org.eclipse.jetty.server.hand

Re: Query to get the number of documents processed from PostgreSQL

2018-10-08 Thread Karl Wright

If you want all the documents for a specific job, the query is:

select count(*) from jobqueue where jobid=

Karl


On Mon, Oct 8, 2018 at 4:23 AM Romaric Pighetti <
romaric.pighe...@francelabs.com> wrote:

> Hi Karl,
>
> I am currently facing the need of getting the number of documents
> processed by MCF in a specific job.
> This number is getting bigger than the limit set for the web interface
> and i don't want to increase this limit because of the stress it will
> put on the database (openning the tab in the UI will pop queries for all
> the jobs, and I know from previous readings that these queries are heavy
> to process for postgre).
> Thus i would like to know if you can provide me with the query used in
> the interface to display the number of processed documents so that i can
> fire it to postgreSQL manually and request it only for the job i am
> interested in; lowering the impact on postgre.
>
> Thanks for your help.
> Romaric
>
> --
> Romaric Pighetti
> France Labs – Les experts du Search
>
> Les créateurs de Datafari 4, LA solution de recherche pour entreprise
>
> www.francelabs.com
>
>

[jira] [Resolved] (CONNECTORS-1541) Documents updated in Google Drive are send with 0 byte to CMIS Output Connector

2018-10-07 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1541.
-
   Resolution: Fixed
 Assignee: Karl Wright  (was: Piergiorgio Lucidi)
Fix Version/s: ManifoldCF 2.12

r1843058

> Documents updated in Google Drive are send with 0 byte to CMIS Output 
> Connector
> ---
>
> Key: CONNECTORS-1541
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1541
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: CMIS Output Connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Douglas C. R. Paes
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
> Attachments: CmisOutputConnector.java
>
>
> When dealing with migration process, like when using the CMIS Output 
> Connector to ingest content into an ECM (Alfresco in my case), I noticed that 
> when a document is updated inside Google Drive, the engine is able to detect 
> the change and put it into the queue to be updated into the output.
> By using the CMIS Output Connector, the document is versioned into Alfresco, 
> but this new version is always created as a 0 byte file.
>  
> I configured the issue as a *Framework core* because I am still not sure who 
> is causing the problem, if the *GoogleDrive Connector* or the *CMIS Output 
> Connector*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (CONNECTORS-1540) When the folder name contains "/", the export api request should replace it with the valid "%2F" before sending it

2018-10-06 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1540.
-
Resolution: Cannot Reproduce

> When the folder name contains "/", the export api request should replace it 
> with the valid "%2F" before sending it
> --
>
> Key: CONNECTORS-1540
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1540
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: GoogleDrive connector
>Affects Versions: ManifoldCF 2.10
>    Reporter: Douglas C. R. Paes
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> The Google Drive Connector needs to encode the URL before trying to get the 
> content exported.
> I noticed that when the folder/document name contains, for example, the "/" 
> character, the export fails because of the non-encoded value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CONNECTORS-1541) Documents updated in Google Drive are send with 0 byte to CMIS Output Connector

2018-10-06 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640919#comment-16640919
 ] 

Karl Wright edited comment on CONNECTORS-1541 at 10/7/18 12:04 AM:
---

If you can't find what commit is missing, just attach the entire connector 
source file to this ticket and I can commit that.  Or, better yet, recompute 
the diff based on the current code in master.




was (Author: kwri...@metacarta.com):
If you can't find what commit is missing, just attach the entire connector 
source file to this ticket and I can commit that.


> Documents updated in Google Drive are send with 0 byte to CMIS Output 
> Connector
> ---
>
> Key: CONNECTORS-1541
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1541
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: CMIS Output Connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Douglas C. R. Paes
>Assignee: Piergiorgio Lucidi
>Priority: Major
>
> When dealing with migration process, like when using the CMIS Output 
> Connector to ingest content into an ECM (Alfresco in my case), I noticed that 
> when a document is updated inside Google Drive, the engine is able to detect 
> the change and put it into the queue to be updated into the output.
> By using the CMIS Output Connector, the document is versioned into Alfresco, 
> but this new version is always created as a 0 byte file.
>  
> I configured the issue as a *Framework core* because I am still not sure who 
> is causing the problem, if the *GoogleDrive Connector* or the *CMIS Output 
> Connector*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (CONNECTORS-1543) Folders and documents containing illegal characters in the names break the migration process into CMIS

2018-10-06 Thread Karl Wright (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1543.
-
   Resolution: Fixed
 Assignee: Karl Wright
Fix Version/s: ManifoldCF 2.12

r1843037


> Folders and documents containing illegal characters in the names break the 
> migration process into CMIS
> --
>
> Key: CONNECTORS-1543
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1543
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: GoogleDrive connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Douglas C. R. Paes
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.12
>
>
> Google Drive allows users to create documents and folders using whatever 
> characters they want.
> Things like:
> ' Routine Chart g4 Name/Another Final Draft.pptx'
> 'LEARNING EXPECTATION '
>  
> The ideal would be to have the leading and trailing spaces trimmed and the 
> illegal characters replaces with "_", which is the default behaviour for the 
> official Google Drive Backup & Sync desktop client.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1541) Documents updated in Google Drive are send with 0 byte to CMIS Output Connector

2018-10-06 Thread Karl Wright (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640919#comment-16640919
 ] 

Karl Wright commented on CONNECTORS-1541:
-

If you can't find what commit is missing, just attach the entire connector 
source file to this ticket and I can commit that.


> Documents updated in Google Drive are send with 0 byte to CMIS Output 
> Connector
> ---
>
> Key: CONNECTORS-1541
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1541
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: CMIS Output Connector
>Affects Versions: ManifoldCF 2.10
>Reporter: Douglas C. R. Paes
>Assignee: Piergiorgio Lucidi
>Priority: Major
>
> When dealing with migration process, like when using the CMIS Output 
> Connector to ingest content into an ECM (Alfresco in my case), I noticed that 
> when a document is updated inside Google Drive, the engine is able to detect 
> the change and put it into the queue to be updated into the output.
> By using the CMIS Output Connector, the document is versioned into Alfresco, 
> but this new version is always created as a 0 byte file.
>  
> I configured the issue as a *Framework core* because I am still not sure who 
> is causing the problem, if the *GoogleDrive Connector* or the *CMIS Output 
> Connector*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

< 9 10 11 12 13 14 15 16 17 18 >

1301 - 1400 of 13502 matches

Mail list logo