Is Manifold capable of handling these kind of files

2022-12-23 Thread Priya Arora
Hi

Is Manifold capable of handling this kind (ingesting) of file in window
shares connector which has special characters like these

demo/11208500/11208550/I. Proposal/PHASE II/220808
Input/__MACOSX/虎尾/._62A33A6377CF08B472CC2AB562BD8B5D.JPG


Any reply would be appreciated


Re: sharepoint crawler documents limit

2019-12-20 Thread Priya Arora
Hi All,

Is this issue something to have with below value/parameters set in
properties.xml.
[image: image.png]


On Fri, Dec 20, 2019 at 5:21 PM Jorge Alonso Garcia 
wrote:

> And what other sharepoint parameter I could check?
>
> Jorge Alonso Garcia
>
>
>
> El vie., 20 dic. 2019 a las 12:47, Karl Wright ()
> escribió:
>
>> The code seems correct and many people are using it without encountering
>> this problem.  There may be another SharePoint configuration parameter you
>> also need to look at somewhere.
>>
>> Karl
>>
>>
>> On Fri, Dec 20, 2019 at 6:38 AM Jorge Alonso Garcia 
>> wrote:
>>
>>>
>>> Hi Karl,
>>> On sharepoint the list view threshold is 150,000 but we only receipt
>>> 20,000 from mcf
>>> [image: image.png]
>>>
>>>
>>> Jorge Alonso Garcia
>>>
>>>
>>>
>>> El jue., 19 dic. 2019 a las 19:19, Karl Wright ()
>>> escribió:
>>>
 If the job finished without error it implies that the number of
 documents returned from this one library was 1 when the service is
 called the first time (starting at doc 0), 1 when it's called the
 second time (starting at doc 1), and zero when it is called the third
 time (starting at doc 2).

 The plugin code is unremarkable and actually gets results in chunks of
 1000 under the covers:

 >>
 SPQuery listQuery = new SPQuery();
 listQuery.Query = ">>> Override=\"TRUE\">";
 listQuery.QueryThrottleMode =
 SPQueryThrottleOption.Override;
 listQuery.ViewAttributes =
 "Scope=\"Recursive\"";
 listQuery.ViewFields = ">>> Name='FileRef' />";
 listQuery.RowLimit = 1000;

 XmlDocument doc = new XmlDocument();
 retVal = doc.CreateElement("GetListItems",
 "
 http://schemas.microsoft.com/sharepoint/soap/directory/;);
 XmlNode getListItemsNode =
 doc.CreateElement("GetListItemsResponse");

 uint counter = 0;
 do
 {
 if (counter >= startRowParam +
 rowLimitParam)
 break;

 SPListItemCollection collListItems =
 oList.GetItems(listQuery);


 foreach (SPListItem oListItem in
 collListItems)
 {
 if (counter >= startRowParam && counter
 < startRowParam + rowLimitParam)
 {
 XmlNode resultNode =
 doc.CreateElement("GetListItemsResult");
 XmlAttribute idAttribute =
 doc.CreateAttribute("FileRef");
 idAttribute.Value = oListItem.Url;

 resultNode.Attributes.Append(idAttribute);
 XmlAttribute urlAttribute =
 doc.CreateAttribute("ListItemURL");
 //urlAttribute.Value =
 oListItem.ParentList.DefaultViewUrl;
 urlAttribute.Value =
 string.Format("{0}?ID={1}",
 oListItem.ParentList.Forms[PAGETYPE.PAGE_DISPLAYFORM].ServerRelativeUrl,
 oListItem.ID);

 resultNode.Attributes.Append(urlAttribute);

 getListItemsNode.AppendChild(resultNode);
 }
 counter++;
 }

 listQuery.ListItemCollectionPosition =
 collListItems.ListItemCollectionPosition;

 } while (listQuery.ListItemCollectionPosition
 != null);

 retVal.AppendChild(getListItemsNode);
 <<

 The code is clearly working if you get 2 results returned, so I
 submit that perhaps there's a configured limit in your SharePoint instance
 that prevents listing more than 2.  That's the only way I can explain
 this.

 Karl


 On Thu, Dec 19, 2019 at 12:51 PM Jorge Alonso Garcia <
 jalon...@gmail.com> wrote:

> Hi,
> The job finnish ok (several times) but always with this 2
> documents, for some reason the loop only execute twice
>
> Jorge Alonso Garcia
>
>
>
> El jue., 19 dic. 2019 a las 18:14, Karl Wright ()
> escribió:
>
>> If the are all in one document, then you'd be running this code:
>>
>> >>
>> int startingIndex = 0;
>> int amtToRequest = 1;
>> while (true)
>> {
>>
>> com.microsoft.sharepoint.webpartpages.GetListItemsResponseGetListItemsResult
>> itemsResult =
>>

Re: Manifoldcf server Error

2019-12-20 Thread Priya Arora
Hi All,

When i am trying to execute bash command inside manifoldcf container
getting error.
[image: image.png]
And when checking logs Sudo docker logs 
2019-12-19 18:09:05,848 Job start thread ERROR Unable to write to stream
logs/manifoldcf.log for appender MyFile
2019-12-19 18:09:05,848 Seeding thread ERROR Unable to write to stream
logs/manifoldcf.log for appender MyFile
2019-12-19 18:09:05,848 Job reset thread ERROR Unable to write to stream
logs/manifoldcf.log for appender MyFile
2019-12-19 18:09:05,848 Job notification thread ERROR Unable to write to
stream logs/manifoldcf.log for appender MyFile
2019-12-19 18:09:05,849 Seeding thread ERROR An exception occurred
processing Appender MyFile org
 .apache.logging.log4j.core.appender.AppenderLoggingException: Error
flushing stream logs/manifoldcf.log
at
org.apache.logging.log4j.core.appender.OutputStreamManager.flush(OutputStreamManager.java:159)

Can any body suggest reason behind this error?

Thanks
Priya

On Fri, Dec 20, 2019 at 3:37 PM Priya Arora  wrote:

> Hi Markus,
>
> Many thanks for your reply!!.
>
> I tried this approach to reproduce the scenario in a different
> environment, but the case  where I listed the error above is when I am
> crawling INTRANET sites which can be accessible over a remote server. Also
> I have used Transformation connectors:-Allow Documents, Tika Parser,
> Content Limiter( 1000), Metadata Adjuster.
>
> When tried reproducing the error with Public sites of the same domain and
> on a different server(DEV), it was successful, with no error.Also there was
> no any postgres related error.
>
> Can it depends observer related configurations like Firewall etc, as this
> case include some firewall,security related configurations.
>
> Thanks
> Priya
>
>
>
>
> On Fri, Dec 20, 2019 at 3:23 PM Markus Schuch 
> wrote:
>
>> Hi Priya,
>>
>> in my experience, i would focus on the OutOfMemoryError (OOME).
>> 8 Gigs can be enough, but they don't have to.
>>
>> At first i would check if the jvm is really getting the desired heap
>> size. The dockered environment make that a little harder find find out,
>> since you need to get access to the jvm metrics, e.g. via jmxremote.
>> Beeing able to monitor the jvm metrics helps you with correlating the
>> errors with the heap and garbage collection activity.
>>
>> The errors you see on postgresql jdbc driver might be very related to
>> the OOME.
>>
>> Some question i would ask myself:
>>
>> Do the problems repeatingly occur only when crawling this specific
>> content source or only with this specific output connection? Can you
>> reproduce it outside of docker in a controlled dev environment? Or is it
>> a more general problem with your manifoldcf instance?
>>
>> May be there are some huge files beeing crawled in your content source?
>> To you have any kind of transformations configured? (e.g. content size
>> limit?) You should try to see in the job's history if there are any
>> patterns, like the error rises always after encountering the same
>> document xy.
>>
>> Cheers
>> Markus
>>
>>
>>
>> Am 20.12.2019 um 09:59 schrieb Priya Arora:
>> > Hi  Markus ,
>> >
>> > Heap size defined is 8GB. Manifoldcf start-options-unix file  Xmx etc
>> > parameters is defined to have memory 8192mb.
>> >
>> > It seems to be an issue with memory also, and also when manifoldcf tries
>> > to communicate to Database. Do you explicitly define somewhere
>> > connection timer when to communicate to postgres.
>> > Postgres is installed as a part of docker image pull and then some
>> > changes in properties.xml(of manifoldcf) to connect to database.
>> > On the other hand Elastic search is also holding sufficient memory and
>> > Manifoldcf is also provided with 8 cores CPU.
>> >
>> > Can you suggest some solution.
>> >
>> > Thanks
>> > Priya
>> >
>> > On Fri, Dec 20, 2019 at 2:23 PM Markus Schuch > > <mailto:markus_sch...@web.de>> wrote:
>> >
>> > Hi Priya,
>> >
>> > your manifoldcf JVM suffers from high garbage collection pressure:
>> >
>> > java.lang.OutOfMemoryError: GC overhead limit exceeded
>> >
>> > What is your current heap size?
>> > Without knowing that, i suggest to increase the heap size. (java
>> > -Xmx...)
>> >
>> > Cheers,
>> > Markus
>> >
>> > Am 20.12.2019 um 09:02 schrieb Priya Arora:
>> > > Hi All,
>> > >
>> > > I a

Re: Manifoldcf server Error

2019-12-20 Thread Priya Arora
Hi Markus,

Many thanks for your reply!!.

I tried this approach to reproduce the scenario in a different environment,
but the case  where I listed the error above is when I am crawling INTRANET
sites which can be accessible over a remote server. Also I have used
Transformation connectors:-Allow Documents, Tika Parser, Content Limiter(
1000), Metadata Adjuster.

When tried reproducing the error with Public sites of the same domain and
on a different server(DEV), it was successful, with no error.Also there was
no any postgres related error.

Can it depends observer related configurations like Firewall etc, as this
case include some firewall,security related configurations.

Thanks
Priya




On Fri, Dec 20, 2019 at 3:23 PM Markus Schuch  wrote:

> Hi Priya,
>
> in my experience, i would focus on the OutOfMemoryError (OOME).
> 8 Gigs can be enough, but they don't have to.
>
> At first i would check if the jvm is really getting the desired heap
> size. The dockered environment make that a little harder find find out,
> since you need to get access to the jvm metrics, e.g. via jmxremote.
> Beeing able to monitor the jvm metrics helps you with correlating the
> errors with the heap and garbage collection activity.
>
> The errors you see on postgresql jdbc driver might be very related to
> the OOME.
>
> Some question i would ask myself:
>
> Do the problems repeatingly occur only when crawling this specific
> content source or only with this specific output connection? Can you
> reproduce it outside of docker in a controlled dev environment? Or is it
> a more general problem with your manifoldcf instance?
>
> May be there are some huge files beeing crawled in your content source?
> To you have any kind of transformations configured? (e.g. content size
> limit?) You should try to see in the job's history if there are any
> patterns, like the error rises always after encountering the same
> document xy.
>
> Cheers
> Markus
>
>
>
> Am 20.12.2019 um 09:59 schrieb Priya Arora:
> > Hi  Markus ,
> >
> > Heap size defined is 8GB. Manifoldcf start-options-unix file  Xmx etc
> > parameters is defined to have memory 8192mb.
> >
> > It seems to be an issue with memory also, and also when manifoldcf tries
> > to communicate to Database. Do you explicitly define somewhere
> > connection timer when to communicate to postgres.
> > Postgres is installed as a part of docker image pull and then some
> > changes in properties.xml(of manifoldcf) to connect to database.
> > On the other hand Elastic search is also holding sufficient memory and
> > Manifoldcf is also provided with 8 cores CPU.
> >
> > Can you suggest some solution.
> >
> > Thanks
> > Priya
> >
> > On Fri, Dec 20, 2019 at 2:23 PM Markus Schuch  > <mailto:markus_sch...@web.de>> wrote:
> >
> > Hi Priya,
> >
> > your manifoldcf JVM suffers from high garbage collection pressure:
> >
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> >
> > What is your current heap size?
> > Without knowing that, i suggest to increase the heap size. (java
> > -Xmx...)
> >
> > Cheers,
> > Markus
> >
> > Am 20.12.2019 um 09:02 schrieb Priya Arora:
> > > Hi All,
> > >
> > > I am facing below error while accessing Manifoldcf. Requirement is
> to
> > > crawl data from a website using Repository as "Web" and Output
> > connector
> > > as "Elastic Search"
> > > Manifoldcf is configured inside a docker container and also
> > postgres is
> > > used a docker container.
> > > When launching manifold getting below error
> > > image.png
> > >
> > > When checked logs:-
> > > *1)sudo docker exec -it 0b872dfafc5c tail -1000
> > > /usr/share/manifoldcf/example/logs/manifoldcf.log*
> > > FATAL 2019-12-20T06:06:13,176 (Stuffer thread) - Error tossed:
> Timer
> > > already cancelled.
> > > java.lang.IllegalStateException: Timer already cancelled.
> > > at java.util.Timer.sched(Timer.java:397) ~[?:1.8.0_232]
> > > at java.util.Timer.schedule(Timer.java:193) ~[?:1.8.0_232]
> > > at
> > >
> org.postgresql.jdbc.PgConnection.addTimerTask(PgConnection.java:1113)
> > > ~[postgresql-42.1.3.jar:42.1.3]
> > > at
> > > org.postgresql.jdbc.PgStatement.startTimer(PgStatement.java:887)
> > > ~[postgresql-42.1.3.jar:42.1.3]
> > > at
> >  

Re: Manifoldcf server Error

2019-12-20 Thread Priya Arora
Hi  Markus ,

Heap size defined is 8GB. Manifoldcf start-options-unix file  Xmx etc
parameters is defined to have memory 8192mb.

It seems to be an issue with memory also, and also when manifoldcf tries to
communicate to Database. Do you explicitly define somewhere connection
timer when to communicate to postgres.
Postgres is installed as a part of docker image pull and then some changes
in properties.xml(of manifoldcf) to connect to database.
On the other hand Elastic search is also holding sufficient memory and
Manifoldcf is also provided with 8 cores CPU.

Can you suggest some solution.

Thanks
Priya

On Fri, Dec 20, 2019 at 2:23 PM Markus Schuch  wrote:

> Hi Priya,
>
> your manifoldcf JVM suffers from high garbage collection pressure:
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> What is your current heap size?
> Without knowing that, i suggest to increase the heap size. (java -Xmx...)
>
> Cheers,
> Markus
>
> Am 20.12.2019 um 09:02 schrieb Priya Arora:
> > Hi All,
> >
> > I am facing below error while accessing Manifoldcf. Requirement is to
> > crawl data from a website using Repository as "Web" and Output connector
> > as "Elastic Search"
> > Manifoldcf is configured inside a docker container and also postgres is
> > used a docker container.
> > When launching manifold getting below error
> > image.png
> >
> > When checked logs:-
> > *1)sudo docker exec -it 0b872dfafc5c tail -1000
> > /usr/share/manifoldcf/example/logs/manifoldcf.log*
> > FATAL 2019-12-20T06:06:13,176 (Stuffer thread) - Error tossed: Timer
> > already cancelled.
> > java.lang.IllegalStateException: Timer already cancelled.
> > at java.util.Timer.sched(Timer.java:397) ~[?:1.8.0_232]
> > at java.util.Timer.schedule(Timer.java:193) ~[?:1.8.0_232]
> > at
> > org.postgresql.jdbc.PgConnection.addTimerTask(PgConnection.java:1113)
> > ~[postgresql-42.1.3.jar:42.1.3]
> > at
> > org.postgresql.jdbc.PgStatement.startTimer(PgStatement.java:887)
> > ~[postgresql-42.1.3.jar:42.1.3]
> > at
> > org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:427)
> > ~[postgresql-42.1.3.jar:42.1.3]
> > at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354)
> > ~[postgresql-42.1.3.jar:42.1.3]
> > at
> >
> org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:169)
> > ~[postgresql-42.1.3.jar:42.1.3]
> > at
> >
> org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:136)
> > ~[postgresql-42.1.3.jar:42.1.3]
> > at
> > org.postgresql.jdbc.PgConnection.isValid(PgConnection.java:1311)
> > ~[postgresql-42.1.3.jar:42.1.3]
> > at
> >
> org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection(ConnectionPool.java:92)
> > ~[mcf-core.jar:?]
> > at
> >
> org.apache.manifoldcf.core.database.ConnectionFactory.getConnectionWithRetries(ConnectionFactory.java:126)
> > ~[mcf-core.jar:?]
> > at
> >
> org.apache.manifoldcf.core.database.ConnectionFactory.getConnection(ConnectionFactory.java:75)
> > ~[mcf-core.jar:?]
> > at
> >
> org.apache.manifoldcf.core.database.Database.executeUncachedQuery(Database.java:797)
> > ~[mcf-core.jar:?]
> > at
> >
> org.apache.manifoldcf.core.database.Database$QueryCacheExecutor.create(Database.java:1457)
> > ~[mcf-core.jar:?]
> > at
> >
> org.apache.manifoldcf.core.cachemanager.CacheManager.findObjectsAndExecute(CacheManager.java:146)
> > ~[mcf-core.jar:?]
> > at
> >
> org.apache.manifoldcf.core.database.Database.executeQuery(Database.java:204)
> > ~[mcf-core.jar:?]
> > at
> >
> org.apache.manifoldcf.core.database.DBInterfacePostgreSQL.performQuery(DBInterfacePostgreSQL.java:837)
> > ~[mcf-core.jar:?]
> > at
> >
> org.apache.manifoldcf.core.database.BaseTable.performQuery(BaseTable.java:221)
> > ~[mcf-core.jar:?]
> > at
> >
> org.apache.manifoldcf.crawler.jobs.Jobs.getActiveJobConnections(Jobs.java:736)
> > ~[mcf-pull-agent.jar:?]
> > at
> >
> org.apache.manifoldcf.crawler.jobs.JobManager.getNextDocuments(JobManager.java:2869)
> > ~[mcf-pull-agent.jar:?]
> > at
> >
> org.apache.manifoldcf.crawler.system.StufferThread.run(StufferThread.java:186)
> > [mcf-pull-agent.jar:?]
> > *2)sudo docker logs  --tail 1000*
> > Exception in thread "PostgreSQL-JDBC-SharedTimer-1"
> > java.lang.OutOfMemoryError: GC overhead limit exceeded

Re: Manifoldcf version conflict

2019-11-19 Thread Priya Arora
I am using docker commands to install manifoldcf inside docker container.
So what I understand is that mcf downloads latest crawler-ui.war files in
the web folder(that is what i checked in the local system). Do I need to
check somewhere else.
[image: image.png]

On Tue, Nov 19, 2019 at 4:40 PM Karl Wright  wrote:

> That version comes directly from the ant build version that was used to
> compile the UI.  What version of crawler-ui.war do you have?
>
> Karl
>
>
> On Tue, Nov 19, 2019 at 5:50 AM Priya Arora  wrote:
>
>> Hi All,
>>
>>  I have upgraded manifoldcf version on the server to version 2.14, I
>> re-confirmed it via docker build command that it is downloading 2.14
>> version only.
>>
>> [image: image.png]
>>
>> But when I am  starting up manifold wythe version is showing me up 2.10
>> as shown in the screenshot above. Is there any static value being passed.
>> Or do I have to manually do something, which i guess so "not", because on
>> local system its pointing the correct value/ version
>>
>> Thanks
>> Priya
>>
>


Manifoldcf version conflict

2019-11-19 Thread Priya Arora
Hi All,

 I have upgraded manifoldcf version on the server to version 2.14, I
re-confirmed it via docker build command that it is downloading 2.14
version only.

[image: image.png]

But when I am  starting up manifold wythe version is showing me up 2.10 as
shown in the screenshot above. Is there any static value being passed.
Or do I have to manually do something, which i guess so "not", because on
local system its pointing the correct value/ version

Thanks
Priya


Re: Windows shares connector-Error

2019-11-08 Thread Priya Arora
This didn't work even. Is that(manifoldcf version 2.14) something to do
with java version also. If yes , I am using JAVA_HOME :_ java version 8.
Can you suggest something

On Fri, Nov 8, 2019 at 4:16 PM Sreejith Variyath <
sreejith.variy...@tarams.com> wrote:

> place the jcifs.jar into the *connector-lib-proprietary* directory
>
> On Fri, Nov 8, 2019 at 2:38 PM Priya Arora  wrote:
>
>> Hi All
>>
>> I installed the 2.14 version of manifoldcf , then uncommented the line in
>> connectors.xml file "" , but when I try to start with(java- jar start.jar) gives error:
>>
>> I also checked it mcf-jcifs-connector.jar is also present in
>> connector-lib.
>>
>> Do i need to do something else also.Here is the error log.
>>
>> Successfully registered repository connector
>> 'org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector'
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> jcifs/smb/SmbException
>> at java.base/java.lang.Class.forName0(Native Method)
>> at java.base/java.lang.Class.forName(Unknown Source)
>> at
>> org.apache.manifoldcf.core.system.ManifoldCFResourceLoader.findClass(ManifoldCFResourceLoader.java:149)
>> at
>> org.apache.manifoldcf.core.system.ManifoldCF.findClass(ManifoldCF.java:1533)
>> at
>> org.apache.manifoldcf.core.interfaces.ConnectorFactory.getThisConnectorRaw(ConnectorFactory.java:144)
>> at
>> org.apache.manifoldcf.core.interfaces.ConnectorFactory.getThisConnectorNoCheck(ConnectorFactory.java:118)
>> at
>> org.apache.manifoldcf.core.interfaces.ConnectorFactory.installThis(ConnectorFactory.java:48)
>> at
>> org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.install(RepositoryConnectorFactory.java:100)
>> at
>> org.apache.manifoldcf.crawler.connmgr.ConnectorManager.registerConnector(ConnectorManager.java:180)
>> at
>> org.apache.manifoldcf.crawler.system.ManifoldCF.registerConnectors(ManifoldCF.java:672)
>> at
>> org.apache.manifoldcf.crawler.system.ManifoldCF.reregisterAllConnectors(ManifoldCF.java:160)
>> at
>> org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:239)
>> Caused by: java.lang.ClassNotFoundException: jcifs.smb.SmbException
>> at java.base/java.net.URLClassLoader.findClass(Unknown Source)
>> at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
>> at java.base/java.net.FactoryURLClassLoader.loadClass(Unknown
>> Source)
>> at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
>> ... 12 more
>>
>> Thanks and regards
>> Priya
>>
>
>
> --
> Best Regards,
>
>
> *Sreejith Variyath*
> Lead Software Engineer
> Tarams Software Technologies Pvt. Ltd.
> Venus Buildings, 2nd Floor 1/2,3rd Main,
> Kalyanamantapa Road Jakasandra, 1st Block Kormangala
> Bangalore - 560034
> Tarams <http://www.tarams.com/>
>
>
> www.tarams.com
> =
> DISCLAIMER: The information in this message is confidential and may be
> legally privileged. It is intended solely for the addressee. Access to this
> message by anyone else is unauthorized. If you are not the intended
> recipient, any disclosure, copying, or distribution of the message, or any
> action or omission taken by you in reliance on it, is prohibited and may be
> unlawful. Please immediately contact the sender if you have received this
> message in error. Further, this e-mail may contain viruses and all
> reasonable precaution to minimize the risk arising there from is taken by
> Tarams. Tarams is not liable for any damage sustained by you as a result of
> any virus in this e-mail. All applicable virus checks should be carried out
> by you before opening this e-mail or any attachment thereto.
> Thank you - Tarams Software Technologies Pvt.Ltd.
> =
>


Windows shares connector-Error

2019-11-08 Thread Priya Arora
Hi All

I installed the 2.14 version of manifoldcf , then uncommented the line in
connectors.xml file "" , but when I try to start with(java- jar start.jar) gives error:

I also checked it mcf-jcifs-connector.jar is also present in connector-lib.

Do i need to do something else also.Here is the error log.

Successfully registered repository connector
'org.apache.manifoldcf.crawler.connectors.jdbc.JDBCConnector'
Exception in thread "main" java.lang.NoClassDefFoundError:
jcifs/smb/SmbException
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Unknown Source)
at
org.apache.manifoldcf.core.system.ManifoldCFResourceLoader.findClass(ManifoldCFResourceLoader.java:149)
at
org.apache.manifoldcf.core.system.ManifoldCF.findClass(ManifoldCF.java:1533)
at
org.apache.manifoldcf.core.interfaces.ConnectorFactory.getThisConnectorRaw(ConnectorFactory.java:144)
at
org.apache.manifoldcf.core.interfaces.ConnectorFactory.getThisConnectorNoCheck(ConnectorFactory.java:118)
at
org.apache.manifoldcf.core.interfaces.ConnectorFactory.installThis(ConnectorFactory.java:48)
at
org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.install(RepositoryConnectorFactory.java:100)
at
org.apache.manifoldcf.crawler.connmgr.ConnectorManager.registerConnector(ConnectorManager.java:180)
at
org.apache.manifoldcf.crawler.system.ManifoldCF.registerConnectors(ManifoldCF.java:672)
at
org.apache.manifoldcf.crawler.system.ManifoldCF.reregisterAllConnectors(ManifoldCF.java:160)
at
org.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner.main(ManifoldCFJettyRunner.java:239)
Caused by: java.lang.ClassNotFoundException: jcifs.smb.SmbException
at java.base/java.net.URLClassLoader.findClass(Unknown Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at java.base/java.net.FactoryURLClassLoader.loadClass(Unknown
Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
... 12 more

Thanks and regards
Priya


Re: Window shares-Repository Type Source Code

2019-11-05 Thread Priya Arora
Many thanks!!

On Wed, Nov 6, 2019 at 12:14 PM SREEJITH va  wrote:

> Yes. Exactly.
>
> On Wed, Nov 6, 2019 at 12:04 PM Priya Arora  wrote:
>
>> Hi Sir,
>>
>> 
>>
>> This connector code I am looking for , does JCIF connector is the same ?
>> [image: image.png]
>>
>> Thanks
>> Priya
>>
>> On Wed, Nov 6, 2019 at 11:58 AM SREEJITH va 
>> wrote:
>>
>>> I think you are searching for JCIFS connector. Its the windows
>>> repository connector. Its in manifoldcf\connectors\jcifs
>>>
>>> On Wed, Nov 6, 2019 at 11:19 AM Priya Arora  wrote:
>>>
>>>> Hi,
>>>>
>>>> I need to implement "Window Shares" type Repository connection type and
>>>> needs to access and understand code first.
>>>> But I am unable to find its code at path:-
>>>> ..\ManifoldCF\apache-manifoldcf-2.13-src\apache-manifoldcf-2.13\connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors,
>>>> where i can find all other connectors instead, as Webcrawler etc.
>>>>
>>>> Can anybody let me know where can i find its source code. , as i have
>>>> checked in the 2.8.1 version also.
>>>>
>>>> Thanks
>>>> Priya
>>>>
>>>
>>>
>>> --
>>> Regards
>>> -Sreejith
>>>
>>
>
> --
> Regards
> -Sreejith
>


Re: Window shares-Repository Type Source Code

2019-11-05 Thread Priya Arora
Hi Sir,



This connector code I am looking for , does JCIF connector is the same ?
[image: image.png]

Thanks
Priya

On Wed, Nov 6, 2019 at 11:58 AM SREEJITH va  wrote:

> I think you are searching for JCIFS connector. Its the windows repository
> connector. Its in manifoldcf\connectors\jcifs
>
> On Wed, Nov 6, 2019 at 11:19 AM Priya Arora  wrote:
>
>> Hi,
>>
>> I need to implement "Window Shares" type Repository connection type and
>> needs to access and understand code first.
>> But I am unable to find its code at path:-
>> ..\ManifoldCF\apache-manifoldcf-2.13-src\apache-manifoldcf-2.13\connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors,
>> where i can find all other connectors instead, as Webcrawler etc.
>>
>> Can anybody let me know where can i find its source code. , as i have
>> checked in the 2.8.1 version also.
>>
>> Thanks
>> Priya
>>
>
>
> --
> Regards
> -Sreejith
>


Re: Manifoldcf - Job Deletion Process

2019-11-05 Thread Priya Arora
When I created a new job and followed the process of lifecycle/execution of
identifier, then it didnt start the Deletion process. There was no any
change in job configuration and in database start-up and configurations.

On Sat, Nov 2, 2019 at 12:41 AM Priya Arora  wrote:

> No, I am not deleting the job after it run.. its status is getting updated
> as ‘Done’ after all process.
> Although the process involves indexation of documents and just before the
> job ends deletion process executed
> Sequence is fetch etc, indexation ,extract other processes , deletion then
> job done
>
> Sent from my iPhone
>
> On 01-Nov-2019, at 8:42 PM, Karl Wright  wrote:
>
> 
> So Priya, one thing is not clear to me: are you *deleting* the job after
> it runs?
> Because if you are, all documents indexed by that job will be deleted as
> well.
> You need to leave the job around and not delete it unless you want the
> documents to go away that the job indexed.
>
> Karl
>
>
> On Fri, Nov 1, 2019 at 6:51 AM Karl Wright  wrote:
>
>> There is a "Hop filters" tab in the job.  This allows you to specify the
>> maximum number of hops from the seed documents that are allowed.  Or you
>> can turn it off entirely, if you do not want this feature.
>>
>> Bear in mind that documents that are unreachable by *any* means from the
>> seed documents will always be deleted at the end of each job run.  So if
>> you are relying on some special page you generate to point at all the
>> documents you want to crawl, make sure it has a complete list.  If you try
>> to make an incremental list of just the new documents, then all the old
>> ones will get removed.
>>
>> Karl
>>
>>
>> On Fri, Nov 1, 2019 at 6:41 AM Priya Arora  wrote:
>>
>>> Yes, I have set Authenticity properly, as we have configured this
>>> setting by passing this info in header.
>>>
>>> (1) They are now unreachable, whereas they were reachable before by the
>>> specified number of hops from the seed documents; -But If I compared it
>>> with the previous index where data is not much old(like a week before),
>>> documents(deleted one) were ingested and when i am checking its not
>>> resulting in 404.
>>> Regarding  the specified number of hops from the seed documents;:- Can
>>> you please help me with little bit of elaboration
>>>
>>> Thanks
>>> Priya
>>>
>>> On Fri, Nov 1, 2019 at 3:43 PM Karl Wright  wrote:
>>>
>>>> Hi Priya,
>>>>
>>>> ManifoldCF doesn't delete documents unless:
>>>> (1) They are now unreachable, whereas they were reachable before by the
>>>> specified number of hops from the seed documents;
>>>> (2) They cannot be fetched due to a 404 error, or something similar
>>>> which tells ManifoldCF that they are not available.
>>>>
>>>> Your site, I notice, has a "sso" page.  Are you setting up session
>>>> authentication properly?
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Fri, Nov 1, 2019 at 3:59 AM Priya Arora  wrote:
>>>>
>>>>> 
>>>>>
>>>>>
>>>>> Screenshot the Deleted documents other than PDF's
>>>>>
>>>>> On Fri, Nov 1, 2019 at 1:28 PM Priya Arora 
>>>>> wrote:
>>>>>
>>>>>> The jib was started as per below schedule:-
>>>>>> 
>>>>>>
>>>>>>
>>>>>> And just before the completion of the job. It started the Deletion
>>>>>> process. Before starting the job a new index in ES was taken and the
>>>>>> Database was cleaned up before starting the jib.
>>>>>> 
>>>>>>
>>>>>>
>>>>>> Records were processed and indexed successfully. When I am checking
>>>>>> this URL(those are Deleted) on a browser, it seems to be a valid URl and 
>>>>>> is
>>>>>> accessible.
>>>>>> Job is to crawl around 2.25 lakhs of records so the seeded url have
>>>>>> many sub-links within. If we think the URL;s were already present in
>>>>>> Database that why somehow crawler deletes it, it should not be the case, 
>>>>>> as
>>>>>> the database clean up processed has been done before run.
>>>>>>
>>>>>> If we think the crawler is deleting only documents related to PDF
>>>>>> extension, this is not the 

Window shares-Repository Type Source Code

2019-11-05 Thread Priya Arora
Hi,

I need to implement "Window Shares" type Repository connection type and
needs to access and understand code first.
But I am unable to find its code at path:-
..\ManifoldCF\apache-manifoldcf-2.13-src\apache-manifoldcf-2.13\connectors\webcrawler\connector\src\main\java\org\apache\manifoldcf\crawler\connectors,
where i can find all other connectors instead, as Webcrawler etc.

Can anybody let me know where can i find its source code. , as i have
checked in the 2.8.1 version also.

Thanks
Priya


Re: Manifoldcf - Job Deletion Process

2019-11-01 Thread Priya Arora
No, I am not deleting the job after it run.. its status is getting updated as 
‘Done’ after all process.
Although the process involves indexation of documents and just before the job 
ends deletion process executed
Sequence is fetch etc, indexation ,extract other processes , deletion then job 
done

Sent from my iPhone

> On 01-Nov-2019, at 8:42 PM, Karl Wright  wrote:
> 
> 
> So Priya, one thing is not clear to me: are you *deleting* the job after it 
> runs?
> Because if you are, all documents indexed by that job will be deleted as well.
> You need to leave the job around and not delete it unless you want the 
> documents to go away that the job indexed.
> 
> Karl
> 
> 
>> On Fri, Nov 1, 2019 at 6:51 AM Karl Wright  wrote:
>> There is a "Hop filters" tab in the job.  This allows you to specify the 
>> maximum number of hops from the seed documents that are allowed.  Or you can 
>> turn it off entirely, if you do not want this feature.
>> 
>> Bear in mind that documents that are unreachable by *any* means from the 
>> seed documents will always be deleted at the end of each job run.  So if you 
>> are relying on some special page you generate to point at all the documents 
>> you want to crawl, make sure it has a complete list.  If you try to make an 
>> incremental list of just the new documents, then all the old ones will get 
>> removed.
>> 
>> Karl
>> 
>> 
>>> On Fri, Nov 1, 2019 at 6:41 AM Priya Arora  wrote:
>>> Yes, I have set Authenticity properly, as we have configured this setting 
>>> by passing this info in header.
>>> 
>>> (1) They are now unreachable, whereas they were reachable before by the 
>>> specified number of hops from the seed documents; -But If I compared it 
>>> with the previous index where data is not much old(like a week before), 
>>> documents(deleted one) were ingested and when i am checking its not 
>>> resulting in 404.
>>> Regarding  the specified number of hops from the seed documents;:- Can you 
>>> please help me with little bit of elaboration
>>> 
>>> Thanks
>>> Priya
>>> 
>>>> On Fri, Nov 1, 2019 at 3:43 PM Karl Wright  wrote:
>>>> Hi Priya,
>>>> 
>>>> ManifoldCF doesn't delete documents unless:
>>>> (1) They are now unreachable, whereas they were reachable before by the 
>>>> specified number of hops from the seed documents;
>>>> (2) They cannot be fetched due to a 404 error, or something similar which 
>>>> tells ManifoldCF that they are not available.
>>>> 
>>>> Your site, I notice, has a "sso" page.  Are you setting up session 
>>>> authentication properly?
>>>> 
>>>> Karl
>>>> 
>>>> 
>>>>> On Fri, Nov 1, 2019 at 3:59 AM Priya Arora  wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> Screenshot the Deleted documents other than PDF's
>>>>> 
>>>>>> On Fri, Nov 1, 2019 at 1:28 PM Priya Arora  wrote:
>>>>>> The jib was started as per below schedule:-
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> And just before the completion of the job. It started the Deletion 
>>>>>> process. Before starting the job a new index in ES was taken and the 
>>>>>> Database was cleaned up before starting the jib.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Records were processed and indexed successfully. When I am checking this 
>>>>>> URL(those are Deleted) on a browser, it seems to be a valid URl and is 
>>>>>> accessible.
>>>>>> Job is to crawl around 2.25 lakhs of records so the seeded url have many 
>>>>>> sub-links within. If we think the URL;s were already present in Database 
>>>>>> that why somehow crawler deletes it, it should not be the case, as the 
>>>>>> database clean up processed has been done before run.
>>>>>> 
>>>>>> If we think the crawler is deleting only documents related to PDF 
>>>>>> extension, this is not the case, as other HTML pages are also deleted.
>>>>>> 
>>>>>> Can you please suggest something on this.
>>>>>> 
>>>>>> Thanks
>>>>>> Priya
>>>>>> 
>>>>>>> On Wed, Oct 30, 2019 at 3:39 PM Karl Wright  wrote:
>>>>>>> So it looks like the UR

Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Priya Arora
Indexation screenshot is as below.

[image: image.png]

On Tue, Oct 29, 2019 at 7:57 PM Karl Wright  wrote:

> I need both ingestion and deletion.
> Karl
>
>
> On Tue, Oct 29, 2019 at 8:09 AM Priya Arora  wrote:
>
>> History is shown as below as it does not indicates any error.
>> [image: 12.JPG]
>>
>> Thanks
>> Priya
>>
>> On Tue, Oct 29, 2019 at 5:02 PM Karl Wright  wrote:
>>
>>> What does the history say about these documents?
>>> Karl
>>>
>>> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:
>>>
>>>>
>>>>  it may be that (a) they weren't found, or (b) that the document
>>>> specification in the job changed and they are no longer included in the 
>>>> job.
>>>>
>>>> URL's that were deleted are valid URL's(as that does not result in 404
>>>> or page not found error), and it is not being mentioned in Exclusion tab of
>>>> job configuration.
>>>> And the URL's were getting indexed earlier and except for index name in
>>>> Elasticsearch nothing is changed in Job specification and in other
>>>> connectors.
>>>>
>>>> Thanks
>>>> Priya
>>>>
>>>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:
>>>>
>>>>> ManifoldCF is an incremental crawler, which means that on every
>>>>> (non-continuous) job run it sees which documents it can find and removes
>>>>> the ones it can't.  The history for the documents being deleted should 
>>>>> tell
>>>>> you why they are being deleted -- it may be that (a) they weren't found, 
>>>>> or
>>>>> (b) that the document specification in the job changed and they are no
>>>>> longer included in the job.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora 
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have a query regarding ManifoldCF Job process.I have a job to crawl
>>>>>> intranet site
>>>>>> Repository Type:- Web
>>>>>> Output Connector Type:- Elastic search.
>>>>>>
>>>>>> Job have to crawl around4-5 lakhs of total records. I have discarded
>>>>>> the previous index and created a new index(in Elasticsearch) with proper
>>>>>> mappings and settings and started the job again after cleaning Database
>>>>>> even(Database used a PostgreSQL).
>>>>>> But while the job continues its ingests the records properly but just
>>>>>> before finishing (some times in between also), it initiates the process 
>>>>>> of
>>>>>> Deletions and also it does not index the deleted documents again in 
>>>>>> index.
>>>>>>
>>>>>> Can you please something if I am doing anything wrong? or is this a
>>>>>> process of manifoldcf if yes , why its not getting ingested again.
>>>>>>
>>>>>> Thanks and regards
>>>>>> Priya
>>>>>>
>>>>>>


Re: Manifoldcf - Job Deletion Process

2019-10-29 Thread Priya Arora
History is shown as below as it does not indicates any error.
[image: 12.JPG]

Thanks
Priya

On Tue, Oct 29, 2019 at 5:02 PM Karl Wright  wrote:

> What does the history say about these documents?
> Karl
>
> On Tue, Oct 29, 2019 at 6:53 AM Priya Arora  wrote:
>
>>
>>  it may be that (a) they weren't found, or (b) that the document
>> specification in the job changed and they are no longer included in the job.
>>
>> URL's that were deleted are valid URL's(as that does not result in 404
>> or page not found error), and it is not being mentioned in Exclusion tab of
>> job configuration.
>> And the URL's were getting indexed earlier and except for index name in
>> Elasticsearch nothing is changed in Job specification and in other
>> connectors.
>>
>> Thanks
>> Priya
>>
>> On Tue, Oct 29, 2019 at 3:40 PM Karl Wright  wrote:
>>
>>> ManifoldCF is an incremental crawler, which means that on every
>>> (non-continuous) job run it sees which documents it can find and removes
>>> the ones it can't.  The history for the documents being deleted should tell
>>> you why they are being deleted -- it may be that (a) they weren't found, or
>>> (b) that the document specification in the job changed and they are no
>>> longer included in the job.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Oct 29, 2019 at 5:30 AM Priya Arora  wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have a query regarding ManifoldCF Job process.I have a job to crawl
>>>> intranet site
>>>> Repository Type:- Web
>>>> Output Connector Type:- Elastic search.
>>>>
>>>> Job have to crawl around4-5 lakhs of total records. I have discarded
>>>> the previous index and created a new index(in Elasticsearch) with proper
>>>> mappings and settings and started the job again after cleaning Database
>>>> even(Database used a PostgreSQL).
>>>> But while the job continues its ingests the records properly but just
>>>> before finishing (some times in between also), it initiates the process of
>>>> Deletions and also it does not index the deleted documents again in index.
>>>>
>>>> Can you please something if I am doing anything wrong? or is this a
>>>> process of manifoldcf if yes , why its not getting ingested again.
>>>>
>>>> Thanks and regards
>>>> Priya
>>>>
>>>>


Re: Manifold with OpenJDK

2019-10-17 Thread Priya Arora
Hi Sreejith,

Can you please let me know,,JAVA_HOME variable value you have set.

Thanks
Priya

On Thu, Oct 17, 2019 at 12:17 PM SREEJITH va  wrote:

> Hi, We use manifold with openjdk version "1.8.0_222" in Centos and did not
> faced any issues.
>
> On Thu, Oct 17, 2019 at 12:04 PM Bisonti Mario 
> wrote:
>
>> Hallo, I use Ubuntu 18.04.02 LTS with:
>> openjdk version "11.0.4" 2019-07-16
>>
>>
>>
>> And I have no issue with ManifoldCF
>>
>>
>>
>> Mario
>>
>>
>>
>> *Da:* Markus Schuch 
>> *Inviato:* giovedì 17 ottobre 2019 07:35
>> *A:* user@manifoldcf.apache.org; Praveen Bejji 
>> *Oggetto:* Re: Manifold with OpenJDK
>>
>>
>>
>> Hi Praveen,
>>
>> we use openjdk 8 in dockered red hat linux for 2 years now and didn't
>> have problems with it.
>>
>> We had one minor issue when we migrated: the image processing
>> capabilities of openjdk are somehow different from Oracle JDK. One of our
>> connectors creates image thumbnails and on openjdk some results had weird
>> colours.
>>
>> Cheers
>> Markus
>>
>> Am 16. Oktober 2019 17:32:16 MESZ schrieb Praveen Bejji <
>> praveen.b...@gmail.com>:
>>
>> Hi,
>>
>>
>>
>> We are planning on using ManifoldCF with Open JDK 1.8 on Linux  server.
>> Can you please let us know if there are any known issues/challenges on
>> using ManifldCF with Open JDK?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Praveen
>>
>>
>> --
>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
>> gesendet.
>>
>
>
> --
> Regards
> -Sreejith
>


Re: Manifold Crawler Crashes

2019-06-20 Thread Priya Arora
  I would highly recommend moving to Postgresql if you have any really
sizable crawl.
Yes, we are already using Postgresql 9.6.10 for it. Below are the settings
in postgresql.conf file our postgres server.

max_connections = 100
shared_buffers = 128MB
#temp_buffers = 8MB
#max_prepared_transactions = 0
#max_files_per_process = 1000
#autovacuum = on
#deadlock_timeout = 1s
#max_locks_per_transaction = 64
#max_pred_locks_per_transaction = 64

Can you please check if these parameters are sufficient to handle multiple
job ingesting huge data(8 Lakhs or more data)into an index. If not, can you
please let me know at maximum what these parameters should to be to have
optimal run of the jobs.

Alternatively you could just hand the manifoldCF process more memory.  Your
choice.
can you please help me on this, how to achieve this.

Also do we have to reduce some number of maximum connections in both
Repository and Output connections. can this be the symptom for heavy memory
load(due to multiple jobs running all together) that causes HEAP:-OUT OF
MEMORY.




On Thu, Jun 20, 2019 at 5:04 PM Karl Wright  wrote:

> If you are running single-process on top of HSQLDB, all database tables
> are kept in memory so you need a lot of memory.
>
> I would highly recommend moving to Postgresql if you have any really
> sizable crawl.
>
> Alternatively you could just hand the manifoldCF process more memory.
> Your choice.
>
> However, if you cannot even use bash to get into the instance, something
> far more serious is happening to your docker world.
>
> Karl
>
>
> On Thu, Jun 20, 2019 at 6:27 AM Priya Arora  wrote:
>
>> Hi Karl,
>> 1) It's single process deployment process.
>> 2) Not  able to access through bash(during crash happens)
>> 3) Server Configuration:-
>>  For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660
>> v3 @ 2.60GHz and
>> For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU E5-2660
>> v3 @ 2.60GHz
>> 4) Manifold configuration:-
>> Repository Max connection:-48
>> Output Max connections:-48
>>
>> This crash happens when we are running more than two parallel jobs with
>> almost same configuration at a time.
>> [image: image.png]
>>
>> Also, facing these warnings in the log file.It seems to be the reason for
>> crash.
>>
>> agents process ran out of memory - shutting down
>> java.lang.OutOfMemoryError: Java heap space
>> at java.util.Arrays.copyOf(Arrays.java:3308)
>> at java.util.BitSet.ensureCapacity(BitSet.java:337)
>> at java.util.BitSet.expandTo(BitSet.java:352)
>> at java.util.BitSet.set(BitSet.java:447)
>> at
>> de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
>> at
>> org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>> at
>> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>> at
>> org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
>> at
>> org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
>> at
>> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
>> at
>> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
>> at
>> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>> at
>> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
>> at
>> org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
>> at
>> org.apache.tika.sax.ContentHandlerDecorator.characters(Co

Re: Manifold Crawler Crashes

2019-06-20 Thread Priya Arora
Hi Karl,
1) It's single process deployment process.
2) Not  able to access through bash(during crash happens)
3) Server Configuration:-
 For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660 v3
@ 2.60GHz and
For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU E5-2660 v3
@ 2.60GHz
4) Manifold configuration:-
Repository Max connection:-48
Output Max connections:-48

This crash happens when we are running more than two parallel jobs with
almost same configuration at a time.
[image: image.png]

Also, facing these warnings in the log file.It seems to be the reason for
crash.

agents process ran out of memory - shutting down
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3308)
at java.util.BitSet.ensureCapacity(BitSet.java:337)
at java.util.BitSet.expandTo(BitSet.java:352)
at java.util.BitSet.set(BitSet.java:447)
at
de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler.characters(BoilerpipeHTMLContentHandler.java:267)
at
org.apache.tika.parser.html.BoilerpipeContentHandler.characters(BoilerpipeContentHandler.java:155)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
at
org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
at
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
at
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
at
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:47)
at
org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:83)
at
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:141)
at
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:288)
at
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:284)
at
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)

On Thu, Jun 20, 2019 at 3:36 PM Karl Wright  wrote:

> Hi Priya,
>
> Being unable to reach the web interface sounds like either a network issue
> or a problem with the app server.
>
> Can you describe the configuration you are running in?  Is this a
> multiprocess deployment or a single-process deployment?
>
> When your docker container dies, can you still reach it via the standard
> in-container bash tools?  What is happening there?
>
> Karl
>
>
> On Thu, Jun 20, 2019 at 5:54 AM Priya Arora  wrote:
>
>> Hi Karl,
>>
>> Crash here means, "the site could not be reached" kind of HTML page
>> appears , when accessing http://localhost:3000/mcf-crawler-ui/index.jsp.
>> Explanation:- When running certain job on ManifoldCF server(2.13) after
>> sometime (of successful running state), suddenly browser gives me "the site
>> could not be reached" (this kind of error) and page does not reload until i
>> restart it through docker command.
>> once i will restart the container through docker MCF get to load again.
>>
>> Thanks
>> Priya
>>
>> On Thu, Jun 20, 2019 at 3:08 PM Karl Wright  wrote:
>>
>>> Please describe what you mean by "crash".  What actually happens?
>>>
>>> Karl
>

Re: Manifold Crawler Crashes

2019-06-20 Thread Priya Arora
Hi Karl,

Crash here means, "the site could not be reached" kind of HTML page appears
, when accessing http://localhost:3000/mcf-crawler-ui/index.jsp.
Explanation:- When running certain job on ManifoldCF server(2.13) after
sometime (of successful running state), suddenly browser gives me "the site
could not be reached" (this kind of error) and page does not reload until i
restart it through docker command.
once i will restart the container through docker MCF get to load again.

Thanks
Priya

On Thu, Jun 20, 2019 at 3:08 PM Karl Wright  wrote:

> Please describe what you mean by "crash".  What actually happens?
>
> Karl
>
> On Thu, Jun 20, 2019, 2:04 AM Priya Arora  wrote:
>
>>
>>
>> Hi,
>>
>> I am running multiple jobs(2,3) simultaneously on Manifold server and the
>> configuration is
>>
>> 1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU
>> E5-2660 v3 @ 2.60GHz and
>>
>> 2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU
>> E5-2660 v3 @ 2.60GHz
>> Job working is to fetch data from some public and intranet sites and then
>> ingesting data into Elastic search.
>>
>> Maximum connection on both Repository connections and Output connection
>> is 48(for all 3 jobs).
>>
>> What problem i am facing here is when i am running multiple jobs the
>> manifold crashes after some time and there is nothing inside manifold.log
>> files that hints out me some error.
>> Is the maximum connections increases(48+48+48) while running all three
>> jobs together?
>> So do i need to divide max connections(48) among all three jobs?
>> How many connections maximum we can have to run the jobs individually and
>> simultaneously.
>>
>> what should be the maximum allowed number of max handles in
>> properties.xml file and postgres config file?
>>
>> So the problem is to figure out what is the reason for the crawler crash.
>> Can you please help me on that as soon as possible.
>>
>> Thanks and regards
>> Priya
>> pr...@smartshore.nl
>>
>>
>>


Fwd: Manifold Crawler Crashes

2019-06-20 Thread Priya Arora
Hi,

I am running multiple jobs(2,3) simultaneously on Manifold server and the
configuration is

1) For Crawler server - 16 GB RAM and 8-Core Intel(R) Xeon(R) CPU E5-2660
v3 @ 2.60GHz and

2) For Elasticsearch server - 48GB and 1-Core Intel(R) Xeon(R) CPU E5-2660
v3 @ 2.60GHz
Job working is to fetch data from some public and intranet sites and then
ingesting data into Elastic search.

Maximum connection on both Repository connections and Output connection is
48(for all 3 jobs).

What problem i am facing here is when i am running multiple jobs the
manifold crashes after some time and there is nothing inside manifold.log
files that hints out me some error.
Is the maximum connections increases(48+48+48) while running all three jobs
together?
So do i need to divide max connections(48) among all three jobs?
How many connections maximum we can have to run the jobs individually and
simultaneously.

what should be the maximum allowed number of max handles in properties.xml
file and postgres config file?

So the problem is to figure out what is the reason for the crawler crash.
Can you please help me on that as soon as possible.

Thanks and regards
Priya
pr...@smartshore.nl