Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright
re added to the queue. >> Hopcounts are stored for each document in the hopcount table. So if you >> change a hopcount limit, it is quite possible that nothing will change >> unless documents that are at the previous hopcount limit are re-evaluated. >> I believe there is no lo

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo
at the previous hopcount limit are re-evaluated. > I believe there is no logic in ManifoldCF for that at this time, but I'd > have to review the codebase to be certain of that. > > What that means is that you can't increase the hopcount limit and expect > the next crawl to p

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright
are re-evaluated. I believe there is no logic in ManifoldCF for that at this time, but I'd have to review the codebase to be certain of that. What that means is that you can't increase the hopcount limit and expect the next crawl to pick up the documents you excluded before with the hopcount

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Marisol Redondo
or (Solr >> connector) >> After that, the same pages are still out of scope like the limit has been >> set to 1, and they are not indexed. >> >> I have tried to "Reset seeding" thinking that maybe the pages need to be >> check again, but still having the sam

Re: Documents Out Of Scope and hop count

2023-09-26 Thread Karl Wright
t maybe the pages need to be > check again, but still having the same problem, I don't think the problem > is with the output, but I have also use the option "Re-index all associated > documents" and "Remove all associated records" with the same result > I don't want

Re: web crawler https

2023-09-25 Thread Karl Wright
See this article: https://stackoverflow.com/questions/6784463/error-trustanchors-parameter-must-be-non-empty ManifoldCF web crawler configuration allows you to drop certs into a local trust store for the connection. You need to either do that (adding whatever certificate authority cert you

Re: Duplicate key value violates unique constraint "repohistory_pkey"

2023-06-16 Thread Marisol Redondo
Hi, Did you find any solution for that or do you have still disabled the history? I'm having the same problem, and we are using postgresql as the db. Regards On Sun, 29 Jan 2023 at 05:48, Artem Abeleshev wrote: > Hi everyone! > > We are using ManifoldCF 2.22.1 with multiple nodes in our

Re: Solr connector authentication issue

2023-06-07 Thread Karl Wright
But if those are set, and the connection health check passes, then I can't tell you why Solr is unhappy with your connection. It's clearly working sometimes. I'd look on the Solr end to figure out whether its rejection is coming from just one of your instances. On Wed, Jun 7, 2023 at 7:49 AM

Re: Solr connector authentication issue

2023-06-07 Thread Karl Wright
The Solr output connection configuration contains all credentials that are sent to Solr. If those aren't set Solr won't get them. Karl On Wed, Jun 7, 2023 at 7:23 AM Marisol Redondo < marisol.redondo.gar...@gmail.com> wrote: > Hi, > > We are using Solr 8 with basic authentication, and when

Re: Long Job on Windows Share

2023-05-25 Thread Karl Wright
ares” works for near 18 > hours. > > My document numebr a little bit of 1 million. > > > > If I check the documents scan from MifoldCF I see, for example: > > > > It seems that re work on the document every day even if it hadn’t been > modified. > > So,

Re: Apache Manifold Documentum connector

2023-03-17 Thread Rasťa Šíša
Thanks a lot for your kind and elaborate response! I will do some further investigation on my own towards the documentum. Best regards, Rasta pá 17. 3. 2023 v 12:08 odesílatel Karl Wright napsal: > It was open-sourced back in 2012 at the same time ManifoldCF was > open-sourced. It was written

Re: Apache Manifold Documentum connector

2023-03-17 Thread Karl Wright
It was open-sourced back in 2012 at the same time ManifoldCF was open-sourced. It was written by a contractor paid by MetaCarta, who also paid for the development of ManifoldCF itself (I developed that). It was spun off as open source when MetaCarta was bought by Nokia who had no interest in the

Re: Apache Manifold Documentum connector

2023-03-17 Thread Rasťa Šíša
Hi Karl, thanks for your answer! Would you be able to point me towards the author/git branch of the documentum connector? Best regards, Rasta čt 16. 3. 2023 v 20:58 odesílatel Karl Wright napsal: > Hi, > > I didn't write the documentum connector initially, so I trust that the > engineer who did

Re: Apache Manifold Documentum connector

2023-03-16 Thread Karl Wright
Hi, I didn't write the documentum connector initially, so I trust that the engineer who did knew how to construct the proper DQL. I've not seen any bugs related to it so it does seem to work. Karl On Thu, Mar 16, 2023 at 8:23 AM Rasťa Šíša wrote: > Hello, > i would like to ask how does

Re: Job stucked with cleaning up status

2023-02-03 Thread Karl Wright
The shutdown procedure for ManifoldCF involves sending interruptions (or socket interruptions) to all worker threads. These then put the threads in the "terminated" state, one by one. So you should only get this if you shut down the agents process, or try to. The handling for this is correct,

Re: Job stucked with cleaning up status

2023-02-02 Thread Artem Abeleshev
Karl, good day! Thank you for the hint! It was very useful! Actually, you was right and the actual problem was about the connection. But I doesn't expect it would be so dramatic. Here is what I found using some debugging: First I have found the actual code that was responsible for the deletion

Re: JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy

2023-02-01 Thread Karl Wright
gt; 0x7f051c50a000,0x7f051c60a000] [id=2537470] > > > > Stack: [0x7f051c50a000,0x7f051c60a000], sp=0x7f051c608080, > free space=1016k > > Native frames: (J=compiled Java code, A=aot compiled Java code, > j=interpreted, Vv=VM code, C=native code) > >

Re: Job stucked with cleaning up status

2023-01-29 Thread Karl Wright
Hi, 2.22 makes no changes to the way document deletions are processed over probably 10 previous versions of ManifoldCF. What likely is the case is that the connection to the output for the job you are cleaning up is down. When that happens, the documents are queued but the delete worker threads

Re: JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy

2023-01-18 Thread Karl Wright
t; > > > > > *Da:* Karl Wright > *Inviato:* mercoledì 18 gennaio 2023 12:10 > *A:* user@manifoldcf.apache.org > *Oggetto:* Re: JCIFS: Possibly transient exception detected on attempt 1 > while getting share security: All pipe instances are busy > > > > Hi, &qu

Re: JCIFS: Possibly transient exception detected on attempt 1 while getting share security: All pipe instances are busy

2023-01-18 Thread Karl Wright
Hi, "Possibly transient issue" means that the error will be retried anyway, according to a schedule. There should not need to be any requirement to shut down the agents process and restart. Karl On Wed, Jan 18, 2023 at 5:08 AM Bisonti Mario wrote: > Hi. > > Often, I obtain the error: > > WARN

Re: Help for subscribing the user mailing list of MCF

2023-01-10 Thread Koji Sekiguchi
Hi Karl, I agree. BTW, Artem, the colleague, finally succeeded to subscribe. He tried to subscribe some more times before opening JIRA ticket in INFRA, and he finally got some responses from the ML system. Maybe they restarted the system or did something else. Thanks! Koji 2023年1月10日(火) 20:17

Re: Help for subscribing the user mailing list of MCF

2023-01-10 Thread Karl Wright
Hmm - I haven't heard of difficulties like this before. The mail manager is used apache-wide; if it doesn't work the best thing to do would be to create an infra ticket in JIRA. Karl On Tue, Jan 10, 2023 at 3:50 AM Koji Sekiguchi wrote: > Hi Karl, everyone! > > I'm writing to the moderator

Re: Is Manifold capable of handling these kind of files

2022-12-23 Thread Karl Wright
The internals of ManifoldCF will handle this fine if you are sure to set the encoding of your database to be UTF-8. However, I don't know about the JCIFS library, and whether there might be a restriction on characters in that code base. I think you'd have to just try it and see, frankly. Karl

Re: Manifoldcf -XML parsing error: Character reference "" is an invalid XML character.

2022-12-22 Thread ritika jain
Can anybody provide any clue on this. Would be of great help On Thu, Dec 22, 2022 at 5:33 PM ritika jain wrote: > Hi all, > > I am using Manifoldcf 2.21 version with Windows shares connector and > Output as Elastic. > I am facing this error while clicking "List all jobs", Manifoldcf, jobs >

Re: Unscribe

2022-10-22 Thread Muhammed Olgun
Hi Ronny, Unsubscribing is self-service. Please follow here, https://manifoldcf.apache.org/en_US/mail.html On 22 Oct 2022 Sat at 08:55 Ronny Heylen wrote: > Hi, > Please unscribe me from these emails, I don't work anymore. > > Regards, > > Ronny >

Re: Frequent error while window shares job

2022-08-22 Thread Karl Wright
You will need to contact the current maintainers of the Jcifs library to get answers to these questions. Karl On Mon, Aug 22, 2022 at 3:27 AM ritika jain wrote: > Hi All, > > I have a Windows shared job to crawl files from samba server, it's a huge > job to crawl documents in millions(about

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-14 Thread Karl Wright
Remember, there is already a "forget" button on the output connection, which will remove everything associated with the connection. It's meant to be used when the output index has been reset and is empty. I'm not sure what you'd do different functionally. Karl On Tue, Jun 14, 2022 at 2:04 AM

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-14 Thread Koji Sekiguchi
+1. I respect for the design concept of ManifoldCF, but I think force delete options make MCF more useful for those who use MCF as crawler. Adding force delete options doesn't change default behaviors and it doesn't break back-compatibility. Koji On 2022/06/14 14:46, Ricardo Ruiz wrote: Hi

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-13 Thread Ricardo Ruiz
Hi Karl We are using ManifoldCF as a crawler more than a synchronizer. We are thinking of contributing to ManifoldCf by including a force job delete and force output connector delete, considering of course the things that need to be deleted with them (BD, etc). Do you think this is possible? We

Re: Can't delete a job when solr output connection can't connect to the instance.

2022-06-13 Thread Karl Wright
Because ManifoldCF is not just a crawler, but a synchonizer, a job represents and includes a list of documents that have been indexed. Deleting the job requires deleting the documents that have been indexed also. It's part of the basic model. So if you tear down your target output instance and

Re: Job Service Interruption- and stops

2022-04-29 Thread Karl Wright
" repeated service interruption" means that it happens again and again. For this particular document, the problem is that the error we are seeing is: "The process cannot access the file because it is being used by another process." ManifoldCF assumes that if it retries enough it should be able

Re: Log4j Update Doubt

2022-03-15 Thread Karl Wright
We cannot do back patches of older versions of ManifoldCF. There is a new release which shipped in January that addresses log4j issues. I suggest updating to that. Karl On Tue, Mar 15, 2022 at 8:59 AM ritika jain wrote: > Hi, > > How manifoldcf uses log4j files in bin

Re: Manifoldcf freezes and sit idle

2022-01-31 Thread Karl Wright
As I've mentioned before, the best way to diagnose problems like this is to get a thread dump of the agents process. There are many potential reasons it could occur, ranging from stuck locks to resource starvation. What locking model are you using? Karl On Mon, Jan 31, 2022 at 6:02 AM ritika

Re: Log4j dependency

2021-12-14 Thread Karl Wright
ManifoldCF framework and connectors use log4j 2.x to dump information to the ManifoldCF log file. Please read the following page: https://logging.apache.org/log4j/2.x/security.html Specifically, this part: 'Descripton: Apache Log4j2 <=2.14.1 JNDI features used in configuration, log messages,

Re: Log4j dependency

2021-12-14 Thread Furkan KAMACI
Hi Ritika, For maven check here: https://github.com/apache/manifoldcf/blob/trunk/pom.xml#L80 For Ant check here: https://github.com/apache/manifoldcf/blob/trunk/build.xml#L87 Kind Regards, Furkan KAMACI On Tue, Dec 14, 2021 at 12:41 PM ritika jain wrote: > .Hi All, > > How does manifold.cf

Re: Manifoldcf background process

2021-11-18 Thread Karl Wright
The degree of parallelism can be controlled in two ways. The first way is to set the number of worker threads to something reasonable. Usually, this is no more than about 2x the number of processors you have. The second way is to control the number of connections in your jcifs connector to keep

Re: Manifold Job process isssue

2021-11-15 Thread Karl Wright
SMB exceptions with jcifs in the trace tell us that JCIFS couldn't talk to your windows share server. That's all we can tell though. Karl On Mon, Nov 15, 2021 at 7:24 AM ritika jain wrote: > Hi, > > Raising the concern above again, to process only 60k of document (when > clock issue is fixed

Re: Manifold Job process isssue

2021-11-15 Thread ritika jain
Hi, Raising the concern above again, to process only 60k of document (when clock issue is fixed too), job process is not progressing , its being stuck for like days. So had to restart the docker container every time for it to process. This time now we are getting this :- Timeout Exception. What

Re: Manifold Job process isssue

2021-11-09 Thread Karl Wright
One hour is quite a lot and will wreak havoc on the document queue. Karl On Tue, Nov 9, 2021 at 7:08 AM ritika jain wrote: > I have checked, there is only one hour time difference between docker > container and docker host > > On Tue, Nov 9, 2021 at 4:41 PM Karl Wright wrote: > >> If your

Re: Manifold Job process isssue

2021-11-09 Thread ritika jain
I have checked, there is only one hour time difference between docker container and docker host On Tue, Nov 9, 2021 at 4:41 PM Karl Wright wrote: > If your docker image's clock is out of sync badly with the real world, > then System.currentTimeMillis() may give bogus values, and ManifoldCF uses

Re: Manifold Job process isssue

2021-11-09 Thread Karl Wright
If your docker image's clock is out of sync badly with the real world, then System.currentTimeMillis() may give bogus values, and ManifoldCF uses that to manage throttling etc. I don't know if that is the correct explanation but it's the only thing I can think of. Karl On Tue, Nov 9, 2021 at

Re: Duplicate key error

2021-10-27 Thread Karl Wright
We see errors like this only because MCF is a highly multithreaded application, and two threads sometimes are able to collide in what they are doing even though they are transactionally separated. That is because of bugs in the database software. So if you restart the job it should not encounter

Re: Duplicate key error

2021-10-27 Thread Karl Wright
Is it repeatable? My guess is it is not repeatable. Karl On Wed, Oct 27, 2021 at 4:43 AM ritika jain wrote: > So , it can be left as it is.. ? because it is preventing job to complete > and its stopping. > > On Tue, Oct 26, 2021 at 8:40 PM Karl Wright wrote: > >> That's a database bug. All

Re: Duplicate key error

2021-10-27 Thread ritika jain
So , it can be left as it is.. ? because it is preventing job to complete and its stopping. On Tue, Oct 26, 2021 at 8:40 PM Karl Wright wrote: > That's a database bug. All of our underlying databases have some bugs of > this kind. > > Karl > > > On Tue, Oct 26, 2021 at 9:17 AM ritika jain >

Re:

2021-10-26 Thread Karl Wright
That's a database bug. All of our underlying databases have some bugs of this kind. Karl On Tue, Oct 26, 2021 at 9:17 AM ritika jain wrote: > Hi All, > > While using Manifoldcf 2.14 with Web connector and ES connector. After a > certain time of continuing the job (jobs ingest some documents

Re: Windows Shares job-Limit on defining no of paths

2021-10-25 Thread Karl Wright
The only limit is that the more you add, the slower it gets. Karl On Mon, Oct 25, 2021 at 6:06 AM ritika jain wrote: > Hi , > Is there any limit on the number of paths we can define in job using > Repository as Window Shares and ES as Output > > Thanks >

Re: Null Pointer Exception

2021-10-25 Thread Karl Wright
The API should really catch this situation. Basically, you are calling a function that requires an input but you are not providing one. In that case the API sets the input to "null", and the detailed operation is called. The detailed operation is not expecting a null input. This is API piece

Re: Error: Repeated service interruptions - failure processing document: Read timed out

2021-09-30 Thread Karl Wright
Hi, You say this is a "Tika error". Is this Tika as a stand-alone service? I do not recognize any ManifoldCF classes whatsoever in this thread dump. If this is Tika, I suggest contacting the Tika team. Karl On Thu, Sep 30, 2021 at 3:02 AM Bisonti Mario wrote: > Additional info. > > > > I

Re: Tika Parser Issue

2021-09-07 Thread Karl Wright
This is something you should contact the Tika project about. Karl On Tue, Sep 7, 2021 at 8:46 AM ritika jain wrote: > Hi All, > > I am using tika-core 1.21 and tika-parsers 1.21 jar files as tika > dependencies in Manifoldcf 2.14 version. > Getting some issues while parsing *PDF *files. Some

Re: ZooKeeper leaking or does not handle temporary network failures

2021-08-26 Thread Raman Gupta
I'm having issues with ManifoldCF losing connection to ZooKeeper. This is easily repeatable: I just need to leave ManifoldCF running for a few days. The results are not always "No route to host" as I previously reported -- sometimes its just connect timeouts or other behavior, but the connection

Re: Query:JCIFS connector

2021-08-23 Thread Karl Wright
I have a work day today, with limited time. The UI is what it is; it does not have capabilities beyond what is stated in the UI and in the manual. It's meant to allow construction of paths piece by piece, not by full subdirectory at a time. You can obviously use the API if you want to construct

Re: Query:JCIFS connector

2021-08-23 Thread ritika jain
Can anybody have a clue on this ? On Fri, Aug 20, 2021 at 12:33 PM ritika jain wrote: > Hi All, > > I am having a query , is there any way using which we can mention > subdirectories' path in the file spec of Window shares connector. > > Like my requirement is to mention Most top hierarchical

Re: Job Deletion query

2021-08-12 Thread Karl Wright
Yes, when you delete a job, the indexed documents associated with that job are removed from the index. ManifoldCF is a synchronizer, not a crawler, so when you remove the synchronization job then if it didn't delete the indexed documents they would be left dangling. Karl On Thu, Aug 12, 2021

Re: Window shares dynamic Job issue

2021-08-11 Thread ritika jain
Seems to be working now!!! Thanks a lot !!! On Wed, Aug 11, 2021 at 6:22 PM ritika jain wrote: > Hi , > > Yes this works only the difference is when a single file is ingested we > are having ingested one as C:/Users/Dell/Desktop/abc.txt/.-with a UNWANTED > slash at end > > *The file spec part

Re: Window shares dynamic Job issue

2021-08-11 Thread ritika jain
Hi , Yes this works only the difference is when a single file is ingested we are having ingested one as C:/Users/Dell/Desktop/abc.txt/.-with a UNWANTED slash at end *The file spec part should include the file name.:- *This way I have tried, I am getting Access denied. Also checked about all the

Re: Window shares dynamic Job issue

2021-08-11 Thread Karl Wright
The "path" attribute is not meant to include terminal file names, only directories. I'm surprised that this works at all. The file spec part should include the file name. Karl On Wed, Aug 11, 2021 at 2:14 AM ritika jain wrote: > *Dynamic Job * > >

Re: Window shares dynamic Job issue

2021-08-11 Thread ritika jain
*Dynamic Job * {"job":{"_children_":[{"_type_":"id","_value_":"1628595470228"},{"_type_":"description","_value_":"DEMo TEMP

Re: Window shares dynamic Job issue

2021-08-10 Thread Karl Wright
I am sorry, but I'm having trouble understanding how exactly you are configuring the JCIFS connector in these two cases.Can you view the job in each case and provide cut-and-paste of the view? Karl On Tue, Aug 10, 2021 at 9:09 AM ritika jain wrote: > Hi All, > > I am using Window shares

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8
I had a quick look at Jira. I think there is already a ticket which covers the reqirement of using a sitemap.xml file which is referenced by robots.txt https://issues.apache.org/jira/browse/CONNECTORS-1657 I'll update this ticket with infos from the sitemap protocol page

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
If you wish to add a feature request, please create a CONNECTORS ticket that describes the functionality you think the connector should have. Karl On Wed, Jul 7, 2021 at 9:29 AM h0444xk8 wrote: > Hi, > > yes, that seems to be the reason. In: > > >

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread h0444xk8
Hi, yes, that seems to be the reason. In: https://github.com/apache/manifoldcf/blob/030703a7f2bbfbb5a8dcde529b29ead830a7f60c/connectors/rss/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/rss/Robots.java there is the following code sequence: else if

Re: Is the Web connector supporting zipped sitemap.xml.gz referenced by robots.txt?

2021-07-07 Thread Karl Wright
The robots parsing does not recognize the "sitemaps" line, which was likely not in the spec for robots when this connector was written. Karl On Wed, Jul 7, 2021 at 3:31 AM h0444xk8 wrote: > Hi, > > I have a general question. Is the Web connector supporting sitemap files > referenced by the

Re: Manifoldcf Redirection process

2021-05-28 Thread Karl Wright
302 does get recognized as a redirection, yes On Fri, May 28, 2021 at 5:07 AM ritika jain wrote: > Is the process the same when fetch/process status code returned is 302 ? When running a job with web crawler and ES output connector >>> > can anybody have a clue about this >

Re: Manifoldcf Redirection process

2021-05-28 Thread ritika jain
> > Is the process the same when fetch/process status code returned is 302 ? >>> When running a job with web crawler and ES output connector >>> >> can anybody have a clue about this

Re: Manifoldcf Redirection process

2021-05-20 Thread ritika jain
Is the process the same when fetch/process status code returned is 302 ? When running a job with web crawler and ES output connector On Wed, May 19, 2021 at 10:35 PM Karl Wright wrote: > ManifoldCF reads all the URLs on its queue. > If it's a 301, it detects this and pushes the new URL onto

Re: Manifoldcf Redirection process

2021-05-19 Thread Karl Wright
ManifoldCF reads all the URLs on its queue. If it's a 301, it detects this and pushes the new URL onto the document queue. When it gets to the new URL, it processes it like any other. Karl On Wed, May 19, 2021 at 8:32 AM ritika jain wrote: > Hi > > I want to understand the process of "How

Re: Interrupted while acquiring credits

2021-05-14 Thread Karl Wright
"crashing the manifold" is probably running out of memory, and it is probably due to having too many worker threads and insufficient memory, not the error you found. If that error caused a problem, it would simply abort the job, not "crash" Manifold. Karl On Fri, May 14, 2021 at 4:10 AM ritika

Re: Interrupted while acquiring credits

2021-05-14 Thread ritika jain
It retries for 3 times and it usually crashes the manifoldcf. Similar ticket i observed https://issues.apache.org/jira/browse/CONNECTORS-1633, does manifoldcf itself capable of skipping the file that cause issue instead of aborting the job or crashing manifold On Fri, May 14, 2021 at 1:34 PM

Re: Interrupted while acquiring credits

2021-05-14 Thread Karl Wright
' *JCIFS: Possibly transient exception detected on attempt 1 while getting share security'Yes, it is going to retry.* *Karl* On Fri, May 14, 2021 at 1:45 AM ritika jain wrote: > Hi, > I am using Windows shares connector in manifoldcf 2.14 and ElasticSearch > connector as Output connector and

Re: Notification connector error

2021-05-11 Thread Karl Wright
This used to work fine, but I suspect that when SSH was declared unsafe, it was disabled, and now only TLS will work. Karl On Tue, May 11, 2021 at 12:13 PM wrote: > Hello, > > > > I am trying to use an email notification connector but without success. > When the connector tries to send an

Re: General questions

2021-04-12 Thread Karl Wright
Hi, There was a book written but never published on ManifoldCF and how to write connectors. It's meant to be extended in that way. The PDFs for the book are available for free online, and they are linked through the manifoldcf web site. Karl On Mon, Apr 12, 2021 at 8:49 AM koch wrote: > Hi

Re: Manifoldcf Deletion Process

2021-03-30 Thread Karl Wright
Hi Ritika, There is no deletion process. Deletion takes place when a job is run in a mode where deletion is possible (there are some where it is not). The way it takes place depends on the kind of repository connector (what model it declares itself to use). For the most common kinds of

Re: Another Elasticsearch patch to allow the long URI

2021-03-25 Thread Shirai Takashi/ 白井隆
Hi, Karl. Karl Wright wrote: >I have now updated (I think) everything that this patch actually has, save >for one deprecated field substitution (the "types" field is now the "doc_" I've confirmed the updated sources via git://git.apache.org/manifoldcf.git, to find some problem in the following

Re: Another Elasticsearch patch to allow the long URI

2021-03-21 Thread Shirai Takashi/ 白井隆
Hi, Karl. Karl Wrightさんは書きました: >field). I would like to know more about this. Does the "types" field no >longer work? Should we send both, in order to be sure that the connector >works with most versions of ElasticSearch? Please help clarify so that I >can finish this off. The "types" field

Re: Another Elasticsearch patch to allow the long URI

2021-03-20 Thread Karl Wright
I have now updated (I think) everything that this patch actually has, save for one deprecated field substitution (the "types" field is now the "doc_" field). I would like to know more about this. Does the "types" field no longer work? Should we send both, in order to be sure that the connector

Re: Another Elasticsearch patch to allow the long URI

2021-03-20 Thread Karl Wright
Hi, Please see https://issues.apache.org/jira/browse/CONNECTORS-1666 . I did not commit the patches as given because I felt that the fix was a relatively narrow one and it could be implemented with no user involvement. Adding control for the user was therefore beyond the scope of the repair.

Re: Another Elasticsearch patch to allow the long URI

2021-03-19 Thread Karl Wright
Thanks for the information. I'll see what I can do. Karl On Thu, Mar 18, 2021 at 7:23 PM Shirai Takashi/ 白井隆 wrote: > Hi, Karl. > > Karl Wright wrote: > >Hi - I'm still waiting for this patch to be attached to a ticket. That is > >the only way I believe we're allowed to accept it legally. >

Re: Another Elasticsearch patch to allow the long URI

2021-03-18 Thread Shirai Takashi/ 白井隆
Hi, Karl. Karl Wright wrote: >Hi - I'm still waiting for this patch to be attached to a ticket. That is >the only way I believe we're allowed to accept it legally. Do you ask me to send the patch to the JIRA ticket? I can't access the JIRA because of our firewall. Sorry. What can I do without

Re: Another Elasticsearch patch to allow the long URI

2021-03-18 Thread Karl Wright
Hi - I'm still waiting for this patch to be attached to a ticket. That is the only way I believe we're allowed to accept it legally. Karl On Thu, Mar 4, 2021 at 7:16 PM Shirai Takashi/ 白井隆 wrote: > Hi, Karl. > > Karl Wrightさんは書きました: > >I agree it is unlikely that the JDK will lose support

Re: Another Elasticsearch patch to allow the long URI

2021-03-04 Thread Shirai Takashi/ 白井隆
Hi, Karl. Karl Wrightさんは書きました: >I agree it is unlikely that the JDK will lose support for SHA-1 because it >is used commonly, as is MD5. So please feel free to use it. I know. I think that SHA-1 is better on the whole. I don't care that apache-manifoldcf-elastic-id-2.patch.gz is discarded.

Re: Another Elasticsearch patch to allow the long URI

2021-03-04 Thread Karl Wright
I agree it is unlikely that the JDK will lose support for SHA-1 because it is used commonly, as is MD5. So please feel free to use it. Karl On Wed, Mar 3, 2021 at 7:54 PM Shirai Takashi/ 白井隆 wrote: > Hi, Horn. > > Jörn Franke wrote: > >Makes sense > > I don't think that it's easy. > > > >>>

Re: Another Elasticsearch patch to allow the long URI

2021-03-03 Thread Shirai Takashi/ 白井隆
Hi, There. Shirai Takashi/ 白井隆 wrote: >I can use SHA-256 with Elasticsearch connector. I've prepared the patch to support SHA-256. It minimizes changes, to avoid the global effects. It seems unbeautiful to include the try-catch clause. I can't decide which is better. Nintendo, Co., Ltd.

Re: Another Elasticsearch patch to allow the long URI

2021-03-03 Thread Shirai Takashi/ 白井隆
Hi, Horn. Jörn Franke wrote: >Makes sense I don't think that it's easy. >>> Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it >>> will be removed from JDK. I also know SHA-1 is dangerous. Someone can generate the string which is hashed into the same SHA-1 to pretend

Re: Another Elasticsearch patch to allow the long URI

2021-03-02 Thread Shirai Takashi/ 白井隆
Hi, Karl. Karl Wright wrote: >Backwards compatibility means that we very likely have to >use the hash approach, and not use the decoding approach. Do you object to the decoding? It may be useless for the users with the alphabetical language. But it's useful for the users with the multibyte

Re: Another Elasticsearch patch to allow the long URI

2021-03-02 Thread Karl Wright
Hi - this is very helpful. I would like you to officially create a ticket in Jira: https://issues.apache.org/jira , project "CONNECTORS", and attach these patches. Backwards compatibility means that we very likely have to use the hash approach, and not use the decoding approach. Thanks, Karl

Re: Another Elasticsearch patch to allow the long URI

2021-03-02 Thread Jörn Franke
current ManifoldCF use SHA-1? > This case may have to use SHA-1 depending on the reason. > If the reason is only the compatibility, > I can re-design the method ManifoldCF.hash(), > to add the argument which indicates the algorism. > > > Nintendo, Co., Ltd. > Product Technology Dept. > Takashi SHIRAI > PHONE: +81-75-662-9600 > mailto:shi...@nintendo.co.jp

Re: Another Elasticsearch patch to allow the long URI

2021-03-01 Thread Shirai Takashi/ 白井隆
lass, the default algorism is updated entirely. I've just followed the standard of ManifoldCF. I also think SHA-256 or later is better. Why the current ManifoldCF use SHA-1? This case may have to use SHA-1 depending on the reason. If the reason is only the compatibility, I can re-design the method

Re: Another Elasticsearch patch to allow the long URI

2021-03-01 Thread Jörn Franke
Maybe use SHA-256 or later. SHA-1 is obsolete and one never knows when it will be removed from JDK. > Am 02.03.2021 um 04:10 schrieb Shirai Takashi/ 白井隆 : > > Hi, there. > > I've found another trouble in Elasticsearch connector. > Elasticsearch output connector use the URI string as ID. >

Re: Patch contribution to support Ingest-Attachment for Elasticsearch

2021-02-25 Thread Shirai Takashi/ 白井隆
Hi, there. Shirai Takashi wrote: >ManifoldCF can use mapping-attachments plugin for Elasticsearch connector. >But it is obsolete, to recommend ingest-attachment plugin instead. >I try to support this plugin with the attached patch. Sorry, I have some mistake with this patch. Please replace it

Re: Multiprocess file installation of manifold

2021-02-17 Thread Karl Wright
File synchronization is still supported but is deprecated. We recommend zookeeper synchronization unless you have a very good reason not to. Karl On Wed, Feb 17, 2021 at 12:26 PM Ananth Peddinti wrote: > Hello Team , > > > I would like to know if someone has already done multi-process model

Re: Job Content Length issue

2021-02-17 Thread Karl Wright
The internal Tika is not memory bounded; some transformations stream, but others put everything into memory. You can try using the external tika, with a tika instance you run separately, and that would likely help. But you may need to give it lots of memory too. Karl On Wed, Feb 17, 2021 at

Re: Job Content Length issue

2021-02-17 Thread ritika jain
Hi Karl, I am using Elastic search as an output connector and yes using an internal Tika extracter, not using solr output connection. Also Elastic search server is on hosted on different server with huge memory allocation. On Tue, Feb 16, 2021 at 7:29 PM Karl Wright wrote: > Hi, do you mean

Re: Job Content Length issue

2021-02-16 Thread Karl Wright
Hi, do you mean content limiter length of 100? I assume you are using the internal Tika transformer? Are you combining this with a Solr output connection that is not using the extract handler? By "manifold crashes" I assume you actually mean it runs out of memory. The "long running query"

Re: content length tab

2021-02-15 Thread Karl Wright
This parameter is in bytes. Karl On Mon, Feb 15, 2021 at 9:03 AM ritika jain wrote: > Hi Users, > > Can anybody tell me if this can be filled as bytes or kilobytes here. > > The "Content Length tab looks like this: > > > [image: Windows Share Job, Content Length tab] > > Values are to be

Re: Job status stuck in terminating

2021-01-07 Thread Karl Wright
Hi, Usually the reason a job doesn't complete is because a document is retrying indefinitely. You can see what's going on by looking at the Simple History job report, or, if you prefer, tailing the manifoldcf log. Other times a job won't complete because somebody shut down the agents process.

RE: Job status stuck in terminating

2021-01-07 Thread Isaac Kunz
I have a job that is stuck in terminating for 12 hrs. it is a small test job and I am wondering if there is a way to fix this? The job ran once and completed 175k documents. I modified the query to the job and reseeded. The job was modified to process a smaller document set. I assume reseeding

Re: Indexation Not OK

2021-01-01 Thread Karl Wright
gt; > > > > -- > > Michael Cizmar > > > > *From:* ritika jain > *Sent:* Thursday, December 31, 2020 7:33 AM > *To:* user@manifoldcf.apache.org > *Subject:* Re: Indexation Not OK > > > > Elastic search output connector with some custom changes for s

RE: Indexation Not OK

2020-12-31 Thread Michael Cizmar
if that traffic is actually going to Elastic search. Karl – I believe Ritika said Elastic. -- Michael Cizmar From: ritika jain Sent: Thursday, December 31, 2020 7:33 AM To: user@manifoldcf.apache.org Subject: Re: Indexation Not OK Elastic search output connector with some custom changes for some fields

Re: Indexation Not OK

2020-12-31 Thread Karl Wright
Sorry, I couldn't quite understand everything in your email, but it sounds like the problem is in the ES connection. It is possible that ES expires your connection and the indexing fails after that happens. If that is happening, however, I would expect to see a much more detailed error message

Re: Indexation Not OK

2020-12-31 Thread ritika jain
Elastic search output connector with some custom changes for some fields On Thursday, December 31, 2020, Karl Wright wrote: > Hi, > Can you let us know what you are using for the output connector? > Thanks, > Karl > > > On Thu, Dec 31, 2020 at 8:24 AM ritika jain > wrote: > >> Hi, >> >> I am

  1   2   3   4   5   6   7   8   9   10   >