R: How to set Tika with ManifoldCF and Solr

2018-10-12 Thread Bisonti Mario
Hallo.
I downloaded and compiled ManifoldCF 2.11 from scratch, I used Tika internal 
but I obtain the same problem.
[cid:image002.jpg@01D4621B.29A03030]


Da: Karl Wright 
Inviato: giovedì 11 ottobre 2018 19:29
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

I cannot reproduce your problem.  Perhaps you can download a new instance and 
configure it from scratch using the embedded tika?  If that works it should be 
possible to figure out what the difference is.

Karl

On Thu, Oct 11, 2018, 12:23 PM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
I tried to update Solr, Tika server and ManifoldCF to the last versions.

I tried to add another Transformation before the TikaTransformation ti filter 
the alloweddocuments as you suggested in another discussion but nothing..
I always have the same Result Code: EXCLUDEDMIMETYPE


I read other discussion ( 
https://lists.apache.org/thread.html/66a3f9780bbcc98e404e25f5a0e56a8a6c007448642c3bc15a366ed2@%3Cuser.manifoldcf.apache.org%3E)
  but I don’t understand if they solved the issue

☹

Thanks a lot.
Mario






Da: Karl Wright mailto:daddy...@gmail.com>>
Inviato: giovedì 11 ottobre 2018 14:57
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

When you don't check the "use extracting update handler" field is UNCHECKED, 
the mime types you list are IGNORED.  Only "text" mime types are accepted by 
the Solr connection in that case.  But that is exactly what the Tika extractor 
sends along, and many other people do this, and I can make it work fine here, 
so I don't know what you are doing wrong.

Karl


On Thu, Oct 11, 2018 at 8:37 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
This is my solr output connection:

I tried to put content_type as “Mime type field name:” but the result is always 
the same

Could be that, unchecking the flag, ManifoldCF doesn’t use the mime types 
specified?

I am using a snapshot version of ManifoldCF of three monts  ago.




Da: Karl Wright mailto:daddy...@gmail.com>>
Inviato: giovedì 11 ottobre 2018 14:20
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

I confirmed that both the Tika Service transformer and the Tika transformer 
check the same exact mime type:

>>
  @Override
  public boolean checkMimeTypeIndexable(VersionContext pipelineDescription, 
String mimeType, IOutputCheckActivity checkActivity)
throws ManifoldCFException, ServiceInterruption
  {
// We should see what Tika will transform
// MHL
// Do a downstream check
return checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
  }
<<

So: please verify that your Solr connection is set up correctly and the "use 
extracting update handler" box is UNCHECKED.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:16 AM Karl Wright 
mailto:daddy...@gmail.com>> wrote:
When you uncheck the "use extracting update handler" checkbox, the Solr 
connection only accepts text/plain, and no binary formats.  The Tika extractor, 
though, should set the mime type always to "text/plain".  Since the Simple 
History says otherwise, I wonder if there's a problem with the external Tika 
extractor.  Perhaps you can try the internal one to get your pipeline working 
first?  If the external one does not send the right mime type, then we need to 
correct that so you should open a ticket.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Now the document isn’t ingested by solr because I obtain:

Solr connector rejected document due to mime type restrictions: 
(application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)


But the mime type is on the tab


And the settings worked well when I used Tika inside solr.

Could you help me?
Thanks

Da: Bisonti Mario mailto:mario.biso...@vimar.com>>
Inviato: giovedì 11 ottobre 2018 14:03
A: user@manifoldcf.apache.org
Oggetto: R: How to set Tika with ManifoldCF and Solr


My mistake…
As you wrote me I had to uncheck “use extracting update handler”

Now I have to understand the field mentioned in schema etc.

Da: Bisonti Mario mailto:mario.biso...@vimar.com>>
Inviato: giovedì 11 ottobre 2018 13:45
A: user@manifoldcf.apache.org
Oggetto: R: How to set Tika with ManifoldCF and Solr

I see the job processed but without the document inside.
10-11-2018 13:32:25.649

job end

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1

10-11-2018 13:32:14.211

job start


Re: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Karl Wright
I cannot reproduce your problem.  Perhaps you can download a new instance
and configure it from scratch using the embedded tika?  If that works it
should be possible to figure out what the difference is.

Karl

On Thu, Oct 11, 2018, 12:23 PM Bisonti Mario 
wrote:

> I tried to update Solr, Tika server and ManifoldCF to the last versions.
>
>
>
> I tried to add another Transformation before the TikaTransformation ti
> filter the alloweddocuments as you suggested in another discussion but
> nothing..
>
> I always have the same Result Code: EXCLUDEDMIMETYPE
>
>
>
>
>
> I read other discussion (
> https://lists.apache.org/thread.html/66a3f9780bbcc98e404e25f5a0e56a8a6c007448642c3bc15a366ed2@%3Cuser.manifoldcf.apache.org%3E)
>  but I don’t understand if they solved the issue
>
>
>
> ☹
>
>
>
> Thanks a lot.
>
> Mario
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 14:57
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> When you don't check the "use extracting update handler" field is
> UNCHECKED, the mime types you list are IGNORED.  Only "text" mime types are
> accepted by the Solr connection in that case.  But that is exactly what the
> Tika extractor sends along, and many other people do this, and I can make
> it work fine here, so I don't know what you are doing wrong.
>
>
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:37 AM Bisonti Mario 
> wrote:
>
> This is my solr output connection:
>
>
>
> I tried to put content_type as “Mime type field name:” but the result is
> always the same
>
>
>
> Could be that, unchecking the flag, ManifoldCF doesn’t use the mime types
> specified?
>
>
>
> I am using a snapshot version of ManifoldCF of three monts  ago.
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 14:20
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> I confirmed that both the Tika Service transformer and the Tika
> transformer check the same exact mime type:
>
> >>
>
>   @Override
>
>   public boolean checkMimeTypeIndexable(VersionContext
> pipelineDescription, String mimeType, IOutputCheckActivity checkActivity)
>
> throws ManifoldCFException, ServiceInterruption
>
>   {
>
> // We should see what Tika will transform
>
> // MHL
>
> // Do a downstream check
>
> return
> checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
>
>   }
>
> <<
>
>
>
> So: please verify that your Solr connection is set up correctly and the
> "use extracting update handler" box is UNCHECKED.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:16 AM Karl Wright  wrote:
>
> When you uncheck the "use extracting update handler" checkbox, the Solr
> connection only accepts text/plain, and no binary formats.  The Tika
> extractor, though, should set the mime type always to "text/plain".  Since
> the Simple History says otherwise, I wonder if there's a problem with the
> external Tika extractor.  Perhaps you can try the internal one to get your
> pipeline working first?  If the external one does not send the right mime
> type, then we need to correct that so you should open a ticket.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario 
> wrote:
>
> Now the document isn’t ingested by solr because I obtain:
>
>
>
> Solr connector rejected document due to mime type restrictions:
> (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
>
>
>
>
>
> But the mime type is on the tab
>
>
>
>
>
> And the settings worked well when I used Tika inside solr.
>
>
>
> Could you help me?
>
> Thanks
>
>
>
> *Da:* Bisonti Mario 
> *Inviato:* giovedì 11 ottobre 2018 14:03
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
>
>
> My mistake…
>
> As you wrote me I had to uncheck “use extracting update handler”
>
>
>
> Now I have to understand the field mentioned in schema etc.
>
>
>
> *Da:* Bisonti Mario 
> *Inviato:* giovedì 11 ottobre 2018 13:45
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
> I see the job processed but without the document inside.
>
> 10-11-2018 13:32:25.649
>
> job end
>
> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>
> 0
>
> 1
>
> 10-11-2018 13:32:14.211
>
> job start
>
> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>
> 0
>
> 1
>
>
>
>
>
>
>
>
>
> Have I to uncheck, on my Solr output connection the “Use the Extract
> Update Handler”?
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 13:36
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> Please have a look at your "Simple History" report to see why the
> documents aren't getting indexed.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
> wrote:
>
> Thanks Karl.
>
> I tried, but it doesn’t index documents.
>
> It seemes 

Re: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Karl Wright
When you don't check the "use extracting update handler" field is
UNCHECKED, the mime types you list are IGNORED.  Only "text" mime types are
accepted by the Solr connection in that case.  But that is exactly what the
Tika extractor sends along, and many other people do this, and I can make
it work fine here, so I don't know what you are doing wrong.

Karl


On Thu, Oct 11, 2018 at 8:37 AM Bisonti Mario 
wrote:

> This is my solr output connection:
>
>
>
> I tried to put content_type as “Mime type field name:” but the result is
> always the same
>
>
>
> Could be that, unchecking the flag, ManifoldCF doesn’t use the mime types
> specified?
>
>
>
> I am using a snapshot version of ManifoldCF of three monts  ago.
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 14:20
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> I confirmed that both the Tika Service transformer and the Tika
> transformer check the same exact mime type:
>
> >>
>
>   @Override
>
>   public boolean checkMimeTypeIndexable(VersionContext
> pipelineDescription, String mimeType, IOutputCheckActivity checkActivity)
>
> throws ManifoldCFException, ServiceInterruption
>
>   {
>
> // We should see what Tika will transform
>
> // MHL
>
> // Do a downstream check
>
> return
> checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
>
>   }
>
> <<
>
>
>
> So: please verify that your Solr connection is set up correctly and the
> "use extracting update handler" box is UNCHECKED.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:16 AM Karl Wright  wrote:
>
> When you uncheck the "use extracting update handler" checkbox, the Solr
> connection only accepts text/plain, and no binary formats.  The Tika
> extractor, though, should set the mime type always to "text/plain".  Since
> the Simple History says otherwise, I wonder if there's a problem with the
> external Tika extractor.  Perhaps you can try the internal one to get your
> pipeline working first?  If the external one does not send the right mime
> type, then we need to correct that so you should open a ticket.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario 
> wrote:
>
> Now the document isn’t ingested by solr because I obtain:
>
>
>
> Solr connector rejected document due to mime type restrictions:
> (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
>
>
>
>
>
> But the mime type is on the tab
>
>
>
>
>
> And the settings worked well when I used Tika inside solr.
>
>
>
> Could you help me?
>
> Thanks
>
>
>
> *Da:* Bisonti Mario 
> *Inviato:* giovedì 11 ottobre 2018 14:03
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
>
>
> My mistake…
>
> As you wrote me I had to uncheck “use extracting update handler”
>
>
>
> Now I have to understand the field mentioned in schema etc.
>
>
>
> *Da:* Bisonti Mario 
> *Inviato:* giovedì 11 ottobre 2018 13:45
> *A:* user@manifoldcf.apache.org
> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>
>
>
> I see the job processed but without the document inside.
>
> 10-11-2018 13:32:25.649
>
> job end
>
> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>
> 0
>
> 1
>
> 10-11-2018 13:32:14.211
>
> job start
>
> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>
> 0
>
> 1
>
>
>
>
>
>
>
>
>
> Have I to uncheck, on my Solr output connection the “Use the Extract
> Update Handler”?
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 13:36
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> Please have a look at your "Simple History" report to see why the
> documents aren't getting indexed.
>
>
>
> Thanks,
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
> wrote:
>
> Thanks Karl.
>
> I tried, but it doesn’t index documents.
>
> It seemes that it doesn’t see them?
>
>
>
> Perhaps is the “Ignore Tika exception that I don’t know where to set in
> ManifoldCF  the problem?
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 12:24
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> Hi Mario,
>
>
>
> (1) When you use the Tika server externally, you do not get the boilerpipe
> HTML extractor available for configuration and use.  That is because it's
> external now.
>
> (2) In your Solr connection, you want to uncheck the box that says "use
> extracting update handler", and you want to change the output handler from
> "/update/extract" to just "/update".
>
>
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
> I would like to use Tika server started from command line into ManifoldCF
> so, ManifoldCF as Trasformation connector, process with Tika and index to
> the output connecto Solr.
>
>
>
> I started Tika server:
> java -jar 

R: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Bisonti Mario
This is my solr output connection:

[cid:image002.jpg@01D4616F.EA54D800]

I tried to put content_type as “Mime type field name:” but the result is always 
the same

Could be that, unchecking the flag, ManifoldCF doesn’t use the mime types 
specified?

I am using a snapshot version of ManifoldCF of three monts  ago.




Da: Karl Wright 
Inviato: giovedì 11 ottobre 2018 14:20
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

I confirmed that both the Tika Service transformer and the Tika transformer 
check the same exact mime type:

>>
  @Override
  public boolean checkMimeTypeIndexable(VersionContext pipelineDescription, 
String mimeType, IOutputCheckActivity checkActivity)
throws ManifoldCFException, ServiceInterruption
  {
// We should see what Tika will transform
// MHL
// Do a downstream check
return checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
  }
<<

So: please verify that your Solr connection is set up correctly and the "use 
extracting update handler" box is UNCHECKED.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:16 AM Karl Wright 
mailto:daddy...@gmail.com>> wrote:
When you uncheck the "use extracting update handler" checkbox, the Solr 
connection only accepts text/plain, and no binary formats.  The Tika extractor, 
though, should set the mime type always to "text/plain".  Since the Simple 
History says otherwise, I wonder if there's a problem with the external Tika 
extractor.  Perhaps you can try the internal one to get your pipeline working 
first?  If the external one does not send the right mime type, then we need to 
correct that so you should open a ticket.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Now the document isn’t ingested by solr because I obtain:

Solr connector rejected document due to mime type restrictions: 
(application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)


But the mime type is on the tab


And the settings worked well when I used Tika inside solr.

Could you help me?
Thanks

Da: Bisonti Mario mailto:mario.biso...@vimar.com>>
Inviato: giovedì 11 ottobre 2018 14:03
A: user@manifoldcf.apache.org
Oggetto: R: How to set Tika with ManifoldCF and Solr


My mistake…
As you wrote me I had to uncheck “use extracting update handler”

Now I have to understand the field mentioned in schema etc.

Da: Bisonti Mario mailto:mario.biso...@vimar.com>>
Inviato: giovedì 11 ottobre 2018 13:45
A: user@manifoldcf.apache.org
Oggetto: R: How to set Tika with ManifoldCF and Solr

I see the job processed but without the document inside.
10-11-2018 13:32:25.649

job end

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1

10-11-2018 13:32:14.211

job start

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1





Have I to uncheck, on my Solr output connection the “Use the Extract Update 
Handler”?






Da: Karl Wright mailto:daddy...@gmail.com>>
Inviato: giovedì 11 ottobre 2018 13:36
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Please have a look at your "Simple History" report to see why the documents 
aren't getting indexed.

Thanks,
Karl


On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Thanks Karl.
I tried, but it doesn’t index documents.
It seemes that it doesn’t see them?

Perhaps is the “Ignore Tika exception that I don’t know where to set in 
ManifoldCF  the problem?





Da: Karl Wright mailto:daddy...@gmail.com>>
Inviato: giovedì 11 ottobre 2018 12:24
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Hi Mario,

(1) When you use the Tika server externally, you do not get the boilerpipe HTML 
extractor available for configuration and use.  That is because it's external 
now.
(2) In your Solr connection, you want to uncheck the box that says "use 
extracting update handler", and you want to change the output handler from 
"/update/extract" to just "/update".

Karl


On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Hallo.
I would like to use Tika server started from command line into ManifoldCF so, 
ManifoldCF as Trasformation connector, process with Tika and index to the 
output connecto Solr.

I started Tika server:
java -jar /opt/tika/tika-server-1.19.1.jar

After, I created a transformation connection with TikaServer: localhost and 
Tika port 998 and connection works.

After, I created a job and in the Tab Connection I inserted the Transformation 
yet created Before the Output Solr.



Note that I don’t see the tab “Excepition” and “Boilerplate”
Why this?

Furthermore, if I start the job, I see that Solr hangs with exception:
2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share] 
o.e.j.s.HttpChannel /solr/core_share/update/extract

Re: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Karl Wright
I confirmed that both the Tika Service transformer and the Tika transformer
check the same exact mime type:

>>
  @Override
  public boolean checkMimeTypeIndexable(VersionContext pipelineDescription,
String mimeType, IOutputCheckActivity checkActivity)
throws ManifoldCFException, ServiceInterruption
  {
// We should see what Tika will transform
// MHL
// Do a downstream check
return checkActivity.checkMimeTypeIndexable("text/plain;charset=utf-8");
  }
<<

So: please verify that your Solr connection is set up correctly and the
"use extracting update handler" box is UNCHECKED.

Thanks,
Karl


On Thu, Oct 11, 2018 at 8:16 AM Karl Wright  wrote:

> When you uncheck the "use extracting update handler" checkbox, the Solr
> connection only accepts text/plain, and no binary formats.  The Tika
> extractor, though, should set the mime type always to "text/plain".  Since
> the Simple History says otherwise, I wonder if there's a problem with the
> external Tika extractor.  Perhaps you can try the internal one to get your
> pipeline working first?  If the external one does not send the right mime
> type, then we need to correct that so you should open a ticket.
>
> Thanks,
> Karl
>
>
> On Thu, Oct 11, 2018 at 8:10 AM Bisonti Mario 
> wrote:
>
>> Now the document isn’t ingested by solr because I obtain:
>>
>>
>>
>> Solr connector rejected document due to mime type restrictions:
>> (application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
>>
>>
>>
>>
>>
>> But the mime type is on the tab
>>
>>
>>
>>
>>
>> And the settings worked well when I used Tika inside solr.
>>
>>
>>
>> Could you help me?
>>
>> Thanks
>>
>>
>>
>> *Da:* Bisonti Mario 
>> *Inviato:* giovedì 11 ottobre 2018 14:03
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>>
>>
>> My mistake…
>>
>> As you wrote me I had to uncheck “use extracting update handler”
>>
>>
>>
>> Now I have to understand the field mentioned in schema etc.
>>
>>
>>
>> *Da:* Bisonti Mario 
>> *Inviato:* giovedì 11 ottobre 2018 13:45
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* R: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>> I see the job processed but without the document inside.
>>
>> 10-11-2018 13:32:25.649
>>
>> job end
>>
>> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>>
>> 0
>>
>> 1
>>
>> 10-11-2018 13:32:14.211
>>
>> job start
>>
>> 1539153700219(G_IT_Area_condivisa_Mario_XLSM)
>>
>> 0
>>
>> 1
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Have I to uncheck, on my Solr output connection the “Use the Extract
>> Update Handler”?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Da:* Karl Wright 
>> *Inviato:* giovedì 11 ottobre 2018 13:36
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>> Please have a look at your "Simple History" report to see why the
>> documents aren't getting indexed.
>>
>>
>>
>> Thanks,
>>
>> Karl
>>
>>
>>
>>
>>
>> On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
>> wrote:
>>
>> Thanks Karl.
>>
>> I tried, but it doesn’t index documents.
>>
>> It seemes that it doesn’t see them?
>>
>>
>>
>> Perhaps is the “Ignore Tika exception that I don’t know where to set in
>> ManifoldCF  the problem?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Da:* Karl Wright 
>> *Inviato:* giovedì 11 ottobre 2018 12:24
>> *A:* user@manifoldcf.apache.org
>> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>>
>>
>>
>> Hi Mario,
>>
>>
>>
>> (1) When you use the Tika server externally, you do not get the
>> boilerpipe HTML extractor available for configuration and use.  That is
>> because it's external now.
>>
>> (2) In your Solr connection, you want to uncheck the box that says "use
>> extracting update handler", and you want to change the output handler from
>> "/update/extract" to just "/update".
>>
>>
>>
>> Karl
>>
>>
>>
>>
>>
>> On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario 
>> wrote:
>>
>> Hallo.
>>
>> I would like to use Tika server started from command line into ManifoldCF
>> so, ManifoldCF as Trasformation connector, process with Tika and index to
>> the output connecto Solr.
>>
>>
>>
>> I started Tika server:
>> java -jar /opt/tika/tika-server-1.19.1.jar
>>
>>
>>
>> After, I created a transformation connection with TikaServer: localhost
>> and Tika port 998 and connection works.
>>
>>
>>
>> After, I created a job and in the Tab Connection I inserted the
>> Transformation yet created Before the Output Solr.
>>
>>
>>
>>
>>
>>
>>
>> Note that I don’t see the tab “Excepition” and “Boilerplate”
>>
>> Why this?
>>
>>
>>
>> Furthermore, if I start the job, I see that Solr hangs with exception:
>>
>> 2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share]
>> o.e.j.s.HttpChannel /solr/core_share/update/extract
>>
>> java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
>>
>> at java.lang.Class.forName0(Native Method) ~[?:?]
>>
>> at java.lang.Class.forName(Class.java:374) ~[?:?]
>>
>>
>>
>> infact, I renamed the 

R: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Bisonti Mario
Now the document isn’t ingested by solr because I obtain:


Solr connector rejected document due to mime type restrictions: 
(application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)


But the mime type is on the tab
[cid:image001.jpg@01D4616C.27CBFFF0]


And the settings worked well when I used Tika inside solr.

Could you help me?
Thanks

Da: Bisonti Mario 
Inviato: giovedì 11 ottobre 2018 14:03
A: user@manifoldcf.apache.org
Oggetto: R: How to set Tika with ManifoldCF and Solr


My mistake…
As you wrote me I had to uncheck “use extracting update handler”

Now I have to understand the field mentioned in schema etc.

Da: Bisonti Mario mailto:mario.biso...@vimar.com>>
Inviato: giovedì 11 ottobre 2018 13:45
A: user@manifoldcf.apache.org
Oggetto: R: How to set Tika with ManifoldCF and Solr

I see the job processed but without the document inside.
10-11-2018 13:32:25.649

job end

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1

10-11-2018 13:32:14.211

job start

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1





Have I to uncheck, on my Solr output connection the “Use the Extract Update 
Handler”?

[cid:image004.jpg@01D4616C.27CBFFF0]





Da: Karl Wright mailto:daddy...@gmail.com>>
Inviato: giovedì 11 ottobre 2018 13:36
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Please have a look at your "Simple History" report to see why the documents 
aren't getting indexed.

Thanks,
Karl


On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Thanks Karl.
I tried, but it doesn’t index documents.
It seemes that it doesn’t see them?

Perhaps is the “Ignore Tika exception that I don’t know where to set in 
ManifoldCF  the problem?





Da: Karl Wright mailto:daddy...@gmail.com>>
Inviato: giovedì 11 ottobre 2018 12:24
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Hi Mario,

(1) When you use the Tika server externally, you do not get the boilerpipe HTML 
extractor available for configuration and use.  That is because it's external 
now.
(2) In your Solr connection, you want to uncheck the box that says "use 
extracting update handler", and you want to change the output handler from 
"/update/extract" to just "/update".

Karl


On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Hallo.
I would like to use Tika server started from command line into ManifoldCF so, 
ManifoldCF as Trasformation connector, process with Tika and index to the 
output connecto Solr.

I started Tika server:
java -jar /opt/tika/tika-server-1.19.1.jar

After, I created a transformation connection with TikaServer: localhost and 
Tika port 998 and connection works.

After, I created a job and in the Tab Connection I inserted the Transformation 
yet created Before the Output Solr.



Note that I don’t see the tab “Excepition” and “Boilerplate”
Why this?

Furthermore, if I start the job, I see that Solr hangs with exception:
2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share] 
o.e.j.s.HttpChannel /solr/core_share/update/extract
java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method) ~[?:?]
at java.lang.Class.forName(Class.java:374) ~[?:?]

infact, I renamed the tika .jar:
in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t use 
Tika because I would like that Manifoldcfuses Tika buti t doesn’t work.

Have I to configure solr to don’t use Tika I suppose.

How to do this?

I see 
https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration
 but I haven’t Datafari, so, in a Solr standard configuration, how could I 
deactivated the tika ?

Thanks a lot

Mario



R: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Bisonti Mario

My mistake…
As you wrote me I had to uncheck “use extracting update handler”

Now I have to understand the field mentioned in schema etc.

Da: Bisonti Mario 
Inviato: giovedì 11 ottobre 2018 13:45
A: user@manifoldcf.apache.org
Oggetto: R: How to set Tika with ManifoldCF and Solr

I see the job processed but without the document inside.
10-11-2018 13:32:25.649

job end

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1

10-11-2018 13:32:14.211

job start

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1





Have I to uncheck, on my Solr output connection the “Use the Extract Update 
Handler”?

[cid:image001.jpg@01D4616B.1EDDBBF0]





Da: Karl Wright mailto:daddy...@gmail.com>>
Inviato: giovedì 11 ottobre 2018 13:36
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Please have a look at your "Simple History" report to see why the documents 
aren't getting indexed.

Thanks,
Karl


On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Thanks Karl.
I tried, but it doesn’t index documents.
It seemes that it doesn’t see them?

Perhaps is the “Ignore Tika exception that I don’t know where to set in 
ManifoldCF  the problem?





Da: Karl Wright mailto:daddy...@gmail.com>>
Inviato: giovedì 11 ottobre 2018 12:24
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Hi Mario,

(1) When you use the Tika server externally, you do not get the boilerpipe HTML 
extractor available for configuration and use.  That is because it's external 
now.
(2) In your Solr connection, you want to uncheck the box that says "use 
extracting update handler", and you want to change the output handler from 
"/update/extract" to just "/update".

Karl


On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Hallo.
I would like to use Tika server started from command line into ManifoldCF so, 
ManifoldCF as Trasformation connector, process with Tika and index to the 
output connecto Solr.

I started Tika server:
java -jar /opt/tika/tika-server-1.19.1.jar

After, I created a transformation connection with TikaServer: localhost and 
Tika port 998 and connection works.

After, I created a job and in the Tab Connection I inserted the Transformation 
yet created Before the Output Solr.



Note that I don’t see the tab “Excepition” and “Boilerplate”
Why this?

Furthermore, if I start the job, I see that Solr hangs with exception:
2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share] 
o.e.j.s.HttpChannel /solr/core_share/update/extract
java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method) ~[?:?]
at java.lang.Class.forName(Class.java:374) ~[?:?]

infact, I renamed the tika .jar:
in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t use 
Tika because I would like that Manifoldcfuses Tika buti t doesn’t work.

Have I to configure solr to don’t use Tika I suppose.

How to do this?

I see 
https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration
 but I haven’t Datafari, so, in a Solr standard configuration, how could I 
deactivated the tika ?

Thanks a lot

Mario



R: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Bisonti Mario
I see the job processed but without the document inside.
10-11-2018 13:32:25.649

job end

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1

10-11-2018 13:32:14.211

job start

1539153700219(G_IT_Area_condivisa_Mario_XLSM)

0

1





Have I to uncheck, on my Solr output connection the “Use the Extract Update 
Handler”?

[cid:image002.jpg@01D46168.9A8CAA70]





Da: Karl Wright 
Inviato: giovedì 11 ottobre 2018 13:36
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Please have a look at your "Simple History" report to see why the documents 
aren't getting indexed.

Thanks,
Karl


On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Thanks Karl.
I tried, but it doesn’t index documents.
It seemes that it doesn’t see them?

Perhaps is the “Ignore Tika exception that I don’t know where to set in 
ManifoldCF  the problem?





Da: Karl Wright mailto:daddy...@gmail.com>>
Inviato: giovedì 11 ottobre 2018 12:24
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Hi Mario,

(1) When you use the Tika server externally, you do not get the boilerpipe HTML 
extractor available for configuration and use.  That is because it's external 
now.
(2) In your Solr connection, you want to uncheck the box that says "use 
extracting update handler", and you want to change the output handler from 
"/update/extract" to just "/update".

Karl


On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Hallo.
I would like to use Tika server started from command line into ManifoldCF so, 
ManifoldCF as Trasformation connector, process with Tika and index to the 
output connecto Solr.

I started Tika server:
java -jar /opt/tika/tika-server-1.19.1.jar

After, I created a transformation connection with TikaServer: localhost and 
Tika port 998 and connection works.

After, I created a job and in the Tab Connection I inserted the Transformation 
yet created Before the Output Solr.


Note that I don’t see the tab “Excepition” and “Boilerplate”
Why this?

Furthermore, if I start the job, I see that Solr hangs with exception:
2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share] 
o.e.j.s.HttpChannel /solr/core_share/update/extract
java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method) ~[?:?]
at java.lang.Class.forName(Class.java:374) ~[?:?]

infact, I renamed the tika .jar:
in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t use 
Tika because I would like that Manifoldcfuses Tika buti t doesn’t work.

Have I to configure solr to don’t use Tika I suppose.

How to do this?

I see 
https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration
 but I haven’t Datafari, so, in a Solr standard configuration, how could I 
deactivated the tika ?

Thanks a lot

Mario



Re: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Karl Wright
Please have a look at your "Simple History" report to see why the documents
aren't getting indexed.

Thanks,
Karl


On Thu, Oct 11, 2018 at 7:10 AM Bisonti Mario 
wrote:

> Thanks Karl.
>
> I tried, but it doesn’t index documents.
>
> It seemes that it doesn’t see them?
>
>
>
> Perhaps is the “Ignore Tika exception that I don’t know where to set in
> ManifoldCF  the problem?
>
>
>
>
>
>
>
>
>
>
>
> *Da:* Karl Wright 
> *Inviato:* giovedì 11 ottobre 2018 12:24
> *A:* user@manifoldcf.apache.org
> *Oggetto:* Re: How to set Tika with ManifoldCF and Solr
>
>
>
> Hi Mario,
>
>
>
> (1) When you use the Tika server externally, you do not get the boilerpipe
> HTML extractor available for configuration and use.  That is because it's
> external now.
>
> (2) In your Solr connection, you want to uncheck the box that says "use
> extracting update handler", and you want to change the output handler from
> "/update/extract" to just "/update".
>
>
>
> Karl
>
>
>
>
>
> On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario 
> wrote:
>
> Hallo.
>
> I would like to use Tika server started from command line into ManifoldCF
> so, ManifoldCF as Trasformation connector, process with Tika and index to
> the output connecto Solr.
>
>
>
> I started Tika server:
> java -jar /opt/tika/tika-server-1.19.1.jar
>
>
>
> After, I created a transformation connection with TikaServer: localhost
> and Tika port 998 and connection works.
>
>
>
> After, I created a job and in the Tab Connection I inserted the
> Transformation yet created Before the Output Solr.
>
>
>
>
>
> Note that I don’t see the tab “Excepition” and “Boilerplate”
>
> Why this?
>
>
>
> Furthermore, if I start the job, I see that Solr hangs with exception:
>
> 2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share]
> o.e.j.s.HttpChannel /solr/core_share/update/extract
>
> java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
>
> at java.lang.Class.forName0(Native Method) ~[?:?]
>
> at java.lang.Class.forName(Class.java:374) ~[?:?]
>
>
>
> infact, I renamed the tika .jar:
> in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t
> use Tika because I would like that Manifoldcfuses Tika buti t doesn’t work.
>
>
>
> Have I to configure solr to don’t use Tika I suppose.
>
>
>
> How to do this?
>
>
>
> I see
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration
> 
> but I haven’t Datafari, so, in a Solr standard configuration, how could I
> deactivated the tika ?
>
>
>
> Thanks a lot
>
>
>
> Mario
>
>
>
>


R: How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Bisonti Mario
Thanks Karl.
I tried, but it doesn’t index documents.
It seemes that it doesn’t see them?

Perhaps is the “Ignore Tika exception that I don’t know where to set in 
ManifoldCF  the problem?





Da: Karl Wright 
Inviato: giovedì 11 ottobre 2018 12:24
A: user@manifoldcf.apache.org
Oggetto: Re: How to set Tika with ManifoldCF and Solr

Hi Mario,

(1) When you use the Tika server externally, you do not get the boilerpipe HTML 
extractor available for configuration and use.  That is because it's external 
now.
(2) In your Solr connection, you want to uncheck the box that says "use 
extracting update handler", and you want to change the output handler from 
"/update/extract" to just "/update".

Karl


On Thu, Oct 11, 2018 at 4:45 AM Bisonti Mario 
mailto:mario.biso...@vimar.com>> wrote:
Hallo.
I would like to use Tika server started from command line into ManifoldCF so, 
ManifoldCF as Trasformation connector, process with Tika and index to the 
output connecto Solr.

I started Tika server:
java -jar /opt/tika/tika-server-1.19.1.jar

After, I created a transformation connection with TikaServer: localhost and 
Tika port 998 and connection works.

After, I created a job and in the Tab Connection I inserted the Transformation 
yet created Before the Output Solr.

[cid:image003.png@01D4614F.84B2AD80]

Note that I don’t see the tab “Excepition” and “Boilerplate”
Why this?

Furthermore, if I start the job, I see that Solr hangs with exception:
2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share] 
o.e.j.s.HttpChannel /solr/core_share/update/extract
java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method) ~[?:?]
at java.lang.Class.forName(Class.java:374) ~[?:?]

infact, I renamed the tika .jar:
in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t use 
Tika because I would like that Manifoldcfuses Tika buti t doesn’t work.

Have I to configure solr to don’t use Tika I suppose.

How to do this?

I see 
https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration
 but I haven’t Datafari, so, in a Solr standard configuration, how could I 
deactivated the tika ?

Thanks a lot

Mario



How to set Tika with ManifoldCF and Solr

2018-10-11 Thread Bisonti Mario
Hallo.
I would like to use Tika server started from command line into ManifoldCF so, 
ManifoldCF as Trasformation connector, process with Tika and index to the 
output connecto Solr.

I started Tika server:
java -jar /opt/tika/tika-server-1.19.1.jar

After, I created a transformation connection with TikaServer: localhost and 
Tika port 998 and connection works.

After, I created a job and in the Tab Connection I inserted the Transformation 
yet created Before the Output Solr.

[cid:image003.png@01D4614F.84B2AD80]

Note that I don’t see the tab “Excepition” and “Boilerplate”
Why this?

Furthermore, if I start the job, I see that Solr hangs with exception:
2018-10-11 10:03:47.268 WARN  (qtp1223240796-17) [   x:core_share] 
o.e.j.s.HttpChannel /solr/core_share/update/extract
java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
at java.lang.Class.forName0(Native Method) ~[?:?]
at java.lang.Class.forName(Class.java:374) ~[?:?]

infact, I renamed the tika .jar:
in the folder : solr/contrib/extraction/lib to be sure that solr doesn’t use 
Tika because I would like that Manifoldcfuses Tika buti t doesn’t work.

Have I to configure solr to don’t use Tika I suppose.

How to do this?

I see 
https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/107708451/Data+Extraction+Tika+Embedded+in+Solr+Deactivation+Configuration
 but I haven’t Datafari, so, in a Solr standard configuration, how could I 
deactivated the tika ?

Thanks a lot

Mario