solr-injection

2020-02-11 Thread Martin Frank Hansen (MHQ)
Hi,

I was wondering how others are handling solr – injection in their solutions?

After reading this post: 
https://www.waratek.com/apache-solr-injection-vulnerability-customer-alert/ I 
can see how important it is to update to Solr-8.2 or higher.

Has anyone been successful in injecting unintended queries to Solr? I have 
tried to delete the database from the front-end, using basic search strings and 
Solr commands, but has yet not been successful (which is good). I think there 
are many who knows much more about this than me, so would be nice to hear from 
someone with more experience.

Which considerations do I need to look at in order to secure my Solr core? 
Currently we have a security layer on top on Solr, but at the same time we do 
not want to restrict the flexibility of the searches too much.

Best regards

Martin


Internal - KMD A/S

Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


RE: highlighting not working as expected

2019-07-01 Thread Martin Frank Hansen (MHQ)
Hi Edwin,

Thanks for your explanation, makes sense now.

Best regards

Martin


Internal - KMD A/S

-Original Message-
From: Zheng Lin Edwin Yeo 
Sent: 30. juni 2019 01:57
To: solr-user@lucene.apache.org
Subject: Re: highlighting not working as expected

Hi,

If you are using the type "string", it will require exact match, including 
space and upper/lower case.

You can use the type "text" for a start, but further down the road it will be 
good to have your own custom fieldType with your own tokenizer and filter.

Regards,
Edwin

On Tue, 25 Jun 2019 at 14:52, Martin Frank Hansen (MHQ)  wrote:

> Hi again,
>
> I have tested a bit and I was wondering if the highlighter requires a
> field to be of type "text"? Whenever I try highlighting on fields
> which are of type "string" nothing gets returned.
>
> Best regards
>
> Martin
>
>
> Internal - KMD A/S
>
> -Original Message-
> From: Jörn Franke 
> Sent: 11. juni 2019 08:45
> To: solr-user@lucene.apache.org
> Subject: Re: highlighting not working as expected
>
> Could it be a stop word ? What is the exact type definition of those
> fields? Could this word be omitted or with wrong encoding during
> loading of the documents?
>
> > Am 03.06.2019 um 10:06 schrieb Martin Frank Hansen (MHQ) :
> >
> > Hi,
> >
> > I am having some difficulties making highlighting work. For some
> > reason
> the highlighting feature only works on some fields but not on other
> fields even though these fields are stored.
> >
> > An example of a request looks like this:
> http://localhost/solr/mytest/select?fl=id,doc.Type,Journalnummer,Sagst
> itel=Sagstitel=%3C/b%3E=%3Cb%3E=
> on=rotte
> >
> > It simply returns an empty set, for all documents even though I can
> > see
> several documents which have “Sagstitel” containing the word “rotte”
> (rotte=rat).  What am I missing here?
> >
> > I am using the standard highlighter as below.
> >
> >
> > 
> >
> >  
> >  
> >   >  default="true"
> >  class="solr.highlight.GapFragmenter">
> >
> >  100
> >
> >  
> >
> >  
> >   >  class="solr.highlight.RegexFragmenter">
> >
> >  
> >  70
> >  
> >  0.5
> >  
> >  [-\w
> ,/\n\]{20,200}
> >
> >  
> >
> >  
> >   > default="true"
> > class="solr.highlight.HtmlFormatter">
> >
> >  b
> >  /b
> >
> >  
> >
> >  
> >   >   class="solr.highlight.HtmlEncoder" />
> >
> >  
> >   >   class="solr.highlight.SimpleFragListBuilder"/>
> >
> >  
> >   >   class="solr.highlight.SingleFragListBuilder"/>
> >
> >  
> >  >   default="true"
> >
> > class="solr.highlight.WeightedFragListBuilder"/>
> >
> >  
> >   >default="true"
> >class="solr.highlight.ScoreOrderFragmentsBuilder">
> >
> >  
> >
> >  
> >   >class="solr.highlight.ScoreOrderFragmentsBuilder">
> >
> >  
> >  
> >
> >  
> >
> >   >   default="true"
> >   class="solr.highlight.SimpleBoundaryScanner">
> >
> >  10
> >  .,!? 
> >
> >  
> >
> >   >
>  class="solr.highlight.BreakIteratorBoundaryScanner">
> >
> >  
> >  WORD
> >  
> >  
> >  da
> >
> >  
> >
> >  
> >
> > Hope that some one can help, thanks in advance.
> >
> > Best regards
> > Martin
> >
> >
> >
> > Internal - KMD A/S
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > finder
> du KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der
> fortæller, hvordan vi behandler oplysninger om dig.
> >
> > Protection of your personal dat

RE: highlighting not working as expected

2019-06-25 Thread Martin Frank Hansen (MHQ)
Hi again,

I have tested a bit and I was wondering if the highlighter requires a field to 
be of type "text"? Whenever I try highlighting on fields which are of type 
"string" nothing gets returned.

Best regards

Martin


Internal - KMD A/S

-Original Message-
From: Jörn Franke 
Sent: 11. juni 2019 08:45
To: solr-user@lucene.apache.org
Subject: Re: highlighting not working as expected

Could it be a stop word ? What is the exact type definition of those fields? 
Could this word be omitted or with wrong encoding during loading of the 
documents?

> Am 03.06.2019 um 10:06 schrieb Martin Frank Hansen (MHQ) :
>
> Hi,
>
> I am having some difficulties making highlighting work. For some reason the 
> highlighting feature only works on some fields but not on other fields even 
> though these fields are stored.
>
> An example of a request looks like this: 
> http://localhost/solr/mytest/select?fl=id,doc.Type,Journalnummer,Sagstitel=Sagstitel=%3C/b%3E=%3Cb%3E=on=rotte
>
> It simply returns an empty set, for all documents even though I can see 
> several documents which have “Sagstitel” containing the word “rotte” 
> (rotte=rat).  What am I missing here?
>
> I am using the standard highlighter as below.
>
>
> 
>
>  
>  
>default="true"
>  class="solr.highlight.GapFragmenter">
>
>  100
>
>  
>
>  
>class="solr.highlight.RegexFragmenter">
>
>  
>  70
>  
>  0.5
>  
>  [-\w ,/\n\]{20,200}
>
>  
>
>  
>   default="true"
> class="solr.highlight.HtmlFormatter">
>
>  b
>  /b
>
>  
>
>  
> class="solr.highlight.HtmlEncoder" />
>
>  
> class="solr.highlight.SimpleFragListBuilder"/>
>
>  
> class="solr.highlight.SingleFragListBuilder"/>
>
>  
>default="true"
>   class="solr.highlight.WeightedFragListBuilder"/>
>
>  
>  default="true"
>class="solr.highlight.ScoreOrderFragmentsBuilder">
>
>  
>
>  
>  class="solr.highlight.ScoreOrderFragmentsBuilder">
>
>  
>  
>
>  
>
> default="true"
>   class="solr.highlight.SimpleBoundaryScanner">
>
>  10
>  .,!? 
>
>  
>
> class="solr.highlight.BreakIteratorBoundaryScanner">
>
>  
>  WORD
>  
>  
>  da
>
>  
>
>  
>
> Hope that some one can help, thanks in advance.
>
> Best regards
> Martin
>
>
>
> Internal - KMD A/S
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process 
> your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
>
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.


RE: highlighting not working as expected

2019-06-17 Thread Martin Frank Hansen (MHQ)
Hi Edwin,

Yes the field is defined just like the other fields:



BR
Martin


Internal - KMD A/S

-Original Message-
From: Zheng Lin Edwin Yeo 
Sent: 4. juni 2019 10:32
To: solr-user@lucene.apache.org
Subject: Re: highlighting not working as expected

Hi Martin,

What fieldType are you using for the field “Sagstitel”? Is it the same as other 
fields?

Regards,
Edwin

On Mon, 3 Jun 2019 at 16:06, Martin Frank Hansen (MHQ)  wrote:

> Hi,
>
> I am having some difficulties making highlighting work. For some
> reason the highlighting feature only works on some fields but not on
> other fields even though these fields are stored.
>
> An example of a request looks like this:
> http://localhost/solr/mytest/select?fl=id,doc.Type,Journalnummer,Sagst
> itel=Sagstitel=%3C/b%3E=%3Cb%3E=
> on=rotte
>
> It simply returns an empty set, for all documents even though I can
> see several documents which have “Sagstitel” containing the word “rotte”
> (rotte=rat).  What am I missing here?
>
> I am using the standard highlighter as below.
>
>
> 
> 
>   
>   
>  default="true"
>   class="solr.highlight.GapFragmenter">
> 
>   100
> 
>   
>
>   
>  class="solr.highlight.RegexFragmenter">
> 
>   
>   70
>   
>   0.5
>   
>   [-\w
> ,/\n\]{20,200}
> 
>   
>
>   
> default="true"
>  class="solr.highlight.HtmlFormatter">
> 
>   b
>   /b
> 
>   
>
>   
>   class="solr.highlight.HtmlEncoder" />
>
>   
>   class="solr.highlight.SimpleFragListBuilder"/>
>
>   
>   class="solr.highlight.SingleFragListBuilder"/>
>
>   
>  default="true"
>
> class="solr.highlight.WeightedFragListBuilder"/>
>
>   
>default="true"
> class="solr.highlight.ScoreOrderFragmentsBuilder">
> 
>   
>
>   
>class="solr.highlight.ScoreOrderFragmentsBuilder">
> 
>   
>   
> 
>   
>
>   default="true"
>class="solr.highlight.SimpleBoundaryScanner">
> 
>   10
>   .,!? 
> 
>   
>
>   class="solr.highlight.BreakIteratorBoundaryScanner">
> 
>   
>   WORD
>   
>   
>   da
> 
>   
> 
>   
>
> Hope that some one can help, thanks in advance.
>
> Best regards
> Martin
>
>
>
> Internal - KMD A/S
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> finder du KMD’s
> Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how
> we process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere 
> den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> er fri for virus og andre fejl, som kan påvirke computeren eller
> it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> opstået i forbindelse med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If
> you have received this message by mistake, please inform the sender of
> the mistake by sending a reply, then delete the message from your
> system without making, distributing or retaining any copies of it.
> Although we believe that the message and any attachments are free from
> viruses and other errors that might affect the computer or it-system
> where it is received and read, the recipient opens the message at his or her 
> own risk.
> We assume no responsibility for any loss or damage arising from the
> receipt or use of this message.
>


RE: highlighting not working as expected

2019-06-17 Thread Martin Frank Hansen (MHQ)
Hi Jörn,

Thanks for your input!

I do not use stop-words, so that should not be the issue. The encoding of the 
documents might be an issue, as they come in many different file formats. It 
will however need to test this.

The field is defined as below:



BR

Martin


Internal - KMD A/S

-Original Message-
From: Jörn Franke 
Sent: 11. juni 2019 08:45
To: solr-user@lucene.apache.org
Subject: Re: highlighting not working as expected

Could it be a stop word ? What is the exact type definition of those fields? 
Could this word be omitted or with wrong encoding during loading of the 
documents?

> Am 03.06.2019 um 10:06 schrieb Martin Frank Hansen (MHQ) :
>
> Hi,
>
> I am having some difficulties making highlighting work. For some reason the 
> highlighting feature only works on some fields but not on other fields even 
> though these fields are stored.
>
> An example of a request looks like this: 
> http://localhost/solr/mytest/select?fl=id,doc.Type,Journalnummer,Sagstitel=Sagstitel=%3C/b%3E=%3Cb%3E=on=rotte
>
> It simply returns an empty set, for all documents even though I can see 
> several documents which have “Sagstitel” containing the word “rotte” 
> (rotte=rat).  What am I missing here?
>
> I am using the standard highlighter as below.
>
>
> 
>
>  
>  
>default="true"
>  class="solr.highlight.GapFragmenter">
>
>  100
>
>  
>
>  
>class="solr.highlight.RegexFragmenter">
>
>  
>  70
>  
>  0.5
>  
>  [-\w ,/\n\]{20,200}
>
>  
>
>  
>   default="true"
> class="solr.highlight.HtmlFormatter">
>
>  b
>  /b
>
>  
>
>  
> class="solr.highlight.HtmlEncoder" />
>
>  
> class="solr.highlight.SimpleFragListBuilder"/>
>
>  
> class="solr.highlight.SingleFragListBuilder"/>
>
>  
>default="true"
>   class="solr.highlight.WeightedFragListBuilder"/>
>
>  
>  default="true"
>class="solr.highlight.ScoreOrderFragmentsBuilder">
>
>  
>
>  
>  class="solr.highlight.ScoreOrderFragmentsBuilder">
>
>  
>  
>
>  
>
> default="true"
>   class="solr.highlight.SimpleBoundaryScanner">
>
>  10
>  .,!? 
>
>  
>
> class="solr.highlight.BreakIteratorBoundaryScanner">
>
>  
>  WORD
>  
>  
>  da
>
>  
>
>  
>
> Hope that some one can help, thanks in advance.
>
> Best regards
> Martin
>
>
>
> Internal - KMD A/S
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process 
> your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
>
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.


RE: highlighting not working as expected

2019-06-11 Thread Martin Frank Hansen (MHQ)
Hi David,

Thanks for your response and sorry my late reply.

Still the same result when using hl.method=unified.

Best regards
Martin


Internal - KMD A/S

-Original Message-
From: David Smiley 
Sent: 10. juni 2019 16:48
To: solr-user 
Subject: Re: highlighting not working as expected

Please try hl.method=unified and tell us if that helps.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Jun 3, 2019 at 4:06 AM Martin Frank Hansen (MHQ)  wrote:

> Hi,
>
> I am having some difficulties making highlighting work. For some
> reason the highlighting feature only works on some fields but not on
> other fields even though these fields are stored.
>
> An example of a request looks like this:
> http://localhost/solr/mytest/select?fl=id,doc.Type,Journalnummer,Sagst
> itel=Sagstitel=%3C/b%3E=%3Cb%3E=
> on=rotte
>
> It simply returns an empty set, for all documents even though I can
> see several documents which have “Sagstitel” containing the word “rotte”
> (rotte=rat).  What am I missing here?
>
> I am using the standard highlighter as below.
>
>
> 
> 
>   
>   
>  default="true"
>   class="solr.highlight.GapFragmenter">
> 
>   100
> 
>   
>
>   
>  class="solr.highlight.RegexFragmenter">
> 
>   
>   70
>   
>   0.5
>   
>   [-\w
> ,/\n\]{20,200}
> 
>   
>
>   
> default="true"
>  class="solr.highlight.HtmlFormatter">
> 
>   b
>   /b
> 
>   
>
>   
>   class="solr.highlight.HtmlEncoder" />
>
>   
>   class="solr.highlight.SimpleFragListBuilder"/>
>
>   
>   class="solr.highlight.SingleFragListBuilder"/>
>
>   
>  default="true"
>
> class="solr.highlight.WeightedFragListBuilder"/>
>
>   
>default="true"
> class="solr.highlight.ScoreOrderFragmentsBuilder">
> 
>   
>
>   
>class="solr.highlight.ScoreOrderFragmentsBuilder">
> 
>   
>   
> 
>   
>
>   default="true"
>class="solr.highlight.SimpleBoundaryScanner">
> 
>   10
>   .,!? 
> 
>   
>
>   class="solr.highlight.BreakIteratorBoundaryScanner">
> 
>   
>   WORD
>   
>   
>   da
> 
>   
> 
>   
>
> Hope that some one can help, thanks in advance.
>
> Best regards
> Martin
>
>
>
> Internal - KMD A/S
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> finder du KMD’s
> Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how
> we process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere 
> den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> er fri for virus og andre fejl, som kan påvirke computeren eller
> it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> opstået i forbindelse med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If
> you have received this message by mistake, please inform the sender of
> the mistake by sending a reply, then delete the message from your
> system without making, distributing or retaining any copies of it.
> Although we believe that the message and any attachments are free from
> viruses and other errors that might affect the computer or it-system
> where it is received and read, the recipient opens the message at his or her 
> own risk.
> We assume no responsibility for any loss or damage arising from the
> receipt or use of this message.
>


highlighting not working as expected

2019-06-03 Thread Martin Frank Hansen (MHQ)
Hi,

I am having some difficulties making highlighting work. For some reason the 
highlighting feature only works on some fields but not on other fields even 
though these fields are stored.

An example of a request looks like this: 
http://localhost/solr/mytest/select?fl=id,doc.Type,Journalnummer,Sagstitel=Sagstitel=%3C/b%3E=%3Cb%3E=on=rotte

It simply returns an empty set, for all documents even though I can see several 
documents which have “Sagstitel” containing the word “rotte” (rotte=rat).  What 
am I missing here?

I am using the standard highlighter as below.




  
  
  

  100

  

  
  

  
  70
  
  0.5
  
  [-\w ,/\n\]{20,200}

  

  
  

  b
  /b

  

  
  

  
  

  
  

  
 

  
  

  

  
  

  
  

  

  

  10
  .,!? 

  

  

  
  WORD
  
  
  da

  

  

Hope that some one can help, thanks in advance.

Best regards
Martin



Internal - KMD A/S

Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


RE: highlighter, stored documents and performance

2019-03-21 Thread Martin Frank Hansen (MHQ)
Hi Jörn,

Thanks for your answer.

Unfortunately, there is no summary included in the documents  and I would like 
it to work for all documents.

Best regards

Martin


Internal - KMD A/S

-Original Message-
From: Jörn Franke 
Sent: 21. marts 2019 17:11
To: solr-user@lucene.apache.org
Subject: Re: highlighter, stored documents and performance

I don’t think so - to highlight any possible query you need the full document.

You could optimize it by only storing a subset of the document and highlight 
only in this subset.

Alternatively you can store a summary and show only the summary without 
highlighting.

> Am 21.03.2019 um 17:05 schrieb Martin Frank Hansen (MHQ) :
>
> Hi,
>
> I am wondering how performance highlighting in Solr performs when the number 
> of documents get large?
>
> Right now we have about 1 TB of data in all sorts of file types and I was 
> wondering how storing these documents within Solr (for highlighting purpose) 
> will affect performance?
>
> Is it possible to use highlighting without storing the documents?
>
> Best regards
>
> Martin
>
>
>
>
> Internal - KMD A/S
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process 
> your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
>
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.


highlighter, stored documents and performance

2019-03-21 Thread Martin Frank Hansen (MHQ)
Hi,

I am wondering how performance highlighting in Solr performs when the number of 
documents get large?

Right now we have about 1 TB of data in all sorts of file types and I was 
wondering how storing these documents within Solr (for highlighting purpose) 
will affect performance?

Is it possible to use highlighting without storing the documents?

Best regards

Martin




Internal - KMD A/S

Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


RE: Update handler and atomic update

2019-03-19 Thread Martin Frank Hansen (MHQ)
Hi Thierry,

Thanks for your help. I think I will try to make my own handler instead.

Best regards

Martin


Internal - KMD A/S

-Original Message-
From: THIERRY BOUCHENY 
Sent: 19. marts 2019 10:38
To: solr-user@lucene.apache.org
Subject: Re: Update handler and atomic update

Hi Martin,

I read after answering your email that you don’t want to use curl, that might 
be a problem. I might be wrong but I don’t think you can make an atomic update 
with a GET request having the params in the url. I think you need to make a 
POST request and embed [{"id":"docid","clicks":{“inc”:"1"}}] in the raw body 
hence using curl or any other app that allows you this like Postman.

Best regards

Thierry

> On 19 Mar 2019, at 08:59, Martin Frank Hansen (MHQ)  wrote:
>
> Hi Thierry,
>
> Do you mean something like this?
>
> http://localhost:8983/solr/.../update? 
> [{"id":"docid","clicks":{“inc”:"1"}}]commit=true
>
> I do not get an error, but it does not increase the value of clicks 
> (unfortunately).
>
> Best regards
>
> Martin
>
>
> Internal - KMD A/S
>
> -Original Message-
> From: THIERRY BOUCHENY 
> Sent: 19. marts 2019 09:51
> To: solr-user@lucene.apache.org
> Subject: Re: Update handler and atomic update
>
> Hi Martin,
>
> Have you tried doing a POST with some JSON or XML Body.
>
> I would POST some json like the following
>
> [{"id":"docid","clicks":{“inc”:"1"}}]
>
> In an /update?commit=true
>
> Best regards
>
> Thierry
>
> See documentation here 
> https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html
>
>> On 19 Mar 2019, at 08:14, Martin Frank Hansen (MHQ)  wrote:
>>
>> Hi,
>>
>> Hope someone can help me, I am trying to make an incremental update for one 
>> document using the API, but cannot make it work. I have tried a lot of 
>> things and all I actually want is to increment the value of the field 
>> “clicks” by one.
>>
>> I have something like this:
>> http://localhost:8983/solr/.../update?id:docid:clicks=1=true
>>
>> in the schema.xml the field looks like this:
>>
>> > multiValued="false" docValues="true"/>
>>
>> Please note that I do not wish to use curl for this operation.
>>
>> Thanks in advance.
>>
>> Best regards
>>
>> Martin
>>
>>
>> Internal - KMD A/S
>>
>> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
>> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
>> hvordan vi behandler oplysninger om dig.
>>
>> Protection of your personal data is important to us. Here you can read KMD’s 
>> Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process 
>> your personal data.
>>
>> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
>> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst 
>> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi 
>> dig slette e-mailen i dit system uden at videresende eller kopiere den. 
>> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri 
>> for virus og andre fejl, som kan påvirke computeren eller it-systemet, hvori 
>> den modtages og læses, åbnes den på modtagerens eget ansvar. Vi påtager os 
>> ikke noget ansvar for tab og skade, som er opstået i forbindelse med at 
>> modtage og bruge e-mailen.
>>
>> Please note that this message may contain confidential information. If you 
>> have received this message by mistake, please inform the sender of the 
>> mistake by sending a reply, then delete the message from your system without 
>> making, distributing or retaining any copies of it. Although we believe that 
>> the message and any attachments are free from viruses and other errors that 
>> might affect the computer or it-system where it is received and read, the 
>> recipient opens the message at his or her own risk. We assume no 
>> responsibility for any loss or damage arising from the receipt or use of 
>> this message.


RE: Update handler and atomic update

2019-03-19 Thread Martin Frank Hansen (MHQ)
Hi Thierry,

Do you mean something like this?

http://localhost:8983/solr/.../update? 
[{"id":"docid","clicks":{“inc”:"1"}}]commit=true

I do not get an error, but it does not increase the value of clicks 
(unfortunately).

Best regards

Martin


Internal - KMD A/S

-Original Message-
From: THIERRY BOUCHENY 
Sent: 19. marts 2019 09:51
To: solr-user@lucene.apache.org
Subject: Re: Update handler and atomic update

Hi Martin,

Have you tried doing a POST with some JSON or XML Body.

I would POST some json like the following

[{"id":"docid","clicks":{“inc”:"1"}}]

In an /update?commit=true

Best regards

Thierry

See documentation here 
https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html

> On 19 Mar 2019, at 08:14, Martin Frank Hansen (MHQ)  wrote:
>
> Hi,
>
> Hope someone can help me, I am trying to make an incremental update for one 
> document using the API, but cannot make it work. I have tried a lot of things 
> and all I actually want is to increment the value of the field “clicks” by 
> one.
>
> I have something like this:
> http://localhost:8983/solr/.../update?id:docid:clicks=1=true
>
> in the schema.xml the field looks like this:
>
>  multiValued="false" docValues="true"/>
>
> Please note that I do not wish to use curl for this operation.
>
> Thanks in advance.
>
> Best regards
>
> Martin
>
>
> Internal - KMD A/S
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process 
> your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
>
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.


Update handler and atomic update

2019-03-19 Thread Martin Frank Hansen (MHQ)
Hi,

Hope someone can help me, I am trying to make an incremental update for one 
document using the API, but cannot make it work. I have tried a lot of things 
and all I actually want is to increment the value of the field “clicks” by one.

I have something like this:
http://localhost:8983/solr/.../update?id:docid:clicks=1=true

in the schema.xml the field looks like this:



Please note that I do not wish to use curl for this operation.

Thanks in advance.

Best regards

Martin


Internal - KMD A/S

Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


RE: MLT and facetting

2019-03-01 Thread Martin Frank Hansen (MHQ)
Hi Walter, 

Thanks for your answer, it makes sense. 

Best regards
Martin


Internal - KMD A/S

-Original Message-
From: Walter Underwood  
Sent: 1. marts 2019 03:30
To: solr-user@lucene.apache.org
Subject: Re: MLT and facetting

The last time I looked, the MLT was a search handler but not a search 
component. It wasn’t able to be combined with other features. The handler is 
based on very old code, like 1.3.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 28, 2019, at 5:47 PM, Zheng Lin Edwin Yeo  wrote:
> 
> Hi Martin,
> 
> I have no idea on this, as the case has not been active for almost 2 years.
> Maybe I can try to follow up.
> 
> Faceting by default will show the list according to the number of 
> occurrences. But I'm not sure how it will affect the MLT score or how 
> it will be output when combine together, as it is not working 
> currently and there is no way to test.
> 
> Regards,
> Edwin
> 
> On Thu, 28 Feb 2019 at 14:51, Martin Frank Hansen (MHQ)  wrote:
> 
>> Hi Edwin,
>> 
>> Ok that is nice to know. Do you know when this bug will get fixed?
>> 
>> By ordering I mean that MLT score the documents according to its 
>> similarity function (believe it is cosine similarity), and I don’t 
>> know how faceting will affect this score? Or ignore it all together?
>> 
>> Best regards
>> 
>> Martin
>> 
>> 
>> Internal - KMD A/S
>> 
>> -Original Message-
>> From: Zheng Lin Edwin Yeo 
>> Sent: 28. februar 2019 06:19
>> To: solr-user@lucene.apache.org
>> Subject: Re: MLT and facetting
>> 
>> Hi Martin,
>> 
>> According to the JIRA, it says it is a bug, as it was working 
>> previously in Solr 4. I have not tried Solr 4 before, so I'm not sure how it 
>> works.
>> 
>> For the ordering of the documents, do you mean to sort them according 
>> to the criteria that you want?
>> 
>> Regards,
>> Edwin
>> 
>> On Wed, 27 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
>> wrote:
>> 
>>> Hi Edwin,
>>> 
>>> Thanks for your response. Are you sure it is a bug? Or is it not 
>>> meant to work together?
>>> After doing some thinking I do see a problem faceting a MLT-result.
>>> MLT-results have a clear ordering of the documents which will be 
>>> hard to maintain with facets. How will faceting MLT-results deal 
>>> with the ordering of the documents? Will the ordering just be ignored?
>>> 
>>> Best regards
>>> 
>>> Martin
>>> 
>>> 
>>> 
>>> Internal - KMD A/S
>>> 
>>> -Original Message-
>>> From: Zheng Lin Edwin Yeo 
>>> Sent: 27. februar 2019 03:38
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: MLT and facetting
>>> 
>>> Hi Martin,
>>> 
>>> I also get the same problem in Solr 7.7 if I turn on faceting in 
>>> /mlt requestHandler.
>>> 
>>> Found this issue in the JIRA:
>>> https://issues.apache.org/jira/browse/SOLR-7883
>>> Seems like it is a bug in Solr and it has not been resolved yet.
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> On Tue, 26 Feb 2019 at 21:03, Martin Frank Hansen (MHQ) 
>>> wrote:
>>> 
>>>> Hi Edwin,
>>>> 
>>>> Here it is:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -
>>>> 
>>>> 
>>>> -
>>>> 
>>>> text
>>>> 
>>>> 1
>>>> 
>>>> 1
>>>> 
>>>> true
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Internal - KMD A/S
>>>> 
>>>> -Original Message-
>>>> From: Zheng Lin Edwin Yeo 
>>>> Sent: 26. februar 2019 08:24
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: MLT and facetting
>>>> 
>>>> Hi Martin,
>>>> 
>>>> What is your setting in your /mlt requestHandler in solrconfig.xml?
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> On Tue, 26 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Edwin,
>>>>> 
>>>>> Thanks for your response.
>>>>> 
>>>>> Yes you are right. It was simply the search parameters from Solr.
>>>&g

RE: MLT and facetting

2019-03-01 Thread Martin Frank Hansen (MHQ)
Hi Dave, 

The problem is that we have different levels of metadata and documents. 
The documents are arranged such that we have a case for which there are 
multiple documents (files). When we use the mlt function, we do it on 
file-level, but it needs to be displayed at case level, which means that we 
need to group files that are connected to the same case. 

Hope this makes sense. 


Internal - KMD A/S

-Original Message-
From: Dave  
Sent: 1. marts 2019 02:51
To: solr-user@lucene.apache.org
Subject: Re: MLT and facetting

I’m more curious what you’d expect to see, and what possible benefit you could 
get from it

> On Feb 28, 2019, at 8:48 PM, Zheng Lin Edwin Yeo  wrote:
> 
> Hi Martin,
> 
> I have no idea on this, as the case has not been active for almost 2 years.
> Maybe I can try to follow up.
> 
> Faceting by default will show the list according to the number of 
> occurrences. But I'm not sure how it will affect the MLT score or how 
> it will be output when combine together, as it is not working 
> currently and there is no way to test.
> 
> Regards,
> Edwin
> 
>> On Thu, 28 Feb 2019 at 14:51, Martin Frank Hansen (MHQ)  wrote:
>> 
>> Hi Edwin,
>> 
>> Ok that is nice to know. Do you know when this bug will get fixed?
>> 
>> By ordering I mean that MLT score the documents according to its 
>> similarity function (believe it is cosine similarity), and I don’t 
>> know how faceting will affect this score? Or ignore it all together?
>> 
>> Best regards
>> 
>> Martin
>> 
>> 
>> Internal - KMD A/S
>> 
>> -Original Message-
>> From: Zheng Lin Edwin Yeo 
>> Sent: 28. februar 2019 06:19
>> To: solr-user@lucene.apache.org
>> Subject: Re: MLT and facetting
>> 
>> Hi Martin,
>> 
>> According to the JIRA, it says it is a bug, as it was working 
>> previously in Solr 4. I have not tried Solr 4 before, so I'm not sure how it 
>> works.
>> 
>> For the ordering of the documents, do you mean to sort them according 
>> to the criteria that you want?
>> 
>> Regards,
>> Edwin
>> 
>> On Wed, 27 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
>> wrote:
>> 
>>> Hi Edwin,
>>> 
>>> Thanks for your response. Are you sure it is a bug? Or is it not 
>>> meant to work together?
>>> After doing some thinking I do see a problem faceting a MLT-result.
>>> MLT-results have a clear ordering of the documents which will be 
>>> hard to maintain with facets. How will faceting MLT-results deal 
>>> with the ordering of the documents? Will the ordering just be ignored?
>>> 
>>> Best regards
>>> 
>>> Martin
>>> 
>>> 
>>> 
>>> Internal - KMD A/S
>>> 
>>> -Original Message-
>>> From: Zheng Lin Edwin Yeo 
>>> Sent: 27. februar 2019 03:38
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: MLT and facetting
>>> 
>>> Hi Martin,
>>> 
>>> I also get the same problem in Solr 7.7 if I turn on faceting in 
>>> /mlt requestHandler.
>>> 
>>> Found this issue in the JIRA:
>>> https://issues.apache.org/jira/browse/SOLR-7883
>>> Seems like it is a bug in Solr and it has not been resolved yet.
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> On Tue, 26 Feb 2019 at 21:03, Martin Frank Hansen (MHQ) 
>>> wrote:
>>> 
>>>> Hi Edwin,
>>>> 
>>>> Here it is:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -
>>>> 
>>>> 
>>>> -
>>>> 
>>>> text
>>>> 
>>>> 1
>>>> 
>>>> 1
>>>> 
>>>> true
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Internal - KMD A/S
>>>> 
>>>> -Original Message-
>>>> From: Zheng Lin Edwin Yeo 
>>>> Sent: 26. februar 2019 08:24
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: MLT and facetting
>>>> 
>>>> Hi Martin,
>>>> 
>>>> What is your setting in your /mlt requestHandler in solrconfig.xml?
>>>> 
>>>> Regards,
>>>> Edwin
>>>> 
>>>> On Tue, 26 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
>>>> 
>>>> wrote:
>>>> 
>>>>> Hi Edwin,
>>>>> 
>>>>> Thanks for your resp

RE: MLT and facetting

2019-03-01 Thread Martin Frank Hansen (MHQ)
Hi Edwin, 

Thanks for your time, much appreciated. 

Best regards 
Martin


Internal - KMD A/S

-Original Message-
From: Zheng Lin Edwin Yeo  
Sent: 1. marts 2019 02:48
To: solr-user@lucene.apache.org
Subject: Re: MLT and facetting

Hi Martin,

I have no idea on this, as the case has not been active for almost 2 years.
Maybe I can try to follow up.

Faceting by default will show the list according to the number of occurrences. 
But I'm not sure how it will affect the MLT score or how it will be output when 
combine together, as it is not working currently and there is no way to test.

Regards,
Edwin

On Thu, 28 Feb 2019 at 14:51, Martin Frank Hansen (MHQ)  wrote:

> Hi Edwin,
>
> Ok that is nice to know. Do you know when this bug will get fixed?
>
> By ordering I mean that MLT score the documents according to its 
> similarity function (believe it is cosine similarity), and I don’t 
> know how faceting will affect this score? Or ignore it all together?
>
> Best regards
>
> Martin
>
>
> Internal - KMD A/S
>
> -Original Message-
> From: Zheng Lin Edwin Yeo 
> Sent: 28. februar 2019 06:19
> To: solr-user@lucene.apache.org
> Subject: Re: MLT and facetting
>
> Hi Martin,
>
> According to the JIRA, it says it is a bug, as it was working 
> previously in Solr 4. I have not tried Solr 4 before, so I'm not sure how it 
> works.
>
> For the ordering of the documents, do you mean to sort them according 
> to the criteria that you want?
>
> Regards,
> Edwin
>
> On Wed, 27 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Edwin,
> >
> > Thanks for your response. Are you sure it is a bug? Or is it not 
> > meant to work together?
> > After doing some thinking I do see a problem faceting a MLT-result.
> > MLT-results have a clear ordering of the documents which will be 
> > hard to maintain with facets. How will faceting MLT-results deal 
> > with the ordering of the documents? Will the ordering just be ignored?
> >
> > Best regards
> >
> > Martin
> >
> >
> >
> > Internal - KMD A/S
> >
> > -Original Message-
> > From: Zheng Lin Edwin Yeo 
> > Sent: 27. februar 2019 03:38
> > To: solr-user@lucene.apache.org
> > Subject: Re: MLT and facetting
> >
> > Hi Martin,
> >
> > I also get the same problem in Solr 7.7 if I turn on faceting in 
> > /mlt requestHandler.
> >
> > Found this issue in the JIRA:
> > https://issues.apache.org/jira/browse/SOLR-7883
> > Seems like it is a bug in Solr and it has not been resolved yet.
> >
> > Regards,
> > Edwin
> >
> > On Tue, 26 Feb 2019 at 21:03, Martin Frank Hansen (MHQ) 
> > wrote:
> >
> > > Hi Edwin,
> > >
> > > Here it is:
> > >
> > >
> > > 
> > >
> > >
> > > -
> > >
> > >
> > > -
> > >
> > > text
> > >
> > > 1
> > >
> > > 1
> > >
> > > true
> > >
> > > 
> > >
> > > 
> > >
> > >
> > > Internal - KMD A/S
> > >
> > > -Original Message-
> > > From: Zheng Lin Edwin Yeo 
> > > Sent: 26. februar 2019 08:24
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: MLT and facetting
> > >
> > > Hi Martin,
> > >
> > > What is your setting in your /mlt requestHandler in solrconfig.xml?
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Tue, 26 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
> > > 
> > > wrote:
> > >
> > > > Hi Edwin,
> > > >
> > > > Thanks for your response.
> > > >
> > > > Yes you are right. It was simply the search parameters from Solr.
> > > >
> > > > The query looks like this:
> > > >
> > > > http://
> > > > .../solr/.../mlt?df=text=Journalnummer=on=i
> > > > d,
> > > > Jo
> > > > ur
> > > > nalnummer=id:*6512815*
> > > >
> > > > best regards,
> > > >
> > > > Martin
> > > >
> > > >
> > > > Internal - KMD A/S
> > > >
> > > > -Original Message-
> > > > From: Zheng Lin Edwin Yeo 
> > > > Sent: 26. februar 2019 03:54
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: MLT and facetting
> > > >
> > > > Hi Martin,
> > > >
> > > > I 

RE: MLT and facetting

2019-02-27 Thread Martin Frank Hansen (MHQ)
Hi Edwin, 

Ok that is nice to know. Do you know when this bug will get fixed? 

By ordering I mean that MLT score the documents according to its similarity 
function (believe it is cosine similarity), and I don’t know how faceting will 
affect this score? Or ignore it all together? 

Best regards

Martin 


Internal - KMD A/S

-Original Message-
From: Zheng Lin Edwin Yeo  
Sent: 28. februar 2019 06:19
To: solr-user@lucene.apache.org
Subject: Re: MLT and facetting

Hi Martin,

According to the JIRA, it says it is a bug, as it was working previously in 
Solr 4. I have not tried Solr 4 before, so I'm not sure how it works.

For the ordering of the documents, do you mean to sort them according to the 
criteria that you want?

Regards,
Edwin

On Wed, 27 Feb 2019 at 14:43, Martin Frank Hansen (MHQ)  wrote:

> Hi Edwin,
>
> Thanks for your response. Are you sure it is a bug? Or is it not meant 
> to work together?
> After doing some thinking I do see a problem faceting a MLT-result.
> MLT-results have a clear ordering of the documents which will be hard 
> to maintain with facets. How will faceting MLT-results deal with the 
> ordering of the documents? Will the ordering just be ignored?
>
> Best regards
>
> Martin
>
>
>
> Internal - KMD A/S
>
> -Original Message-
> From: Zheng Lin Edwin Yeo 
> Sent: 27. februar 2019 03:38
> To: solr-user@lucene.apache.org
> Subject: Re: MLT and facetting
>
> Hi Martin,
>
> I also get the same problem in Solr 7.7 if I turn on faceting in /mlt 
> requestHandler.
>
> Found this issue in the JIRA:
> https://issues.apache.org/jira/browse/SOLR-7883
> Seems like it is a bug in Solr and it has not been resolved yet.
>
> Regards,
> Edwin
>
> On Tue, 26 Feb 2019 at 21:03, Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Edwin,
> >
> > Here it is:
> >
> >
> > 
> >
> >
> > -
> >
> >
> > -
> >
> > text
> >
> > 1
> >
> > 1
> >
> > true
> >
> > 
> >
> > 
> >
> >
> > Internal - KMD A/S
> >
> > -Original Message-
> > From: Zheng Lin Edwin Yeo 
> > Sent: 26. februar 2019 08:24
> > To: solr-user@lucene.apache.org
> > Subject: Re: MLT and facetting
> >
> > Hi Martin,
> >
> > What is your setting in your /mlt requestHandler in solrconfig.xml?
> >
> > Regards,
> > Edwin
> >
> > On Tue, 26 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
> > wrote:
> >
> > > Hi Edwin,
> > >
> > > Thanks for your response.
> > >
> > > Yes you are right. It was simply the search parameters from Solr.
> > >
> > > The query looks like this:
> > >
> > > http://
> > > .../solr/.../mlt?df=text=Journalnummer=on=id,
> > > Jo
> > > ur
> > > nalnummer=id:*6512815*
> > >
> > > best regards,
> > >
> > > Martin
> > >
> > >
> > > Internal - KMD A/S
> > >
> > > -Original Message-
> > > From: Zheng Lin Edwin Yeo 
> > > Sent: 26. februar 2019 03:54
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: MLT and facetting
> > >
> > > Hi Martin,
> > >
> > > I think there are some pictures which are not being sent through 
> > > in the email.
> > >
> > > Do send your query that you are using, and which version of Solr 
> > > you are using?
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Mon, 25 Feb 2019 at 20:54, Martin Frank Hansen (MHQ) 
> > > 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I am trying to combine the mlt functionality with facets, but 
> > > > Solr throws
> > > > org.apache.solr.common.SolrException: ":"Unable to compute facet 
> > > > ranges, facet context is not set".
> > > >
> > > >
> > > >
> > > > What I am trying to do is quite simple, find similar documents 
> > > > using mlt and group these using the facet parameter. When using 
> > > > mlt and facets separately everything works fine, but not when 
> > > > combining the
> > > functionality.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > {
> > > >
> > > >   "responseHeader":{
> > > >
> > > > "status":500,
> > > >
> > &g

RE: MLT and facetting

2019-02-26 Thread Martin Frank Hansen (MHQ)
Hi Edwin,

Thanks for your response. Are you sure it is a bug? Or is it not meant to work 
together? 
After doing some thinking I do see a problem faceting a MLT-result. MLT-results 
have a clear ordering of the documents which will be hard to maintain with 
facets. How will faceting MLT-results deal with the ordering of the documents? 
Will the ordering just be ignored?

Best regards

Martin 



Internal - KMD A/S

-Original Message-
From: Zheng Lin Edwin Yeo  
Sent: 27. februar 2019 03:38
To: solr-user@lucene.apache.org
Subject: Re: MLT and facetting

Hi Martin,

I also get the same problem in Solr 7.7 if I turn on faceting in /mlt 
requestHandler.

Found this issue in the JIRA:
https://issues.apache.org/jira/browse/SOLR-7883
Seems like it is a bug in Solr and it has not been resolved yet.

Regards,
Edwin

On Tue, 26 Feb 2019 at 21:03, Martin Frank Hansen (MHQ)  wrote:

> Hi Edwin,
>
> Here it is:
>
>
> 
>
>
> -
>
>
> -
>
> text
>
> 1
>
> 1
>
> true
>
> 
>
> 
>
>
> Internal - KMD A/S
>
> -Original Message-
> From: Zheng Lin Edwin Yeo 
> Sent: 26. februar 2019 08:24
> To: solr-user@lucene.apache.org
> Subject: Re: MLT and facetting
>
> Hi Martin,
>
> What is your setting in your /mlt requestHandler in solrconfig.xml?
>
> Regards,
> Edwin
>
> On Tue, 26 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Edwin,
> >
> > Thanks for your response.
> >
> > Yes you are right. It was simply the search parameters from Solr.
> >
> > The query looks like this:
> >
> > http://
> > .../solr/.../mlt?df=text=Journalnummer=on=id,Jo
> > ur
> > nalnummer=id:*6512815*
> >
> > best regards,
> >
> > Martin
> >
> >
> > Internal - KMD A/S
> >
> > -Original Message-
> > From: Zheng Lin Edwin Yeo 
> > Sent: 26. februar 2019 03:54
> > To: solr-user@lucene.apache.org
> > Subject: Re: MLT and facetting
> >
> > Hi Martin,
> >
> > I think there are some pictures which are not being sent through in 
> > the email.
> >
> > Do send your query that you are using, and which version of Solr you 
> > are using?
> >
> > Regards,
> > Edwin
> >
> > On Mon, 25 Feb 2019 at 20:54, Martin Frank Hansen (MHQ) 
> > wrote:
> >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to combine the mlt functionality with facets, but Solr 
> > > throws
> > > org.apache.solr.common.SolrException: ":"Unable to compute facet 
> > > ranges, facet context is not set".
> > >
> > >
> > >
> > > What I am trying to do is quite simple, find similar documents 
> > > using mlt and group these using the facet parameter. When using 
> > > mlt and facets separately everything works fine, but not when 
> > > combining the
> > functionality.
> > >
> > >
> > >
> > >
> > >
> > > {
> > >
> > >   "responseHeader":{
> > >
> > > "status":500,
> > >
> > > "QTime":109},
> > >
> > >   "match":{"numFound":1,"start":0,"docs":[
> > >
> > >   {
> > >
> > > "Journalnummer":" 00759",
> > >
> > > "id":"6512815"  },
> > >
> > >   "response":{"numFound":602234,"start":0,"docs":[
> > >
> > >   {
> > >
> > > "Journalnummer":" 00759",
> > >
> > > "id":"6512816",
> > >
> > >   {
> > >
> > > "Journalnummer":" 00759",
> > >
> > > "id":"6834653"
> > >
> > >   {
> > >
> > > "Journalnummer":" 00739",
> > >
> > > "id":"6202373"
> > >
> > >   {
> > >
> > > "Journalnummer":" 00739",
> > >
> > > "id":"6748105"
> > >
> > >
> > >
> > >   {
> > >
> > > "Journalnummer":" 00803",
> > >
> > > "id":"7402155"
> > >
> > >   },
> > >
> > >  

RE: MLT and facetting

2019-02-26 Thread Martin Frank Hansen (MHQ)
Hi Edwin,

Here it is: 





-


-

text

1

1

true






Internal - KMD A/S

-Original Message-
From: Zheng Lin Edwin Yeo  
Sent: 26. februar 2019 08:24
To: solr-user@lucene.apache.org
Subject: Re: MLT and facetting

Hi Martin,

What is your setting in your /mlt requestHandler in solrconfig.xml?

Regards,
Edwin

On Tue, 26 Feb 2019 at 14:43, Martin Frank Hansen (MHQ)  wrote:

> Hi Edwin,
>
> Thanks for your response.
>
> Yes you are right. It was simply the search parameters from Solr.
>
> The query looks like this:
>
> http://
> .../solr/.../mlt?df=text=Journalnummer=on=id,Jour
> nalnummer=id:*6512815*
>
> best regards,
>
> Martin
>
>
> Internal - KMD A/S
>
> -Original Message-
> From: Zheng Lin Edwin Yeo 
> Sent: 26. februar 2019 03:54
> To: solr-user@lucene.apache.org
> Subject: Re: MLT and facetting
>
> Hi Martin,
>
> I think there are some pictures which are not being sent through in 
> the email.
>
> Do send your query that you are using, and which version of Solr you 
> are using?
>
> Regards,
> Edwin
>
> On Mon, 25 Feb 2019 at 20:54, Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi,
> >
> >
> >
> > I am trying to combine the mlt functionality with facets, but Solr 
> > throws
> > org.apache.solr.common.SolrException: ":"Unable to compute facet 
> > ranges, facet context is not set".
> >
> >
> >
> > What I am trying to do is quite simple, find similar documents using 
> > mlt and group these using the facet parameter. When using mlt and 
> > facets separately everything works fine, but not when combining the
> functionality.
> >
> >
> >
> >
> >
> > {
> >
> >   "responseHeader":{
> >
> > "status":500,
> >
> > "QTime":109},
> >
> >   "match":{"numFound":1,"start":0,"docs":[
> >
> >   {
> >
> > "Journalnummer":" 00759",
> >
> > "id":"6512815"  },
> >
> >   "response":{"numFound":602234,"start":0,"docs":[
> >
> >   {
> >
> > "Journalnummer":" 00759",
> >
> > "id":"6512816",
> >
> >   {
> >
> > "Journalnummer":" 00759",
> >
> > "id":"6834653"
> >
> >   {
> >
> > "Journalnummer":" 00739",
> >
> > "id":"6202373"
> >
> >   {
> >
> > "Journalnummer":" 00739",
> >
> > "id":"6748105"
> >
> >
> >
> >   {
> >
> > "Journalnummer":" 00803",
> >
> > "id":"7402155"
> >
> >   },
> >
> >   "error":{
> >
> > "metadata":[
> >
> >   "error-class","org.apache.solr.common.SolrException",
> >
> >   "root-error-class","org.apache.solr.common.SolrException"],
> >
> > "msg":"Unable to compute facet ranges, facet context is not 
> > set",
> >
> > "trace":"org.apache.solr.common.SolrException: Unable to compute 
> > facet ranges, facet context is not set\n\tat 
> > org.apache.solr.handler.component.RangeFacetProcessor.getFacetRangeC
> > ou nts(RangeFacetProcessor.java:66)\n\tat
> > org.apache.solr.handler.component.FacetComponent.getFacetCounts(Face
> > tC
> > omponent.java:331)\n\tat
> > org.apache.solr.handler.component.FacetComponent.getFacetCounts(Face
> > tC
> > omponent.java:295)\n\tat
> > org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLi
> > ke
> > ThisHandler.java:240)\n\tat
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHand
> > le
> > rBase.java:199)\n\tat
> > org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)\n\tat
> > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)\
> > n\
> > tat
> > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)\n\t
> > at 
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilt
> > er
> > .java:377)\n\tat
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilt
> > er
> > .java:323)\n\tat
> > org.

RE: MLT and facetting

2019-02-25 Thread Martin Frank Hansen (MHQ)
Sorry forgot to mention that we are using Solr 7.5. 


Internal - KMD A/S

-Original Message-
From: Martin Frank Hansen (MHQ)  
Sent: 26. februar 2019 07:43
To: solr-user@lucene.apache.org
Subject: RE: MLT and facetting

Hi Edwin,

Thanks for your response. 

Yes you are right. It was simply the search parameters from Solr. 

The query looks like this:

http://.../solr/.../mlt?df=text=Journalnummer=on=id,Journalnummer=id:*6512815*

best regards,

Martin


Internal - KMD A/S

-Original Message-
From: Zheng Lin Edwin Yeo 
Sent: 26. februar 2019 03:54
To: solr-user@lucene.apache.org
Subject: Re: MLT and facetting

Hi Martin,

I think there are some pictures which are not being sent through in the email.

Do send your query that you are using, and which version of Solr you are using?

Regards,
Edwin

On Mon, 25 Feb 2019 at 20:54, Martin Frank Hansen (MHQ)  wrote:

> Hi,
>
>
>
> I am trying to combine the mlt functionality with facets, but Solr 
> throws
> org.apache.solr.common.SolrException: ":"Unable to compute facet 
> ranges, facet context is not set".
>
>
>
> What I am trying to do is quite simple, find similar documents using 
> mlt and group these using the facet parameter. When using mlt and 
> facets separately everything works fine, but not when combining the 
> functionality.
>
>
>
>
>
> {
>
>   "responseHeader":{
>
> "status":500,
>
> "QTime":109},
>
>   "match":{"numFound":1,"start":0,"docs":[
>
>   {
>
> "Journalnummer":" 00759",
>
> "id":"6512815"  },
>
>   "response":{"numFound":602234,"start":0,"docs":[
>
>   {
>
> "Journalnummer":" 00759",
>
> "id":"6512816",
>
>   {
>
> "Journalnummer":" 00759",
>
> "id":"6834653"
>
>   {
>
> "Journalnummer":" 00739",
>
> "id":"6202373"
>
>   {
>
> "Journalnummer":" 00739",
>
> "id":"6748105"
>
>
>
>   {
>
> "Journalnummer":" 00803",
>
> "id":"7402155"
>
>   },
>
>   "error":{
>
> "metadata":[
>
>   "error-class","org.apache.solr.common.SolrException",
>
>   "root-error-class","org.apache.solr.common.SolrException"],
>
> "msg":"Unable to compute facet ranges, facet context is not set",
>
> "trace":"org.apache.solr.common.SolrException: Unable to compute 
> facet ranges, facet context is not set\n\tat 
> org.apache.solr.handler.component.RangeFacetProcessor.getFacetRangeCou
> nts(RangeFacetProcessor.java:66)\n\tat
> org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetC
> omponent.java:331)\n\tat
> org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetC
> omponent.java:295)\n\tat
> org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLike
> ThisHandler.java:240)\n\tat
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> rBase.java:199)\n\tat
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)\n\tat
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)\n\
> tat
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> .java:377)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> .java:323)\n\tat
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletH
> andler.java:1634)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:
> 533)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.ja
> va:146)\n\tat
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java
> :548)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.
> java:132)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandle
> r.java:257)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandle
> r.java:1595)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandle
> r.java:255)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandle
> r.java:1317)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler
>

RE: MLT and facetting

2019-02-25 Thread Martin Frank Hansen (MHQ)
Hi Dave, 

Thanks for your suggestion, I was under the impression that you could do it in 
one-search approach. But if that’s not possible I will try to divide into two 
searches. 

Is the best way to do this through Solrj? 

Best regards

Martin


Internal - KMD A/S

-Original Message-
From: Dave  
Sent: 26. februar 2019 05:39
To: solr-user@lucene.apache.org
Subject: Re: MLT and facetting

Use the mlt to get the queries to use for getting facets in a two search 
approach

> On Feb 25, 2019, at 10:18 PM, Zheng Lin Edwin Yeo  
> wrote:
> 
> Hi Martin,
> 
> I think there are some pictures which are not being sent through in 
> the email.
> 
> Do send your query that you are using, and which version of Solr you 
> are using?
> 
> Regards,
> Edwin
> 
>> On Mon, 25 Feb 2019 at 20:54, Martin Frank Hansen (MHQ)  wrote:
>> 
>> Hi,
>> 
>> 
>> 
>> I am trying to combine the mlt functionality with facets, but Solr 
>> throws
>> org.apache.solr.common.SolrException: ":"Unable to compute facet 
>> ranges, facet context is not set".
>> 
>> 
>> 
>> What I am trying to do is quite simple, find similar documents using 
>> mlt and group these using the facet parameter. When using mlt and 
>> facets separately everything works fine, but not when combining the 
>> functionality.
>> 
>> 
>> 
>> 
>> 
>> {
>> 
>>  "responseHeader":{
>> 
>>"status":500,
>> 
>>"QTime":109},
>> 
>>  "match":{"numFound":1,"start":0,"docs":[
>> 
>>  {
>> 
>>"Journalnummer":" 00759",
>> 
>>"id":"6512815"  },
>> 
>>  "response":{"numFound":602234,"start":0,"docs":[
>> 
>>  {
>> 
>>"Journalnummer":" 00759",
>> 
>>"id":"6512816",
>> 
>>  {
>> 
>>"Journalnummer":" 00759",
>> 
>>"id":"6834653"
>> 
>>  {
>> 
>>"Journalnummer":" 00739",
>> 
>>"id":"6202373"
>> 
>>  {
>> 
>>"Journalnummer":" 00739",
>> 
>>"id":"6748105"
>> 
>> 
>> 
>>  {
>> 
>>"Journalnummer":" 00803",
>> 
>>"id":"7402155"
>> 
>>  },
>> 
>>  "error":{
>> 
>>"metadata":[
>> 
>>  "error-class","org.apache.solr.common.SolrException",
>> 
>>  "root-error-class","org.apache.solr.common.SolrException"],
>> 
>>"msg":"Unable to compute facet ranges, facet context is not set",
>> 
>>"trace":"org.apache.solr.common.SolrException: Unable to compute 
>> facet ranges, facet context is not set\n\tat 
>> org.apache.solr.handler.component.RangeFacetProcessor.getFacetRangeCo
>> unts(RangeFacetProcessor.java:66)\n\tat
>> org.apache.solr.handler.component.FacetComponent.getFacetCounts(Facet
>> Component.java:331)\n\tat 
>> org.apache.solr.handler.component.FacetComponent.getFacetCounts(Facet
>> Component.java:295)\n\tat 
>> org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLik
>> eThisHandler.java:240)\n\tat 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
>> erBase.java:199)\n\tat 
>> org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)\n\tat
>> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)\n
>> \tat 
>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)\n\ta
>> t 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
>> r.java:377)\n\tat 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
>> r.java:323)\n\tat 
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
>> Handler.java:1634)\n\tat 
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java
>> :533)\n\tat 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
>> ava:146)\n\tat 
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.jav
>> a:548)\n\tat 
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWr

RE: MLT and facetting

2019-02-25 Thread Martin Frank Hansen (MHQ)
Hi Edwin,

Thanks for your response. 

Yes you are right. It was simply the search parameters from Solr. 

The query looks like this:

http://.../solr/.../mlt?df=text=Journalnummer=on=id,Journalnummer=id:*6512815*

best regards,

Martin


Internal - KMD A/S

-Original Message-
From: Zheng Lin Edwin Yeo  
Sent: 26. februar 2019 03:54
To: solr-user@lucene.apache.org
Subject: Re: MLT and facetting

Hi Martin,

I think there are some pictures which are not being sent through in the email.

Do send your query that you are using, and which version of Solr you are using?

Regards,
Edwin

On Mon, 25 Feb 2019 at 20:54, Martin Frank Hansen (MHQ)  wrote:

> Hi,
>
>
>
> I am trying to combine the mlt functionality with facets, but Solr 
> throws
> org.apache.solr.common.SolrException: ":"Unable to compute facet 
> ranges, facet context is not set".
>
>
>
> What I am trying to do is quite simple, find similar documents using 
> mlt and group these using the facet parameter. When using mlt and 
> facets separately everything works fine, but not when combining the 
> functionality.
>
>
>
>
>
> {
>
>   "responseHeader":{
>
> "status":500,
>
> "QTime":109},
>
>   "match":{"numFound":1,"start":0,"docs":[
>
>   {
>
> "Journalnummer":" 00759",
>
> "id":"6512815"  },
>
>   "response":{"numFound":602234,"start":0,"docs":[
>
>   {
>
> "Journalnummer":" 00759",
>
> "id":"6512816",
>
>   {
>
> "Journalnummer":" 00759",
>
> "id":"6834653"
>
>   {
>
> "Journalnummer":" 00739",
>
> "id":"6202373"
>
>   {
>
> "Journalnummer":" 00739",
>
> "id":"6748105"
>
>
>
>   {
>
> "Journalnummer":" 00803",
>
> "id":"7402155"
>
>   },
>
>   "error":{
>
> "metadata":[
>
>   "error-class","org.apache.solr.common.SolrException",
>
>   "root-error-class","org.apache.solr.common.SolrException"],
>
> "msg":"Unable to compute facet ranges, facet context is not set",
>
> "trace":"org.apache.solr.common.SolrException: Unable to compute 
> facet ranges, facet context is not set\n\tat 
> org.apache.solr.handler.component.RangeFacetProcessor.getFacetRangeCou
> nts(RangeFacetProcessor.java:66)\n\tat
> org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetC
> omponent.java:331)\n\tat 
> org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetC
> omponent.java:295)\n\tat 
> org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLike
> ThisHandler.java:240)\n\tat 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> rBase.java:199)\n\tat 
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)\n\tat
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)\n\
> tat 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> .java:377)\n\tat 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> .java:323)\n\tat 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletH
> andler.java:1634)\n\tat 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:
> 533)\n\tat 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.ja
> va:146)\n\tat 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java
> :548)\n\tat 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.
> java:132)\n\tat 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandle
> r.java:257)\n\tat 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandle
> r.java:1595)\n\tat 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandle
> r.java:255)\n\tat 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandle
> r.java:1317)\n\tat 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler
> .java:203)\n\tat 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:4
> 73)\n\tat 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler
> .java:1

MLT and facetting

2019-02-25 Thread Martin Frank Hansen (MHQ)
Hi,

I am trying to combine the mlt functionality with facets, but Solr throws 
org.apache.solr.common.SolrException: ":"Unable to compute facet ranges, facet 
context is not set".

What I am trying to do is quite simple, find similar documents using mlt and 
group these using the facet parameter. When using mlt and facets separately 
everything works fine, but not when combining the functionality.

[cid:image002.png@01D4CD11.A38E3110]
[cid:image003.png@01D4CD11.A38E3110]

{
  "responseHeader":{
"status":500,
"QTime":109},
  "match":{"numFound":1,"start":0,"docs":[
  {
"Journalnummer":" 00759",
"id":"6512815"  },
  "response":{"numFound":602234,"start":0,"docs":[
  {
"Journalnummer":" 00759",
"id":"6512816",
  {
"Journalnummer":" 00759",
"id":"6834653"
  {
"Journalnummer":" 00739",
"id":"6202373"
  {
"Journalnummer":" 00739",
"id":"6748105"

  {
"Journalnummer":" 00803",
"id":"7402155"
  },
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"Unable to compute facet ranges, facet context is not set",
"trace":"org.apache.solr.common.SolrException: Unable to compute facet 
ranges, facet context is not set\n\tat 
org.apache.solr.handler.component.RangeFacetProcessor.getFacetRangeCounts(RangeFacetProcessor.java:66)\n\tat
 
org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:331)\n\tat
 
org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:295)\n\tat
 
org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLikeThisHandler.java:240)\n\tat
 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
 org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)\n\tat 
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)\n\tat 
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)\n\tat 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
 org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat 
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
 org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat 
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat
 

RE: unable to create new threads: out-of-memory issues

2019-02-12 Thread Martin Frank Hansen (MHQ)
Hi Mikhail, 

Thanks for your help. I will try it. 

-Original Message-
From: Mikhail Khludnev  
Sent: 12. februar 2019 15:54
To: solr-user 
Subject: Re: unable to create new threads: out-of-memory issues

1. you can jstack  to find it out.
2. It might create a thread, I don't know.
3. SolrClient is definitely a subject for heavy reuse.

On Tue, Feb 12, 2019 at 5:16 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi Mikhail,
>
> I am using Solrj but think I might have found the problem.
>
> I am doing a atomicUpdate on existing documents, and found out that I 
> create a new SolrClient for each document. I guess this is where all 
> the threads are coming from. Is it correct that when creating a 
> SolrClient, I also create a new thread?
>
> SolrClient solr = new HttpSolrClient.Builder(urlString).build();
>
> Thanks
>
> -Original Message-
> From: Mikhail Khludnev 
> Sent: 12. februar 2019 15:09
> To: solr-user 
> Subject: Re: unable to create new threads: out-of-memory issues
>
> Hello, Martin.
> How do you index? Where did you get this error?
>  Usually it occurs in custom code with many new Thread() calls and 
> usually healed with thread poling.
>
> On Tue, Feb 12, 2019 at 3:25 PM Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi,
> >
> > I am trying to create an index on a small Linux server running 
> > Solr-7.5.0, but keep running into problems.
> >
> > When I try to index a file-folder of roughly 18 GB (18000 files) I 
> > get the following error from the server:
> >
> > java.lang.OutOfMemoryError: unable to create new native thread.
> >
> > From the server I can see the following limits:
> >
> > User$ ulimit -a
> > core file size (blocks, -c) 0
> > data seg size (kbytes, -d) unlimited
> > scheduling priority (-e) 0
> > file size   (blocks, -f)
> unlimited
> > pending signals  (-i) 257568
> > max locked memory (kbytes, -l) 64
> > max memory size  (kbytes, -m) unlimited
> > open files(-n) 1024
> > pipe size   (512 bytes, -p) 8
> > POSIX message queues(bytes, -q) 819200
> > real-time priority  (-r) 0
> > stack size  (kbytes, -s) 8192
> > cpu time   (seconds, -t) unlimited
> > max user processes  (-u) 257568
> > virtual memory  (kbytes, -v) unlimited
> > file locks  (-x) unlimited
> >
> > I do not see any limits on threads only on open files.
> >
> > I have added a autoCommit of a maximum of 1000 documents, but that 
> > did not help. How can I increase the thread limit, or is there 
> > another way of solving this issue? Any help is appreciated.
> >
> > Best regards
> >
> > Martin
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her 
> > finder du KMD’s 
> > Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der 
> > fortæller,
> hvordan vi behandler oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can 
> > read KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> 
> > outlining how we process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst 
> > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig 
> > beder vi dig slette e-mailen i dit system uden at videresende eller
> kopiere den.
> > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning 
> > er fri for virus og andre fejl, som kan påvirke computeren eller 
> > it-systemet, hvori den modtages og læses, åbnes den på modtagerens 
> > eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som 
> > er opstået i forbindelse med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information. 
> > If you have received this message by mistake, please inform the 
> > sender of the mistake by sending a reply, then delete the message 
> > from your system without making, distributing or retaining any copies of it.
> > Although we believe that the message and any attachments are free 
> > from viruses and other errors that might affect the computer or 
> > it-system where it is received and read, the recipient opens the 
> > message at his or
> her own risk.
> > We assume no responsibility for any loss or damage arising from the 
> > receipt or use of this message.
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


--
Sincerely yours
Mikhail Khludnev


RE: unable to create new threads: out-of-memory issues

2019-02-12 Thread Martin Frank Hansen (MHQ)
Hi Mikhail, 

I am using Solrj but think I might have found the problem. 

I am doing a atomicUpdate on existing documents, and found out that I create a 
new SolrClient for each document. I guess this is where all the threads are 
coming from. Is it correct that when creating a SolrClient, I also create a new 
thread? 

SolrClient solr = new HttpSolrClient.Builder(urlString).build();

Thanks 

-Original Message-
From: Mikhail Khludnev  
Sent: 12. februar 2019 15:09
To: solr-user 
Subject: Re: unable to create new threads: out-of-memory issues

Hello, Martin.
How do you index? Where did you get this error?
 Usually it occurs in custom code with many new Thread() calls and usually 
healed with thread poling.

On Tue, Feb 12, 2019 at 3:25 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi,
>
> I am trying to create an index on a small Linux server running 
> Solr-7.5.0, but keep running into problems.
>
> When I try to index a file-folder of roughly 18 GB (18000 files) I get 
> the following error from the server:
>
> java.lang.OutOfMemoryError: unable to create new native thread.
>
> From the server I can see the following limits:
>
> User$ ulimit -a
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size   (blocks, -f) unlimited
> pending signals  (-i) 257568
> max locked memory (kbytes, -l) 64
> max memory size  (kbytes, -m) unlimited
> open files(-n) 1024
> pipe size   (512 bytes, -p) 8
> POSIX message queues(bytes, -q) 819200
> real-time priority  (-r) 0
> stack size  (kbytes, -s) 8192
> cpu time   (seconds, -t) unlimited
> max user processes  (-u) 257568
> virtual memory  (kbytes, -v) unlimited
> file locks  (-x) unlimited
>
> I do not see any limits on threads only on open files.
>
> I have added a autoCommit of a maximum of 1000 documents, but that did 
> not help. How can I increase the thread limit, or is there another way 
> of solving this issue? Any help is appreciated.
>
> Best regards
>
> Martin
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her 
> finder du KMD’s 
> Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read 
> KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how 
> we process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst 
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig 
> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere 
> den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning 
> er fri for virus og andre fejl, som kan påvirke computeren eller 
> it-systemet, hvori den modtages og læses, åbnes den på modtagerens 
> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er 
> opstået i forbindelse med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If 
> you have received this message by mistake, please inform the sender of 
> the mistake by sending a reply, then delete the message from your 
> system without making, distributing or retaining any copies of it. 
> Although we believe that the message and any attachments are free from 
> viruses and other errors that might affect the computer or it-system 
> where it is received and read, the recipient opens the message at his or her 
> own risk.
> We assume no responsibility for any loss or damage arising from the 
> receipt or use of this message.
>


--
Sincerely yours
Mikhail Khludnev


unable to create new threads: out-of-memory issues

2019-02-12 Thread Martin Frank Hansen (MHQ)
Hi,

I am trying to create an index on a small Linux server running Solr-7.5.0, but 
keep running into problems.

When I try to index a file-folder of roughly 18 GB (18000 files) I get the 
following error from the server:

java.lang.OutOfMemoryError: unable to create new native thread.

>From the server I can see the following limits:

User$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size   (blocks, -f) unlimited
pending signals  (-i) 257568
max locked memory (kbytes, -l) 64
max memory size  (kbytes, -m) unlimited
open files(-n) 1024
pipe size   (512 bytes, -p) 8
POSIX message queues(bytes, -q) 819200
real-time priority  (-r) 0
stack size  (kbytes, -s) 8192
cpu time   (seconds, -t) unlimited
max user processes  (-u) 257568
virtual memory  (kbytes, -v) unlimited
file locks  (-x) unlimited

I do not see any limits on threads only on open files.

I have added a autoCommit of a maximum of 1000 documents, but that did not 
help. How can I increase the thread limit, or is there another way of solving 
this issue? Any help is appreciated.

Best regards

Martin

Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


RE: indexing multiple levels of data

2018-11-16 Thread Martin Frank Hansen (MHQ)
Hi Jan,

Thanks for your quick reply!

I was fearing that you would suggest this  I have already moved much of the 
indexing application out of Solr which gives me the desired flexibility, but I 
am a bit concerned about the time consumption doing so.

Right now I have about 20,000 xml documents at case level that need to be 
matched to 20,000 xml documents at file level as well as around 400,000 files.

I was able to index the 2 xml documents at file level plus the appropriate 
files, as this can be done directly using Solrj and atomic update. So my next 
idea will be to merge the xml documents (as you suggested), and add those to 
Solr followed by the addition of all the files using atomic update. Will this 
be a way to go?

As the xml documents are ordered according to their names it should be easier 
to match the specific documents without going through all files every time.

Right now we have

-Original Message-
From: Jan Høydahl 
Sent: 16. november 2018 15:29
To: solr-user 
Subject: Re: indexing multiple levels of data

Hi Martin,

For a complex use case as this I would recommend you write a separate indexer 
application that crawls the files, looks up the correct metadata XMLs based on 
given business rules, and then constructs the full Solr document to send to 
Solr.
Even parsing full-text from PDF etc I would recommend to do in such an indexer 
application instead of relying on Solr's built-in Tika.

This gives you all the control you need, and the burden of building and running 
a separate app will probably be worth it.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. nov. 2018 kl. 12:24 skrev Martin Frank Hansen (MHQ) :
>
> Hi,
>
> I am trying to add meta data and files to Solr, but are experiencing some 
> problems.
>
> Data is divided on three two, cases and files. For each case the meta-data is 
> given in an xml document, while meta data for the files is given in another 
> xml document, and the actual files are kept in yet another place.
> For each case multiple files might exist.
> There is no unique key between the cases and the files.
> There is however an identifier for each of the cases which is present at file 
> level as well.
>
>
>  1.  I tried using atomic update, but that did not work since a unique key is 
> required.
>  2.  I thought about using a multivalued field for the files within a 
> case-document. The problem is that it is the files that I am interested in, 
> and if I query a specific file, the entire document is returned which is not 
> very helpful. Is there a way to specify which of the files actually match a 
> query within a document (see example below)? I was thinking about the 
> highlight component, but I am not sure if it will work.
>
> {
>
> id:case1
>
> file:{file1, file2, file3…}
>
> }
>
>  1.  Another thing was using a join at query level, but it seems a bit 
> tedious. Is there a way to make a join at index-time?
>
> Any suggestions are much appreciated.
>
> Best regards
>
> Martin
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process 
> your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
>
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.



indexing multiple levels of data

2018-11-16 Thread Martin Frank Hansen (MHQ)
Hi,

I am trying to add meta data and files to Solr, but are experiencing some 
problems.

Data is divided on three two, cases and files. For each case the meta-data is 
given in an xml document, while meta data for the files is given in another xml 
document, and the actual files are kept in yet another place.
For each case multiple files might exist.
There is no unique key between the cases and the files.
There is however an identifier for each of the cases which is present at file 
level as well.


  1.  I tried using atomic update, but that did not work since a unique key is 
required.
  2.  I thought about using a multivalued field for the files within a 
case-document. The problem is that it is the files that I am interested in, and 
if I query a specific file, the entire document is returned which is not very 
helpful. Is there a way to specify which of the files actually match a query 
within a document (see example below)? I was thinking about the highlight 
component, but I am not sure if it will work.

{

id:case1

file:{file1, file2, file3…}

}

  1.  Another thing was using a join at query level, but it seems a bit 
tedious. Is there a way to make a join at index-time?

Any suggestions are much appreciated.

Best regards

Martin

Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


RE: Merging data from different sources

2018-10-31 Thread Martin Frank Hansen (MHQ)
Hi Markus,

Thanks for your reply!

I hope I can make it work as well 

-Original Message-
From: Markus Jelsma 
Sent: 30. oktober 2018 22:02
To: solr-user@lucene.apache.org
Subject: RE: Merging data from different sources

Hello Martin,

We also use an URP for this in some cases. We index documents to some 
collection, the URP reads a field from that document which is an ID in another 
collection. So we fetch that remote Solr document on-the-fly, and use those 
fields to enrich the incoming document.

It is very straightforward and works very well.

Regards,
Markus



-Original message-
> From:Martin Frank Hansen (MHQ) 
> Sent: Tuesday 30th October 2018 21:55
> To: solr-user@lucene.apache.org
> Subject: RE: Merging data from different sources
>
> Hi Alex,
>
> Thanks for your help. I will take a look at the update-request-processor.
>
> I wonder if there is a way to link documents together, so that they always 
> show up together should one of the documents match a search query?
>
> -Original Message-
> From: Alexandre Rafalovitch 
> Sent: 30. oktober 2018 13:16
> To: solr-user 
> Subject: Re: Merging data from different sources
>
> Maybe
> https://lucene.apache.org/solr/guide/7_5/update-request-processors.htm
> l#atomicupdateprocessorfactory
>
> Regards,
> Alex
>
> On Tue, Oct 30, 2018, 7:57 AM Martin Frank Hansen (MHQ),  wrote:
>
> > Hi,
> >
> > I am trying to merge files from different sources and with different
> > content (except for one key-field) , how can this be done in Solr?
> >
> > An example could be:
> >
> > Document 1
> > 
> > 001  Unique id
> > for Document 1
> > test-123
> > …
> > 
> >
> > Document 2
> > 
> > abcdefgh   Unique id
> > for Document 2
> > test-123
> > …
> > 
> >
> > In the above case I would like to merge on Journalnumber thus ending
> > up with something like this:
> >
> >  
> > 001  Unique id
> > for the merge
> > test-123
> > abcdefgh   Reference id
> > for Document 2.
> > …
> > 
> >
> > How would I go about this? I was thinking about embedded documents,
> > but since I am not indexing the different data sources at the same
> > time I don’t think it will work. The ideal result would be to have
> > Document 2 imbedded in Document 1.
> >
> > I am currently using a schema that contains all fields from Document
> > 1 and Document 2.
> >
> > I really hope that Solr can handle this, and any help/feedback is
> > much appreciated.
> >
> > Best regards
> >
> > Martin
> >
> >
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > finder du KMD’s
> > Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> > hvordan vi behandler oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can
> > read KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy>
> > outlining how we process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> > beder vi dig slette e-mailen i dit system uden at videresende eller kopiere 
> > den.
> > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> > er fri for virus og andre fejl, som kan påvirke computeren eller
> > it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> > eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som
> > er opstået i forbindelse med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information.
> > If you have received this message by mistake, please inform the
> > sender of the mistake by sending a reply, then delete the message
> > from your system without making, distributing or retaining any copies of it.
> > Although we believe that the message and any attachments are free
> > from viruses and other errors that might affect the computer or
> > it-system where it is received and read, the recipient opens the message at 
> > his or her own risk.
> > We assume no responsibility for any loss or damage arising from the
> > receipt or use of this message.
> >
>


RE: Merging data from different sources

2018-10-30 Thread Martin Frank Hansen (MHQ)
Hi Alex,

Thanks for your help. I will take a look at the update-request-processor.

I wonder if there is a way to link documents together, so that they always show 
up together should one of the documents match a search query?

-Original Message-
From: Alexandre Rafalovitch 
Sent: 30. oktober 2018 13:16
To: solr-user 
Subject: Re: Merging data from different sources

Maybe
https://lucene.apache.org/solr/guide/7_5/update-request-processors.html#atomicupdateprocessorfactory

Regards,
Alex

On Tue, Oct 30, 2018, 7:57 AM Martin Frank Hansen (MHQ),  wrote:

> Hi,
>
> I am trying to merge files from different sources and with different
> content (except for one key-field) , how can this be done in Solr?
>
> An example could be:
>
> Document 1
> 
> 001  Unique id
> for Document 1
> test-123
> …
> 
>
> Document 2
> 
> abcdefgh   Unique id
> for Document 2
> test-123
> …
> 
>
> In the above case I would like to merge on Journalnumber thus ending
> up with something like this:
>
>  
> 001  Unique id
> for the merge
> test-123
> abcdefgh   Reference id
> for Document 2.
> …
> 
>
> How would I go about this? I was thinking about embedded documents,
> but since I am not indexing the different data sources at the same
> time I don’t think it will work. The ideal result would be to have
> Document 2 imbedded in Document 1.
>
> I am currently using a schema that contains all fields from Document 1
> and Document 2.
>
> I really hope that Solr can handle this, and any help/feedback is much
> appreciated.
>
> Best regards
>
> Martin
>
>
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> finder du KMD’s
> Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how
> we process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere 
> den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> er fri for virus og andre fejl, som kan påvirke computeren eller
> it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> opstået i forbindelse med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If
> you have received this message by mistake, please inform the sender of
> the mistake by sending a reply, then delete the message from your
> system without making, distributing or retaining any copies of it.
> Although we believe that the message and any attachments are free from
> viruses and other errors that might affect the computer or it-system
> where it is received and read, the recipient opens the message at his or her 
> own risk.
> We assume no responsibility for any loss or damage arising from the
> receipt or use of this message.
>


Merging data from different sources

2018-10-30 Thread Martin Frank Hansen (MHQ)
Hi,

I am trying to merge files from different sources and with different content 
(except for one key-field) , how can this be done in Solr?

An example could be:

Document 1

001  Unique id for 
Document 1
test-123
…


Document 2

abcdefgh   Unique id for 
Document 2
test-123
…


In the above case I would like to merge on Journalnumber thus ending up with 
something like this:

 
001  Unique id for the 
merge
test-123
abcdefgh   Reference id for 
Document 2.
…


How would I go about this? I was thinking about embedded documents, but since I 
am not indexing the different data sources at the same time I don’t think it 
will work. The ideal result would be to have Document 2 imbedded in Document 1.

I am currently using a schema that contains all fields from Document 1 and 
Document 2.

I really hope that Solr can handle this, and any help/feedback is much 
appreciated.

Best regards

Martin




Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


RE: Tesseract language

2018-10-28 Thread Martin Frank Hansen (MHQ)
Hi Tim and Rohan,

Really appreciate your help, and I finally made it work (without tess4j).

It was the path-environment variable which had a wrong setting. Instead setting 
the path of TESSDATA_PREFIX to  'Tesseract-OCR/tessdata' I changed it to the 
parent folder 'Tesseract-OCR' and now it works for Danish.

Thanks again for helping.

Best regards

Martin

-Original Message-
From: Tim Allison 
Sent: 27. oktober 2018 14:37
To: solr-user@lucene.apache.org; u...@tika.apache.org
Subject: Re: Tesseract language

Martin,
  Let’s move this over to user@tika.

Rohan,
  Is there something about Tika’s use of tesseract for image files that can be 
improved?

Best,
   Tim

On Sat, Oct 27, 2018 at 3:40 AM Rohan Kasat  wrote:

> I used tess4j for image formats and Tika for scanned PDFs and images
> within PDFs.
>
> Regards,
> Rohan Kasat
>
> On Sat, Oct 27, 2018 at 12:39 AM Martin Frank Hansen (MHQ)
> 
> wrote:
>
> > Hi Rohan,
> >
> > Thanks for your reply, are you using tess4j with Tika or on its own?
> > I will take a look at tess4j if I can't make it work with Tika alone.
> >
> > Best regards
> > Martin
> >
> >
> > -Original Message-
> > From: Rohan Kasat 
> > Sent: 26. oktober 2018 21:45
> > To: solr-user@lucene.apache.org
> > Subject: Re: Tesseract language
> >
> > Hi Martin,
> >
> > Are you using it For image formats , I think you can try tess4j and
> > use give TESSDATA_PREFIX as the home for tessarct Configs.
> >
> > I have tried it and it works pretty well in my local machine.
> >
> > I have used java 8 and tesseact 3 for the same.
> >
> > Regards,
> > Rohan Kasat
> >
> > On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ)
> > 
> > wrote:
> >
> > > Hi Tim,
> > >
> > > You were right.
> > >
> > > When I called `tesseract testing/eurotext.png testing/eurotext-dan
> > > -l dan`, I got an error message so I downloaded "dan.traineddata"
> > > and added it to the Tesseract-OCR/tessdata folder. Furthermore I
> > > added the 'TESSDATA_PREFIX' variable to the path-variables
> > > pointing to "Tesseract-OCR/tessdata".
> > >
> > > Now Tesseract works with Danish language from the CMD, but now I
> > > can't make the code work in Java, not even with default settings
> > > (which I could before). Am I missing something or just mixing some things 
> > > up?
> > >
> > >
> > >
> > > -Original Message-
> > > From: Tim Allison 
> > > Sent: 26. oktober 2018 19:58
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Tesseract language
> > >
> > > Tika relies on you to install tesseract and all the language
> > > libraries you'll need.
> > >
> > > If you can successfully call `tesseract testing/eurotext.png
> > > testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> > > with your code above.
> > > On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> > > 
> > > wrote:
> > > >
> > > > Hi again,
> > > >
> > > > Now I moved the OCR part to Tika, but I still can't make it work
> > > > with
> > > Danish. It works when using default language settings and it seems
> > > like Tika is missing Danish dictionary.
> > > >
> > > > My java code looks like this:
> > > >
> > > > {
> > > > File file = new File(pathfilename);
> > > >
> > > > Metadata meta = new Metadata();
> > > >
> > > > InputStream stream = TikaInputStream.get(file);
> > > >
> > > >     Parser parser = new AutoDetectParser();
> > > > BodyContentHandler handler = new
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > >
> > > > TesseractOCRConfig config = new TesseractOCRConfig();
> > > > config.setLanguage("dan"); // code works if this
> > > > phrase is
> > > commented out.
> > > >
> > > > ParseContext parseContext = new ParseContext();
> > > >
> > > >  parseContext.set(TesseractOCRConfig.class, config);
> > > >
> > > > parser.parse(stream, handler, meta, parseContext);
> > > > System.out.println(handler.toString());
> > > > }
> > > >
> > > > Hope

RE: Tesseract language

2018-10-27 Thread Martin Frank Hansen (MHQ)
Hi Rohan,

Thanks for your reply, are you using tess4j with Tika or on its own?  I will 
take a look at tess4j if I can't make it work with Tika alone.

Best regards
Martin


-Original Message-
From: Rohan Kasat 
Sent: 26. oktober 2018 21:45
To: solr-user@lucene.apache.org
Subject: Re: Tesseract language

Hi Martin,

Are you using it For image formats , I think you can try tess4j and use give 
TESSDATA_PREFIX as the home for tessarct Configs.

I have tried it and it works pretty well in my local machine.

I have used java 8 and tesseact 3 for the same.

Regards,
Rohan Kasat

On Fri, Oct 26, 2018 at 12:31 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi Tim,
>
> You were right.
>
> When I called `tesseract testing/eurotext.png testing/eurotext-dan -l
> dan`, I got an error message so I downloaded "dan.traineddata" and
> added it to the Tesseract-OCR/tessdata folder. Furthermore I added the
> 'TESSDATA_PREFIX' variable to the path-variables pointing to
> "Tesseract-OCR/tessdata".
>
> Now Tesseract works with Danish language from the CMD, but now I can't
> make the code work in Java, not even with default settings (which I
> could before). Am I missing something or just mixing some things up?
>
>
>
> -Original Message-
> From: Tim Allison 
> Sent: 26. oktober 2018 19:58
> To: solr-user@lucene.apache.org
> Subject: Re: Tesseract language
>
> Tika relies on you to install tesseract and all the language libraries
> you'll need.
>
> If you can successfully call `tesseract testing/eurotext.png
> testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
> with your code above.
> On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)
> 
> wrote:
> >
> > Hi again,
> >
> > Now I moved the OCR part to Tika, but I still can't make it work
> > with
> Danish. It works when using default language settings and it seems
> like Tika is missing Danish dictionary.
> >
> > My java code looks like this:
> >
> > {
> > File file = new File(pathfilename);
> >
> > Metadata meta = new Metadata();
> >
> > InputStream stream = TikaInputStream.get(file);
> >
> > Parser parser = new AutoDetectParser();
> > BodyContentHandler handler = new
> > BodyContentHandler(Integer.MAX_VALUE);
> >
> > TesseractOCRConfig config = new TesseractOCRConfig();
> > config.setLanguage("dan"); // code works if this phrase
> > is
> commented out.
> >
> > ParseContext parseContext = new ParseContext();
> >
> >  parseContext.set(TesseractOCRConfig.class, config);
> >
> > parser.parse(stream, handler, meta, parseContext);
> > System.out.println(handler.toString());
> > }
> >
> > Hope that someone can help here.
> >
> > -Original Message-
> > From: Martin Frank Hansen (MHQ) 
> > Sent: 22. oktober 2018 07:58
> > To: solr-user@lucene.apache.org
> > Subject: SV: Tessera
> <https://maps.google.com/?q=ect:+SV:+Tessera=gmail=g>ct
> language
> >
> > Hi Erick,
> >
> > Thanks for the help! I will take a look at it.
> >
> >
> > Martin Frank Hansen, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > www.kmd.dk Mobil +4525571418
> >
> > -Oprindelig meddelelse-
> > Fra: Erick Erickson 
> > Sendt: 21. oktober 2018 22:49
> > Til: solr-user 
> > Emne: Re: Tesseract language
> >
> > Here's a skeletal program that uses Tika in a stand-alone client.
> > Rip
> the RDBMS parts out
> >
> > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> > On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch <
> arafa...@gmail.com> wrote:
> > >
> > > Usually, we just say to do a custom solution using SolrJ client to
> > > connect. This gives you maximum flexibility and allows to
> > > integrate Tika either inside your code or as a server. Latest Tika
> > > actually has some off-thread handling I believe, to make it safer to 
> > > embed.
> > >
> > > For DIH alternatives, if you want configuration over custom code,
> > > you could look at something like Apache NiFI. It can push data
> > > into
> Solr.
> > > Obviously it is a bigger solution, but it is correspondingly more
> > > robust too.
> > >
> > > Regards,
> > >Alex.
> > > On Sun, 2

RE: Tesseract language

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi Tim,

You were right.

When I called `tesseract testing/eurotext.png testing/eurotext-dan -l dan`, I 
got an error message so I downloaded "dan.traineddata" and added it to the 
Tesseract-OCR/tessdata folder. Furthermore I added the 'TESSDATA_PREFIX' 
variable to the path-variables pointing to "Tesseract-OCR/tessdata".

Now Tesseract works with Danish language from the CMD, but now I can't make the 
code work in Java, not even with default settings (which I could before). Am I 
missing something or just mixing some things up?



-Original Message-
From: Tim Allison 
Sent: 26. oktober 2018 19:58
To: solr-user@lucene.apache.org
Subject: Re: Tesseract language

Tika relies on you to install tesseract and all the language libraries you'll 
need.

If you can successfully call `tesseract testing/eurotext.png 
testing/eurotext-dan -l dan`, Tika _should_ be able to specify "dan"
with your code above.
On Fri, Oct 26, 2018 at 10:49 AM Martin Frank Hansen (MHQ)  wrote:
>
> Hi again,
>
> Now I moved the OCR part to Tika, but I still can't make it work with Danish. 
> It works when using default language settings and it seems like Tika is 
> missing Danish dictionary.
>
> My java code looks like this:
>
> {
> File file = new File(pathfilename);
>
> Metadata meta = new Metadata();
>
> InputStream stream = TikaInputStream.get(file);
>
> Parser parser = new AutoDetectParser();
> BodyContentHandler handler = new
> BodyContentHandler(Integer.MAX_VALUE);
>
> TesseractOCRConfig config = new TesseractOCRConfig();
> config.setLanguage("dan"); // code works if this phrase is 
> commented out.
>
> ParseContext parseContext = new ParseContext();
>
>  parseContext.set(TesseractOCRConfig.class, config);
>
> parser.parse(stream, handler, meta, parseContext);
>         System.out.println(handler.toString());
> }
>
> Hope that someone can help here.
>
> -Original Message-
> From: Martin Frank Hansen (MHQ) 
> Sent: 22. oktober 2018 07:58
> To: solr-user@lucene.apache.org
> Subject: SV: Tesseract language
>
> Hi Erick,
>
> Thanks for the help! I will take a look at it.
>
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -Oprindelig meddelelse-
> Fra: Erick Erickson 
> Sendt: 21. oktober 2018 22:49
> Til: solr-user 
> Emne: Re: Tesseract language
>
> Here's a skeletal program that uses Tika in a stand-alone client. Rip the 
> RDBMS parts out
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch  
> wrote:
> >
> > Usually, we just say to do a custom solution using SolrJ client to
> > connect. This gives you maximum flexibility and allows to integrate
> > Tika either inside your code or as a server. Latest Tika actually
> > has some off-thread handling I believe, to make it safer to embed.
> >
> > For DIH alternatives, if you want configuration over custom code,
> > you could look at something like Apache NiFI. It can push data into Solr.
> > Obviously it is a bigger solution, but it is correspondingly more
> > robust too.
> >
> > Regards,
> >Alex.
> > On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)  wrote:
> > >
> > > Hi Alexandre,
> > >
> > > Thanks for your reply.
> > >
> > > Yes right now it is just for testing the possibilities of Solr and 
> > > Tesseract.
> > >
> > > I will take a look at the Tika documentation to see if I can make it work.
> > >
> > > You said that DIH are not recommended for production usage, what is the 
> > > recommended method(s) to upload data to a Solr instance?
> > >
> > > Best regards
> > >
> > > Martin Frank Hansen
> > >
> > > -Oprindelig meddelelse-
> > > Fra: Alexandre Rafalovitch 
> > > Sendt: 21. oktober 2018 16:26
> > > Til: solr-user 
> > > Emne: Re: Tesseract language
> > >
> > > There is a couple of things mixed in here:
> > > 1) Extract handler is not recommended for production usage. It is great 
> > > for a quick test, just like you did it, but going to production, running 
> > > it externally is better. Tika - especially with large files can use up a 
> > > lot of memory and trip up the Solr instance it is running within.
> > > 2) If you are still just 

RE: Tesseract language

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi again,

Now I moved the OCR part to Tika, but I still can't make it work with Danish. 
It works when using default language settings and it seems like Tika is missing 
Danish dictionary.

My java code looks like this:

{
File file = new File(pathfilename);

Metadata meta = new Metadata();

InputStream stream = TikaInputStream.get(file);

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new 
BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
config.setLanguage("dan"); // code works if this phrase is 
commented out.

ParseContext parseContext = new ParseContext();

 parseContext.set(TesseractOCRConfig.class, config);

parser.parse(stream, handler, meta, parseContext);
System.out.println(handler.toString());
}

Hope that someone can help here.

-Original Message-----
From: Martin Frank Hansen (MHQ) 
Sent: 22. oktober 2018 07:58
To: solr-user@lucene.apache.org
Subject: SV: Tesseract language

Hi Erick,

Thanks for the help! I will take a look at it.


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics



Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk  Web www.kmd.dk
Mobil +4525571418

-Oprindelig meddelelse-
Fra: Erick Erickson 
Sendt: 21. oktober 2018 22:49
Til: solr-user 
Emne: Re: Tesseract language

Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS 
parts out

https://lucidworks.com/2012/02/14/indexing-with-solrj/
On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch  
wrote:
>
> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>    Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)  wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and 
> > Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it work.
> >
> > You said that DIH are not recommended for production usage, what is the 
> > recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -Oprindelig meddelelse-
> > Fra: Alexandre Rafalovitch 
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user 
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is great for 
> > a quick test, just like you did it, but going to production, running it 
> > externally is better. Tika - especially with large files can use up a lot 
> > of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr but 
> > specifying parseContent.config file as shown at the link and described 
> > further down in the same document:
> > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-ce
> > ll-using-apache-tika.html#configuring-the-solr-extractingrequesthand
> > ler You still need to check with Tika documentation with Tesseract
> > can take its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler can 
> > iterate through files and then - as a nested entity - feed it to Tika 
> > processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is also 
> > not recommended for production.
> >
> > I hope this helps,
> > Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)  wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is
> > > Danish Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> 

RE: Reading data using Tika to Solr

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi Tim,

Thanks again, I will update Tika and try it again.

-Original Message-
From: Tim Allison 
Sent: 26. oktober 2018 12:53
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

Ha...emails passed in the ether.

As you saw, we added the RecursiveParserWrapper a while back into Tika so no 
need to re-invent that wheel.  That’s my preferred method/format because it 
maintains metadata from attachments and lets you know about exceptions in 
embedded files. The legacy method concatenates contents, throws out attachment 
metadata and silently swallows attachment exceptions.

On Fri, Oct 26, 2018 at 6:25 AM Martin Frank Hansen (MHQ) 
wrote:

> Hi again,
>
> Never mind, I got manage to get the content of the msg-files as well
> using the following link as inspiration:
> https://wiki.apache.org/tika/RecursiveMetadata
>
> But thanks again for all your help!
>
> -Original Message-
> From: Martin Frank Hansen (MHQ) 
> Sent: 26. oktober 2018 10:14
> To: solr-user@lucene.apache.org
> Subject: RE: Reading data using Tika to Solr
>
> Hi Tim,
>
> It is msg files and I added tika-app-1.14.jar to the build path - and
> now it works  But how do I get it to read the attachments as well?
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 21:57
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> If you’re processing actual msg (not eml), you’ll also need poi and
> poi-scratchpad and their dependencies, but then those msgs could have
> attachments, at which point, you may as just add tika-app. :D
>
> On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) 
> wrote:
>
> > Hi Erick and Tim,
> >
> > Thanks for your answers, I can see that my mail got messed up on the
> > way through the server. It looked much more readable at my end 
> > The attachment simply included my build-path.
> >
> > @Erick I am compiling the program using Netbeans at the moment.
> >
> > I updated to tika-1.7 but that did not help, and I haven't tried
> > maven yet but will probably have to give that a chance. I just find
> > it a bit odd that I can see the dependencies are included in the jar
> > files I added to the project, but I must be missing something?
> >
> > My buildpath looks as follows:
> >
> > Tika-parsers-1.4.jar
> > Tika-core-1.4.jar
> > Commons-io-2.5.jar
> > Httpclient-4.5.3
> > Httpcore-4.4.6.jar
> > Httpmime-4.5.3.jar
> > Slf4j-api1-7-24.jar
> > Jcl-over--slf4j-1.7.24.jar
> > Solr-cell-7.5.0.jar
> > Solr-core-7.5.0.jar
> > Solr-solrj-7.5.0.jar
> > Noggit-0.8.jar
> >
> >
> >
> > -Original Message-
> > From: Tim Allison 
> > Sent: 25. oktober 2018 20:21
> > To: solr-user@lucene.apache.org
> > Subject: Re: Reading data using Tika to Solr
> >
> > To follow up w Erick’s point, there are a bunch of transitive
> > dependencies from tika-parsers. If you aren’t using maven or similar
> > build system to grab the dependencies, it can be tricky to get it
> > right. If you aren’t using maven, and you can afford the risks of
> > jar hell, consider using tika-app or, better perhaps, tika-server.
> >
> > Stay tuned for SOLR-11721...
> >
> > On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> > 
> > wrote:
> >
> > > Martin:
> > >
> > > The mail server is pretty aggressive about stripping attachments,
> > > your png didn't come though. You might also get a more informed
> > > answer on the Tika mailing list.
> > >
> > > That said (and remember I can't see your png so this may be a
> > > silly question), how are you executing the program .vs. compiling
> > > it? You mentioned the "build path". I'm usually lazy and just
> > > execute it in IntelliJ for development and have forgotten to set
> > > my classpath on _numerous_ occasions when running it from a
> > > command line ;)
> > >
> > > Best,
> > > Erick
> > >
> > > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > > 
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > >
> > > >
> > > > I am trying to read content of msg-files using Tika and index
> > > > these in
> > > Solr, however I am having some problems with the OfficeParser(). I
> > > keep getting the error java.lang.NoClassDefFoundError for the
> > > OfficeParcer, even though both

RE: Reading data using Tika to Solr

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi again,

Never mind, I got manage to get the content of the msg-files as well using the 
following link as inspiration: https://wiki.apache.org/tika/RecursiveMetadata

But thanks again for all your help!

-Original Message-
From: Martin Frank Hansen (MHQ) 
Sent: 26. oktober 2018 10:14
To: solr-user@lucene.apache.org
Subject: RE: Reading data using Tika to Solr

Hi Tim,

It is msg files and I added tika-app-1.14.jar to the build path - and now it 
works  But how do I get it to read the attachments as well?

-Original Message-
From: Tim Allison 
Sent: 25. oktober 2018 21:57
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

If you’re processing actual msg (not eml), you’ll also need poi and 
poi-scratchpad and their dependencies, but then those msgs could have 
attachments, at which point, you may as just add tika-app. :D

On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi Erick and Tim,
>
> Thanks for your answers, I can see that my mail got messed up on the
> way through the server. It looked much more readable at my end  The
> attachment simply included my build-path.
>
> @Erick I am compiling the program using Netbeans at the moment.
>
> I updated to tika-1.7 but that did not help, and I haven't tried maven
> yet but will probably have to give that a chance. I just find it a bit
> odd that I can see the dependencies are included in the jar files I
> added to the project, but I must be missing something?
>
> My buildpath looks as follows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.8.jar
>
>
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 20:21
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> To follow up w Erick’s point, there are a bunch of transitive
> dependencies from tika-parsers. If you aren’t using maven or similar
> build system to grab the dependencies, it can be tricky to get it
> right. If you aren’t using maven, and you can afford the risks of jar
> hell, consider using tika-app or, better perhaps, tika-server.
>
> Stay tuned for SOLR-11721...
>
> On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> 
> wrote:
>
> > Martin:
> >
> > The mail server is pretty aggressive about stripping attachments,
> > your png didn't come though. You might also get a more informed
> > answer on the Tika mailing list.
> >
> > That said (and remember I can't see your png so this may be a silly
> > question), how are you executing the program .vs. compiling it? You
> > mentioned the "build path". I'm usually lazy and just execute it in
> > IntelliJ for development and have forgotten to set my classpath on
> > _numerous_ occasions when running it from a command line ;)
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > 
> > wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to read content of msg-files using Tika and index
> > > these in
> > Solr, however I am having some problems with the OfficeParser(). I
> > keep getting the error java.lang.NoClassDefFoundError for the
> > OfficeParcer, even though both tika-core and tika-parsers are
> > included
> in the build path.
> > >
> > >
> > >
> > >
> > >
> > > I am using Java with the following code:
> > >
> > >
> > >
> > >
> > >
> > > public static void main(final String[] args) throws
> > IOException,SAXException, TikaException {
> > >
> > >
> > >
> > > processDocument(pathtofile)
> > >
> > >
> > >
> > >  }
> > >
> > >
> > >
> > > private static void
> > > processDocument(String
> > pathfilename)  {
> > >
> > >
> > >
> > >
> > >
> > >  try {
> > >
> > >
> > >
> > > File file
> > > = new
> > File(pathfilename);
> > >
> > >
> > >
> > > Metadata
> > > meta =
> > new Metadata();
> > >
> > >
> > >
>

RE: Reading data using Tika to Solr

2018-10-26 Thread Martin Frank Hansen (MHQ)
Hi Tim,

It is msg files and I added tika-app-1.14.jar to the build path - and now it 
works  But how do I get it to read the attachments as well?

-Original Message-
From: Tim Allison 
Sent: 25. oktober 2018 21:57
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

If you’re processing actual msg (not eml), you’ll also need poi and 
poi-scratchpad and their dependencies, but then those msgs could have 
attachments, at which point, you may as just add tika-app. :D

On Thu, Oct 25, 2018 at 2:46 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi Erick and Tim,
>
> Thanks for your answers, I can see that my mail got messed up on the
> way through the server. It looked much more readable at my end  The
> attachment simply included my build-path.
>
> @Erick I am compiling the program using Netbeans at the moment.
>
> I updated to tika-1.7 but that did not help, and I haven't tried maven
> yet but will probably have to give that a chance. I just find it a bit
> odd that I can see the dependencies are included in the jar files I
> added to the project, but I must be missing something?
>
> My buildpath looks as follows:
>
> Tika-parsers-1.4.jar
> Tika-core-1.4.jar
> Commons-io-2.5.jar
> Httpclient-4.5.3
> Httpcore-4.4.6.jar
> Httpmime-4.5.3.jar
> Slf4j-api1-7-24.jar
> Jcl-over--slf4j-1.7.24.jar
> Solr-cell-7.5.0.jar
> Solr-core-7.5.0.jar
> Solr-solrj-7.5.0.jar
> Noggit-0.8.jar
>
>
>
> -Original Message-
> From: Tim Allison 
> Sent: 25. oktober 2018 20:21
> To: solr-user@lucene.apache.org
> Subject: Re: Reading data using Tika to Solr
>
> To follow up w Erick’s point, there are a bunch of transitive
> dependencies from tika-parsers. If you aren’t using maven or similar
> build system to grab the dependencies, it can be tricky to get it
> right. If you aren’t using maven, and you can afford the risks of jar
> hell, consider using tika-app or, better perhaps, tika-server.
>
> Stay tuned for SOLR-11721...
>
> On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson
> 
> wrote:
>
> > Martin:
> >
> > The mail server is pretty aggressive about stripping attachments,
> > your png didn't come though. You might also get a more informed
> > answer on the Tika mailing list.
> >
> > That said (and remember I can't see your png so this may be a silly
> > question), how are you executing the program .vs. compiling it? You
> > mentioned the "build path". I'm usually lazy and just execute it in
> > IntelliJ for development and have forgotten to set my classpath on
> > _numerous_ occasions when running it from a command line ;)
> >
> > Best,
> > Erick
> >
> > On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ)
> > 
> > wrote:
> > >
> > > Hi,
> > >
> > >
> > >
> > > I am trying to read content of msg-files using Tika and index
> > > these in
> > Solr, however I am having some problems with the OfficeParser(). I
> > keep getting the error java.lang.NoClassDefFoundError for the
> > OfficeParcer, even though both tika-core and tika-parsers are
> > included
> in the build path.
> > >
> > >
> > >
> > >
> > >
> > > I am using Java with the following code:
> > >
> > >
> > >
> > >
> > >
> > > public static void main(final String[] args) throws
> > IOException,SAXException, TikaException {
> > >
> > >
> > >
> > > processDocument(pathtofile)
> > >
> > >
> > >
> > >  }
> > >
> > >
> > >
> > > private static void
> > > processDocument(String
> > pathfilename)  {
> > >
> > >
> > >
> > >
> > >
> > >  try {
> > >
> > >
> > >
> > > File file
> > > = new
> > File(pathfilename);
> > >
> > >
> > >
> > > Metadata
> > > meta =
> > new Metadata();
> > >
> > >
> > >
> > >
> > > InputStream
> > input = TikaInputStream.get(file);
> > >
> > >
> > >
> > >
> > BodyContentHandler handler = new BodyContentHandler();
> > >
> > >
> > >
> > > Parser
> > > parser =
> > new OfficePa

RE: Reading data using Tika to Solr

2018-10-25 Thread Martin Frank Hansen (MHQ)
Hi Erick and Tim,

Thanks for your answers, I can see that my mail got messed up on the way 
through the server. It looked much more readable at my end  The attachment 
simply included my build-path.

@Erick I am compiling the program using Netbeans at the moment.

I updated to tika-1.7 but that did not help, and I haven't tried maven yet but 
will probably have to give that a chance. I just find it a bit odd that I can 
see the dependencies are included in the jar files I added to the project, but 
I must be missing something?

My buildpath looks as follows:

Tika-parsers-1.4.jar
Tika-core-1.4.jar
Commons-io-2.5.jar
Httpclient-4.5.3
Httpcore-4.4.6.jar
Httpmime-4.5.3.jar
Slf4j-api1-7-24.jar
Jcl-over--slf4j-1.7.24.jar
Solr-cell-7.5.0.jar
Solr-core-7.5.0.jar
Solr-solrj-7.5.0.jar
Noggit-0.8.jar



-Original Message-
From: Tim Allison 
Sent: 25. oktober 2018 20:21
To: solr-user@lucene.apache.org
Subject: Re: Reading data using Tika to Solr

To follow up w Erick’s point, there are a bunch of transitive dependencies from 
tika-parsers. If you aren’t using maven or similar build system to grab the 
dependencies, it can be tricky to get it right. If you aren’t using maven, and 
you can afford the risks of jar hell, consider using tika-app or, better 
perhaps, tika-server.

Stay tuned for SOLR-11721...

On Thu, Oct 25, 2018 at 1:08 PM Erick Erickson 
wrote:

> Martin:
>
> The mail server is pretty aggressive about stripping attachments, your
> png didn't come though. You might also get a more informed answer on
> the Tika mailing list.
>
> That said (and remember I can't see your png so this may be a silly
> question), how are you executing the program .vs. compiling it? You
> mentioned the "build path". I'm usually lazy and just execute it in
> IntelliJ for development and have forgotten to set my classpath on
> _numerous_ occasions when running it from a command line ;)
>
> Best,
> Erick
>
> On Thu, Oct 25, 2018 at 2:55 AM Martin Frank Hansen (MHQ) 
> wrote:
> >
> > Hi,
> >
> >
> >
> > I am trying to read content of msg-files using Tika and index these
> > in
> Solr, however I am having some problems with the OfficeParser(). I
> keep getting the error java.lang.NoClassDefFoundError for the
> OfficeParcer, even though both tika-core and tika-parsers are included in the 
> build path.
> >
> >
> >
> >
> >
> > I am using Java with the following code:
> >
> >
> >
> >
> >
> > public static void main(final String[] args) throws
> IOException,SAXException, TikaException {
> >
> >
> >
> > processDocument(pathtofile)
> >
> >
> >
> >  }
> >
> >
> >
> > private static void
> > processDocument(String
> pathfilename)  {
> >
> >
> >
> >
> >
> >  try {
> >
> >
> >
> > File file =
> > new
> File(pathfilename);
> >
> >
> >
> > Metadata
> > meta =
> new Metadata();
> >
> >
> >
> >  InputStream
> input = TikaInputStream.get(file);
> >
> >
> >
> >
> BodyContentHandler handler = new BodyContentHandler();
> >
> >
> >
> > Parser
> > parser =
> new OfficeParser();
> >
> >
> > ParseContext
> context = new ParseContext();
> >
> >
> parser.parse(input, handler, meta, context);
> >
> >
> >
> >  String
> doccontent = handler.toString();
> >
> >
> >
> >
> >
> >
>  System.out.println(doccontent);
> >
> >
>  System.out.println(meta);
> >
> >
> >
> >  }
> >
> >  }
> >
> > In the buildpath I have the following dependencies:
> >
> >
> >
> >
> >
> > Any help is appreciate.
> >
> >
> >
> > Thanks in advance.
> >
> >
> >
> > Best regards,
> >
> >
> >
> > Martin Hansen
> >
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > finder
> du KMD’s Privatlivspolitik, der fortæller, hvordan vi behandler
> oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can
> > r

Reading data using Tika to Solr

2018-10-25 Thread Martin Frank Hansen (MHQ)
Hi,

I am trying to read content of msg-files using Tika and index these in Solr, 
however I am having some problems with the OfficeParser(). I keep getting the 
error java.lang.NoClassDefFoundError for the OfficeParcer, even though both 
tika-core and tika-parsers are included in the build path.


I am using Java with the following code:


public static void main(final String[] args) throws IOException,SAXException, 
TikaException {

processDocument(pathtofile)

 }

private static void processDocument(String 
pathfilename)  {


 try {

File file = new 
File(pathfilename);

Metadata meta = new 
Metadata();

 InputStream input = 
TikaInputStream.get(file);

 BodyContentHandler 
handler = new BodyContentHandler();

Parser parser = new 
OfficeParser();
 ParseContext context = 
new ParseContext();
 parser.parse(input, 
handler, meta, context);

 String doccontent = 
handler.toString();



System.out.println(doccontent);

System.out.println(meta);

 }
 }
In the buildpath I have the following dependencies:

[cid:image001.png@01D46C59.8AECF060]

Any help is appreciate.

Thanks in advance.

Best regards,

Martin Hansen


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


SV: Tesseract language

2018-10-22 Thread Martin Frank Hansen (MHQ)
Hi Erick,

Thanks for the help! I will take a look at it.


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics



Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk  Web www.kmd.dk
Mobil +4525571418

-Oprindelig meddelelse-
Fra: Erick Erickson 
Sendt: 21. oktober 2018 22:49
Til: solr-user 
Emne: Re: Tesseract language

Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS 
parts out

https://lucidworks.com/2012/02/14/indexing-with-solrj/
On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch  
wrote:
>
> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)  wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and 
> > Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it work.
> >
> > You said that DIH are not recommended for production usage, what is the 
> > recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -Oprindelig meddelelse-
> > Fra: Alexandre Rafalovitch 
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user 
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is great for 
> > a quick test, just like you did it, but going to production, running it 
> > externally is better. Tika - especially with large files can use up a lot 
> > of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr but 
> > specifying parseContent.config file as shown at the link and described 
> > further down in the same document:
> > https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-ce
> > ll-using-apache-tika.html#configuring-the-solr-extractingrequesthand
> > ler You still need to check with Tika documentation with Tesseract
> > can take its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler can 
> > iterate through files and then - as a nested entity - feed it to Tika 
> > processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is also 
> > not recommended for production.
> >
> > I hope this helps,
> > Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)  wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is
> > > Danish Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) 
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the
> > > data-import-handler in Solr and it actually works very well – with
> > > English. As the documents are in Danish, I need to change the
> > > language setting in Tesseract to Danish as well, is that possible from 
> > > Solr?
> > >
> > >
> > >
> > > I was using the update/extract-handler to import single files into
> > > Solr, and it worked for a single file, how would I implement
> > > several files from a file-system?
> > >
> > >
> > >
> > > Here is the r

SV: Tesseract language

2018-10-22 Thread Martin Frank Hansen (MHQ)
Hi Gus,

Thank you so much! I will definitely take a look at it during the day.


Martin Frank Hansen,

-Oprindelig meddelelse-
Fra: Gus Heck 
Sendt: 22. oktober 2018 00:06
Til: solr-user@lucene.apache.org
Emne: Re: Tesseract language

Hi Martin,

I wrote a framework (https://github.com/nsoft/jesterj) that is meant to help 
with small to medium custom solutions It's not (yet) ready for cases where you 
need multiple machines feeding data, but so long as a single box can do the 
work it should be useful. It has a basic Tika stage which is ripe for 
enhancement. The example in the project uses Tika to extract text from 
Shakespeare's plays, though I'll admit that the Tika processor class it has not 
yet been given the full set of configuration options.  Fleshing that out is on 
the list of things to do and would be easy and welcome as a contribution 
(https://github.com/nsoft/jesterj/issues/74).

-Gus


On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch 
wrote:

> Usually, we just say to do a custom solution using SolrJ client to
> connect. This gives you maximum flexibility and allows to integrate
> Tika either inside your code or as a server. Latest Tika actually has
> some off-thread handling I believe, to make it safer to embed.
>
> For DIH alternatives, if you want configuration over custom code, you
> could look at something like Apache NiFI. It can push data into Solr.
> Obviously it is a bigger solution, but it is correspondingly more
> robust too.
>
> Regards,
>Alex.
> On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ) 
> wrote:
> >
> > Hi Alexandre,
> >
> > Thanks for your reply.
> >
> > Yes right now it is just for testing the possibilities of Solr and
> Tesseract.
> >
> > I will take a look at the Tika documentation to see if I can make it
> work.
> >
> > You said that DIH are not recommended for production usage, what is
> > the
> recommended method(s) to upload data to a Solr instance?
> >
> > Best regards
> >
> > Martin Frank Hansen
> >
> > -Oprindelig meddelelse-
> > Fra: Alexandre Rafalovitch 
> > Sendt: 21. oktober 2018 16:26
> > Til: solr-user 
> > Emne: Re: Tesseract language
> >
> > There is a couple of things mixed in here:
> > 1) Extract handler is not recommended for production usage. It is
> > great
> for a quick test, just like you did it, but going to production,
> running it externally is better. Tika - especially with large files
> can use up a lot of memory and trip up the Solr instance it is running within.
> > 2) If you are still just testing, you can configure Tika within Solr
> > but
> specifying parseContent.config file as shown at the link and described
> further down in the same document:
> >
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell
> -using-apache-tika.html#configuring-the-solr-extractingrequesthandler
> > You still need to check with Tika documentation with Tesseract can
> > take
> its configuration from the parseContext file.
> > 3) If you are still testing with multiple files, Data Import Handler
> > can
> iterate through files and then - as a nested entity - feed it to Tika
> processor for further extraction. I think one of the examples shows that.
> > However, I am not sure you can pass parseContext that way and DIH is
> also not recommended for production.
> >
> > I hope this helps,
> > Alex.
> >
> > On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ) 
> wrote:
> >
> > > Hi again,
> > >
> > >
> > >
> > > Is there anyone who has some experience of using Tesseract’s OCR
> > > module within Solr? The files I am trying to read into Solr is
> > > Danish Tiff documents.
> > >
> > >
> > >
> > >
> > >
> > > *Martin Frank Hansen*, Senior Data Analytiker
> > >
> > > Data, IM & Analytics
> > >
> > > [image: cid:image001.png@01D383C9.6C129A60]
> > >
> > >
> > > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > > www.kmd.dk Mobil +4525571418
> > >
> > >
> > >
> > > *Fra:* Martin Frank Hansen (MHQ) 
> > > *Sendt:* 18. oktober 2018 13:30
> > > *Til:* solr-user@lucene.apache.org
> > > *Emne:* Tesseract language
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have been trying to use Tesseract through the
> > > data-import-handler in Solr and it actually works very well – with
> > > English. As the documents are in Danish, I need to change the
> > > langu

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)
Hi Alex,

Thanks again for your reply, much appreciated.


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics



Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk  Web www.kmd.dk
Mobil +4525571418

-Oprindelig meddelelse-
Fra: Alexandre Rafalovitch 
Sendt: 21. oktober 2018 19:13
Til: solr-user 
Emne: Re: Tesseract language

Usually, we just say to do a custom solution using SolrJ client to connect. 
This gives you maximum flexibility and allows to integrate Tika either inside 
your code or as a server. Latest Tika actually has some off-thread handling I 
believe, to make it safer to embed.

For DIH alternatives, if you want configuration over custom code, you could 
look at something like Apache NiFI. It can push data into Solr.
Obviously it is a bigger solution, but it is correspondingly more robust too.

Regards,
   Alex.
On Sun, 21 Oct 2018 at 11:07, Martin Frank Hansen (MHQ)  wrote:
>
> Hi Alexandre,
>
> Thanks for your reply.
>
> Yes right now it is just for testing the possibilities of Solr and Tesseract.
>
> I will take a look at the Tika documentation to see if I can make it work.
>
> You said that DIH are not recommended for production usage, what is the 
> recommended method(s) to upload data to a Solr instance?
>
> Best regards
>
> Martin Frank Hansen
>
> -Oprindelig meddelelse-
> Fra: Alexandre Rafalovitch 
> Sendt: 21. oktober 2018 16:26
> Til: solr-user 
> Emne: Re: Tesseract language
>
> There is a couple of things mixed in here:
> 1) Extract handler is not recommended for production usage. It is great for a 
> quick test, just like you did it, but going to production, running it 
> externally is better. Tika - especially with large files can use up a lot of 
> memory and trip up the Solr instance it is running within.
> 2) If you are still just testing, you can configure Tika within Solr but 
> specifying parseContent.config file as shown at the link and described 
> further down in the same document:
> https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell
> -using-apache-tika.html#configuring-the-solr-extractingrequesthandler
> You still need to check with Tika documentation with Tesseract can take its 
> configuration from the parseContext file.
> 3) If you are still testing with multiple files, Data Import Handler can 
> iterate through files and then - as a nested entity - feed it to Tika 
> processor for further extraction. I think one of the examples shows that.
> However, I am not sure you can pass parseContext that way and DIH is also not 
> recommended for production.
>
> I hope this helps,
> Alex.
>
> On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)  wrote:
>
> > Hi again,
> >
> >
> >
> > Is there anyone who has some experience of using Tesseract’s OCR
> > module within Solr? The files I am trying to read into Solr is
> > Danish Tiff documents.
> >
> >
> >
> >
> >
> > *Martin Frank Hansen*, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> > [image: cid:image001.png@01D383C9.6C129A60]
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > www.kmd.dk Mobil +4525571418
> >
> >
> >
> > *Fra:* Martin Frank Hansen (MHQ) 
> > *Sendt:* 18. oktober 2018 13:30
> > *Til:* solr-user@lucene.apache.org
> > *Emne:* Tesseract language
> >
> >
> >
> > Hi,
> >
> > I have been trying to use Tesseract through the data-import-handler
> > in Solr and it actually works very well – with English. As the
> > documents are in Danish, I need to change the language setting in
> > Tesseract to Danish as well, is that possible from Solr?
> >
> >
> >
> > I was using the update/extract-handler to import single files into
> > Solr, and it worked for a single file, how would I implement several
> > files from a file-system?
> >
> >
> >
> > Here is the request-handler I used:
> >
> >
> >
> >  >
> >   startup="lazy"
> >
> >   class="solr.extraction.ExtractingRequestHandler" >
> >
> > 
> >
> >   false
> >
> >   ignored_
> >
> >   true
> >
> > 
> >
> >   
> >
> >
> >
> >
> >
> > *Martin Frank Hansen*, Senior Data Analytiker
> >
> > Data, IM & Analytics
> >
> > [image: cid:image001.png@01D383C9.6C129A60]
> >
> >
> > Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
> > www.kmd.dk Mobil +4525571418
> >
> >
&g

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)
Hi Alexandre,

Thanks for your reply.

Yes right now it is just for testing the possibilities of Solr and Tesseract.

I will take a look at the Tika documentation to see if I can make it work.

You said that DIH are not recommended for production usage, what is the 
recommended method(s) to upload data to a Solr instance?

Best regards

Martin Frank Hansen

-Oprindelig meddelelse-
Fra: Alexandre Rafalovitch 
Sendt: 21. oktober 2018 16:26
Til: solr-user 
Emne: Re: Tesseract language

There is a couple of things mixed in here:
1) Extract handler is not recommended for production usage. It is great for a 
quick test, just like you did it, but going to production, running it 
externally is better. Tika - especially with large files can use up a lot of 
memory and trip up the Solr instance it is running within.
2) If you are still just testing, you can configure Tika within Solr but 
specifying parseContent.config file as shown at the link and described further 
down in the same document:
https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler
You still need to check with Tika documentation with Tesseract can take its 
configuration from the parseContext file.
3) If you are still testing with multiple files, Data Import Handler can 
iterate through files and then - as a nested entity - feed it to Tika processor 
for further extraction. I think one of the examples shows that.
However, I am not sure you can pass parseContext that way and DIH is also not 
recommended for production.

I hope this helps,
Alex.

On Sun, 21 Oct 2018 at 09:24, Martin Frank Hansen (MHQ)  wrote:

> Hi again,
>
>
>
> Is there anyone who has some experience of using Tesseract’s OCR
> module within Solr? The files I am trying to read into Solr is Danish
> Tiff documents.
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> *Fra:* Martin Frank Hansen (MHQ) 
> *Sendt:* 18. oktober 2018 13:30
> *Til:* solr-user@lucene.apache.org
> *Emne:* Tesseract language
>
>
>
> Hi,
>
> I have been trying to use Tesseract through the data-import-handler in
> Solr and it actually works very well – with English. As the documents
> are in Danish, I need to change the language setting in Tesseract to
> Danish as well, is that possible from Solr?
>
>
>
> I was using the update/extract-handler to import single files into
> Solr, and it worked for a single file, how would I implement several
> files from a file-system?
>
>
>
> Here is the request-handler I used:
>
>
>
> 
>   startup="lazy"
>
>   class="solr.extraction.ExtractingRequestHandler" >
>
> 
>
>   false
>
>   ignored_
>
>   true
>
> 
>
>   
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> finder du KMD’s Privatlivspolitik
> <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi behandler 
> oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy> outlining how
> we process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> beder vi dig slette e-mailen i dit system uden at videresende eller kopiere 
> den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> er fri for virus og andre fejl, som kan påvirke computeren eller
> it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> opstået i forbindelse med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If
> you have received this message by mistake, please inform the sender of
> the mistake by sending a reply, then delete the message from your
> system without making, distributing or retaining any copies of it.
> Although we believe that the message and any attachments are free from
> viruses and other errors that might affect the computer or it-system
> where it is received and read, the recipient opens the message at his or her 
> own risk.
> We assume no responsibility for any loss or damage arising from the
> receipt or use of this message.
>


SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)
Hi again,

Is there anyone who has some experience of using Tesseract’s OCR module within 
Solr? The files I am trying to read into Solr is Danish Tiff documents.


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics

[cid:image001.png@01D383C9.6C129A60]

Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk<mailto:m...@kmd.dk>  Web www.kmd.dk<http://www.kmd.dk/>
Mobil +4525571418

Fra: Martin Frank Hansen (MHQ) 
Sendt: 18. oktober 2018 13:30
Til: solr-user@lucene.apache.org
Emne: Tesseract language

Hi,

I have been trying to use Tesseract through the data-import-handler in Solr and 
it actually works very well – with English. As the documents are  in Danish, I 
need to change the language setting in Tesseract to Danish as well, is that 
possible from Solr?

I was using the update/extract-handler to import single files into Solr, and it 
worked for a single file, how would I implement several files from a 
file-system?

Here is the request-handler I used:



  false
  ignored_
  true
    
  


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics

[cid:image001.png@01D383C9.6C129A60]

Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk<mailto:m...@kmd.dk>  Web www.kmd.dk<http://www.kmd.dk/>
Mobil +4525571418


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


Tesseract language

2018-10-18 Thread Martin Frank Hansen (MHQ)
Hi,

I have been trying to use Tesseract through the data-import-handler in Solr and 
it actually works very well – with English. As the documents are  in Danish, I 
need to change the language setting in Tesseract to Danish as well, is that 
possible from Solr?

I was using the update/extract-handler to import single files into Solr, and it 
worked for a single file, how would I implement several files from a 
file-system?

Here is the request-handler I used:



  false
  ignored_
  true

  


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics

[cid:image001.png@01D383C9.6C129A60]

Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk<mailto:m...@kmd.dk>  Web www.kmd.dk<http://www.kmd.dk/>
Mobil +4525571418


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


SV: DIH for TikaEntityProcessor

2018-10-12 Thread Martin Frank Hansen (MHQ)
You sir just made my day!!!

It worked!!! Thanks a million!


Martin Frank Hansen,

-Oprindelig meddelelse-
Fra: Kamuela Lau 
Sendt: 12. oktober 2018 11:41
Til: solr-user@lucene.apache.org
Emne: Re: DIH for TikaEntityProcessor

Also, just wondering, have you have tried to specify dataSource="bin" for 
read_file?

On Fri, Oct 12, 2018 at 6:38 PM Kamuela Lau  wrote:

> Hi,
>
> I was unable to reproduce the error that you got with the information
> provided.
> Below are the data-config.xml and managed-schema fields I used; the
> data-config is mostly the same (I think that BinFileDataSource doesn't
> actually require a dataSource, so I think it's safe to put
> dataSource="null"):
>
> 
>   
>   
>baseDir="/path/to/sampleData" fileName=".*doc" recursive="true"
> rootEntity="false" dataSource="bin" onError="skip">
> 
>  url="${files.fileAbsolutePath}">
>   
> 
>   
>   
> 
>
> And from the managed schema:
>  required="true" multiValued="false" />
> 
> 
>  docValues="false" />
>  multiValued="true"/>
>
> When I had field column="text" name="content", the documents were
> still indexed, but the text/content was not (as I had no content field
> in the schema).
> I used the default config, and Solr version 7.5.0; I was able to
> import the data just fine (I also tested with .*DOC). Is there any
> other information you can provide that can help me reproduce this error?
>
>
>
>
> On Fri, Oct 12, 2018 at 4:11 PM Martin Frank Hansen (MHQ) 
> wrote:
>
>> Hi again,
>>
>>
>>
>> Can anybody help me? Any suggestions to why I am getting the error below?
>>
>>
>>
>>
>>
>> *Martin Frank Hansen*, Senior Data Analytiker
>>
>> Data, IM & Analytics
>>
>> [image: cid:image001.png@01D383C9.6C129A60]
>>
>>
>> Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk  Web
>> www.kmd.dk Mobil +4525571418
>>
>>
>>
>> *Fra:* Martin Frank Hansen (MHQ)
>> *Sendt:* 10. oktober 2018 10:15
>> *Til:* solr-user 
>> *Emne:* DIH for TikaEntityProcessor
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am trying to read documents from a file system into Solr, using
>> dataimporthandler but keep getting the following errors:
>>
>>
>>
>> Exception while processing: files document :
>> null:org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException: java.io.InputStreamReader cannot be
>> cast to java.io.InputStream
>>
>>  at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
>> Throw(DataImportHandlerException.java:61)
>>
>>  at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>> ityProcessorWrapper.java:270)
>>
>>  at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:476)
>>
>>  at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:517)
>>
>>  at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>> r.java:415)
>>
>>  at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
>> ava:330)
>>
>>  at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
>> :233)
>>
>>  at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
>> rter.java:424)
>>
>>  at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
>> ava:483)
>>
>>  at
>> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Dat
>> aImporter.java:466)
>>
>>  at java.lang.Thread.run(Thread.java:748)
>>
>> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
>> cannot be cast to java.io.InputStream
>>
>>  at
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
>> tityProcessor.java:132)
>>
>>  at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>> ityProcessorWrapper.java:267)
>>
>>  ... 9 more
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Full Import failed:java.lang.RuntimeException:
>> java.lang.RuntimeExcepti

SV: DIH for TikaEntityProcessor

2018-10-12 Thread Martin Frank Hansen (MHQ)
Hi Kamuela,

Thanks for your answer.

I still get the same error, so I think I will try with the tech-products 
example to see if it works there as Alexendre suggest in the mail above.

Martin Frank Hansen,

-Oprindelig meddelelse-
Fra: Kamuela Lau 
Sendt: 12. oktober 2018 11:38
Til: solr-user@lucene.apache.org
Emne: Re: DIH for TikaEntityProcessor

Hi,

I was unable to reproduce the error that you got with the information provided.
Below are the data-config.xml and managed-schema fields I used; the data-config 
is mostly the same (I think that BinFileDataSource doesn't actually require a 
dataSource, so I think it's safe to put dataSource="null"):


  
  
  


  

  
  


And from the managed schema:






When I had field column="text" name="content", the documents were still 
indexed, but the text/content was not (as I had no content field in the schema).
I used the default config, and Solr version 7.5.0; I was able to import the 
data just fine (I also tested with .*DOC). Is there any other information you 
can provide that can help me reproduce this error?




On Fri, Oct 12, 2018 at 4:11 PM Martin Frank Hansen (MHQ) 
wrote:

> Hi again,
>
>
>
> Can anybody help me? Any suggestions to why I am getting the error below?
>
>
>
>
>
> *Martin Frank Hansen*, Senior Data Analytiker
>
> Data, IM & Analytics
>
> [image: cid:image001.png@01D383C9.6C129A60]
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
>
>
> *Fra:* Martin Frank Hansen (MHQ)
> *Sendt:* 10. oktober 2018 10:15
> *Til:* solr-user 
> *Emne:* DIH for TikaEntityProcessor
>
>
>
> Hi,
>
>
>
> I am trying to read documents from a file system into Solr, using
> dataimporthandler but keep getting the following errors:
>
>
>
> Exception while processing: files document :
> null:org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast
> to java.io.InputStream
>
>  at
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndT
> hrow(DataImportHandlerException.java:61)
>
>  at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Enti
> tyProcessorWrapper.java:270)
>
>  at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:476)
>
>  at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:517)
>
>  at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder
> .java:415)
>
>  at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.ja
> va:330)
>
>  at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:
> 233)
>
>  at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpor
> ter.java:424)
>
>  at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.ja
> va:483)
>
>  at
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Data
> Importer.java:466)
>
>  at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
> cannot be cast to java.io.InputStream
>
>  at
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEnt
> ityProcessor.java:132)
>
>  at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Enti
> tyProcessorWrapper.java:267)
>
>  ... 9 more
>
>
>
>
>
>
>
>
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast
> to java.io.InputStream
>
>  at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:
> 271)
>
>  at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpor
> ter.java:424)
>
>  at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.ja
> va:483)
>
>  at
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Data
> Importer.java:466)
>
>  at java.lang.Thread.run(Thread.java:748)
>
> Caused by: java.lang.RuntimeException:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.ClassCastException: java.io.InputStreamReader cannot be cast
> to java.io.InputStream
>
>  at
> org.apache.solr.handler.dataimpor

SV: DIH for TikaEntityProcessor

2018-10-12 Thread Martin Frank Hansen (MHQ)
Hi again,

Can anybody help me? Any suggestions to why I am getting the error below?


Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics

[cid:image001.png@01D383C9.6C129A60]

Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk<mailto:m...@kmd.dk>  Web www.kmd.dk<http://www.kmd.dk/>
Mobil +4525571418

Fra: Martin Frank Hansen (MHQ)
Sendt: 10. oktober 2018 10:15
Til: solr-user 
Emne: DIH for TikaEntityProcessor

Hi,

I am trying to read documents from a file system into Solr, using 
dataimporthandler but keep getting the following errors:

[cid:image002.png@01D4620B.90013EB0]

Exception while processing: files document : 
null:org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to 
java.io.InputStream

 at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)

 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)

 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)

 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)

 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)

 at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)

 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

 at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)

 at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)

 at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

 at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be 
cast to java.io.InputStream

 at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)

 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)

 ... 9 more



[cid:image003.png@01D4620B.90013EB0]

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to 
java.io.InputStream
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
 at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
 at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
 at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to 
java.io.InputStream
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
 at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
 ... 4 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to 
java.io.InputStream
 at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
 ... 6 more
Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be 
cast to java.io.InputStream
 at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
 ... 9 more


My data-config file looks as follows:


  
  
  



  

  
  


And in the Schema I basically have two fields:




Any help is appreciated.


Martin Frank Hansen


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy<http://www.kmd.net/

DIH for TikaEntityProcessor

2018-10-10 Thread Martin Frank Hansen (MHQ)
Hi,

I am trying to read documents from a file system into Solr, using 
dataimporthandler but keep getting the following errors:

[cid:image002.png@01D46082.022FF7A0]

Exception while processing: files document : 
null:org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to 
java.io.InputStream

 at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)

 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)

 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)

 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)

 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)

 at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)

 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

 at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)

 at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)

 at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

 at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be 
cast to java.io.InputStream

 at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)

 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)

 ... 9 more



[cid:image003.png@01D46082.022FF7A0]

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to 
java.io.InputStream
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
 at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
 at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
 at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to 
java.io.InputStream
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
 at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
 ... 4 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.InputStreamReader cannot be cast to 
java.io.InputStream
 at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:270)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
 ... 6 more
Caused by: java.lang.ClassCastException: java.io.InputStreamReader cannot be 
cast to java.io.InputStream
 at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:132)
 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
 ... 9 more


My data-config file looks as follows:


  
  
  



  

  
  


And in the Schema I basically have two fields:




Any help is appreciated.


Martin Frank Hansen


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andr

SV: DIH for different levels of XML

2018-10-07 Thread Martin Frank Hansen (MHQ)
Hi Alex,

Thanks for your answer.

I think I made it work. The problem was actually in the schema.xml, where the 
field "Journalnummer" should have  multiValued="true".


Martin Frank Hansen



Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk  Web www.kmd.dk
Mobil +4525571418

-Oprindelig meddelelse-
Fra: Alexandre Rafalovitch 
Sendt: 7. oktober 2018 20:18
Til: solr-user 
Emne: Re: DIH for different levels of XML

If your ID field comes from one XML level and your record details from another, 
they are processed as two separate records. Have a look at atom example that 
ships with DIH example set. Specifically, at commonField parameter, it may be 
useful for you:
https://lucene.apache.org/solr/guide/7_4/uploading-structured-data-store-data-with-the-data-import-handler.html

Regards,
   Alex.
On Sun, 7 Oct 2018 at 13:23, Martin Frank Hansen (MHQ)  wrote:
>
> Hi,
>
> I am having some difficulties adding data from different levels of a xml 
> document.
>
> The xml can be as simple as this:
>
> 
>   
> 2165432
> 
>   5
>   10
> 
>   
> 
>
> The data-config-file looks like this.
> 
>   
> 
>name="xml"
> pk="Id"
> stream="true"
> processor="XPathEntityProcessor"
> url="C:/Users/z6mhq/Desktop/data_import/test.xml"
> forEach="/Export/Case/ | /Export/Case/item/"
> transformer="DateFormatTransformer" >
>
> 
>  xpath="/Export/Case/item/Journalnummer" />
>
>   
>   
> 
>
> The result is the following:
> {
>   "responseHeader":{
> "status":0,
> "QTime":0,
> "params":{
>   "q":"*:*",
>   "_":"1538931455588"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "Id":"2165432",
> "_version_":1613686828885344256}]
>   }}
>
> While expecting something like this:
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":0,
> "params":{
>   "q":"*:*",
>   "_":"1538931455588"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "Id":"2165432",
> "Journalnummer":[5,10]}]
>   }}
>
>
> I have tried a lot of things to import the data correctly but to no avail, I 
> really hope that someone can help me.
>
> Thanks in advance, any help is much appreciated.
>
> Martin Hansen
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process 
> your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
>
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.


DIH for different levels of XML

2018-10-07 Thread Martin Frank Hansen (MHQ)
Hi,

I am having some difficulties adding data from different levels of a xml 
document.

The xml can be as simple as this:


  
2165432

  5
  10

  


The data-config-file looks like this.

  

  




  
  


The result is the following:
{
  "responseHeader":{
"status":0,
"QTime":0,
"params":{
  "q":"*:*",
  "_":"1538931455588"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"Id":"2165432",
"_version_":1613686828885344256}]
  }}

While expecting something like this:

{
  "responseHeader":{
"status":0,
"QTime":0,
"params":{
  "q":"*:*",
  "_":"1538931455588"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"Id":"2165432",
"Journalnummer":[5,10]}]
  }}


I have tried a lot of things to import the data correctly but to no avail, I 
really hope that someone can help me.

Thanks in advance, any help is much appreciated.

Martin Hansen


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


SV: data-import-handler for solr-7.5.0

2018-10-02 Thread Martin Frank Hansen (MHQ)
I made it work with the simplest of xml-files with some inspiration from 
https://opensolr.com/blog/2011/09/how-to-import-data-from-xml-files-into-your-solr-collection
 .

Data-config is now:


  

  
  
  


And the document is simply:


   
 2165432
 5
   

   
 28548113
 89
   


Now I guess I just have to add to this solution.

Thanks for your help Alex, and also thanks to Jan answering the first mail.

Best regards
Martin Frank Hansen

-Oprindelig meddelelse-
Fra: Alexandre Rafalovitch 
Sendt: 2. oktober 2018 19:52
Til: solr-user 
Emne: Re: data-import-handler for solr-7.5.0

Ok, so then you can switch to debug mode and keep trying to figure it out. Also 
try BinFileDataSource or URLDataSource, maybe it will have an easier way.

Or using relative path (example:
https://github.com/arafalov/solr-apachecon2018-presentation/blob/master/configsets/pets-final/pets-data-config.xml).

Regards,
   Alex.
On Tue, 2 Oct 2018 at 12:46, Martin Frank Hansen (MHQ)  wrote:
>
> Thanks for the info, the UI looks interesting... It does read the data-config 
> correctly, so the problem is probably in this file.
>
> Martin Frank Hansen, Senior Data Analytiker
>
> Data, IM & Analytics
>
>
>
> Lautrupparken 40-42, DK-2750 Ballerup
> E-mail m...@kmd.dk  Web www.kmd.dk
> Mobil +4525571418
>
> -Oprindelig meddelelse-
> Fra: Alexandre Rafalovitch 
> Sendt: 2. oktober 2018 18:18
> Til: solr-user 
> Emne: Re: data-import-handler for solr-7.5.0
>
> Admin UI for DIH will show you the config file read. So, if nothing is
> there, the path is most likely the issue
>
> You can also provide or update the configuration right in UI if you enable 
> debug.
>
> Finally, the config file is reread on every invocation, so you don't need to 
> restart the core after changing it.
>
> Hope this helps,
>Alex.
> On Tue, 2 Oct 2018 at 11:45, Jan Høydahl  wrote:
> >
> > > url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
> >
> > Have you tried url="C:\\Users\\z6mhq/Desktop\\data_import\\nh_test.xml" ?
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com
> >
> > > 2. okt. 2018 kl. 17:15 skrev Martin Frank Hansen (MHQ) :
> > >
> > > Hi,
> > >
> > > I am having some problems getting the data-import-handler in Solr to 
> > > work. I have tried a lot of things but I simply get no response from 
> > > Solr, not even an error.
> > >
> > > When calling the API:
> > > http://localhost:8983/solr/nh/dataimport?command=full-import
> > > {
> > >  "responseHeader":{
> > >"status":0,
> > >"QTime":38},
> > >  "initArgs":[
> > >"defaults",[
> > >
> > > "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
> > >  "command":"full-import",
> > >  "status":"idle",
> > >  "importResponse":"",
> > >  "statusMessages":{}}
> > >
> > > The data looks like this:
> > >
> > > 
> > >  
> > > 2165432
> > > 5  
> > >
> > >  
> > > 28548113
> > > 89   
> > >
> > >
> > > The data-config file looks like this:
> > >
> > > 
> > >  
> > >
> > >   > >name="xml"
> > >pk="id"
> > >processor="XPathEntityProcessor"
> > >stream="true"
> > >forEach="/journal/doc"
> > >url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
> > >transformer="RegexTransformer,TemplateTransformer"
> > >>
> > >    
> > >
> > >
> > >  
> > >  
> > > 
> > >
> > > And I referenced the jar files in the solr-config.xml as well as adding 
> > > the request-handler by adding the following lines:
> > >
> > >  > > regex="solr-dataimporthandler-\d.*\.jar" />  > > dir="${solr.install.dir:../../../..}/dist/"
> > > regex="solr-dataimporthandler-extras-\d.*\.jar" />
> > >
> > >
> > >  > > class="org.apache.solr.handler.dataimport.DataImportHandler">
> > >
> > >   > > name="config">C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml
> > >

SV: data-import-handler for solr-7.5.0

2018-10-02 Thread Martin Frank Hansen (MHQ)
Thanks for the info, the UI looks interesting... It does read the data-config 
correctly, so the problem is probably in this file.

Martin Frank Hansen, Senior Data Analytiker

Data, IM & Analytics



Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk  Web www.kmd.dk
Mobil +4525571418

-Oprindelig meddelelse-
Fra: Alexandre Rafalovitch 
Sendt: 2. oktober 2018 18:18
Til: solr-user 
Emne: Re: data-import-handler for solr-7.5.0

Admin UI for DIH will show you the config file read. So, if nothing is there, 
the path is most likely the issue

You can also provide or update the configuration right in UI if you enable 
debug.

Finally, the config file is reread on every invocation, so you don't need to 
restart the core after changing it.

Hope this helps,
   Alex.
On Tue, 2 Oct 2018 at 11:45, Jan Høydahl  wrote:
>
> > url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
>
> Have you tried url="C:\\Users\\z6mhq/Desktop\\data_import\\nh_test.xml" ?
>
> --
> Jan Høydahl, search solution architect Cominvent AS -
> www.cominvent.com
>
> > 2. okt. 2018 kl. 17:15 skrev Martin Frank Hansen (MHQ) :
> >
> > Hi,
> >
> > I am having some problems getting the data-import-handler in Solr to work. 
> > I have tried a lot of things but I simply get no response from Solr, not 
> > even an error.
> >
> > When calling the API:
> > http://localhost:8983/solr/nh/dataimport?command=full-import
> > {
> >  "responseHeader":{
> >"status":0,
> >"QTime":38},
> >  "initArgs":[
> >"defaults",[
> >  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
> >  "command":"full-import",
> >  "status":"idle",
> >  "importResponse":"",
> >  "statusMessages":{}}
> >
> > The data looks like this:
> >
> > 
> >  
> > 2165432
> > 5  
> >
> >  
> > 28548113
> > 89   
> >
> >
> > The data-config file looks like this:
> >
> > 
> >  
> >
> >   >name="xml"
> >pk="id"
> >processor="XPathEntityProcessor"
> >stream="true"
> >forEach="/journal/doc"
> >url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
> >transformer="RegexTransformer,TemplateTransformer"
> >>
> >
> >
> >
> >  
> >  
> > 
> >
> > And I referenced the jar files in the solr-config.xml as well as adding the 
> > request-handler by adding the following lines:
> >
> >  > regex="solr-dataimporthandler-\d.*\.jar" />  > dir="${solr.install.dir:../../../..}/dist/"
> > regex="solr-dataimporthandler-extras-\d.*\.jar" />
> >
> >
> >  > class="org.apache.solr.handler.dataimport.DataImportHandler">
> >
> >   > name="config">C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml
> >
> >  
> >
> > I am running a core residing in the folder 
> > “C:/Users/z6mhq/Desktop/nh/nh/conf” while the Solr installation is in 
> > “C:/Users/z6mhq/Documents/solr-7.5.0”.
> >
> > I really hope that someone can spot my mistake…
> >
> > Thanks in advance.
> >
> > Martin Frank Hansen
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> > KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der 
> > fortæller, hvordan vi behandler oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can read 
> > KMD’s Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we 
> > process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst 
> > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi 
> > dig slette e-mailen i dit system uden at videresende eller kopiere den. 
> > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri 
> > for virus og andre fejl, som kan påvirke computeren eller it-systemet, 
> > hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi 
> > påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse 
> > med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information. If you 
> > have received this message by mistake, please inform the sender of the 
> > mistake by sending a reply, then delete the message from your system 
> > without making, distributing or retaining any copies of it. Although we 
> > believe that the message and any attachments are free from viruses and 
> > other errors that might affect the computer or it-system where it is 
> > received and read, the recipient opens the message at his or her own risk. 
> > We assume no responsibility for any loss or damage arising from the receipt 
> > or use of this message.
>


SV: data-import-handler for solr-7.5.0

2018-10-02 Thread Martin Frank Hansen (MHQ)
Unfortunately, still no luck.

{
  "responseHeader":{
"status":0,
"QTime":8},
  "initArgs":[
"defaults",[
  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
  "command":"full-import",
  "status":"idle",
  "importResponse":"",
  "statusMessages":{
"Total Requests made to DataSource":"0",
"Total Rows Fetched":"0",
"Total Documents Processed":"0",
"Total Documents Skipped":"0",
"Full Dump Started":"2018-10-02 16:15:21",
"":"Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.",
"Committed":"2018-10-02 16:15:22",
"Time taken":"0:0:0.136"}}

Seems like it is not even trying to read the data.

Martin Frank Hansen

-Oprindelig meddelelse-
Fra: Jan Høydahl 
Sendt: 2. oktober 2018 17:46
Til: solr-user@lucene.apache.org
Emne: Re: data-import-handler for solr-7.5.0

> url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"

Have you tried url="C:\\Users\\z6mhq/Desktop\\data_import\\nh_test.xml" ?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 2. okt. 2018 kl. 17:15 skrev Martin Frank Hansen (MHQ) :
>
> Hi,
>
> I am having some problems getting the data-import-handler in Solr to work. I 
> have tried a lot of things but I simply get no response from Solr, not even 
> an error.
>
> When calling the API:
> http://localhost:8983/solr/nh/dataimport?command=full-import
> {
>  "responseHeader":{
>"status":0,
>"QTime":38},
>  "initArgs":[
>"defaults",[
>  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
>  "command":"full-import",
>  "status":"idle",
>  "importResponse":"",
>  "statusMessages":{}}
>
> The data looks like this:
>
> 
>  
> 2165432
> 5  
>
>  
> 28548113
> 89   
>
>
> The data-config file looks like this:
>
> 
>  
>
>  name="xml"
>pk="id"
>processor="XPathEntityProcessor"
>stream="true"
>forEach="/journal/doc"
>url="C:/Users/z6mhq/Desktop/data_import/nh_test.xml"
>transformer="RegexTransformer,TemplateTransformer"
>>
>
>
>
>  
>  
> 
>
> And I referenced the jar files in the solr-config.xml as well as adding the 
> request-handler by adding the following lines:
>
>  regex="solr-dataimporthandler-\d.*\.jar" />  dir="${solr.install.dir:../../../..}/dist/"
> regex="solr-dataimporthandler-extras-\d.*\.jar" />
>
>
>  class="org.apache.solr.handler.dataimport.DataImportHandler">
>
>   name="config">C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml
>
>  
>
> I am running a core residing in the folder 
> “C:/Users/z6mhq/Desktop/nh/nh/conf” while the Solr installation is in 
> “C:/Users/z6mhq/Documents/solr-7.5.0”.
>
> I really hope that someone can spot my mistake…
>
> Thanks in advance.
>
> Martin Frank Hansen
>
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process 
> your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
>
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.



data-import-handler for solr-7.5.0

2018-10-02 Thread Martin Frank Hansen (MHQ)
Hi,

I am having some problems getting the data-import-handler in Solr to work. I 
have tried a lot of things but I simply get no response from Solr, not even an 
error.

When calling the API: 
http://localhost:8983/solr/nh/dataimport?command=full-import
{
  "responseHeader":{
"status":0,
"QTime":38},
  "initArgs":[
"defaults",[
  "config","C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml"]],
  "command":"full-import",
  "status":"idle",
  "importResponse":"",
  "statusMessages":{}}

The data looks like this:


  
 2165432
 5
  

  
 28548113
 89
  



The data-config file looks like this:


  

  



  
  


And I referenced the jar files in the solr-config.xml as well as adding the 
request-handler by adding the following lines:







  C:/Users/z6mhq/Desktop/nh/nh/conf/data-config.xml

  

I am running a core residing in the folder “C:/Users/z6mhq/Desktop/nh/nh/conf” 
while the Solr installation is in “C:/Users/z6mhq/Documents/solr-7.5.0”.

I really hope that someone can spot my mistake…

Thanks in advance.

Martin Frank Hansen


Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
hvordan vi behandler oplysninger om dig.

Protection of your personal data is important to us. Here you can read KMD’s 
Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process your 
personal data.

Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. Hvis 
du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen og 
ethvert vedhæftet bilag efter vores overbevisning er fri for virus og andre 
fejl, som kan påvirke computeren eller it-systemet, hvori den modtages og 
læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget ansvar 
for tab og skade, som er opstået i forbindelse med at modtage og bruge e-mailen.

Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or it-system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.


Re: Decompound German Words

2012-05-06 Thread Martin Frank
Dear Satish,

did you found a decompounding dictionary for german?

Best Regards
Martin

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Decompound-German-Words-tp3708194p3966013.html
Sent from the Solr - User mailing list archive at Nabble.com.