RE: Odp.: solr issue with pdf forms

Allison, Timothy B. Wed, 29 Apr 2015 05:17:24 -0700

I completely agree with Erick about the utility of the TermsComponent to see 
what is actually being indexed.  If you find problems there and if you haven't 
done so already, you might also investigate further down the stack.  It might 
make sense to run the tika-app.jar (whichever version you are using in DIH or 
other mechanism?) or even the pdfbox-app.jar (ExtractText option) on your files 
outside of Solr to see what text/noise you're getting for the files that are 
causing problems.




-----Original Message-----
From: Erick Erickson [mailto:[email protected]] 
Sent: Tuesday, April 28, 2015 9:07 PM
To: [email protected]
Subject: Re: Odp.: solr issue with pdf forms

There better be.

1> go to the admin UI
2> select a core
3> select "schema browser"
4> select a field from the drop-down

Until you do step 4 the window will be pretty blank.

Here's the info for TermsComponent, what have you tried?

https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Apr 28, 2015 at 1:04 PM,  <[email protected]> wrote:
> Thanks a lot for being patient with me. Unfortunately there is no button 
> "load term info". :-(
> Can you may be help me using the TermsComponent instead? I read it is per 
> default configured.
>
> Thanks a lot
> Best
> Steve
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:[email protected]]
> Gesendet: Montag, 27. April 2015 17:23
> An: [email protected]
> Betreff: Re: Odp.: solr issue with pdf forms
>
> We're still not quite there. There should be a "load term info" button on 
> that page. Clicking that button will show you the terms in your index (as 
> opposed to the raw stored input which is what you get when you look at 
> results in the browser). My bet is that you'll see perfectly normal tokens in 
> the index that will NOT have the wonky characters you see in the display.
>
> If that's the case, then you have a browser issue, Solr is working perfectly 
> fine. On the other hand, if the individual terms are weird, then you have 
> something more fundamental going on.
>
> Which is why I mentioned the TermsComponent. That will return indexed tokens, 
> and allows you a bit more flexibility than the admin page in terms of what 
> tokens you see, but it's essentially the same information.
>
> Best,
> Erick
>
> On Sun, Apr 26, 2015 at 11:18 PM,  <[email protected]> wrote:
>> Erick,
>>
>> thanks a lot for helping me here. In my case it ist he "content" field which 
>> is displayed not correctly. So I went tot he schema browser like you pointed 
>> out. Here ist he information I found:
>> Field: content
>> Field Type: text
>> Properties:  Indexed, Tokenized, Stored, TermVector Stored
>> Schema:  Indexed, Tokenized, Stored, TermVector Stored
>> Index:  Indexed, Tokenized, Stored, TermVector Stored Copied Into:
>> spell teaser Position Increment Gap:  100 Index Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 1 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 1 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SynonymFilterFactory args:{synonyms:
>> german/synonyms.txt expand: true ignoreCase: true luceneMatchVersion:
>> LUCENE_36 }
>> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory
>> args:{maxSubwordSize: 15 onlyLongestMatch: false minSubwordSize: 4
>> minWordSize: 5 dictionary: german/german-common-nouns.txt
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 } Query Analyzer:
>> org.apache.solr.analysis.TokenizerChain Details Tokenizer Class:
>> org.apache.solr.analysis.WhitespaceTokenizerFactory
>> Filters:
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> args:{preserveOriginal: 1 splitOnCaseChange: 0 generateNumberParts: 1
>> catenateWords: 0 luceneMatchVersion: LUCENE_36 generateWordParts: 1
>> catenateAll: 0 catenateNumbers: 0 }
>> org.apache.solr.analysis.LowerCaseFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.StopFilterFactory args:{words:
>> german/stopwords.txt ignoreCase: true enablePositionIncrements: true
>> luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.GermanNormalizationFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.SnowballPorterFilterFactory args:{protected:
>> german/protwords.txt language: German2 luceneMatchVersion: LUCENE_36 }
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
>> args:{luceneMatchVersion: LUCENE_36 }
>> Distinct:  160403
>>
>> Does this somehow help to figure out the issue?
>> Thanks
>> Best
>> Steve
>>
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Erick Erickson [mailto:[email protected]]
>> Gesendet: Freitag, 24. April 2015 20:15
>> An: [email protected]
>> Betreff: Re: Odp.: solr issue with pdf forms
>>
>> Steve:
>>
>> Right, it's not exactly obvious. Bring up the admin UI, something like 
>> http://localhost:8983/solr. From there you have to select a core in the 
>> 'core selector' drop-down on the left side. If you're using SolrCloud, this 
>> will have a rather strange name, but it should be easy to identify what 
>> collection it belongs to.
>>
>> At that point you'll see a bunch of new options, among them "schema 
>> browser". From there, select your field from the drop-down that will appear, 
>> then a button should pop up "load term info".
>>
>> NOTE: you can get the same information from the TermsComponent, see:
>> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component.
>> This is a little more flexible because you can, among other things, specify 
>> the place to start. In your case you might specify terms.prefix=mein which 
>> will show you the terms that are actually being _searched_ as opposed to 
>> being stored. This latter is what you see in the browser when you search for 
>> docs and is sometimes misleading as you're (probably) seeing.
>>
>> Best,
>> Erick
>>
>> On Fri, Apr 24, 2015 at 1:58 AM,  <[email protected]> wrote:
>>> Hey Erick,
>>>
>>> thanks a lot for your answer. I went to the admin schema browser, but
>>> what should I see there? Sorry I'm not firm with the admin schema
>>> browser. :-(
>>>
>>> Best
>>> Steve
>>>
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Erick Erickson [mailto:[email protected]]
>>> Gesendet: Donnerstag, 23. April 2015 18:00
>>> An: [email protected]
>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>
>>> When you say "they're not indexed correctly", what's your evidence?
>>> You cannot rely
>>> on the display in the browser, that's the raw input just as it was sent to 
>>> Solr, _not_ the actual tokens in the index. What do you see when you go to 
>>> the admin schema browser pate and load the actual tokens.
>>>
>>> Or use the TermsComponent
>>> (https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
>>> ) to see the actual terms in the index as opposed to the stored data
>>> you see in the browser when you look at search results.
>>>
>>> If the actual terms don't seem right _in the index_ we need to see your 
>>> analysis chain, i.e. your fieldType definition.
>>>
>>> I'm, 90% sure you're seeing the stored data and your terms are indexed just 
>>> fine, but I've certainly been wrong before, more times than I want to 
>>> remember.....
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Apr 23, 2015 at 1:18 AM,  <[email protected]> wrote:
>>>> Hey Erick,
>>>>
>>>> thanks for your answer. They are not indexed correctly. Also throught the 
>>>> solr admin interface I see these typical questionmarks within a rhombus 
>>>> where a blank space should be.
>>>> I now figured out the following (not sure if it is relevant at all):
>>>> - PDF documents created with "Acrobat PDFMaker 10.0 for Word" are
>>>> indexed correctly, no issues
>>>> - PDF documents (with editable form fields) created with "Adobe
>>>> InDesign CS5 (7.0.1)"  are indexed with the blank space issue
>>>>
>>>> Best
>>>> Steve
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Erick Erickson [mailto:[email protected]]
>>>> Gesendet: Mittwoch, 22. April 2015 17:11
>>>> An: [email protected]
>>>> Betreff: Re: Odp.: solr issue with pdf forms
>>>>
>>>> Are they not _indexed_ correctly or not being displayed correctly?
>>>> Take a look at admin UI>>schema browser>> your field and press the "load 
>>>> terms" button. That'll show you what is _in_ the index as opposed to what 
>>>> the raw data looked like.
>>>>
>>>> When you return the field in a Solr search, you get a verbatim, 
>>>> un-analyzed copy of your original input. My guess is that your browser 
>>>> isn't using the compatible character encoding for display.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, Apr 22, 2015 at 7:08 AM,  <[email protected]> wrote:
>>>>> Thanks for your answer. Maybe my English is not good enough, what are you 
>>>>> trying to say? Sorry I didn't get the point.
>>>>> :-(
>>>>>
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: LAFK [mailto:[email protected]]
>>>>> Gesendet: Mittwoch, 22. April 2015 14:01
>>>>> An: [email protected]; [email protected]
>>>>> Betreff: Odp.: solr issue with pdf forms
>>>>>
>>>>> Out of my head I'd follow how are writable PDFs created and encoded.
>>>>>
>>>>> @LAFK_PL
>>>>>   Oryginalna wiadomość
>>>>> Od: [email protected]
>>>>> Wysłano: środa, 22 kwietnia 2015 12:41
>>>>> Do: [email protected]
>>>>> Odpowiedz: [email protected]
>>>>> Temat: solr issue with pdf forms
>>>>>
>>>>> Hi guys,
>>>>>
>>>>> hopefully you can help me with my issue. We are using a solr setup and 
>>>>> have the following issue:
>>>>> - usual pdf files are indexed just fine
>>>>> - pdf files with writable form-fields look like this:
>>>>> Ich bestätige mit meiner Unterschrift, dass alle Angaben korrekt
>>>>> und v ollständig sind
>>>>>
>>>>> Somehow the blank space character is not indexed correctly.
>>>>>
>>>>> Is this a know issue? Does anybody have an idea?
>>>>>
>>>>> Thanks a lot
>>>>> Best
>>>>> Steve

RE: Odp.: solr issue with pdf forms

Reply via email to