Re: ngramfilter minGramSize problem
it works well. now why does the search only find something when the fieldname is added to the query with stopwords? "cug" -> 9 hits "mit cug" -> 0 hits "plain_text:mit cug" -> 9 hits why is this so? could it be a problem that stopwords aren't used in the query because no all fields that are search have the stopwordfilter? On Mon, 07 Apr 2014 00:37:15 +0200, Furkan KAMACI wrote: Correction: My patch is at SOLR-5152 7 Nis 2014 01:05 tarihinde "Andreas Owen" yazdı: i thought i cound use to index and search words that are only 1 or 2 chars long. it seems to work but i have to test it some more On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen wrote: i have the a fieldtype that uses ngramfilter whle indexing. is there a setting that can force the ngramfilter to index smaller words then the minGramSize? Mine is set to 3 and the search wont find word that are only 1 or 2 chars long. i would like to not set minGramSize=1 because the results would be to diverse. fieldtype: ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/> class="solr.SnowballPorterFilterFactory" language="German"/> -- Using Opera's mail client: http://www.opera.com/mail/ -- Using Opera's mail client: http://www.opera.com/mail/
Re: ngramfilter minGramSize problem
i thought i cound use max="2"/> to index and search words that are only 1 or 2 chars long. it seems to work but i have to test it some more On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen wrote: i have the a fieldtype that uses ngramfilter whle indexing. is there a setting that can force the ngramfilter to index smaller words then the minGramSize? Mine is set to 3 and the search wont find word that are only 1 or 2 chars long. i would like to not set minGramSize=1 because the results would be to diverse. fieldtype: positionIncrementGap="100"> words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> maxGramSize="50"/> words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> -- Using Opera's mail client: http://www.opera.com/mail/
ngramfilter minGramSize problem
i have the a fieldtype that uses ngramfilter whle indexing. is there a setting that can force the ngramfilter to index smaller words then the minGramSize? Mine is set to 3 and the search wont find word that are only 1 or 2 chars long. i would like to not set minGramSize=1 because the results would be to diverse. fieldtype: positionIncrementGap="100"> words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> maxGramSize="50"/> words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
Re: dih data-config.xml onImportEnd event
sorry, the previous conversation was started with a false email-address. On Thu, 27 Mar 2014 14:06:57 +0100, Stefan Matheis wrote: I would suggest you read the replies to your last mail (containing the very same question) first? -Stefan On Thursday, March 27, 2014 at 1:56 PM, Andreas Owen wrote: i would like to call a url after the import is finished whith the event . how can i do this? -- Using Opera's mail client: http://www.opera.com/mail/
dih data-config.xml onImportEnd event
i would like to call a url after the import is finished whith the event . how can i do this?
facet doesnt display all possibilities after selecting one
when i select a facet in "thema_f" all the others in the group disapear but the other facets keep the original findings. it seems like it should work. maybe the underscore is the wrong char for the seperator? example documents in index 1_Produkte dms:381 1_Beratung 1_Beratung_Beratungsportal PK dms:2679 1_Beratung 1_Beratung_Beratungsportal PK dms:190 solrconfig.xml explicit 10 synonym_edismax true plain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 productsegment^5 productgroup^5 contentmanager^5 links^5 last_modified^5 url^5 (expiration:[NOW TO *] OR (*:* -expiration:*))^6 div(clicks,max(displays,1))^8 text *,path,score json AND on plain_text,title 200 on 1 false {!ex=inhaltstyp_s}inhaltstyp_s index {!ex=doctype}doctype index {!ex=thema_f}thema_f index {!ex=productsegment_f}productsegment_f index {!ex=productgroup_f}productgroup_f index {!ex=author_s}author_s index name="facet.field">{!ex=sachverstaendiger_s}sachverstaendiger_s index {!ex=veranstaltung_s}veranstaltung_s index name="facet.field">{!ex=kundensegment_aktive_beratung}kundensegment_aktive_beratung index {!ex=last_modified}last_modified +1MONTH NOW/MONTH+1MONTH NOW/MONTH-36MONTHS after schema.xml positionIncrementGap="100">
dih data-config.xml onImportEnd event
i would like to call a url after the import is finished whith the event . how can i do this?
wrong results with wdf & ngtf
Is there a way to tell ngramfilterfactory while indexing that number shall never be tokenized? then the query should be able to find numbers. Or do i have to change the ngram-min for numbers (not alpha) to 1, if that is possible? So to speak put the hole number as token and not all possible tokens. Solr analysis shows onnly WDF has no underscore in its tokens, the rest have it. can i tell the query to search numbers differently with NGTF, WT, LCF or whatever? I also tried @ => ALPHA _ => ALPHA I have gotten nearly everything to work. There are to queries where i dont get back what i want. "avaloq frage 1" -> only returns if i set minGramSize=1 while indexing "yh_cug"-> query parser doesn't remove "_" but the indexer does (WDF) so there is no match Is there a way to also query the hole term "avaloq frage 1" without tokenizing it? Fieldtype: Solrconfig: > class="solr.SynonymExpandingExtendedDismaxQParserPlugin"> > > > > standard > > > shingle > true > true > 2 > 4 > > > synonym > solr.KeywordTokenizerFactory > synonyms.txt > true > true > > > > > > > >explicit >10 >synonym_edismax >true >plain_text^10 editorschoice^200 > title^20 h_*^14 > tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 > contentmanager^5 links^5 > last_modified^5 url^5 > >(expiration:[NOW TO *] OR (*:* > -expiration:*))^6 >div(clicks,max(displays,1))^8 > >text >*,path,score >json >AND > > >on >plain_text,title >200 > > > > > on > 1 > {!ex=inhaltstyp_s}inhaltstyp_s > index > {!ex=doctype}doctype > index > {!ex=thema_f}thema_f > index > {!ex=author_s}author_s > index > name="facet.field">{!ex=sachverstaendiger_s}sachverstaendiger_s > index > {!ex=veranstaltung_s}veranstaltung_s > index > {!ex=last_modified}last_modified > +1MONTH > NOW/MONTH+1MONTH > NOW/MONTH-36MONTHS > after > > >
wrong query results with wdf and ngtf
Is there a way to tell ngramfilterfactory while indexing that number shall never be tokenized? then the query should be able to find numbers. Or do i have to change the ngram-min for numbers (not alpha) to 1, if that is possible? So to speak put the hole number as token and not all possible tokens. Solr analysis shows onnly WDF has no underscore in its tokens, the rest have it. can i tell the query to search numbers differently with NGTF, WT, LCF or whatever? I also tried @ => ALPHA _ => ALPHA I have gotten nearly everything to work. There are to queries where i dont get back what i want. "avaloq frage 1"-> only returns if i set minGramSize=1 while indexing "yh_cug"-> query parser doesn't remove "_" but the indexer does (WDF) so there is no match Is there a way to also query the hole term "avaloq frage 1" without tokenizing it? Fieldtype: Solrconfig: > class="solr.SynonymExpandingExtendedDismaxQParserPlugin"> > > > > standard > > > shingle > true > true > 2 > 4 > > > synonym > solr.KeywordTokenizerFactory > synonyms.txt > true > true > > > > > > > >explicit >10 >synonym_edismax >true >plain_text^10 editorschoice^200 > title^20 h_*^14 > tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 > contentmanager^5 links^5 > last_modified^5 url^5 > >(expiration:[NOW TO *] OR (*:* > -expiration:*))^6 >div(clicks,max(displays,1))^8 > >text >*,path,score >json >AND > > >on >plain_text,title >200 > > > > > on > 1 > {!ex=inhaltstyp_s}inhaltstyp_s > index > {!ex=doctype}doctype > index > {!ex=thema_f}thema_f > index > {!ex=author_s}author_s > index > name="facet.field">{!ex=sachverstaendiger_s}sachverstaendiger_s > index > {!ex=veranstaltung_s}veranstaltung_s > index > {!ex=last_modified}last_modified > +1MONTH > NOW/MONTH+1MONTH > NOW/MONTH-36MONTHS > after > > >
underscore in query error
If I use the underscore in the query I don't get any results. If I remove the underscore it finds the docs with underscore. Can I tell solr to search through the ngtf instead of the wdf or is there any better solution? Query: yh_cug I attached a doc with the analyzer output
searche for single char number when ngram min is 3
Is there a way to tell ngramfilterfactory while indexing that number shall never be tokenized? then the query should be able to find numbers. Or do i have to change the ngram min for numbers to 1, if that is possible? So to speak put the hole number as token and not all possible tokens. Or can i tell the query to search numbers differently woth WT, LCF or whatever? I attached a doc with screenshots from solr analyzer -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 13. März 2014 13:44 To: solr-user@lucene.apache.org Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards I have gotten nearly everything to work. There are to queries where i dont get back what i want. "avaloq frage 1"-> only returns if i set minGramSize=1 while indexing "yh_cug"-> query parser doesn't remove "_" but the indexer does (WDF) so there is no match Is there a way to also query the hole term "avaloq frage 1" without tokenizing it? Fieldtype: -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 18:39 To: solr-user@lucene.apache.org Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards Hi Jack, do you know how i can use local parameters in my solrconfig? The params are visible in the debugquery-output but solr doesn't parse them. {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) (+organisations:($org) -roles:["" TO *]) -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 14:44 To: solr-user@lucene.apache.org Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index & query). here's the rest: -Original-Nachricht- > Von: "Jack Krupansky" > An: solr-user@lucene.apache.org > Datum: 12/03/2014 13:25 > Betreff: Re: NOT SOLVED searches for single char tokens instead of > from 3 uppwards > > You didn't show the new index analyzer - it's tricky to assure that > index and query are compatible, but the Admin UI Analysis page can help. > > Generally, using pure defaults for WDF is not what you want, > especially for query time. Usually there needs to be a slight > asymmetry between index and query for WDF - index generates more terms than > query. > > -- Jack Krupansky > > -Original Message- > From: Andreas Owen > Sent: Wednesday, March 12, 2014 6:20 AM > To: solr-user@lucene.apache.org > Subject: RE: NOT SOLVED searches for single char tokens instead of > from 3 uppwards > > I now have the following: > > > > types="at-under-alpha.txt"/> class="solr.LowerCaseFilterFactory"/> > words="lang/stopwords_de.txt" format="snowball" > enablePositionIncrements="true"/> class="solr.GermanNormalizationFilterFactory"/> > > > > The gui analysis shows me that wdf doesn't cut the underscore anymore > but it still returns 0 results? > > Output: > > > yh_cug > yh_cug > (+DisjunctionMaxQuery((tags:yh_cug^10.0 | > links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | > url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | > breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 > | > editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) > ((expiration:[1394619501862 TO *] > (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) > FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_ > coord > +(tags:yh_cug^10.0 | > links:yh_cug^5.0 | > thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | > h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | > contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | > editorschoice:yh_cug^200.0 | > doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] > (+*:* -expiration:*))^6.0) > (div(int(clicks),max(int(displays),const(1^8.0 > > > yh_cug > > > DidntFindAnySynonyms > No sy
RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards
I have gotten nearly everything to work. There are to queries where i dont get back what i want. "avaloq frage 1"-> only returns if i set minGramSize=1 while indexing "yh_cug"-> query parser doesn't remove "_" but the indexer does (WDF) so there is no match Is there a way to also query the hole term "avaloq frage 1" without tokenizing it? Fieldtype: -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 18:39 To: solr-user@lucene.apache.org Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards Hi Jack, do you know how i can use local parameters in my solrconfig? The params are visible in the debugquery-output but solr doesn't parse them. {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) (+organisations:($org) -roles:["" TO *]) -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 14:44 To: solr-user@lucene.apache.org Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index & query). here's the rest: -Original-Nachricht- > Von: "Jack Krupansky" > An: solr-user@lucene.apache.org > Datum: 12/03/2014 13:25 > Betreff: Re: NOT SOLVED searches for single char tokens instead of > from 3 uppwards > > You didn't show the new index analyzer - it's tricky to assure that > index and query are compatible, but the Admin UI Analysis page can help. > > Generally, using pure defaults for WDF is not what you want, > especially for query time. Usually there needs to be a slight > asymmetry between index and query for WDF - index generates more terms than > query. > > -- Jack Krupansky > > -Original Message- > From: Andreas Owen > Sent: Wednesday, March 12, 2014 6:20 AM > To: solr-user@lucene.apache.org > Subject: RE: NOT SOLVED searches for single char tokens instead of > from 3 uppwards > > I now have the following: > > > > types="at-under-alpha.txt"/> class="solr.LowerCaseFilterFactory"/> > words="lang/stopwords_de.txt" format="snowball" > enablePositionIncrements="true"/> class="solr.GermanNormalizationFilterFactory"/> > > > > The gui analysis shows me that wdf doesn't cut the underscore anymore > but it still returns 0 results? > > Output: > > > yh_cug > yh_cug > (+DisjunctionMaxQuery((tags:yh_cug^10.0 | > links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | > url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | > breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 > | > editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) > ((expiration:[1394619501862 TO *] > (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) > FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_ > coord > +(tags:yh_cug^10.0 | > links:yh_cug^5.0 | > thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | > h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | > contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | > editorschoice:yh_cug^200.0 | > doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] > (+*:* -expiration:*))^6.0) > (div(int(clicks),max(int(displays),const(1^8.0 > > > yh_cug > > > DidntFindAnySynonyms > No synonyms found for this query. Check > your synonyms file. > > > ExtendedDismaxQParser > > > (expiration:[NOW TO *] OR (*:* -expiration:*))^6 > > > (expiration:[1394619501862 TO *] > (+MatchAllDocsQuery(*:*) -expiration:*))^6.0 > > > div(clicks,max(displays,1))^8 > > > > ExtendedDismaxQParser > > > div(clicks,max(displays,1))^8 > > > > > > > > -Original Message- > From: Jack Krupansky [mailto:j...@basetechnology.com] > Sent: Dienstag, 11. März 2014 14
RE: use local param in solrconfig fq for access-control
I have given up this idee and made a wrapper which adds a fq with the userroles to each request -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Dienstag, 11. März 2014 23:32 To: solr-user@lucene.apache.org Subject: use local param in solrconfig fq for access-control i would like to use $r and $org for access control. it has to allow the fq's from my facet to work aswell. i'm not sure if i'm doing it wright or if i should add it to a qf or the q itself. the debugquery returns a parsed fq string and in them $r and $org are printed instead of their values. how do i get them to be intepreted? the lacal params are listed in the response so they should be valid. {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) (+organisations:($org) -roles:["" TO *])
RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards
Hi Jack, do you know how i can use local parameters in my solrconfig? The params are visible in the debugquery-output but solr doesn't parse them. {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) (+organisations:($org) -roles:["" TO *]) -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Mittwoch, 12. März 2014 14:44 To: solr-user@lucene.apache.org Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index & query). here's the rest: -Original-Nachricht- > Von: "Jack Krupansky" > An: solr-user@lucene.apache.org > Datum: 12/03/2014 13:25 > Betreff: Re: NOT SOLVED searches for single char tokens instead of > from 3 uppwards > > You didn't show the new index analyzer - it's tricky to assure that > index and query are compatible, but the Admin UI Analysis page can help. > > Generally, using pure defaults for WDF is not what you want, > especially for query time. Usually there needs to be a slight > asymmetry between index and query for WDF - index generates more terms than > query. > > -- Jack Krupansky > > -Original Message- > From: Andreas Owen > Sent: Wednesday, March 12, 2014 6:20 AM > To: solr-user@lucene.apache.org > Subject: RE: NOT SOLVED searches for single char tokens instead of > from 3 uppwards > > I now have the following: > > > > types="at-under-alpha.txt"/> class="solr.LowerCaseFilterFactory"/> > words="lang/stopwords_de.txt" format="snowball" > enablePositionIncrements="true"/> class="solr.GermanNormalizationFilterFactory"/> > > > > The gui analysis shows me that wdf doesn't cut the underscore anymore > but it still returns 0 results? > > Output: > > > yh_cug > yh_cug > (+DisjunctionMaxQuery((tags:yh_cug^10.0 | > links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | > url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | > breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 > | > editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) > ((expiration:[1394619501862 TO *] > (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) > FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_ > coord > +(tags:yh_cug^10.0 | > links:yh_cug^5.0 | > thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | > h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | > contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | > editorschoice:yh_cug^200.0 | > doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] > (+*:* -expiration:*))^6.0) > (div(int(clicks),max(int(displays),const(1^8.0 > > > yh_cug > > > DidntFindAnySynonyms > No synonyms found for this query. Check > your synonyms file. > > > ExtendedDismaxQParser > > > (expiration:[NOW TO *] OR (*:* -expiration:*))^6 > > > (expiration:[1394619501862 TO *] > (+MatchAllDocsQuery(*:*) -expiration:*))^6.0 > > > div(clicks,max(displays,1))^8 > > > > ExtendedDismaxQParser > > > div(clicks,max(displays,1))^8 > > > > > > > > -Original Message- > From: Jack Krupansky [mailto:j...@basetechnology.com] > Sent: Dienstag, 11. März 2014 14:25 > To: solr-user@lucene.apache.org > Subject: Re: NOT SOLVED searches for single char tokens instead of > from 3 uppwards > > The usual use of an ngram filter is at index time and not at query time. > What exactly are you trying to achieve by using ngram filtering at > query time as well as index time? > > Generally, it is inappropriate to combine the word delimiter filter > with the standard tokenizer - the later removes the punctuation that > normally influences how WDF treats the parts of a token. Use the white > space tokenizer if you intend to use WDF. > > Which query parser are you using? What fields are being queried? > > Please post the parsed query string from the debug output - it will > show the precise generated query. > > I think what you are seeing is that the ngram filter is generating > tokens like "h_cugtest" and then the WDF is removing the underscore and then > "h" > gets generated as a separate token. >
Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards
yes that is exactly what happend in the analyzer. the term i searched for was listed on both sides (index & query). here's the rest: -Original-Nachricht- > Von: "Jack Krupansky" > An: solr-user@lucene.apache.org > Datum: 12/03/2014 13:25 > Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 > uppwards > > You didn't show the new index analyzer - it's tricky to assure that index > and query are compatible, but the Admin UI Analysis page can help. > > Generally, using pure defaults for WDF is not what you want, especially for > query time. Usually there needs to be a slight asymmetry between index and > query for WDF - index generates more terms than query. > > -- Jack Krupansky > > -Original Message- > From: Andreas Owen > Sent: Wednesday, March 12, 2014 6:20 AM > To: solr-user@lucene.apache.org > Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 > uppwards > > I now have the following: > > > > > > words="lang/stopwords_de.txt" format="snowball" > enablePositionIncrements="true"/> > > > > > The gui analysis shows me that wdf doesn't cut the underscore anymore but it > still returns 0 results? > > Output: > > > yh_cug > yh_cug > (+DisjunctionMaxQuery((tags:yh_cug^10.0 | > links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | > url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | > breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | > editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) > ((expiration:[1394619501862 TO *] > (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) > FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord > +(tags:yh_cug^10.0 | links:yh_cug^5.0 | > thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | > h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | > contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | > doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] > (+*:* -expiration:*))^6.0) > (div(int(clicks),max(int(displays),const(1^8.0 > > > yh_cug > > > DidntFindAnySynonyms > No synonyms found for this query. Check your > synonyms file. > > > ExtendedDismaxQParser > > > (expiration:[NOW TO *] OR (*:* -expiration:*))^6 > > > (expiration:[1394619501862 TO *] > (+MatchAllDocsQuery(*:*) -expiration:*))^6.0 > > > div(clicks,max(displays,1))^8 > > > > ExtendedDismaxQParser > > > div(clicks,max(displays,1))^8 > > > > > > > > -Original Message- > From: Jack Krupansky [mailto:j...@basetechnology.com] > Sent: Dienstag, 11. März 2014 14:25 > To: solr-user@lucene.apache.org > Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 > uppwards > > The usual use of an ngram filter is at index time and not at query time. > What exactly are you trying to achieve by using ngram filtering at query > time as well as index time? > > Generally, it is inappropriate to combine the word delimiter filter with the > standard tokenizer - the later removes the punctuation that normally > influences how WDF treats the parts of a token. Use the white space > tokenizer if you intend to use WDF. > > Which query parser are you using? What fields are being queried? > > Please post the parsed query string from the debug output - it will show the > precise generated query. > > I think what you are seeing is that the ngram filter is generating tokens > like "h_cugtest" and then the WDF is removing the underscore and then "h" > gets generated as a separate token. > > -- Jack Krupansky > > -Original Message- > From: Andreas Owen > Sent: Tuesday, March 11, 2014 5:09 AM > To: solr-user@lucene.apache.org > Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 > uppwards > > I got it roght the first time and here is my requesthandler. The field > "plain_text" is searched correctly and has the sam fieldtype as "title" -> > "text_de" > > class="solr.SynonymExpandingExtendedDismaxQParserPlugin"> > > > > standard > > > shingle > true > true > 2 > 4 > > > synonym > solr.KeywordTokenizerFactory > synonyms.txt > true > true > >
RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards
I now have the following: The gui analysis shows me that wdf doesn't cut the underscore anymore but it still returns 0 results? Output: yh_cug yh_cug (+DisjunctionMaxQuery((tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord +(tags:yh_cug^10.0 | links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) (div(int(clicks),max(int(displays),const(1^8.0 yh_cug DidntFindAnySynonyms No synonyms found for this query. Check your synonyms file. ExtendedDismaxQParser (expiration:[NOW TO *] OR (*:* -expiration:*))^6 (expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) -expiration:*))^6.0 div(clicks,max(displays,1))^8 ExtendedDismaxQParser div(clicks,max(displays,1))^8 -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Dienstag, 11. März 2014 14:25 To: solr-user@lucene.apache.org Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 uppwards The usual use of an ngram filter is at index time and not at query time. What exactly are you trying to achieve by using ngram filtering at query time as well as index time? Generally, it is inappropriate to combine the word delimiter filter with the standard tokenizer - the later removes the punctuation that normally influences how WDF treats the parts of a token. Use the white space tokenizer if you intend to use WDF. Which query parser are you using? What fields are being queried? Please post the parsed query string from the debug output - it will show the precise generated query. I think what you are seeing is that the ngram filter is generating tokens like "h_cugtest" and then the WDF is removing the underscore and then "h" gets generated as a separate token. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Tuesday, March 11, 2014 5:09 AM To: solr-user@lucene.apache.org Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards I got it roght the first time and here is my requesthandler. The field "plain_text" is searched correctly and has the sam fieldtype as "title" -> "text_de" standard shingle true true 2 4 synonym solr.KeywordTokenizerFactory synonyms.txt true true explicit 10 synonym_edismax true plain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) (+organisations:($org) -roles:["" TO *]) (expiration:[NOW TO *] OR (*:* -expiration:*))^6 div(clicks,max(displays,1))^8 text *,path,score json AND on plain_text,title 200 <b> </b> on 1 {!ex=inhaltstyp_s}inhaltstyp_s index {!ex=doctype}doctype index {!ex=thema_f}thema_f index {!ex=author_s}author_s index {!ex=sachverstaendiger_s}sachverstaendiger_s index {!ex=veranstaltung_s}veranstaltung_s index {!ex=last_modified}last_modified +1MONTH NOW/MONTH+1MONTH NOW/MONTH-36MONTHS after i have a field with the following type: shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a query report of 2 results: > 0 name="QTime">125 name="debugQuery">true name="fl">title,roles,organisations,id name="indent">trueyh_cugtest name="_">1394522589347xml name="fq">organisations:* roles:*name="response" numFound="5" start="0"> >.. > > 1.6365329 = (MATCH) sum of: 1.6346203 = (MATCH) max of: > 0.14759353 = (MATCH) product of: 0.28596246 = (MATCH) sum of: > 0.01528686 = (MATCH) weight(plain
use local param in solrconfig fq for access-control
i would like to use $r and $org for access control. it has to allow the fq's from my facet to work aswell. i'm not sure if i'm doing it wright or if i should add it to a qf or the q itself. the debugquery returns a parsed fq string and in them $r and $org are printed instead of their values. how do i get them to be intepreted? the lacal params are listed in the response so they should be valid. {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) (+organisations:($org) -roles:["" TO *])
use local params in query
Shouldn't the numbers be in the output below (parsed_filter_queries) and not $r and $org? This works great but i would like to use lacal params "r" and "org" instead of hard-coded (*:* -organisations:[* TO *] -roles:[* TO *]) (+organisations:(150 42) +roles:(174 72)) I would like (*:* -organisations:[* TO *] -roles:[* TO *]) (+organisations:($org) +roles:($r)) I use this in my requesthandler under invariant because i need it to be added to the query without being able to be overriden. Oh and i use facets so fq has to be combinable. This should work or am i understanding it wrong? Debug query: 0 109 true true 267 yh_cug 1394533792473 xml ... {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) (+organisations:($org) -roles:["" TO *]) (MatchAllDocsQuery(*:*) -organisations:["" TO *] -roles:["" TO *]) (+organisations:$org +roles:$r) (-organisations:["" TO *] +roles:$r) (+organisations:$org -roles:["" TO *])
query with local params
This works great but i would like to use lacal params "r" and "org" instead of hard-coded (*:* -organisations:[* TO *] -roles:[* TO *]) (+organisations:(150 42) +roles:(174 72)) I would like (*:* -organisations:[* TO *] -roles:[* TO *]) (+organisations:($org) +roles:($r)) Shouldn't the numbers be in the output below (parsed_filter_queries) and not $r and $org? I use this in my requesthandler and need it to be added as fq or query params without being able to be overriden, has anybody any idees? Oh and i use facets so fq has to be combinable. Debug query: 0 109 true true 267 yh_cug 1394533792473 xml ... {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) (+organisations:($org) -roles:["" TO *]) (MatchAllDocsQuery(*:*) -organisations:["" TO *] -roles:["" TO *]) (+organisations:$org +roles:$r) (-organisations:["" TO *] +roles:$r) (+organisations:$org -roles:["" TO *])
RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards
I got it roght the first time and here is my requesthandler. The field "plain_text" is searched correctly and has the sam fieldtype as "title" -> "text_de" standard shingle true true 2 4 synonym solr.KeywordTokenizerFactory synonyms.txt true true explicit 10 synonym_edismax true plain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) (+organisations:($org) -roles:["" TO *]) (expiration:[NOW TO *] OR (*:* -expiration:*))^6 div(clicks,max(displays,1))^8 text *,path,score json AND on plain_text,title 200 on 1 {!ex=inhaltstyp_s}inhaltstyp_s index {!ex=doctype}doctype index {!ex=thema_f}thema_f index {!ex=author_s}author_s index {!ex=sachverstaendiger_s}sachverstaendiger_s index {!ex=veranstaltung_s}veranstaltung_s index {!ex=last_modified}last_modified +1MONTH NOW/MONTH+1MONTH NOW/MONTH-36MONTHS after i have a field with the following type: shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a query report of 2 results: > 0 name="QTime">125 name="debugQuery">true name="fl">title,roles,organisations,id name="indent">trueyh_cugtest name="_">1394522589347xml name="fq">organisations:* roles:*name="response" numFound="5" start="0"> >.. > > 1.6365329 = (MATCH) sum of: 1.6346203 = (MATCH) max of: > 0.14759353 = (MATCH) product of: 0.28596246 = (MATCH) sum of: > 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], > result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 > ), product of: 0.035319194 = queryWeight, product of: > > 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = > queryNorm 0.43282017 = fieldWeight in 0, product of: > > 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 > > 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = > fieldNorm(doc=0) 0.0119499 = (MATCH) weight(plain_text:ugt in > 0) [DefaultSimilarity], result of: 0.0119499 = > score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.031227252 = queryWeight, product of: > 4.8982444 = idf(docFreq=18, maxDocs=937) 0.0063751927 = > queryNorm 0.38267535 = fieldWeight in 0, product of: > > 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 > > 4.8982444 = idf(docFreq=18, maxDocs=937) 0.078125 = > fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhc > in 0) [DefaultSimilarity], result of: 0.019351374 = > score(doc=0,freq=1.0 = termFreq=1.0 ), product of: > 0.03973814 = queryWeight, product of: 6.2332454 = > idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm > > 0.4869723 = fieldWeight in 0, product of: 1.0 = > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 > 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) > weight(plain_text:hcu in 0) [DefaultSimilarity], result of: > 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: > 0.03973814 = queryWeight, product of: 6.2332454 = > idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm > > 0.4869723 = fieldWeight in 0, product of: 1.0 = > tf(freq=1.0), with freq of: 1.0 = termF
Re: SOLVED searches for single char tokens instead of from 3 uppwards
sorry i looked at the wrong fieldtype -Original-Nachricht- > Von: "Andreas Owen" > An: solr-user@lucene.apache.org > Datum: 11/03/2014 08:45 > Betreff: searches for single char tokens instead of from 3 uppwards > > i have a field with the following type: > > > > > > words="lang/stopwords_de.txt" format="snowball" > enablePositionIncrements="true"/> > > language="German"/> > maxGramSize="15"/> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > > > > > shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a > query report of 2 results: > 0 name="QTime">125 name="debugQuery">true name="fl">title,roles,organisations,id true > yh_cugtest 1394522589347 > xml organisations:* roles:* > > > .. > > 1.6365329 = (MATCH) sum of: 1.6346203 = (MATCH) max of: 0.14759353 = > (MATCH) product of: 0.28596246 = (MATCH) sum of: 0.01528686 = > (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: > 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: > 0.035319194 = queryWeight, product of: 5.540098 = > idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm > 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), > with freq of: 1.0 = termFreq=1.0 5.540098 = > idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) > 0.0119499 = (MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result > of: 0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.031227252 = queryWeight, product of: > 4.8982444 = idf(docFreq=18, maxDocs=937) 0.0063751927 = > queryNorm 0.38267535 = fieldWeight in 0, product of: > 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 > 4.8982444 = idf(docFreq=18, maxDocs=937) 0.078125 = > fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhc in 0) > [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 > = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product > of: 6.2332454 = idf(docFreq=4, maxDocs=937) > 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product > of: 1.0 = tf(freq=1.0), with freq of: 1.0 = > termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) > weight(plain_text:hcu in 0) [DefaultSimilarity], result of: > 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: > 0.03973814 = queryWeight, product of: 6.2332454 = > idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm > 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), > with freq of: 1.0 = termFreq=1.0 6.2332454 = > idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) > 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result > of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: > 0.035319194 = queryWeight, product of: 5.540098 = > idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = > tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 > 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = > fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:cugt in 0) > [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 > = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product > of: 6.2332454 = idf(docFreq=4, maxDocs=937) > 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product > of: 1.0 = tf(freq=1.0), with freq of: 1.0 = > termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) > 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) > weight(plain_text:yhcu in 0) [DefaultSimilarity], result of: 0.019351374
searches for single char tokens instead of from 3 uppwards
i have a field with the following type: shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a query report of 2 results: 0 125 truetitle,roles,organisations,idtrue yh_cugtest1394522589347xmlorganisations:* roles:* .. 1.6365329 = (MATCH) sum of: 1.6346203 = (MATCH) max of: 0.14759353 = (MATCH) product of: 0.28596246 = (MATCH) sum of: 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.035319194 = queryWeight, product of: 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.0119499 = (MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result of: 0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.031227252 = queryWeight, product of: 4.8982444 = idf(docFreq=18, maxDocs=937) 0.0063751927 = queryNorm 0.38267535 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 4.8982444 = idf(docFreq=18, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhc in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:hcu in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.035319194 = queryWeight, product of: 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:cugt in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhcu in 0) [DefaultSimilarity], result of: 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of: 6.2332454 = idf(docFreq=4, maxDocs=937) 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.2332454 = idf(docFreq=4, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of: 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.035319194 = queryWeight, product of: 5.540098 = idf(docFreq=9, maxDocs=937) 0.0063751927 = queryNorm 0.43282017 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.540098 = idf(docFreq=9, maxDocs=937) 0.078125 = fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:hcug in 0) [DefaultSimila
maxClauseCount is set to 1024
does this maxClauseCount go over each field individually or all put together? is it the date fields? when i execute a query i get this error: 500 93true Ein PDFchen als Dokument roles:* 1394436617394 xml . 0.10604319 390 2 27 1 1 3 3 1 8 10 1 14 37 1 1 4 8 44 4 1 6 57 11 11 3 3 4 1 2 1 2 2 2 2 29 1 1 17 1 1 4 1 3 5 1 5 1 2 1 1 1 35 1 2 26 2 1 2 3 1 1 1 27 3 1 1 3 1 1 3 6 3 1 2 2 2 2 2 28 4 2 1 16 46 1 5 11 58 1 2 29 2 2 1 1 1 9 4 75 2 2 1 2 2 1 4 1 1 2 1 1 1 91 1 11 3 3 20 15 59 11 36 204 18 2 25 7 5 2 7 3 7 10 10 4 1 34 4 35 25 9 +1MONTH 2011-03-01T00:00:00Z 2014-04-01T00:00:00Z 0 maxClauseCount is set to 1024 org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024at org.apache.lucene.search.ScoringRewrite$1.checkMaxClauseCount(ScoringRewrite.java:72) at org.apache.lucene.search.ScoringRewrite$ParallelArraysTermCollector.collect(ScoringRewrite.java:152) at org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:79) at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:108) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:288) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:217) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:99) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:469) at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217) at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:199) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:528) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:415) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:139) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.han
set fq operator independently
i want to use the following in fq and i need to set the operator to OR. My q.op is AND but I need OR in fq. I have read about ofq but that is for putting OR between multiple fq. Can I set the operator for fq? (-organisations:["" TO *] -roles:["" TO *]) (+organisations:(150 42) +roles:(174 72)) The statement should find all docs without organisations and roles or those that have at least one roles and organisations entry. these fields are multivalued.
Re[2]: query parameters
ok i like the logic, you can do much more. i think this should do it for me: (-organisations:["" TO *] -roles:["" TO *]) (+organisations:(150 42) +roles:(174 72)) i want to use this in fq and i need to set the operator to OR. My q.op is AND but I need OR in fq. I have read about ofq but that is for putting OR between multiple fq. Can I set the operator for fq? The statement should find all docs without organisations and roles or those that have at least one roles and organisations entry. these fields are multivalued. -Original-Nachricht- > Von: "Erick Erickson" > An: solr-user@lucene.apache.org > Datum: 19/02/2014 04:09 > Betreff: Re: query parameters > > Solr/Lucene query language is NOT strictly boolean, see > Chris's excellent blog here: > http://searchhub.org/dev/2011/12/28/why-not-and-or-and-not/ > > Best, > Erick > > > On Tue, Feb 18, 2014 at 11:54 AM, Andreas Owen wrote: > > > I tried it in solr admin query and it showed me all the docs without a > > value > > in ogranisations and roles. It didn't matter if i used a base term, isn't > > that give through the q-parameter? > > > > -Original Message- > > From: Raymond Wiker [mailto:rwi...@gmail.com] > > Sent: Dienstag, 18. Februar 2014 13:19 > > To: solr-user@lucene.apache.org > > Subject: Re: query parameters > > > > That could be because the second condition does not do what you think it > > does... have you tried running the second condition separately? > > > > You may have to add a "base term" to the second condition, like what you > > have for the "bq" parameter in your config file; i.e, something like > > > > (*:* -organisations:["" TO *] -roles:["" TO *]) > > > > > > > > > > On Tue, Feb 18, 2014 at 12:16 PM, Andreas Owen wrote: > > > > > It seams that fq doesn't except OR because: (organisations:(150 OR 41) > > > AND > > > roles:(174)) OR (-organisations:["" TO *] AND -roles:["" TO *]) only > > > returns docs that match the first conditions. it doesn't return any > > > docs with the empty fields organisations and roles. > > > > > > -Original Message- > > > From: Andreas Owen [mailto:a...@conx.ch] > > > Sent: Montag, 17. Februar 2014 05:08 > > > To: solr-user@lucene.apache.org > > > Subject: query parameters > > > > > > > > > in solrconfig of my solr 4.3 i have a userdefined requestHandler. i > > > would like to use fq to force the following conditions: > > > 1: organisations is empty and roles is empty > > > 2: organisations contains one of the commadelimited list in > > > variable $org > > > 3: roles contains one of the commadelimited list in variable $r > > > 4: rule 2 and 3 > > > > > > snipet of what i got (havent checked out if the is a "in" operator > > > like in sql for the list value) > > > > > > > > > explicit > > > 10 > > > edismax > > > true > > > plain_text^10 editorschoice^200 > > > title^20 h_*^14 > > > tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 > > > contentmanager^5 links^5 > > > last_modified^5 url^5 > > > > > > (organisations='' roles='') or > > > (organisations=$org roles=$r) or (organisations='' roles=$r) or > > > (organisations=$org roles='') > > > (expiration:[NOW TO *] OR (*:* > > > -expiration:*))^6 > > > div(clicks,max(displays,1))^8 > > > > > > > > > > > > > > > > > > > > > >
RE: query parameters
I tried it in solr admin query and it showed me all the docs without a value in ogranisations and roles. It didn't matter if i used a base term, isn't that give through the q-parameter? -Original Message- From: Raymond Wiker [mailto:rwi...@gmail.com] Sent: Dienstag, 18. Februar 2014 13:19 To: solr-user@lucene.apache.org Subject: Re: query parameters That could be because the second condition does not do what you think it does... have you tried running the second condition separately? You may have to add a "base term" to the second condition, like what you have for the "bq" parameter in your config file; i.e, something like (*:* -organisations:["" TO *] -roles:["" TO *]) On Tue, Feb 18, 2014 at 12:16 PM, Andreas Owen wrote: > It seams that fq doesn't except OR because: (organisations:(150 OR 41) > AND > roles:(174)) OR (-organisations:["" TO *] AND -roles:["" TO *]) only > returns docs that match the first conditions. it doesn't return any > docs with the empty fields organisations and roles. > > -Original Message- > From: Andreas Owen [mailto:a...@conx.ch] > Sent: Montag, 17. Februar 2014 05:08 > To: solr-user@lucene.apache.org > Subject: query parameters > > > in solrconfig of my solr 4.3 i have a userdefined requestHandler. i > would like to use fq to force the following conditions: >1: organisations is empty and roles is empty >2: organisations contains one of the commadelimited list in > variable $org >3: roles contains one of the commadelimited list in variable $r >4: rule 2 and 3 > > snipet of what i got (havent checked out if the is a "in" operator > like in sql for the list value) > > >explicit >10 >edismax >true >plain_text^10 editorschoice^200 > title^20 h_*^14 > tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 > contentmanager^5 links^5 > last_modified^5 url^5 > >(organisations='' roles='') or > (organisations=$org roles=$r) or (organisations='' roles=$r) or > (organisations=$org roles='') >(expiration:[NOW TO *] OR (*:* > -expiration:*))^6 >div(clicks,max(displays,1))^8 > > > > > >
RE: query parameters
It seams that fq doesn't except OR because: (organisations:(150 OR 41) AND roles:(174)) OR (-organisations:["" TO *] AND -roles:["" TO *]) only returns docs that match the first conditions. it doesn't return any docs with the empty fields organisations and roles. -----Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Montag, 17. Februar 2014 05:08 To: solr-user@lucene.apache.org Subject: query parameters in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like to use fq to force the following conditions: 1: organisations is empty and roles is empty 2: organisations contains one of the commadelimited list in variable $org 3: roles contains one of the commadelimited list in variable $r 4: rule 2 and 3 snipet of what i got (havent checked out if the is a "in" operator like in sql for the list value) explicit 10 edismax true plain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 (organisations='' roles='') or (organisations=$org roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='') (expiration:[NOW TO *] OR (*:* -expiration:*))^6 div(clicks,max(displays,1))^8
query parameters
in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like to use fq to force the following conditions: 1: organisations is empty and roles is empty 2: organisations contains one of the commadelimited list in variable $org 3: roles contains one of the commadelimited list in variable $r 4: rule 2 and 3 snipet of what i got (havent checked out if the is a "in" operator like in sql for the list value) explicit 10 edismax true plain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 (organisations='' roles='') or (organisations=$org roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='') (expiration:[NOW TO *] OR (*:* -expiration:*))^6 div(clicks,max(displays,1))^8
admin gui right side not loading
I'm using solr 4.3.1 and have installed it on a win 2008 server. Solr is working, for example import & search. But the admin guis right side isn't loading and I get a javascript error for several d3-objects. The last error is: Load timeout for modules: lib/order!lib/jquery.autogrow lib/order!lib/jquery.cookie lib/order!lib/jquery.form lib/order!lib/jquery.jstree lib/order!lib/jquery.sammy lib/order!lib/jquery.timeago lib/order!lib/jquery.blockUI lib/order!lib/highlight lib/order!lib/linker lib/order!lib/ZeroClipboard lib/order!lib/d3 lib/order!lib/chosen lib/order!scripts/app lib/order!scripts/analysis lib/order!scripts/cloud lib/order!scripts/cores lib/order!scripts/dataimport lib/order!scripts/dashboard lib/order!scripts/file lib/order!scripts/index lib/order!scripts/java-properties lib/order!scripts/logging lib/order!scripts/ping lib/order!scripts/plugins lib/order!scripts/query lib/order!scripts/replication lib/order!scripts/schema-browser lib/order!scripts/threads lib/jquery.autogrow lib/jquery.cookie lib/jquery.form lib/jquery.jstree lib/jquery.sammy lib/jquery.timeago lib/jquery.blockUI lib/highlight lib/linker lib/ZeroClipboard lib/d3 lib/chosen scripts/app scripts/analysis scripts/cloud scripts/cores scripts/dataimport scripts/dashboard scripts/file scripts/index scripts/java-properties scripts/logging scripts/ping scripts/plugins scripts/query scripts/replication scripts/schema-browser scripts/threads http://requirejs.org/docs/errors.html#timeout I have no apparent errors in the log file and the exact conf is working on a other server. What can I do?
RE: json update moves doc to end
I changed my boost-function log(clickrate)^8 to div(clciks,displays)^8 and it works now. I get the following output from debug 0.0022668892 = (MATCH) FunctionQuery(div(const(2),const(5))), product of: 0.4 = div(const(2),const(5)) 8.0 = boost 7.0840283E-4 = queryNorm Am i undestanding this right, that 0.4 and 8.0 result in 7.084? I'm having trouble undestanding how much i boosted it. As i use NgramFilterFactory i get a lot of hits because of the tokens. Can i make the boost higher if the hole search-term is found and not just part of it? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Mittwoch, 4. Dezember 2013 15:07 To: solr-user@lucene.apache.org Subject: Re: json update moves doc to end Well, both have a score of -Infinity. So they're "equal" and the tiebreaker is the internal Lucene doc ID. Now this is not helpful since the question now is where -Infinity comes from, this looks suspicious: -Infinity = (MATCH) FunctionQuery(log(int(clicks))), product of: -Infinity = log(int(clicks)=0) not much help I know, but Erick On Wed, Dec 4, 2013 at 7:24 AM, Andreas Owen wrote: > Hi Erick > > Here are the last 2 results from a search and i am not understanding > why the last one with the boost editorschoice^200 isn't at the top. By > the way can i also give a substantial boost to results that contain > the hole search-request and not just 3 or 4 letters (tokens)? > > > -Infinity = (MATCH) sum of: > 0.013719446 = (MATCH) max of: > 0.013719446 = (MATCH) sum of: > 2.090396E-4 = (MATCH) weight(plain_text:ber in 841) > [DefaultSimilarity], result of: > 2.090396E-4 = score(doc=841,freq=8.0 = termFreq=8.0 ), product > of: > 0.009452709 = queryWeight, product of: > 1.3343692 = idf(docFreq=611, maxDocs=855) > 0.0070840283 = queryNorm > 0.022114253 = fieldWeight in 841, product of: > 2.828427 = tf(freq=8.0), with freq of: > 8.0 = termFreq=8.0 > 1.3343692 = idf(docFreq=611, maxDocs=855) > 0.005859375 = fieldNorm(doc=841) > 0.0012402858 = (MATCH) weight(plain_text:eri in 841) > [DefaultSimilarity], result of: > 0.0012402858 = score(doc=841,freq=9.0 = termFreq=9.0 ), > product of: > 0.022357063 = queryWeight, product of: > 3.1559815 = idf(docFreq=98, maxDocs=855) > 0.0070840283 = queryNorm > 0.05547624 = fieldWeight in 841, product of: > 3.0 = tf(freq=9.0), with freq of: > 9.0 = termFreq=9.0 > 3.1559815 = idf(docFreq=98, maxDocs=855) > 0.005859375 = fieldNorm(doc=841) > 5.0511415E-4 = (MATCH) weight(plain_text:ric in 841) > [DefaultSimilarity], result of: > 5.0511415E-4 = score(doc=841,freq=1.0 = termFreq=1.0 ), > product of: > 0.024712078 = queryWeight, product of: > 3.4884217 = idf(docFreq=70, maxDocs=855) > 0.0070840283 = queryNorm > 0.020439971 = fieldWeight in 841, product of: > 1.0 = tf(freq=1.0), with freq of: > 1.0 = termFreq=1.0 > 3.4884217 = idf(docFreq=70, maxDocs=855) > 0.005859375 = fieldNorm(doc=841) > 8.721528E-4 = (MATCH) weight(plain_text:ich in 841) > [DefaultSimilarity], result of: > 8.721528E-4 = score(doc=841,freq=12.0 = termFreq=12.0 ), > product of: > 0.017446788 = queryWeight, product of: > 2.4628344 = idf(docFreq=197, maxDocs=855) > 0.0070840283 = queryNorm > 0.049989305 = fieldWeight in 841, product of: > 3.4641016 = tf(freq=12.0), with freq of: > 12.0 = termFreq=12.0 > 2.4628344 = idf(docFreq=197, maxDocs=855) > 0.005859375 = fieldNorm(doc=841) > 7.725705E-4 = (MATCH) weight(plain_text:cht in 841) > [DefaultSimilarity], result of: > 7.725705E-4 = score(doc=841,freq=4.0 = termFreq=4.0 ), product > of: > 0.021610687 = queryWeight, product of: > 3.050621 = idf(docFreq=109, maxDocs=855) > 0.0070840283 = queryNorm > 0.035749465 = fieldWeight in 841, product of: > 2.0 = tf(freq=4.0), with freq of: > 4.0 = termFreq=4.0 > 3.050621 = idf(docFreq=109, maxDocs=855) > 0.005859375 = fieldNorm(doc=841) > 0.0010287998 = (MATCH) weight(plain_text:beri in 841) > [DefaultSimilarity], result of: > 0.0010287998 = score(doc=841,freq=1.0 = termFreq=1.0 ), > product of: > 0.035267927 = queryWeight, product of: > 4.978513 = idf(docFreq=15, maxDocs=855) > 0.0070840283 = queryNorm >
RE: json update moves doc to end
s=855) 0.0070840283 = queryNorm 0.1359345 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 4.349904 = idf(docFreq=29, maxDocs=855) 0.03125 = fieldNorm(doc=0) 0.006139375 = (MATCH) weight(plain_text:berich in 0) [DefaultSimilarity], result of: 0.006139375 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.037305873 = queryWeight, product of: 5.266195 = idf(docFreq=11, maxDocs=855) 0.0070840283 = queryNorm 0.16456859 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.266195 = idf(docFreq=11, maxDocs=855) 0.03125 = fieldNorm(doc=0) 0.0059541636 = (MATCH) weight(plain_text:ericht in 0) [DefaultSimilarity], result of: 0.0059541636 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.036738846 = queryWeight, product of: 5.186152 = idf(docFreq=12, maxDocs=855) 0.0070840283 = queryNorm 0.16206725 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.186152 = idf(docFreq=12, maxDocs=855) 0.03125 = fieldNorm(doc=0) 0.006139375 = (MATCH) weight(plain_text:bericht in 0) [DefaultSimilarity], result of: 0.006139375 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.037305873 = queryWeight, product of: 5.266195 = idf(docFreq=11, maxDocs=855) 0.0070840283 = queryNorm 0.16456859 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.266195 = idf(docFreq=11, maxDocs=855) 0.03125 = fieldNorm(doc=0) 7.054 = (MATCH) weight(editorschoice:bericht^200.0 in 0) [DefaultSimilarity], result of: 7.054 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.749 = queryWeight, product of: 200.0 = boost 7.0579543 = idf(docFreq=1, maxDocs=855) 7.0840283E-4 = queryNorm 7.0579543 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 7.0579543 = idf(docFreq=1, maxDocs=855) 1.0 = fieldNorm(doc=0) 0.0021252085 = (MATCH) product of: 0.004250417 = (MATCH) sum of: 0.004250417 = (MATCH) sum of: 0.004250417 = (MATCH) MatchAllDocsQuery, product of: 0.004250417 = queryNorm 0.5 = coord(1/2) -Infinity = (MATCH) FunctionQuery(log(int(clicks))), product of: -Infinity = log(int(clicks)=0) 8.0 = boost 7.0840283E-4 = queryNorm -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Dienstag, 3. Dezember 2013 20:30 To: solr-user@lucene.apache.org Subject: Re: json update moves doc to end Try adding &debug=all and you'll see exactly how docs are scored. Also, it'll show you exactly how your query is parsed. Paste that if it's confused, it'll help figure out what's going wrong. On Tue, Dec 3, 2013 at 1:37 PM, Andreas Owen wrote: > So isn't it sorted automaticly by relevance (boost value)? If not do > should i set it in solrconfig? > > -Original Message- > From: Jonathan Rochkind [mailto:rochk...@jhu.edu] > Sent: Dienstag, 3. Dezember 2013 19:07 > To: solr-user@lucene.apache.org > Subject: Re: json update moves doc to end > > What order, the order if you supply no explicit sort at all? > > Solr does not make any guarantees about what order documents will come > back in if you do not ask for a sort. > > In general in Solr/lucene, the only way to update a document is to > re-add it as a new document, so that's probably what's going on behind > the scenes, and it probably effects the 'default' sort order -- which > Solr makes no agreement about anyway, you probably shouldn't even > count on it being consistent at all. > > If you want a consistent sort order, maybe add a field with a > timestamp, and ask for results sorted by the timestamp field? And then > make sure not to change the timestamp when you do an update that you > don't want to change the order? > > Apologies if I've misunderstood the situation. > > On 12/3/13 1:00 PM, Andreas Owen wrote: > > When I search for "agenda" I get a lot of hits. Now if I update the 2. > > Result by json-update the doc is moved to the end of the index when > > I search for it again. The field I change is "editorschoice" and it > > never contains the search term "agenda" so I don't see why it > > changes the order. Why does it? > > > > > > > > Part of Solrconfig requesthandler I use: > > > > > >
RE: json update moves doc to end
So isn't it sorted automaticly by relevance (boost value)? If not do should i set it in solrconfig? -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Dienstag, 3. Dezember 2013 19:07 To: solr-user@lucene.apache.org Subject: Re: json update moves doc to end What order, the order if you supply no explicit sort at all? Solr does not make any guarantees about what order documents will come back in if you do not ask for a sort. In general in Solr/lucene, the only way to update a document is to re-add it as a new document, so that's probably what's going on behind the scenes, and it probably effects the 'default' sort order -- which Solr makes no agreement about anyway, you probably shouldn't even count on it being consistent at all. If you want a consistent sort order, maybe add a field with a timestamp, and ask for results sorted by the timestamp field? And then make sure not to change the timestamp when you do an update that you don't want to change the order? Apologies if I've misunderstood the situation. On 12/3/13 1:00 PM, Andreas Owen wrote: > When I search for "agenda" I get a lot of hits. Now if I update the 2. > Result by json-update the doc is moved to the end of the index when I > search for it again. The field I change is "editorschoice" and it > never contains the search term "agenda" so I don't see why it changes > the order. Why does it? > > > > Part of Solrconfig requesthandler I use: > > > > > > explicit > > 10 > > synonym_edismax > > true > > plain_text^10 editorschoice^200 > > title^20 h_*^14 > > tags^10 thema^15 inhaltstyp^6 > breadcrumb^6 > doctype^10 > > contentmanager^5 links^5 > > last_modified^5 url^5 > > > > (expiration:[NOW TO *] OR (*:* > -expiration:*))^6 > > log(clicks)^8 > > > > text > > *,path,score > > json > > AND > > > > > > on > > plain_text,title > > <b> > > </b> > > > > > > on > > 1 > > name="facet.field">{!ex=inhaltstyp}inhaltstyp > > name="f.inhaltstyp.facet.sort">index > > name="facet.field">{!ex=doctype}doctype > > name="f.doctype.facet.sort">index > > name="facet.field">{!ex=thema_f}thema_f > > name="f.thema_f.facet.sort">index > > name="facet.field">{!ex=author_s}author_s > > name="f.author_s.facet.sort">index > > name="facet.field">{!ex=sachverstaendiger_s}sachverstaendiger_s > > name="f.sachverstaendiger_s.facet.sort">index > > name="facet.field">{!ex=veranstaltung}veranstaltung > > name="f.veranstaltung.facet.sort">index > > name="facet.date">{!ex=last_modified}last_modified > > name="facet.date.gap">+1MONTH > > name="facet.date.end">NOW/MONTH+1MONTH > > name="facet.date.start">NOW/MONTH-36MONTHS > > name="facet.date.other">after > > > > > >
json update moves doc to end
When I search for agenda I get a lot of hits. Now if I update the 2. Result by json-update the doc is moved to the end of the index when I search for it again. The field I change is editorschoice and it never contains the search term agenda so I dont see why it changes the order. Why does it? Part of Solrconfig requesthandler I use: explicit 10 synonym_edismax true plain_text^10 editorschoice^200 title^20 h_*^14 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10 contentmanager^5 links^5 last_modified^5 url^5 (expiration:[NOW TO *] OR (*:* -expiration:*))^6 log(clicks)^8 text *,path,score json AND on plain_text,title on 1 {!ex=inhaltstyp}inhaltstyp index {!ex=doctype}doctype index {!ex=thema_f}thema_f index {!ex=author_s}author_s index {!ex=sachverstaendiger_s}sachverstaendiger_s index {!ex=veranstaltung}veranstaltung index {!ex=last_modified}last_modified +1MONTH NOW/MONTH+1MONTH NOW/MONTH-36MONTHS after
search with wildcard
I am querying "test" in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like "Supertestplan" it isn't found unless I use a wildcards "*test*". This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words.
RE: search with wildcard
I suppose i have to create another field with diffenet tokenizers and set the boost very low so it doesn't really mess with my ranking because there the word is now in 2 fields. What kind of tokenizer can do the job? From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 21. November 2013 16:13 To: solr-user@lucene.apache.org Subject: search with wildcard I am querying "test" in solr 4.3.1 over the field below and it's not finding all occurences. It seems that if it is a substring of a word like "Supertestplan" it isn't found unless I use a wildcards "*test*". This is write because of my tokenizer but does someone know a way around this? I don't want to add wildcards because that messes up queries with multiple words.
RE: date range tree
I solved it by adding a loop for years and one for quartals in which i count the month-facets -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Montag, 11. November 2013 17:52 To: solr-user@lucene.apache.org Subject: RE: date range tree Has someone at least got a idee how i could do a year/month-date-tree? In Solr-Wiki it is mentioned that facet.date.gap=+1DAY,+2DAY,+3DAY,+10DAY should create 4 buckets but it doesn't work -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 7. November 2013 18:23 To: solr-user@lucene.apache.org Subject: date range tree I would like to make a facet on a date field with the following tree: 2013 4.Quartal December November Oktober 3.Quartal September August Juli 2.Quartal June Mai April 1. Quartal March February January 2012 . Same as above So far I have this in solrconfig.xml: {!ex=last_modified,thema,inhaltstyp,doctype}last_modified< /str> +1MONTH NOW/MONTH NOW/MONTH-36MONTHS after Can I do this in one query or do I need multiple queries? If yes how would I do the second and keep all the facet queries in the count?
RE: date range tree
Has someone at least got a idee how i could do a year/month-date-tree? In Solr-Wiki it is mentioned that facet.date.gap=+1DAY,+2DAY,+3DAY,+10DAY should create 4 buckets but it doesn't work -Original Message- From: Andreas Owen [mailto:a...@conx.ch] Sent: Donnerstag, 7. November 2013 18:23 To: solr-user@lucene.apache.org Subject: date range tree I would like to make a facet on a date field with the following tree: 2013 4.Quartal December November Oktober 3.Quartal September August Juli 2.Quartal June Mai April 1. Quartal March February January 2012 . Same as above So far I have this in solrconfig.xml: {!ex=last_modified,thema,inhaltstyp,doctype}last_modified< /str> +1MONTH NOW/MONTH NOW/MONTH-36MONTHS after Can I do this in one query or do I need multiple queries? If yes how would I do the second and keep all the facet queries in the count?
count links pointing to id
I have a multivalue field with links pointing to ids of solrdocuments. I would like calculate how many links are pointing to each document und put that number into the field links2me. How can I do this, I would prefer to do it with a query and the updater so solr can do it internaly if possible?
date range tree
I would like to make a facet on a date field with the following tree: 2013 4.Quartal December November Oktober 3.Quartal September August Juli 2.Quartal June Mai April 1. Quartal March February January 2012 . Same as above So far I have this in solrconfig.xml: {!ex=last_modified,thema,inhaltstyp,doctype}last_modified< /str> +1MONTH NOW/MONTH NOW/MONTH-36MONTHS after Can I do this in one query or do I need multiple queries? If yes how would I do the second and keep all the facet queries in the count?
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
i'm already using URLDataSource On 30. Sep 2013, at 5:41 PM, P Williams wrote: > Hi Andreas, > > When using > XPathEntityProcessor<http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor>your > DataSource > must be of type DataSource. You shouldn't be using > BinURLDataSource, it's giving you the cast exception. Use > URLDataSource<https://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/URLDataSource.html> > or > FileDataSource<https://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/FileDataSource.html>instead. > > I don't think you need to specify namespaces, at least you didn't used to. > The other thing that I've noticed is that the anywhere xpath expression // > doesn't always work in DIH. You might have to be more specific. > > Cheers, > Tricia > > > > > > On Sun, Sep 29, 2013 at 9:47 AM, Andreas Owen wrote: > >> how dum can you get. obviously quite dum... i would have to analyze the >> html-pages with a nested instance like this: >> >> > url="file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml" >> forEach="/docs/doc" dataSource="main"> >> >>> url="${rec.urlParse}" forEach="/xhtml:html" dataSource="dataUrl"> >> >> >> >> >> >> >> >> but i'm pretty sure the foreach is wrong and the xpath expressions. in the >> moment i getting the following error: >> >>Caused by: java.lang.RuntimeException: >> org.apache.solr.handler.dataimport.DataImportHandlerException: >> java.lang.ClassCastException: >> sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast >> to java.io.Reader >> >> >> >> >> >> On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote: >> >>> ok i see what your getting at but why doesn't the following work: >>> >>> >>> >>> >>> i removed the tiki-processor. what am i missing, i haven't found >> anything in the wiki? >>> >>> >>> On 28. Sep 2013, at 12:28 AM, P Williams wrote: >>> >>>> I spent some more time thinking about this. Do you really need to use >> the >>>> TikaEntityProcessor? It doesn't offer anything new to the document you >> are >>>> building that couldn't be accomplished by the XPathEntityProcessor alone >>>> from what I can tell. >>>> >>>> I also tried to get the Advanced >>>> Parsing<http://wiki.apache.org/solr/TikaEntityProcessor>example to >>>> work without success. There are some obvious typos ( >>>> instead of ) and an odd order to the pieces ( is >>>> enclosed by ). It also looks like >>>> FieldStreamDataSource< >> http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html >>> is >>>> the one that is meant to work in this context. If Koji is still around >>>> maybe he could offer some help? Otherwise this bit of erroneous >>>> instruction should probably be removed from the wiki. >>>> >>>> Cheers, >>>> Tricia >>>> >>>> $ svn diff >>>> Index: >>>> >> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java >>>> === >>>> --- >>>> >> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java >>>> (revision 1526990) >>>> +++ >>>> >> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java >>>> (working copy) >>>> @@ -99,13 +99,13 @@ >>>> runFullImport(getConfigHTML("identity")); >>>> assertQ(req("*:*"), testsHTMLIdentity); >>>> } >>>> - >>>> + >>>> private String getConfigHTML(String htmlMapper) { >>>> return >>>> "" + >&
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
how dum can you get. obviously quite dum... i would have to analyze the html-pages with a nested instance like this: but i'm pretty sure the foreach is wrong and the xpath expressions. in the moment i getting the following error: Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to java.io.Reader On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote: > ok i see what your getting at but why doesn't the following work: > > > > > i removed the tiki-processor. what am i missing, i haven't found anything in > the wiki? > > > On 28. Sep 2013, at 12:28 AM, P Williams wrote: > >> I spent some more time thinking about this. Do you really need to use the >> TikaEntityProcessor? It doesn't offer anything new to the document you are >> building that couldn't be accomplished by the XPathEntityProcessor alone >> from what I can tell. >> >> I also tried to get the Advanced >> Parsing<http://wiki.apache.org/solr/TikaEntityProcessor>example to >> work without success. There are some obvious typos ( >> instead of ) and an odd order to the pieces ( is >> enclosed by ). It also looks like >> FieldStreamDataSource<http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html>is >> the one that is meant to work in this context. If Koji is still around >> maybe he could offer some help? Otherwise this bit of erroneous >> instruction should probably be removed from the wiki. >> >> Cheers, >> Tricia >> >> $ svn diff >> Index: >> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java >> === >> --- >> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java >>(revision 1526990) >> +++ >> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java >>(working copy) >> @@ -99,13 +99,13 @@ >>runFullImport(getConfigHTML("identity")); >>assertQ(req("*:*"), testsHTMLIdentity); >> } >> - >> + >> private String getConfigHTML(String htmlMapper) { >>return >>"" + >>" " + >>" " + >> -"> processor='TikaEntityProcessor' " + >> +"> processor='TikaEntityProcessor' " + >>" url='" + >> getFile("dihextras/structured.html").getAbsolutePath() + "' " + >>((htmlMapper == null) ? "" : (" htmlMapper='" + htmlMapper + >> "'")) + ">" + >>" " + >> @@ -114,4 +114,36 @@ >>""; >> >> } >> + private String[] testsHTMLH1 = { >> + "//*[@numFound='1']" >> + , "//str[@name='h1'][contains(.,'H1 Header')]" >> + }; >> + >> + @Test >> + public void testTikaHTMLMapperSubEntity() throws Exception { >> +runFullImport(getConfigSubEntity("identity")); >> +assertQ(req("*:*"), testsHTMLH1); >> + } >> + >> + private String getConfigSubEntity(String htmlMapper) { >> +return >> +"" + >> +"" + >> +"" + >> +"" + >> +"> dataSource='bin' format='html' rootEntity='false'>" + >> +"" + >> +"" + >> +"" + >> +"" + >> +"" + >> +"> dataSource='fld' dataField='tika.text' rootEntity='true' >" + >> +"" + >> +"" + >> +"" + >> +"" + >> +"" + >> +""; >> + } >> + >> } >> Index: >> solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimp
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
thanks but the first suggestion is already implemented and the 2. didn't work. i have also tried htmlMapper="identity" but nothing worked. i also tried this but the html was stripped in both fields but in the end i think it's best to cut tika out because i'm not getting any benefits from it. i would just need to get this to work: the fields are empty and i'm not getting any errors in the logs. On 28. Sep 2013, at 2:43 AM, Alexandre Rafalovitch wrote: > This is a rather complicated example to chew through, but try the following > two things: > *) dataField="${tika.text}" => dataField="text" (or less likely htmlMapper > tika.text) > You might be trying to read content of the field rather than passing > reference to the field that seems to be expected. This might explain the > exception. > > *) It may help to be aware of > https://issues.apache.org/jira/browse/SOLR-4530 . There is a new > htmlMapper="identity" flag on Tika entries to ensure more of HTML structure > passing through. By default, Tika strips out most of the HTML tags. > > Regards, > Alex. > > On Thu, Sep 26, 2013 at 5:17 PM, Andreas Owen wrote: > >>> url="${rec.urlParse}" dataSource="dataUrl" onError="skip" format="html"> >> >> >>> forEach="/html" dataSource="fld" dataField="${tika.text}" rootEntity="true" >> onError="skip"> >> >> >> >> > > > > Personal website: http://www.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all at > once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) > at > org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) > at > org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:469) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:495) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408) > at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323) > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231) > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411) > at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476) > at > org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) > at org.apache.solr.util.TestHarness.query(TestHarness.java:291) > at > org.apache.solr.handler.dataimport.AbstractDataImportHandlerTestCase.runFullImport(AbstractDataImportHandlerTestCase.java:96) > at > org.apache.solr.handler.dataimport.TestTikaEntityProcessor.testTikaHTMLMapperSubEntity(TestTikaEntityProcessor.java:124) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:737) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:773) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:787) > at > com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53) > at > org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50) > at > org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51) > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46) > at > com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) > at > org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49) > at > org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70) > at > org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:782) > at > com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:442) > at > com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:746) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:648) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:682) > at > com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:693) > at > com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) > at > com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53) > at > org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46) > at > org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42) > at > com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55) > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39) > at > com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39) > at > com.carrotsear
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
i removed the FieldReaderDataSource and dataSource="fld" but it didn't help. i get the following for each document: DataImportHandlerException: Exception in invoking url null Processing Document # 9 nullpointerexception On 26. Sep 2013, at 8:39 PM, P Williams wrote: > Hi, > > Haven't tried this myself but maybe try leaving out the > FieldReaderDataSource entirely. From my quick searching looks like it's > tied to SQL. Did you try copying the > http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example > exactly? What happens when you leave out FieldReaderDataSource? > > Cheers, > Tricia > > > On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen wrote: > >> i'm using solr 4.3.1 and the dataimporter. i am trying to use >> XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages >> but i'm getting this error for each document. i have also tried >> dataField="tika.text" and dataField="text" to no avail. the nested >> XPathEntityProcessor "detail" creates the error, the rest works fine. what >> am i doing wrong? >> >> error: >> >> ERROR - 2013-09-26 12:08:49.006; >> org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed >> 'null' >> java.lang.ClassCastException: java.io.StringReader cannot be cast to >> java.util.Iterator >>at >> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) >>at >> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) >>at >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) >>at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465) >>at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) >>at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) >>at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) >>at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) >>at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) >>at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) >>at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) >>at >> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) >>at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) >>at >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) >>at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) >>at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) >>at >> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) >>at >> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) >>at >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) >>at >> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) >>at >> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) >>at >> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) >>at >> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) >>at >> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) >>at >> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) >>at >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) >>at >> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) >>at >> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) >>at >> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) >>at org.eclipse.jetty.server.Server.handle(Server.java:365) >>at >> org.eclipse.jetty.server.Abstract
XPathEntityProcessor nested in TikaEntityProcessor query null exception
i'm using solr 4.3.1 and the dataimporter. i am trying to use XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages but i'm getting this error for each document. i have also tried dataField="tika.text" and dataField="text" to no avail. the nested XPathEntityProcessor "detail" creates the error, the rest works fine. what am i doing wrong? error: ERROR - 2013-09-26 12:08:49.006; org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed 'null' java.lang.ClassCastException: java.io.StringReader cannot be cast to java.util.Iterator at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) ERROR - 2013-09-26 12:08:49.022; org.apache.solr.common.SolrException; Exception in entity : detail:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.io.StringReader cannot be cast to java.util.Iterator at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataim
dih HTMLStripTransformer
why does stripHTML="false" have no effect in dih? the html is strippedin text and text_nohtml when i do display the index with select?q=* i'm trying to get a field without html and one with it so i can also index the links on the page. data-config.xml
Re: dih delete doc per $deleteDocById
sorry, it works like this, i had a typo in my conf :-( On 17. Sep 2013, at 2:44 PM, Andreas Owen wrote: > i would like to know how to get it to work and delete documents per xml and > dih. > > On 17. Sep 2013, at 1:47 PM, Shalin Shekhar Mangar wrote: > >> What is your question? >> >> On Tue, Sep 17, 2013 at 12:17 AM, andreas owen wrote: >>> i am using dih and want to delete indexed documents by xml-file with ids. i >>> have seen $deleteDocById used in >>> >>> data-config.xml: >>> >> url="file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml" >>> forEach="/docs/doc" dataSource="main" > >>> >>> >>> >>> xml-file: >>> >>> >>> 2345 >>> >>> >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar.
Re: dih delete doc per $deleteDocById
i would like to know how to get it to work and delete documents per xml and dih. On 17. Sep 2013, at 1:47 PM, Shalin Shekhar Mangar wrote: > What is your question? > > On Tue, Sep 17, 2013 at 12:17 AM, andreas owen wrote: >> i am using dih and want to delete indexed documents by xml-file with ids. i >> have seen $deleteDocById used in >> >> data-config.xml: >> > url="file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml" >> forEach="/docs/doc" dataSource="main" > >> >> >> >> xml-file: >> >> >>2345 >> >> > > > > -- > Regards, > Shalin Shekhar Mangar.
dih delete doc per $deleteDocById
i am using dih and want to delete indexed documents by xml-file with ids. i have seen $deleteDocById used in data-config.xml: xml-file: 2345
Re: charset encoding
it was the http-header, as soon as i force a iso-8859-1 header it worked On 12. Sep 2013, at 9:44 AM, Andreas Owen wrote: > could it have something to do with the meta encoding tag is iso-8859-1 but > the http-header tag is utf8 and firefox inteprets it as utf8? > > On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote: > >> no jetty, and yes for tomcat i've seen a couple of answers >> >> On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote: >> >>> Using tomcat by any chance? The ML archive has the solution. May be on >>> Wiki, too. >>> >>> Otis >>> Solr & ElasticSearch Support >>> http://sematext.com/ >>> On Sep 11, 2013 8:56 AM, "Andreas Owen" wrote: >>> >>>> i'm using solr 4.3.1 with tika to index html-pages. the html files are >>>> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the >>>> server-http-header says it's utf8 and firefox-webdeveloper agrees. >>>> >>>> when i index a page with special chars like ä,ö,ü solr outputs it >>>> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in >>>> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. >>>> has anyone got a idea whats wrong? >>>> >>>>
Re: charset encoding
could it have something to do with the meta encoding tag is iso-8859-1 but the http-header tag is utf8 and firefox inteprets it as utf8? On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote: > no jetty, and yes for tomcat i've seen a couple of answers > > On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote: > >> Using tomcat by any chance? The ML archive has the solution. May be on >> Wiki, too. >> >> Otis >> Solr & ElasticSearch Support >> http://sematext.com/ >> On Sep 11, 2013 8:56 AM, "Andreas Owen" wrote: >> >>> i'm using solr 4.3.1 with tika to index html-pages. the html files are >>> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the >>> server-http-header says it's utf8 and firefox-webdeveloper agrees. >>> >>> when i index a page with special chars like ä,ö,ü solr outputs it >>> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in >>> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. >>> has anyone got a idea whats wrong? >>> >>>
Re: charset encoding
no jetty, and yes for tomcat i've seen a couple of answers On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote: > Using tomcat by any chance? The ML archive has the solution. May be on > Wiki, too. > > Otis > Solr & ElasticSearch Support > http://sematext.com/ > On Sep 11, 2013 8:56 AM, "Andreas Owen" wrote: > >> i'm using solr 4.3.1 with tika to index html-pages. the html files are >> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the >> server-http-header says it's utf8 and firefox-webdeveloper agrees. >> >> when i index a page with special chars like ä,ö,ü solr outputs it >> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in >> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. >> has anyone got a idea whats wrong? >> >>
charset encoding
i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: charfilter doesn't do anything
perfect, i tried it before but always at the tail of the expression with no effect. thanks a lot. a last question, do you know how to keep the html comments from being filtered before the transformer has done its work? On 10. Sep 2013, at 3:17 PM, Jack Krupansky wrote: > Okay, I can repro the problem. Yes, in appears that the pattern replace char > filter does not default to multiline mode for pattern matching, so on > one line and on another line cannot be matched. > > Now, whether that is by design or a bug or an option for enhancement is a > matter for some committer to comment on. > > But, the good news is that you can in fact set multiline mode in your pattern > my starting it with "(?s)", which means that dot accepts line break > characters as well. > > So, here are my revised field types: > > positionIncrementGap="100" > > >pattern="(?s)^.*<body>(.*)</body>.*$" replacement="$1" /> > > > > > > positionIncrementGap="100" > > >pattern="(?s)^.*<body>(.*)</body>.*$" replacement="$1" /> > > > > > > > The first type accepts everything within , including nested HTML > formatting, while the latter strips nested HTML formatting as well. > > The tokenizer will in fact strip out white space, but that happens after all > character filters have completed. > > -- Jack Krupansky > > -Original Message- From: Andreas Owen > Sent: Tuesday, September 10, 2013 7:07 AM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > ok i am getting there now but if there are newlines involved the regex stops > as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have > to get rid of the newlines. why isn't whitespaceTokenizerFactory the right > element for this? > > > On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote: > >> Use XML then. Although you will need to escape the XML special characters as >> I did in the pattern. >> >> The point is simply: Quickly and simply try to find the simple test scenario >> that illustrates the problem. >> >> -- Jack Krupansky >> >> -Original Message- From: Andreas Owen >> Sent: Monday, September 09, 2013 7:05 PM >> To: solr-user@lucene.apache.org >> Subject: Re: charfilter doesn't do anything >> >> i tried but that isn't working either, it want a data-stream, i'll have to >> check how to post json instead of xml >> >> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: >> >>> Did you at least try the pattern I gave you? >>> >>> The point of the curl was the data, not how you send the data. You can just >>> use the standard Solr simple post tool. >>> >>> -- Jack Krupansky >>> >>> -Original Message- From: Andreas Owen >>> Sent: Monday, September 09, 2013 6:40 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: charfilter doesn't do anything >>> >>> i've downloaded curl and tried it in the comman prompt and power shell on >>> my win 2008r2 server, thats why i used my dataimporter with a single line >>> html file and copy/pastet the lines into schema.xml >>> >>> >>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: >>> >>>> Did you in fact try my suggested example? If not, please do so. >>>> >>>> -- Jack Krupansky >>>> >>>> -Original Message- From: Andreas Owen >>>> Sent: Monday, September 09, 2013 4:42 PM >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: charfilter doesn't do anything >>>> >>>> i index html pages with a lot of lines and not just a string with the >>>> body-tag. >>>> it doesn't work with proper html files, even though i took all the new >>>> lines out. >>>> >>>> html-file: >>>> nav-content nur das will ich sehenfooter-content >>>> >>>> solr update debug output: >>>> "text_html": ["\r\n\r\n>>> content=\"ISO-8859-1\">\r\n>>> content=\"text/html; >>>> charset=ISO-8859-1\">\r\n\r\n\r\nnav-content nur das >>>> will ich sehenfooter-content"] >>>> >>>> >>>> >>>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: >>>> >>>>> I tried this and it se
Re: charfilter doesn't do anything
ok i am getting there now but if there are newlines involved the regex stops as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have to get rid of the newlines. why isn't whitespaceTokenizerFactory the right element for this? On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote: > Use XML then. Although you will need to escape the XML special characters as > I did in the pattern. > > The point is simply: Quickly and simply try to find the simple test scenario > that illustrates the problem. > > -- Jack Krupansky > > -----Original Message- From: Andreas Owen > Sent: Monday, September 09, 2013 7:05 PM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > i tried but that isn't working either, it want a data-stream, i'll have to > check how to post json instead of xml > > On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: > >> Did you at least try the pattern I gave you? >> >> The point of the curl was the data, not how you send the data. You can just >> use the standard Solr simple post tool. >> >> -- Jack Krupansky >> >> -Original Message- From: Andreas Owen >> Sent: Monday, September 09, 2013 6:40 PM >> To: solr-user@lucene.apache.org >> Subject: Re: charfilter doesn't do anything >> >> i've downloaded curl and tried it in the comman prompt and power shell on my >> win 2008r2 server, thats why i used my dataimporter with a single line html >> file and copy/pastet the lines into schema.xml >> >> >> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: >> >>> Did you in fact try my suggested example? If not, please do so. >>> >>> -- Jack Krupansky >>> >>> -Original Message- From: Andreas Owen >>> Sent: Monday, September 09, 2013 4:42 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: charfilter doesn't do anything >>> >>> i index html pages with a lot of lines and not just a string with the >>> body-tag. >>> it doesn't work with proper html files, even though i took all the new >>> lines out. >>> >>> html-file: >>> nav-content nur das will ich sehenfooter-content >>> >>> solr update debug output: >>> "text_html": ["\r\n\r\n>> content=\"ISO-8859-1\">\r\n>> charset=ISO-8859-1\">\r\n\r\n\r\nnav-content nur das >>> will ich sehenfooter-content"] >>> >>> >>> >>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: >>> >>>> I tried this and it seems to work when added to the standard Solr example >>>> in 4.4: >>>> >>>> >>>> >>>> >>> positionIncrementGap="100" > >>>> >>>> >>> pattern="^.*<body>(.*)</body>.*$" replacement="$1" /> >>>> >>>> >>>> >>>> >>>> >>>> That char filter retains only text between and . Is that >>>> what you wanted? >>>> >>>> Indexing this data: >>>> >>>> curl 'localhost:8983/solr/update?commit=true' -H >>>> 'Content-type:application/json' -d ' >>>> [{"id":"doc-1","body":"abc A test. def"}]' >>>> >>>> And querying with these commands: >>>> >>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"; >>>> Shows all data >>>> >>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"; >>>> shows the body text >>>> >>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"; >>>> shows nothing (outside of body) >>>> >>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"; >>>> shows nothing (outside of body) >>>> >>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"; >>>> Shows nothing, HTML tag stripped >>>> >>>> In your original query, you didn't show us what your default field, df >>>> parameter, was. >>>> >>>> -- Jack Krupansky >>>> >>>> -Original Message- From: Andreas Owen >>>> Sent: Sunday, September 08,
Re: charfilter doesn't do anything
i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: > Did you in fact try my suggested example? If not, please do so. > > -- Jack Krupansky > > -Original Message- From: Andreas Owen > Sent: Monday, September 09, 2013 4:42 PM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > i index html pages with a lot of lines and not just a string with the > body-tag. > it doesn't work with proper html files, even though i took all the new lines > out. > > html-file: > nav-content nur das will ich sehenfooter-content > > solr update debug output: > "text_html": ["\r\n\r\n content=\"ISO-8859-1\">\r\n charset=ISO-8859-1\">\r\n\r\n\r\nnav-content nur das > will ich sehenfooter-content"] > > > > On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: > >> I tried this and it seems to work when added to the standard Solr example in >> 4.4: >> >> >> >> > positionIncrementGap="100" > >> >> > pattern="^.*<body>(.*)</body>.*$" replacement="$1" /> >> >> >> >> >> >> That char filter retains only text between and . Is that what >> you wanted? >> >> Indexing this data: >> >> curl 'localhost:8983/solr/update?commit=true' -H >> 'Content-type:application/json' -d ' >> [{"id":"doc-1","body":"abc A test. def"}]' >> >> And querying with these commands: >> >> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"; >> Shows all data >> >> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"; >> shows the body text >> >> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"; >> shows nothing (outside of body) >> >> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"; >> shows nothing (outside of body) >> >> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"; >> Shows nothing, HTML tag stripped >> >> In your original query, you didn't show us what your default field, df >> parameter, was. >> >> -- Jack Krupansky >> >> -Original Message- From: Andreas Owen >> Sent: Sunday, September 08, 2013 5:21 AM >> To: solr-user@lucene.apache.org >> Subject: Re: charfilter doesn't do anything >> >> yes but that filter html and not the specific tag i want. >> >> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: >> >>> Hmmm, have you looked at: >>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>> >>> Not quite the , perhaps, but might it help? >>> >>> >>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen wrote: >>> >>>> ok i have html pages with .content i >>>> want.. i want to extract (index, store) only >>>> that between the body-comments. i thought regexTransformer would be the >>>> best because xpath doesn't work in tika and i cant nest a >>>> xpathEntetyProcessor to use xpath. what i have also found out is that the >>>> htmlparser from tika cuts my body-comments out and tries to make well >>>> formed html, which i would like to switch off. >>>> >>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: >>>> >>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote: >>>>>> i've managed to get it working if i use the regexTransformer and string >>>> is on the same line in my tika entity. but when the string is multilined it >>>> isn't working even though i tried ?s to set the flag dotall. >>>>>> >>>>>> >>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >>>> transformer="RegexTransformer"> >>>>>> >>> replaceWith="QQQ" sourceColName="text" /> >>>>>> >>>>>> >>>>>> then i tried it like this and i get a stackoverflow >>>>>> >>>>>
Re: charfilter doesn't do anything
i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: > Did you at least try the pattern I gave you? > > The point of the curl was the data, not how you send the data. You can just > use the standard Solr simple post tool. > > -- Jack Krupansky > > -Original Message- From: Andreas Owen > Sent: Monday, September 09, 2013 6:40 PM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > i've downloaded curl and tried it in the comman prompt and power shell on my > win 2008r2 server, thats why i used my dataimporter with a single line html > file and copy/pastet the lines into schema.xml > > > On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: > >> Did you in fact try my suggested example? If not, please do so. >> >> -- Jack Krupansky >> >> -Original Message- From: Andreas Owen >> Sent: Monday, September 09, 2013 4:42 PM >> To: solr-user@lucene.apache.org >> Subject: Re: charfilter doesn't do anything >> >> i index html pages with a lot of lines and not just a string with the >> body-tag. >> it doesn't work with proper html files, even though i took all the new lines >> out. >> >> html-file: >> nav-content nur das will ich sehenfooter-content >> >> solr update debug output: >> "text_html": ["\r\n\r\n> content=\"ISO-8859-1\">\r\n> charset=ISO-8859-1\">\r\n\r\n\r\nnav-content nur das >> will ich sehenfooter-content"] >> >> >> >> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: >> >>> I tried this and it seems to work when added to the standard Solr example >>> in 4.4: >>> >>> >>> >>> >> positionIncrementGap="100" > >>> >>> >> pattern="^.*<body>(.*)</body>.*$" replacement="$1" /> >>> >>> >>> >>> >>> >>> That char filter retains only text between and . Is that what >>> you wanted? >>> >>> Indexing this data: >>> >>> curl 'localhost:8983/solr/update?commit=true' -H >>> 'Content-type:application/json' -d ' >>> [{"id":"doc-1","body":"abc A test. def"}]' >>> >>> And querying with these commands: >>> >>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"; >>> Shows all data >>> >>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"; >>> shows the body text >>> >>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"; >>> shows nothing (outside of body) >>> >>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"; >>> shows nothing (outside of body) >>> >>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"; >>> Shows nothing, HTML tag stripped >>> >>> In your original query, you didn't show us what your default field, df >>> parameter, was. >>> >>> -- Jack Krupansky >>> >>> -Original Message- From: Andreas Owen >>> Sent: Sunday, September 08, 2013 5:21 AM >>> To: solr-user@lucene.apache.org >>> Subject: Re: charfilter doesn't do anything >>> >>> yes but that filter html and not the specific tag i want. >>> >>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: >>> >>>> Hmmm, have you looked at: >>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>>> >>>> Not quite the , perhaps, but might it help? >>>> >>>> >>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen wrote: >>>> >>>>> ok i have html pages with .content i >>>>> want.. i want to extract (index, store) only >>>>> that between the body-comments. i thought regexTransformer would be the >>>>> best because xpath doesn't work in tika and i cant nest a >>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the >>>>> htmlparser from tika cuts my body-comments out and tries to make well >>>>> formed htm
Re: charfilter doesn't do anything
i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: nav-content nur das will ich sehenfooter-content solr update debug output: "text_html": ["\r\n\r\n\r\n\r\n\r\n\r\nnav-content nur das will ich sehenfooter-content"] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: > I tried this and it seems to work when added to the standard Solr example in > 4.4: > > > > positionIncrementGap="100" > > >pattern="^.*<body>(.*)</body>.*$" replacement="$1" /> > > > > > > That char filter retains only text between and . Is that what > you wanted? > > Indexing this data: > > curl 'localhost:8983/solr/update?commit=true' -H > 'Content-type:application/json' -d ' > [{"id":"doc-1","body":"abc A test. def"}]' > > And querying with these commands: > > curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json"; > Shows all data > > curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json"; > shows the body text > > curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json"; > shows nothing (outside of body) > > curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json"; > shows nothing (outside of body) > > curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json"; > Shows nothing, HTML tag stripped > > In your original query, you didn't show us what your default field, df > parameter, was. > > -- Jack Krupansky > > -Original Message- From: Andreas Owen > Sent: Sunday, September 08, 2013 5:21 AM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > yes but that filter html and not the specific tag i want. > > On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: > >> Hmmm, have you looked at: >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >> >> Not quite the , perhaps, but might it help? >> >> >> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen wrote: >> >>> ok i have html pages with .content i >>> want.. i want to extract (index, store) only >>> that between the body-comments. i thought regexTransformer would be the >>> best because xpath doesn't work in tika and i cant nest a >>> xpathEntetyProcessor to use xpath. what i have also found out is that the >>> htmlparser from tika cuts my body-comments out and tries to make well >>> formed html, which i would like to switch off. >>> >>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: >>> >>>> On 9/6/2013 7:09 AM, Andreas Owen wrote: >>>>> i've managed to get it working if i use the regexTransformer and string >>> is on the same line in my tika entity. but when the string is multilined it >>> isn't working even though i tried ?s to set the flag dotall. >>>>> >>>>> >> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >>> transformer="RegexTransformer"> >>>>>>> replaceWith="QQQ" sourceColName="text" /> >>>>> >>>>> >>>>> then i tried it like this and i get a stackoverflow >>>>> >>>>> >> replaceWith="QQQ" sourceColName="text" /> >>>>> >>>>> in javascript this works but maybe because i only used a small string. >>>> >>>> Sounds like we've got an XY problem here. >>>> >>>> http://people.apache.org/~hossman/#xyproblem >>>> >>>> How about you tell us *exactly* what you'd actually like to have happen >>>> and then we can find a solution for you? >>>> >>>> It sounds a little bit like you're interested in stripping all the HTML >>>> tags out. Perhaps the HTMLStripCharFilter? >>>> >>>> >>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>>> >>>> Something that I already said: By using the KeywordTokenizer, you won't >>>> be able to search for individual words on your HTML input. The entire >>>> input string is treated as a single token, and therefore ONLY exact >>>> entire-field matches (or certain wildcard matches) will be possible. >>>> >>>> >>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory >>>> >>>> Note that no matter what you do to your data with the analysis chain, >>>> Solr will always return the text that was originally indexed in search >>>> results. If you need to affect what gets stored as well, perhaps you >>>> need an Update Processor. >>>> >>>> Thanks, >>>> Shawn >>>
Re: charfilter doesn't do anything
yes but that filter html and not the specific tag i want. On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: > Hmmm, have you looked at: > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory > > Not quite the , perhaps, but might it help? > > > On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen wrote: > >> ok i have html pages with .content i >> want.. i want to extract (index, store) only >> that between the body-comments. i thought regexTransformer would be the >> best because xpath doesn't work in tika and i cant nest a >> xpathEntetyProcessor to use xpath. what i have also found out is that the >> htmlparser from tika cuts my body-comments out and tries to make well >> formed html, which i would like to switch off. >> >> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: >> >>> On 9/6/2013 7:09 AM, Andreas Owen wrote: >>>> i've managed to get it working if i use the regexTransformer and string >> is on the same line in my tika entity. but when the string is multilined it >> isn't working even though i tried ?s to set the flag dotall. >>>> >>>> > dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >> transformer="RegexTransformer"> >>>> > replaceWith="QQQ" sourceColName="text" /> >>>> >>>> >>>> then i tried it like this and i get a stackoverflow >>>> >>>> > replaceWith="QQQ" sourceColName="text" /> >>>> >>>> in javascript this works but maybe because i only used a small string. >>> >>> Sounds like we've got an XY problem here. >>> >>> http://people.apache.org/~hossman/#xyproblem >>> >>> How about you tell us *exactly* what you'd actually like to have happen >>> and then we can find a solution for you? >>> >>> It sounds a little bit like you're interested in stripping all the HTML >>> tags out. Perhaps the HTMLStripCharFilter? >>> >>> >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>> >>> Something that I already said: By using the KeywordTokenizer, you won't >>> be able to search for individual words on your HTML input. The entire >>> input string is treated as a single token, and therefore ONLY exact >>> entire-field matches (or certain wildcard matches) will be possible. >>> >>> >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory >>> >>> Note that no matter what you do to your data with the analysis chain, >>> Solr will always return the text that was originally indexed in search >>> results. If you need to affect what gets stored as well, perhaps you >>> need an Update Processor. >>> >>> Thanks, >>> Shawn >> >>
Re: charfilter doesn't do anything
ok i have html pages with .content i want.. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the best because xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. what i have also found out is that the htmlparser from tika cuts my body-comments out and tries to make well formed html, which i would like to switch off. On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: > On 9/6/2013 7:09 AM, Andreas Owen wrote: >> i've managed to get it working if i use the regexTransformer and string is >> on the same line in my tika entity. but when the string is multilined it >> isn't working even though i tried ?s to set the flag dotall. >> >> > dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >> transformer="RegexTransformer"> >> > replaceWith="QQQ" sourceColName="text" /> >> >> >> then i tried it like this and i get a stackoverflow >> >> > replaceWith="QQQ" sourceColName="text" /> >> >> in javascript this works but maybe because i only used a small string. > > Sounds like we've got an XY problem here. > > http://people.apache.org/~hossman/#xyproblem > > How about you tell us *exactly* what you'd actually like to have happen > and then we can find a solution for you? > > It sounds a little bit like you're interested in stripping all the HTML > tags out. Perhaps the HTMLStripCharFilter? > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory > > Something that I already said: By using the KeywordTokenizer, you won't > be able to search for individual words on your HTML input. The entire > input string is treated as a single token, and therefore ONLY exact > entire-field matches (or certain wildcard matches) will be possible. > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory > > Note that no matter what you do to your data with the analysis chain, > Solr will always return the text that was originally indexed in search > results. If you need to affect what gets stored as well, perhaps you > need an Update Processor. > > Thanks, > Shawn
Re: charfilter doesn't do anything
i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall. then i tried it like this and i get a stackoverflow in javascript this works but maybe because i only used a small string. On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote: > Is there any chance that your changed your schema since you indexed the data? > If so, re-index the data. > > If a "*" query finds nothing, that implies that the default field is empty. > Are you sure the "df" parameter is set to the field containing your data? > Show us your request handler definition and a sample of your actual Solr > input (Solr XML or JSON?) so that we can see what fields are being populated. > > -- Jack Krupansky > > -Original Message- From: Andreas Owen > Sent: Friday, September 06, 2013 4:01 AM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > the input string is a normal html page with the word Zahlungsverkehr in it > and my query is ...solr/collection1/select?q=* > > On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote: > >> And show us an input string and a query that fail. >> >> -- Jack Krupansky >> >> -Original Message- From: Shawn Heisey >> Sent: Thursday, September 05, 2013 2:41 PM >> To: solr-user@lucene.apache.org >> Subject: Re: charfilter doesn't do anything >> >> On 9/5/2013 10:03 AM, Andreas Owen wrote: >>> i would like to filter / replace a word during indexing but it doesn't do >>> anything and i dont get a error. >>> >>> in schema.xml i have the following: >>> >>> >> multiValued="true"/> >>> >>> >>> >>> >>> >> pattern="Zahlungsverkehr" replacement="ASDFGHJK" /> >>> >>> >>> >>> >>> my 2. question is where can i say that the expression is multilined like in >>> javascript i can use /m at the end of the pattern? >> >> I don't know about your second question. I don't know if that will be >> possible, but I'll leave that to someone who's more expert than I. >> >> As for the first question, here's what I have. Did you reindex? That >> will be required. >> >> http://wiki.apache.org/solr/HowToReindex >> >> Assuming that you did reindex, are you trying to search for ASDFGHJK in >> a field that contains more than just "Zahlungsverkehr"? The keyword >> tokenizer might not do what you expect - it tokenizes the entire input >> string as a single token, which means that you won't be able to search >> for single words in a multi-word field without wildcards, which are >> pretty slow. >> >> Note that both the pattern and replacement are case sensitive. This is >> how regex works. You haven't used a lowercase filter, which means that >> you won't be able to search for asdfghjk. >> >> Use the analysis tab in the UI on your core to see what Solr does to >> your field text. >> >> Thanks, >> Shawn
Re: charfilter doesn't do anything
the input string is a normal html page with the word Zahlungsverkehr in it and my query is ...solr/collection1/select?q=* On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote: > And show us an input string and a query that fail. > > -- Jack Krupansky > > -Original Message- From: Shawn Heisey > Sent: Thursday, September 05, 2013 2:41 PM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > On 9/5/2013 10:03 AM, Andreas Owen wrote: >> i would like to filter / replace a word during indexing but it doesn't do >> anything and i dont get a error. >> >> in schema.xml i have the following: >> >> > multiValued="true"/> >> >> >> >> >> > pattern="Zahlungsverkehr" replacement="ASDFGHJK" /> >> >> >> >> >> my 2. question is where can i say that the expression is multilined like in >> javascript i can use /m at the end of the pattern? > > I don't know about your second question. I don't know if that will be > possible, but I'll leave that to someone who's more expert than I. > > As for the first question, here's what I have. Did you reindex? That > will be required. > > http://wiki.apache.org/solr/HowToReindex > > Assuming that you did reindex, are you trying to search for ASDFGHJK in > a field that contains more than just "Zahlungsverkehr"? The keyword > tokenizer might not do what you expect - it tokenizes the entire input > string as a single token, which means that you won't be able to search > for single words in a multi-word field without wildcards, which are > pretty slow. > > Note that both the pattern and replacement are case sensitive. This is > how regex works. You haven't used a lowercase filter, which means that > you won't be able to search for asdfghjk. > > Use the analysis tab in the UI on your core to see what Solr does to > your field text. > > Thanks, > Shawn
charfilter doesn't do anything
i would like to filter / replace a word during indexing but it doesn't do anything and i dont get a error. in schema.xml i have the following: my 2. question is where can i say that the expression is multilined like in javascript i can use /m at the end of the pattern?
Re: dataimporter tika doesn't extract certain div
or could i use a filter in schema.xml where i define a fieldtype and use some filter that understands xpath? On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote: > No that wouldn't work. It seems that you probably need a custom > Transformer to extract the right div content. I do not know if > TikaEntityProcessor supports such a thing. > > On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen wrote: >> so could i just nest it in a XPathEntityProcessor to filter the html or is >> there something like xpath for tika? >> >> > forEach="/div[@id='content']" dataSource="main"> >>> url="${htm}" dataSource="dataUrl" onError="skip" htmlMapper="identity" >> format="html" > >> >> >> >> >> but now i dont know how to pass the text to tika, what do i put in url and >> datasource? >> >> >> On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote: >> >>> I don't know much about Tika but in the example data-config.xml that >>> you posted, the "xpath" attribute on the field "text" won't work >>> because the xpath attribute is used only by a XPathEntityProcessor. >>> >>> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen wrote: >>>> I want tika to only index the content in ... for >>>> the field "text". unfortunately it's indexing the hole page. Can't xpath >>>> do this? >>>> >>>> data-config.xml: >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> url="http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" >>>> dataSource="main"> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" >>>> htmlMapper="identity" format="html" > >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >> > > > > -- > Regards, > Shalin Shekhar Mangar.
Re: dataimporter tika doesn't extract certain div
so could i just nest it in a XPathEntityProcessor to filter the html or is there something like xpath for tika? but now i dont know how to pass the text to tika, what do i put in url and datasource? On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote: > I don't know much about Tika but in the example data-config.xml that > you posted, the "xpath" attribute on the field "text" won't work > because the xpath attribute is used only by a XPathEntityProcessor. > > On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen wrote: >> I want tika to only index the content in ... for the >> field "text". unfortunately it's indexing the hole page. Can't xpath do this? >> >> data-config.xml: >> >> >> >> >> >> >>> url="http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" >> dataSource="main"> >> >> >> >> >> >> >> >>> url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" >> htmlMapper="identity" format="html" > >> >> >> >> >> >> > > > > -- > Regards, > Shalin Shekhar Mangar.
dataimporter tika doesn't extract certain div
I want tika to only index the content in ... for the field "text". unfortunately it's indexing the hole page. Can't xpath do this? data-config.xml: http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" dataSource="main">
Re: dataimporter tika fields empty
i changed following line (xpath): On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote: > Ah. That's because Tika processor does not support path extraction. You > need to nest one more level. > > Regards, > Alex > On 22 Aug 2013 13:34, "Andreas Owen" wrote: > >> i can do it like this but then the content isn't copied to text. it's just >> in text_test >> >> > url="${rec.path}${rec.file}" dataSource="dataUrl" > >> >> >> >> >> >> On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: >> >>> i put it in the tika-entity as attribute, but it doesn't change >> anything. my bigger concern is why text_test isn't populated at all >>> >>> On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: >>> >>>> Can you try SOLR-4530 switch: >>>> https://issues.apache.org/jira/browse/SOLR-4530 >>>> >>>> Specifically, setting htmlMapper="identity" on the entity definition. >> This >>>> will tell Tika to send full HTML rather than a seriously stripped one. >>>> >>>> Regards, >>>> Alex. >>>> >>>> Personal website: http://www.outerthoughts.com/ >>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >>>> - Time is the quality of nature that keeps events from happening all at >>>> once. Lately, it doesn't seem to be working. (Anonymous - via GTD >> book) >>>> >>>> >>>> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen wrote: >>>> >>>>> i'm trying to index a html page and only user the div with the >>>>> id="content". unfortunately nothing is working within the tika-entity, >> only >>>>> the standard text (content) is populated. >>>>> >>>>> do i have to use copyField for test_text to get the data? >>>>> or is there a problem with the entity-hirarchy? >>>>> or is the xpath wrong, even though i've tried it without and just >>>>> using text? >>>>> or should i use the updateextractor? >>>>> >>>>> data-config.xml: >>>>> >>>>> >>>>> >>>>> >>>>> http://127.0.0.1/tkb/internet/"; name="main"/> >>>>> >>>>> >>>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main"> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> url="${rec.path}${rec.file}" dataSource="dataUrl" > >>>>> >>>>> >>>> xpath="//div[@id='content']" /> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> docImporterUrl.xml: >>>>> >>>>> >>>>> >>>>> >>>>> 5 >>>>> tkb >>>>> Startseite >>>>> blabla ... >>>>> http://localhost/tkb/internet/index.cfm >>>>> http://localhost/tkb/internet/index.cfm/url >>>>> http\specialConf >>>>> >>>>> >>>>> 6 >>>>> tkb >>>>> Eigenheim >>>>> Machen Sie sich erste Gedanken über den >>>>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder >> gar ein >>>>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den >>>>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in >> finanzieller >>>>> Hinsicht gelingt. >>>>> >>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm >>>>> >>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url >>>>> >>>>> >> >>
Re: dataimporter tika fields empty
ok but i'm not doing any path extraction, at least i don't think so. htmlMapper="identity" isn't preserving html it's reading the content of the pages but it's not putting it into "text_test" and "text". it's only in "text_test" the copyField isn't working. data-config.xml: http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" dataSource="main"> On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote: > Ah. That's because Tika processor does not support path extraction. You > need to nest one more level. > > Regards, > Alex > On 22 Aug 2013 13:34, "Andreas Owen" wrote: > >> i can do it like this but then the content isn't copied to text. it's just >> in text_test >> >> > url="${rec.path}${rec.file}" dataSource="dataUrl" > >> >> >> >> >> >> On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: >> >>> i put it in the tika-entity as attribute, but it doesn't change >> anything. my bigger concern is why text_test isn't populated at all >>> >>> On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: >>> >>>> Can you try SOLR-4530 switch: >>>> https://issues.apache.org/jira/browse/SOLR-4530 >>>> >>>> Specifically, setting htmlMapper="identity" on the entity definition. >> This >>>> will tell Tika to send full HTML rather than a seriously stripped one. >>>> >>>> Regards, >>>> Alex. >>>> >>>> Personal website: http://www.outerthoughts.com/ >>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >>>> - Time is the quality of nature that keeps events from happening all at >>>> once. Lately, it doesn't seem to be working. (Anonymous - via GTD >> book) >>>> >>>> >>>> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen wrote: >>>> >>>>> i'm trying to index a html page and only user the div with the >>>>> id="content". unfortunately nothing is working within the tika-entity, >> only >>>>> the standard text (content) is populated. >>>>> >>>>> do i have to use copyField for test_text to get the data? >>>>> or is there a problem with the entity-hirarchy? >>>>> or is the xpath wrong, even though i've tried it without and just >>>>> using text? >>>>> or should i use the updateextractor? >>>>> >>>>> data-config.xml: >>>>> >>>>> >>>>> >>>>> >>>>> http://127.0.0.1/tkb/internet/"; name="main"/> >>>>> >>>>> >>>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main"> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> url="${rec.path}${rec.file}" dataSource="dataUrl" > >>>>> >>>>> >>>> xpath="//div[@id='content']" /> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> docImporterUrl.xml: >>>>> >>>>> >>>>> >>>>> >>>>> 5 >>>>> tkb >>>>> Startseite >>>>> blabla ... >>>>> http://localhost/tkb/internet/index.cfm >>>>> http://localhost/tkb/internet/index.cfm/url >>>>> http\specialConf >>>>> >>>>> >>>>> 6 >>>>> tkb >>>>> Eigenheim >>>>> Machen Sie sich erste Gedanken über den >>>>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder >> gar ein >>>>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den >>>>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in >> finanzieller >>>>> Hinsicht gelingt. >>>>> >>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm >>>>> >>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url >>>>> >>>>> >> >>
Re: dataimporter tika fields empty
i can do it like this but then the content isn't copied to text. it's just in text_test On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote: > i put it in the tika-entity as attribute, but it doesn't change anything. my > bigger concern is why text_test isn't populated at all > > On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: > >> Can you try SOLR-4530 switch: >> https://issues.apache.org/jira/browse/SOLR-4530 >> >> Specifically, setting htmlMapper="identity" on the entity definition. This >> will tell Tika to send full HTML rather than a seriously stripped one. >> >> Regards, >> Alex. >> >> Personal website: http://www.outerthoughts.com/ >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >> - Time is the quality of nature that keeps events from happening all at >> once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) >> >> >> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen wrote: >> >>> i'm trying to index a html page and only user the div with the >>> id="content". unfortunately nothing is working within the tika-entity, only >>> the standard text (content) is populated. >>> >>> do i have to use copyField for test_text to get the data? >>> or is there a problem with the entity-hirarchy? >>> or is the xpath wrong, even though i've tried it without and just >>> using text? >>> or should i use the updateextractor? >>> >>> data-config.xml: >>> >>> >>> >>> >>> http://127.0.0.1/tkb/internet/"; name="main"/> >>> >>> >> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main"> >>> >>> >>> >>> >>> >>> >>> >>> >> url="${rec.path}${rec.file}" dataSource="dataUrl" > >>> >>> >> xpath="//div[@id='content']" /> >>> >>> >>> >>> >>> >>> docImporterUrl.xml: >>> >>> >>> >>> >>> 5 >>> tkb >>> Startseite >>> blabla ... >>> http://localhost/tkb/internet/index.cfm >>> http://localhost/tkb/internet/index.cfm/url >>> http\specialConf >>> >>> >>> 6 >>> tkb >>> Eigenheim >>> Machen Sie sich erste Gedanken über den >>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein >>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den >>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller >>> Hinsicht gelingt. >>> >>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm >>> >>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url >>> >>>
Re: dataimporter tika fields empty
i put it in the tika-entity as attribute, but it doesn't change anything. my bigger concern is why text_test isn't populated at all On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote: > Can you try SOLR-4530 switch: > https://issues.apache.org/jira/browse/SOLR-4530 > > Specifically, setting htmlMapper="identity" on the entity definition. This > will tell Tika to send full HTML rather than a seriously stripped one. > > Regards, > Alex. > > Personal website: http://www.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all at > once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) > > > On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen wrote: > >> i'm trying to index a html page and only user the div with the >> id="content". unfortunately nothing is working within the tika-entity, only >> the standard text (content) is populated. >> >>do i have to use copyField for test_text to get the data? >>or is there a problem with the entity-hirarchy? >>or is the xpath wrong, even though i've tried it without and just >> using text? >>or should i use the updateextractor? >> >> data-config.xml: >> >> >> >> >>http://127.0.0.1/tkb/internet/"; name="main"/> >> >>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main"> >> >> >> >> >> >> >> >>> url="${rec.path}${rec.file}" dataSource="dataUrl" > >> >>> xpath="//div[@id='content']" /> >> >> >> >> >> >> docImporterUrl.xml: >> >> >> >> >>5 >>tkb >>Startseite >>blabla ... >>http://localhost/tkb/internet/index.cfm >>http://localhost/tkb/internet/index.cfm/url >>http\specialConf >> >> >>6 >>tkb >>Eigenheim >>Machen Sie sich erste Gedanken über den >> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein >> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den >> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller >> Hinsicht gelingt. >> >> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm >> >> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url >> >>
dataimporter tika fields empty
i'm trying to index a html page and only user the div with the id="content". unfortunately nothing is working within the tika-entity, only the standard text (content) is populated. do i have to use copyField for test_text to get the data? or is there a problem with the entity-hirarchy? or is the xpath wrong, even though i've tried it without and just using text? or should i use the updateextractor? data-config.xml: http://127.0.0.1/tkb/internet/"; name="main"/> docImporterUrl.xml: 5 tkb Startseite blabla ... http://localhost/tkb/internet/index.cfm http://localhost/tkb/internet/index.cfm/url http\specialConf 6 tkb Eigenheim Machen Sie sich erste Gedanken über den Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht gelingt. http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url
Re: dataimporter, custom fields and parsing error
i have tried post.jar and it works when i set the literal.id in solrconfig.xml. i can't pass the id with post.jar (-Dparams=literal.id=abc) because i get a error: "could not find or load main class .id=abc". On 20. Jul 2013, at 7:05 PM, Andreas Owen wrote: > path was set text wasn't, but it doesn't make a difference. my importer says > 1 row fetched, 0 docs processed, 0 docs skipped. i don't understand how it > can have 2 docs indexed with such a output. > > > On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote: > >> Are the "path" and "text" fields set to "stored" in the schema.xml? >> >> >> On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen wrote: >> >>> they are in my schema, path is typed correctly the others are default >>> fields which already exist. all the other fields are populated and i can >>> search for them, just path and text aren't. >>> >>> >>> On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote: >>> >>>> Dumb question: they are in your schema? Spelled right, in the right >>>> section, using types also defined? Can you populate them by hand with a >>> CSV >>>> file and post.jar? >>>> >>>> Regards, >>>> Alex. >>>> >>>> Personal website: http://www.outerthoughts.com/ >>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >>>> - Time is the quality of nature that keeps events from happening all at >>>> once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) >>>> >>>> >>>> On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen wrote: >>>> >>>>> i'm using solr 4.3 which i just downloaded today and am using only jars >>>>> that came with it. i have enabled the dataimporter and it runs without >>>>> error. but the field "path" (included in schema.xml) and "text" (file >>>>> content) aren't indexed. what am i doing wrong? >>>>> >>>>> solr-path: C:\ColdFusion10\cfusion\jetty-new >>>>> collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1 >>>>> pdf-doc-path: C:\web\development\tkb\internet\public >>>>> >>>>> >>>>> data-config.xml: >>>>> >>>>> >>>>> >>>>> >>>>> http://127.0.0.1/tkb/internet/"; name="main"/> >>>>> >>>>> >>>> url="docImportUrl.xml" forEach="/albums/album" dataSource="main"> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> url="../../../../../web/development/tkb/internet/public/${rec.path}/${ >>>>> rec.id}" >>>>> >>>>> dataSource="data" > >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> docImportUrl.xml: >>>>> >>>>> >>>>> >>>>> >>>>> Peter Z. >>>>> Beratungsseminar kundenbrief >>>>> wie kommuniziert man >>>>> >>>>> 0226520141_e-banking_Checkliste_CLX.Sentinel.pdf >>>>> download/online >>>>> >>>>> >>>>> Marcel X. >>>>> kuchen backen >>>>> torten, kuchen, geb‰ck ... >>>>> Kundenbrief.pdf >>>>> download/online >>>>> >>>>> >>> >>> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar.
Re: dataimporter, custom fields and parsing error
path was set text wasn't, but it doesn't make a difference. my importer says 1 row fetched, 0 docs processed, 0 docs skipped. i don't understand how it can have 2 docs indexed with such a output. On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote: > Are the "path" and "text" fields set to "stored" in the schema.xml? > > > On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen wrote: > >> they are in my schema, path is typed correctly the others are default >> fields which already exist. all the other fields are populated and i can >> search for them, just path and text aren't. >> >> >> On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote: >> >>> Dumb question: they are in your schema? Spelled right, in the right >>> section, using types also defined? Can you populate them by hand with a >> CSV >>> file and post.jar? >>> >>> Regards, >>> Alex. >>> >>> Personal website: http://www.outerthoughts.com/ >>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >>> - Time is the quality of nature that keeps events from happening all at >>> once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) >>> >>> >>> On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen wrote: >>> >>>> i'm using solr 4.3 which i just downloaded today and am using only jars >>>> that came with it. i have enabled the dataimporter and it runs without >>>> error. but the field "path" (included in schema.xml) and "text" (file >>>> content) aren't indexed. what am i doing wrong? >>>> >>>> solr-path: C:\ColdFusion10\cfusion\jetty-new >>>> collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1 >>>> pdf-doc-path: C:\web\development\tkb\internet\public >>>> >>>> >>>> data-config.xml: >>>> >>>> >>>> >>>> >>>> http://127.0.0.1/tkb/internet/"; name="main"/> >>>> >>>> >>> url="docImportUrl.xml" forEach="/albums/album" dataSource="main"> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> url="../../../../../web/development/tkb/internet/public/${rec.path}/${ >>>> rec.id}" >>>> >>>> dataSource="data" > >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> docImportUrl.xml: >>>> >>>> >>>> >>>> >>>> Peter Z. >>>> Beratungsseminar kundenbrief >>>> wie kommuniziert man >>>> >>>> 0226520141_e-banking_Checkliste_CLX.Sentinel.pdf >>>> download/online >>>> >>>> >>>> Marcel X. >>>> kuchen backen >>>> torten, kuchen, geb‰ck ... >>>> Kundenbrief.pdf >>>> download/online >>>> >>>> >> >> > > > -- > Regards, > Shalin Shekhar Mangar.
Re: dataimporter, custom fields and parsing error
they are in my schema, path is typed correctly the others are default fields which already exist. all the other fields are populated and i can search for them, just path and text aren't. On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote: > Dumb question: they are in your schema? Spelled right, in the right > section, using types also defined? Can you populate them by hand with a CSV > file and post.jar? > > Regards, > Alex. > > Personal website: http://www.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all at > once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) > > > On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen wrote: > >> i'm using solr 4.3 which i just downloaded today and am using only jars >> that came with it. i have enabled the dataimporter and it runs without >> error. but the field "path" (included in schema.xml) and "text" (file >> content) aren't indexed. what am i doing wrong? >> >> solr-path: C:\ColdFusion10\cfusion\jetty-new >> collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1 >> pdf-doc-path: C:\web\development\tkb\internet\public >> >> >> data-config.xml: >> >> >> >> >>http://127.0.0.1/tkb/internet/"; name="main"/> >> >>> url="docImportUrl.xml" forEach="/albums/album" dataSource="main"> >> >> >> >> >> >> >> >>> url="../../../../../web/development/tkb/internet/public/${rec.path}/${ >> rec.id}" >> >> dataSource="data" > >> >> >> >> >> >> >> >> >> docImportUrl.xml: >> >> >> >> >>Peter Z. >>Beratungsseminar kundenbrief >>wie kommuniziert man >> >> 0226520141_e-banking_Checkliste_CLX.Sentinel.pdf >>download/online >> >> >>Marcel X. >>kuchen backen >>torten, kuchen, geb‰ck ... >>Kundenbrief.pdf >>download/online >> >>
dataimporter, custom fields and parsing error
i'm using solr 4.3 which i just downloaded today and am using only jars that came with it. i have enabled the dataimporter and it runs without error. but the field "path" (included in schema.xml) and "text" (file content) aren't indexed. what am i doing wrong? solr-path: C:\ColdFusion10\cfusion\jetty-new collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1 pdf-doc-path: C:\web\development\tkb\internet\public data-config.xml: http://127.0.0.1/tkb/internet/"; name="main"/> docImportUrl.xml: Peter Z. Beratungsseminar kundenbrief wie kommuniziert man 0226520141_e-banking_Checkliste_CLX.Sentinel.pdf download/online Marcel X. kuchen backen torten, kuchen, geb‰ck ... Kundenbrief.pdf download/online
Re: solr autodetectparser tikaconfig dataimporter error
i have now changed some things and the import runs without error. in schema.xml i haven't got the field "text" but "contentsExact". unfortunatly the text (from file) isn't indexed even though i mapped it to the proper field. what am i doing wrong? data-config.xml: http://127.0.0.1/tkb/internet/"; name="main"/> i noticed, that when I move the field author into the tika- it isn't indexed. can this have something to do why the text from the file isn't indexed? Do I have to do something special about the -levels in ps: how do i import tsstamp, it's a static value? On 14. Jul 2013, at 10:30 PM, Jack Krupansky wrote: > "Caused by: java.lang.NoSuchMethodError:" > > That means you have some out of date jars or some newer jars mixed in with > the old ones. > > -- Jack Krupansky > > -Original Message- From: Andreas Owen > Sent: Sunday, July 14, 2013 3:07 PM > To: solr-user@lucene.apache.org > Subject: Re: solr autodetectparser tikaconfig dataimporter error > > hi > > is there nowone with a idea what this error is or even give me a pointer > where to look? If not is there a alternitave way to import documents from a > xml-file with meta-data and the filename to parse? > > thanks for any help. > > > On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote: > >> i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to = >> import a >> file via xml i get this error, it doesn't matter what file format i try = >> to index txt, cfm, pdf all the same error: >> >> SEVERE: Exception while processing: rec document : >> SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt}, >> title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, = >> contents=3Dcontents(1.0)=3D{wie >> kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.}, >> = >> path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.= >> DataImportHandlerException: >> java.lang.NoSuchMethodError: >> = >> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= >> TikaConfig;)V >> at >> = >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= >> a:669) >> at >> = >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= >> a:622) >> at >> = >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= >> 68) >> at >> = >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= >> >> at >> = >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= >> java:359) >> at >> = >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= >> 27) >> at >> = >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= >> 8) >> Caused by: java.lang.NoSuchMethodError: >> = >> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= >> TikaConfig;)V >> at >> = >> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= >> rocessor.java:122) >> at >> = >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= >> ocessorWrapper.java:238) >> at >> = >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= >> a:596) >> ... 6 more >> >> Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log >> SEVERE: Full Import >> failed:org.apache.solr.handler.dataimport.DataImportHandlerException: >> java.lang.NoSuchMethodError: >> = >> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= >> TikaConfig;)V >> at >> = >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= >> a:669) >> at >> = >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= >> a:622) >> at >> = >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= >> 68) >> at >> = >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= >> >> at >> = >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= >> java:359) >> at >> = &
Re: solr autodetectparser tikaconfig dataimporter error
hi is there nowone with a idea what this error is or even give me a pointer where to look? If not is there a alternitave way to import documents from a xml-file with meta-data and the filename to parse? thanks for any help. On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote: > i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to = > import a > file via xml i get this error, it doesn't matter what file format i try = > to index txt, cfm, pdf all the same error: > > SEVERE: Exception while processing: rec document : > SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt}, > title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, = > contents=3Dcontents(1.0)=3D{wie > kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.}, > = > path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.= > DataImportHandlerException: > java.lang.NoSuchMethodError: > = > org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= > TikaConfig;)V > at > = > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= > a:669) > at > = > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= > a:622) > at > = > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= > 68) > at > = > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= > > at > = > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= > java:359) > at > = > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= > 27) > at > = > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= > 8) > Caused by: java.lang.NoSuchMethodError: > = > org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= > TikaConfig;)V > at > = > org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= > rocessor.java:122) > at > = > org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= > ocessorWrapper.java:238) > at > = > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= > a:596) > ... 6 more > > Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log > SEVERE: Full Import > failed:org.apache.solr.handler.dataimport.DataImportHandlerException: > java.lang.NoSuchMethodError: > = > org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= > TikaConfig;)V > at > = > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= > a:669) > at > = > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= > a:622) > at > = > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= > 68) > at > = > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= > > at > = > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= > java:359) > at > = > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= > 27) > at > = > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= > 8) > Caused by: java.lang.NoSuchMethodError: > = > org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= > TikaConfig;)V > at > = > org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= > rocessor.java:122) > at > = > org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= > ocessorWrapper.java:238) > at > = > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= > a:596) > ... 6 more > > Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 = > rollback > > data-config.xml: > > >baseUrl=3D"http://127.0.0.1/tkb/internet/"; > name=3D"main"/> > >url=3D"docImport.xml" > forEach=3D"/albums/album" dataSource=3D"main">=20 > > > > > > =09 > =09 > =09 >= > url=3D"file:///C:\web\development\tkb\internet\public\download\online\${re= > c.id}" > dataSource=3D"data" onerror=3D"skip"> > > > > > > > the lib are included and declared in the logs, i have also tried = > tika-app > 1.0 and tagsoup 1.2 with the same result. can someone please help, i = > don't > know where to start looking for the error.
solr autodetectparser tikaconfig dataimporter error
i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to = import a file via xml i get this error, it doesn't matter what file format i try = to index txt, cfm, pdf all the same error: SEVERE: Exception while processing: rec document : SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt}, title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, = contents=3Dcontents(1.0)=3D{wie kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.}, = path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.= DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:669) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:622) at = org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2= 68) at = org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)= at = org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.= java:359) at = org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4= 27) at = org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40= 8) Caused by: java.lang.NoSuchMethodError: = org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/= TikaConfig;)V at = org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP= rocessor.java:122) at = org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr= ocessorWrapper.java:238) at = org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav= a:596) ... 6 more Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 = rollback data-config.xml: http://127.0.0.1/tkb/internet/"; name=3D"main"/> =20 =09 =09 =09 the lib are included and declared in the logs, i have also tried = tika-app 1.0 and tagsoup 1.2 with the same result. can someone please help, i = don't know where to start looking for the error.