from:"andreas owen"

Re: ngramfilter minGramSize problem

2014-04-07 Thread Andreas Owen

it works well. now why does the search only find something when the  
fieldname is added to the query with stopwords?


"cug" -> 9 hits
"mit cug" -> 0 hits
"plain_text:mit cug" -> 9 hits

why is this so? could it be a problem that stopwords aren't used in the  
query because no all fields that are search have the stopwordfilter?



On Mon, 07 Apr 2014 00:37:15 +0200, Furkan KAMACI   
wrote:



Correction: My patch is at SOLR-5152
7 Nis 2014 01:05 tarihinde "Andreas Owen"  yazdı:


i thought i cound use  to index and search words that are only 1 or 2 chars long. it
seems to work but i have to test it some more


On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen 
wrote:

 i have the a fieldtype that uses ngramfilter whle indexing. is there a

setting that can force the ngramfilter to index smaller words then the
minGramSize? Mine is set to 3 and the search wont find word that are  
only 1
or 2 chars long. i would like to not set minGramSize=1 because the  
results

would be to diverse.

fieldtype:


   
 
 

ignoreCase="true"
words="lang/stopwords_de.txt" format="snowball"  
enablePositionIncrements="true"/>


 
 



   
   


 

class="solr.SnowballPorterFilterFactory"

language="German"/>

   
 




--
Using Opera's mail client: http://www.opera.com/mail/




--
Using Opera's mail client: http://www.opera.com/mail/

Re: ngramfilter minGramSize problem

2014-04-06 Thread Andreas Owen

i thought i cound use max="2"/> to index and search words that are only 1 or 2 chars long. it  
seems to work but i have to test it some more



On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen   
wrote:


i have the a fieldtype that uses ngramfilter whle indexing. is there a  
setting that can force the ngramfilter to index smaller words then the  
minGramSize? Mine is set to 3 and the search wont find word that are  
only 1 or 2 chars long. i would like to not set minGramSize=1 because  
the results would be to diverse.


fieldtype:

positionIncrementGap="100">

   
 
 
		
		words="lang/stopwords_de.txt" format="snowball"  
enablePositionIncrements="true"/> 

 
		  

		generateNumberParts="1" catenateWords="1" catenateNumbers="1"  
catenateAll="0" splitOnCaseChange="1"/>
		maxGramSize="50"/>


   
   


			words="lang/stopwords_de.txt" format="snowball"  
enablePositionIncrements="true"/> 



			generateNumberParts="1" catenateWords="1" catenateNumbers="1"  
catenateAll="0" splitOnCaseChange="1"/>

   
 



--
Using Opera's mail client: http://www.opera.com/mail/

ngramfilter minGramSize problem

2014-04-06 Thread Andreas Owen

i have the a fieldtype that uses ngramfilter whle indexing. is there a  
setting that can force the ngramfilter to index smaller words then the  
minGramSize? Mine is set to 3 and the search wont find word that are only  
1 or 2 chars long. i would like to not set minGramSize=1 because the  
results would be to diverse.


fieldtype:

positionIncrementGap="100">

  


		
		words="lang/stopwords_de.txt" format="snowball"  
enablePositionIncrements="true"/> 


		  

		generateNumberParts="1" catenateWords="1" catenateNumbers="1"  
catenateAll="0" splitOnCaseChange="1"/>
		maxGramSize="50"/>


   
   


			words="lang/stopwords_de.txt" format="snowball"  
enablePositionIncrements="true"/> 



			generateNumberParts="1" catenateWords="1" catenateNumbers="1"  
catenateAll="0" splitOnCaseChange="1"/>

Re: dih data-config.xml onImportEnd event

2014-03-27 Thread Andreas Owen


sorry, the previous conversation was started with a false email-address.

On Thu, 27 Mar 2014 14:06:57 +0100, Stefan Matheis  
 wrote:


I would suggest you read the replies to your last mail (containing the  
very same question) first?


-Stefan


On Thursday, March 27, 2014 at 1:56 PM, Andreas Owen wrote:


i would like to call a url after the import is finished whith the event
. how can i do this?








--
Using Opera's mail client: http://www.opera.com/mail/

dih data-config.xml onImportEnd event

2014-03-27 Thread Andreas Owen


i would like to call a url after the import is finished whith the event
. how can i do this?

facet doesnt display all possibilities after selecting one

2014-03-27 Thread Andreas Owen

when i select a facet in "thema_f" all the others in the group disapear  
but the other facets keep the original findings. it seems like it should  
work. maybe the underscore is the wrong char for the seperator?


example documents in index

 

  1_Produkte

dms:381

  

  1_Beratung
  1_Beratung_Beratungsportal PK

dms:2679

  

  1_Beratung
  1_Beratung_Beratungsportal PK

dms:190




solrconfig.xml


 
   explicit
   10
   synonym_edismax
   true
   plain_text^10 editorschoice^200
title^20 h_*^14
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
productsegment^5 productgroup^5 contentmanager^5 links^5
last_modified^5 url^5
   
   (expiration:[NOW TO *] OR (*:* -expiration:*))^6
   div(clicks,max(displays,1))^8 

   text
   *,path,score
   json
   AND

   
   on
   plain_text,title
   200
   
   


on
1
false
{!ex=inhaltstyp_s}inhaltstyp_s
index
{!ex=doctype}doctype
index
{!ex=thema_f}thema_f
index
{!ex=productsegment_f}productsegment_f
index
{!ex=productgroup_f}productgroup_f
index
{!ex=author_s}author_s
index
		name="facet.field">{!ex=sachverstaendiger_s}sachverstaendiger_s

index
{!ex=veranstaltung_s}veranstaltung_s
index
		name="facet.field">{!ex=kundensegment_aktive_beratung}kundensegment_aktive_beratung

index
{!ex=last_modified}last_modified
+1MONTH
NOW/MONTH+1MONTH
NOW/MONTH-36MONTHS
after






schema.xml

positionIncrementGap="100">

dih data-config.xml onImportEnd event

2014-03-27 Thread Andreas Owen

i would like to call a url after the import is finished whith the event  
. how can i do this?

wrong results with wdf & ngtf

2014-03-20 Thread Andreas Owen

Is there a way to tell ngramfilterfactory while indexing that number shall
never be tokenized? then the query should be able to find numbers.

 

Or do i have to change the ngram-min for numbers (not alpha) to 1, if that
is possible? So to speak put the hole number as token and not all possible
tokens.

 

Solr analysis shows onnly WDF has no underscore in its tokens, the rest have
it. can i tell the query to search numbers differently with NGTF, WT, LCF or
whatever?

 

I also tried 

@ => ALPHA

_ => ALPHA

 

I have gotten nearly everything to work. There are to queries where i dont
get back what i want.

 

"avaloq frage 1"   -> only returns if i set
minGramSize=1 while indexing

"yh_cug"-> query parser doesn't
remove "_" but the indexer does (WDF) so there is no match

 

Is there a way to also query the hole term "avaloq frage 1" without
tokenizing it?

 

Fieldtype:

 



   

   












   

   

   

   

   

   





   

   

  



 

 

Solrconfig:

 

>  class="solr.SynonymExpandingExtendedDismaxQParserPlugin">

>   

> 

>   

> standard

>   

>   

> shingle

> true

> true

> 2

> 4

>   

>   

> synonym

> solr.KeywordTokenizerFactory

> synonyms.txt

> true

> true

>   

> 

>   

> 

> 

> 

>  

>explicit

>10

>synonym_edismax

>true

>plain_text^10 editorschoice^200

> title^20 h_*^14

> tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10

> contentmanager^5 links^5

> last_modified^5 url^5

>

>(expiration:[NOW TO *] OR (*:* 

> -expiration:*))^6

>div(clicks,max(displays,1))^8 

> 

>text

>*,path,score

>json

>AND

> 

>

>on

>plain_text,title

>200

>

>

> 

> 

> on

> 1

> {!ex=inhaltstyp_s}inhaltstyp_s

> index

> {!ex=doctype}doctype

> index

> {!ex=thema_f}thema_f

> index

> {!ex=author_s}author_s

> index

>  name="facet.field">{!ex=sachverstaendiger_s}sachverstaendiger_s

> index

> {!ex=veranstaltung_s}veranstaltung_s

> index

> {!ex=last_modified}last_modified

> +1MONTH

> NOW/MONTH+1MONTH

> NOW/MONTH-36MONTHS

> after

> 

>

>

wrong query results with wdf and ngtf

2014-03-20 Thread Andreas Owen

Is there a way to tell ngramfilterfactory while indexing that number shall 
never be tokenized? then the query should be able to find numbers.

Or do i have to change the ngram-min for numbers (not alpha) to 1, if that is 
possible? So to speak put the hole number as token and not all possible tokens.

Solr analysis shows onnly WDF has no underscore in its tokens, the rest have 
it. can i tell the query to search numbers differently with NGTF, WT, LCF or 
whatever?

I also tried 
@ => ALPHA
_ => ALPHA

I have gotten nearly everything to work. There are to queries where i dont get 
back what i want.

"avaloq frage 1"-> only returns if i set minGramSize=1 while 
indexing
"yh_cug"-> query parser doesn't remove "_" but the 
indexer does (WDF) so there is no match

Is there a way to also query the hole term "avaloq frage 1" without tokenizing 
it?

Fieldtype:


   


 
 
 
  


   
   


 
 


  
 


Solrconfig:

>  class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
>   
> 
>   
> standard
>   
>   
> shingle
> true
> true
> 2
> 4
>   
>   
> synonym
> solr.KeywordTokenizerFactory
> synonyms.txt
> true
> true
>   
> 
>   
> 
> 
> 
>  
>explicit
>10
>synonym_edismax
>true
>plain_text^10 editorschoice^200
> title^20 h_*^14
> tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
> contentmanager^5 links^5
> last_modified^5 url^5
>
>(expiration:[NOW TO *] OR (*:* 
> -expiration:*))^6
>div(clicks,max(displays,1))^8 
> 
>text
>*,path,score
>json
>AND
> 
>
>on
>plain_text,title
>200
>
>
> 
> 
> on
> 1
> {!ex=inhaltstyp_s}inhaltstyp_s
> index
> {!ex=doctype}doctype
> index
> {!ex=thema_f}thema_f
> index
> {!ex=author_s}author_s
> index
>  name="facet.field">{!ex=sachverstaendiger_s}sachverstaendiger_s
> index
> {!ex=veranstaltung_s}veranstaltung_s
> index
> {!ex=last_modified}last_modified
> +1MONTH
> NOW/MONTH+1MONTH
> NOW/MONTH-36MONTHS
> after
> 
>
>

underscore in query error

2014-03-19 Thread Andreas Owen

If I use the underscore in the query I don't get any results. If I remove
the underscore it finds the docs with underscore.

Can I tell solr  to search through the ngtf instead of the wdf or is there
any better solution?

 

Query: yh_cug

 

I attached a doc with the analyzer output

searche for single char number when ngram min is 3

2014-03-19 Thread Andreas Owen

Is there a way to tell ngramfilterfactory while indexing that number shall 
never be tokenized? then the query should be able to find numbers.
Or do i have to change the ngram min for numbers to 1, if that is possible? So 
to speak put the hole number as token and not all possible tokens.
Or can i tell the query to search numbers differently woth WT, LCF or whatever?

I attached a doc with screenshots from solr analyzer


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 13. März 2014 13:44
To: solr-user@lucene.apache.org
Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 
3 uppwards

I have gotten nearly everything to work. There are to queries where i dont get 
back what i want.

"avaloq frage 1"-> only returns if i set minGramSize=1 while 
indexing
"yh_cug"-> query parser doesn't remove "_" but the 
indexer does (WDF) so there is no match

Is there a way to also query the hole term "avaloq frage 1" without tokenizing 
it?

Fieldtype:


   


 
 
 
  


   
   


 
 

    
  
 


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch]
Sent: Mittwoch, 12. März 2014 18:39
To: solr-user@lucene.apache.org
Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 
3 uppwards

Hi Jack,

do you know how i can use local parameters in my solrconfig? The params are 
visible in the debugquery-output but solr doesn't parse them.


{!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO 
*]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) 
(+organisations:($org) -roles:["" TO *]) 


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch]
Sent: Mittwoch, 12. März 2014 14:44
To: solr-user@lucene.apache.org
Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index & query).

here's the rest:










  

-Original-Nachricht- 
> Von: "Jack Krupansky" 
> An: solr-user@lucene.apache.org
> Datum: 12/03/2014 13:25
> Betreff: Re: NOT SOLVED searches for single char tokens instead of 
> from 3 uppwards
> 
> You didn't show the new index analyzer - it's tricky to assure that 
> index and query are compatible, but the Admin UI Analysis page can help.
> 
> Generally, using pure defaults for WDF is not what you want, 
> especially for query time. Usually there needs to be a slight 
> asymmetry between index and query for WDF - index generates more terms than 
> query.
> 
> -- Jack Krupansky
> 
> -Original Message-
> From: Andreas Owen
> Sent: Wednesday, March 12, 2014 6:20 AM
> To: solr-user@lucene.apache.org
> Subject: RE: NOT SOLVED searches for single char tokens instead of 
> from 3 uppwards
> 
> I now have the following:
> 
> 
> 
>  types="at-under-alpha.txt"/>  class="solr.LowerCaseFilterFactory"/>
>  words="lang/stopwords_de.txt" format="snowball" 
> enablePositionIncrements="true"/>   class="solr.GermanNormalizationFilterFactory"/>
> 
>   
> 
> The gui analysis shows me that wdf doesn't cut the underscore anymore 
> but it still returns 0 results?
> 
> Output:
> 
> 
>   yh_cug
>   yh_cug
>   (+DisjunctionMaxQuery((tags:yh_cug^10.0 |
> links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 |
> url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 |
> breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0
> |
> editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0))
> ((expiration:[1394619501862 TO *]
> (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
> FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_
> coord
>   +(tags:yh_cug^10.0 |
> links:yh_cug^5.0 |
> thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 |
> h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 |
> contentmanager:yh_cug^5.0 | title:yh_cug^20.0 |
> editorschoice:yh_cug^200.0 |
> doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *]
> (+*:* -expiration:*))^6.0)
> (div(int(clicks),max(int(displays),const(1^8.0
>   
>   
> yh_cug
>   
>   
> DidntFindAnySynonyms
> No sy

RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-13 Thread Andreas Owen

I have gotten nearly everything to work. There are to queries where i dont get 
back what i want.

"avaloq frage 1"-> only returns if i set minGramSize=1 while 
indexing
"yh_cug"-> query parser doesn't remove "_" but the 
indexer does (WDF) so there is no match

Is there a way to also query the hole term "avaloq frage 1" without tokenizing 
it?

Fieldtype:


   


 
 
 
  


   
   


 
 

    
  
 


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Mittwoch, 12. März 2014 18:39
To: solr-user@lucene.apache.org
Subject: RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 
3 uppwards

Hi Jack,

do you know how i can use local parameters in my solrconfig? The params are 
visible in the debugquery-output but solr doesn't parse them.


{!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO 
*]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) 
(+organisations:($org) -roles:["" TO *]) 


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch]
Sent: Mittwoch, 12. März 2014 14:44
To: solr-user@lucene.apache.org
Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index & query).

here's the rest:










  

-Original-Nachricht- 
> Von: "Jack Krupansky" 
> An: solr-user@lucene.apache.org
> Datum: 12/03/2014 13:25
> Betreff: Re: NOT SOLVED searches for single char tokens instead of 
> from 3 uppwards
> 
> You didn't show the new index analyzer - it's tricky to assure that 
> index and query are compatible, but the Admin UI Analysis page can help.
> 
> Generally, using pure defaults for WDF is not what you want, 
> especially for query time. Usually there needs to be a slight 
> asymmetry between index and query for WDF - index generates more terms than 
> query.
> 
> -- Jack Krupansky
> 
> -Original Message-
> From: Andreas Owen
> Sent: Wednesday, March 12, 2014 6:20 AM
> To: solr-user@lucene.apache.org
> Subject: RE: NOT SOLVED searches for single char tokens instead of 
> from 3 uppwards
> 
> I now have the following:
> 
> 
> 
>  types="at-under-alpha.txt"/>  class="solr.LowerCaseFilterFactory"/>
>  words="lang/stopwords_de.txt" format="snowball" 
> enablePositionIncrements="true"/>   class="solr.GermanNormalizationFilterFactory"/>
> 
>   
> 
> The gui analysis shows me that wdf doesn't cut the underscore anymore 
> but it still returns 0 results?
> 
> Output:
> 
> 
>   yh_cug
>   yh_cug
>   (+DisjunctionMaxQuery((tags:yh_cug^10.0 |
> links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 |
> url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 |
> breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0
> |
> editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0))
> ((expiration:[1394619501862 TO *]
> (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
> FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_
> coord
>   +(tags:yh_cug^10.0 |
> links:yh_cug^5.0 |
> thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 |
> h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 |
> contentmanager:yh_cug^5.0 | title:yh_cug^20.0 |
> editorschoice:yh_cug^200.0 |
> doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *]
> (+*:* -expiration:*))^6.0)
> (div(int(clicks),max(int(displays),const(1^8.0
>   
>   
> yh_cug
>   
>   
> DidntFindAnySynonyms
> No synonyms found for this query.  Check 
> your synonyms file.
>   
>   
> ExtendedDismaxQParser
> 
> 
>   (expiration:[NOW TO *] OR (*:* -expiration:*))^6
> 
> 
>   (expiration:[1394619501862 TO *]
> (+MatchAllDocsQuery(*:*) -expiration:*))^6.0
> 
> 
>   div(clicks,max(displays,1))^8
> 
>   
>   
> ExtendedDismaxQParser
> 
> 
>   div(clicks,max(displays,1))^8
> 
>   
>   
> 
> 
> 
> 
> -Original Message-
> From: Jack Krupansky [mailto:j...@basetechnology.com]
> Sent: Dienstag, 11. März 2014 14

RE: use local param in solrconfig fq for access-control

2014-03-13 Thread Andreas Owen

I have given up this idee and made a wrapper which adds a fq with the userroles 
to each request

-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Dienstag, 11. März 2014 23:32
To: solr-user@lucene.apache.org
Subject: use local param in solrconfig fq for access-control

i would like to use $r and $org for access control. it has to allow the fq's 
from my facet to work aswell. i'm not sure if i'm doing it wright or if i 
should add it to a qf or the q itself. the debugquery returns a parsed fq 
string and in them $r and $org are printed instead of their values. how do i 
get them to be intepreted? the lacal params are listed in the response so they 
should be valid.

  {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) 
(+organisations:($org) -roles:["" TO *])

RE: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Andreas Owen

Hi Jack,

do you know how i can use local parameters in my solrconfig? The params are 
visible in the debugquery-output but solr doesn't parse them.


{!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO 
*]) (+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) 
(+organisations:($org) -roles:["" TO *])



-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Mittwoch, 12. März 2014 14:44
To: solr-user@lucene.apache.org
Subject: Re[2]: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index & query).

here's the rest:










  

-Original-Nachricht- 
> Von: "Jack Krupansky" 
> An: solr-user@lucene.apache.org
> Datum: 12/03/2014 13:25
> Betreff: Re: NOT SOLVED searches for single char tokens instead of 
> from 3 uppwards
> 
> You didn't show the new index analyzer - it's tricky to assure that 
> index and query are compatible, but the Admin UI Analysis page can help.
> 
> Generally, using pure defaults for WDF is not what you want, 
> especially for query time. Usually there needs to be a slight 
> asymmetry between index and query for WDF - index generates more terms than 
> query.
> 
> -- Jack Krupansky
> 
> -Original Message-
> From: Andreas Owen
> Sent: Wednesday, March 12, 2014 6:20 AM
> To: solr-user@lucene.apache.org
> Subject: RE: NOT SOLVED searches for single char tokens instead of 
> from 3 uppwards
> 
> I now have the following:
> 
> 
> 
>  types="at-under-alpha.txt"/>  class="solr.LowerCaseFilterFactory"/>
>  words="lang/stopwords_de.txt" format="snowball" 
> enablePositionIncrements="true"/>   class="solr.GermanNormalizationFilterFactory"/>
> 
>   
> 
> The gui analysis shows me that wdf doesn't cut the underscore anymore 
> but it still returns 0 results?
> 
> Output:
> 
> 
>   yh_cug
>   yh_cug
>   (+DisjunctionMaxQuery((tags:yh_cug^10.0 |
> links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 |
> url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 |
> breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 
> |
> editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0))
> ((expiration:[1394619501862 TO *]
> (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
> FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_
> coord
>   +(tags:yh_cug^10.0 | 
> links:yh_cug^5.0 |
> thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 |
> h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 |
> contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | 
> editorschoice:yh_cug^200.0 |
> doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *]
> (+*:* -expiration:*))^6.0)
> (div(int(clicks),max(int(displays),const(1^8.0
>   
>   
> yh_cug
>   
>   
> DidntFindAnySynonyms
> No synonyms found for this query.  Check 
> your synonyms file.
>   
>   
> ExtendedDismaxQParser
> 
> 
>   (expiration:[NOW TO *] OR (*:* -expiration:*))^6
> 
> 
>   (expiration:[1394619501862 TO *]
> (+MatchAllDocsQuery(*:*) -expiration:*))^6.0
> 
> 
>   div(clicks,max(displays,1))^8
> 
>   
>   
> ExtendedDismaxQParser
> 
> 
>   div(clicks,max(displays,1))^8
> 
>   
>   
> 
> 
> 
> 
> -Original Message-
> From: Jack Krupansky [mailto:j...@basetechnology.com]
> Sent: Dienstag, 11. März 2014 14:25
> To: solr-user@lucene.apache.org
> Subject: Re: NOT SOLVED searches for single char tokens instead of 
> from 3 uppwards
> 
> The usual use of an ngram filter is at index time and not at query time.
> What exactly are you trying to achieve by using ngram filtering at 
> query time as well as index time?
> 
> Generally, it is inappropriate to combine the word delimiter filter 
> with the standard tokenizer - the later removes the punctuation that 
> normally influences how WDF treats the parts of a token. Use the white 
> space tokenizer if you intend to use WDF.
> 
> Which query parser are you using? What fields are being queried?
> 
> Please post the parsed query string from the debug output - it will 
> show the precise generated query.
> 
> I think what you are seeing is that the ngram filter is generating 
> tokens like "h_cugtest" and then the WDF is removing the underscore and then 
> "h"
> gets generated as a separate token.
>

Re[2]: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Andreas Owen

yes that is exactly what happend in the analyzer. the term i searched for was 
listed on both sides (index & query).

here's the rest:


        
        
        
        
        
        
        
        
      

-Original-Nachricht- 
> Von: "Jack Krupansky"  
> An: solr-user@lucene.apache.org 
> Datum: 12/03/2014 13:25 
> Betreff: Re: NOT SOLVED searches for single char tokens instead of from 3 
> uppwards 
> 
> You didn't show the new index analyzer - it's tricky to assure that index 
> and query are compatible, but the Admin UI Analysis page can help.
> 
> Generally, using pure defaults for WDF is not what you want, especially for 
> query time. Usually there needs to be a slight asymmetry between index and 
> query for WDF - index generates more terms than query.
> 
> -- Jack Krupansky
> 
> -Original Message- 
> From: Andreas Owen
> Sent: Wednesday, March 12, 2014 6:20 AM
> To: solr-user@lucene.apache.org
> Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
> uppwards
> 
> I now have the following:
> 
> 
> 
> 
> 
>  words="lang/stopwords_de.txt" format="snowball" 
> enablePositionIncrements="true"/> 
> 
> 
>       
> 
> The gui analysis shows me that wdf doesn't cut the underscore anymore but it 
> still returns 0 results?
> 
> Output:
> 
> 
>   yh_cug
>   yh_cug
>   (+DisjunctionMaxQuery((tags:yh_cug^10.0 | 
> links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | 
> url:yh_cug^5.0 | h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | 
> breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | 
> editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0)) 
> ((expiration:[1394619501862 TO *] 
> (+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
> FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord
>   +(tags:yh_cug^10.0 | links:yh_cug^5.0 | 
> thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | 
> h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | 
> contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | 
> doctype:yh_cug^10.0) ((expiration:[1394619501862 TO *] 
> (+*:* -expiration:*))^6.0) 
> (div(int(clicks),max(int(displays),const(1^8.0
>   
>   
>     yh_cug
>   
>   
>     DidntFindAnySynonyms
>     No synonyms found for this query.  Check your 
> synonyms file.
>   
>   
>     ExtendedDismaxQParser
>     
>     
>       (expiration:[NOW TO *] OR (*:* -expiration:*))^6
>     
>     
>       (expiration:[1394619501862 TO *] 
> (+MatchAllDocsQuery(*:*) -expiration:*))^6.0
>     
>     
>       div(clicks,max(displays,1))^8
>     
>   
>   
>     ExtendedDismaxQParser
>     
>     
>       div(clicks,max(displays,1))^8
>     
>   
>   
> 
> 
> 
> 
> -Original Message-
> From: Jack Krupansky [mailto:j...@basetechnology.com]
> Sent: Dienstag, 11. März 2014 14:25
> To: solr-user@lucene.apache.org
> Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 
> uppwards
> 
> The usual use of an ngram filter is at index time and not at query time.
> What exactly are you trying to achieve by using ngram filtering at query 
> time as well as index time?
> 
> Generally, it is inappropriate to combine the word delimiter filter with the 
> standard tokenizer - the later removes the punctuation that normally 
> influences how WDF treats the parts of a token. Use the white space 
> tokenizer if you intend to use WDF.
> 
> Which query parser are you using? What fields are being queried?
> 
> Please post the parsed query string from the debug output - it will show the 
> precise generated query.
> 
> I think what you are seeing is that the ngram filter is generating tokens 
> like "h_cugtest" and then the WDF is removing the underscore and then "h"
> gets generated as a separate token.
> 
> -- Jack Krupansky
> 
> -Original Message-
> From: Andreas Owen
> Sent: Tuesday, March 11, 2014 5:09 AM
> To: solr-user@lucene.apache.org
> Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
> uppwards
> 
> I got it roght the first time and here is my requesthandler. The field 
> "plain_text" is searched correctly and has the sam fieldtype as "title" -> 
> "text_de"
> 
>  class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
>   
> 
>   
> standard
>   
>   
> shingle
> true
> true
> 2
> 4
>   
>   
> synonym
> solr.KeywordTokenizerFactory
> synonyms.txt
> true
> true
>   
>

RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-12 Thread Andreas Owen

I now have the following:



 

 


  

The gui analysis shows me that wdf doesn't cut the underscore anymore but it 
still returns 0 results?

Output:


  yh_cug
  yh_cug
  (+DisjunctionMaxQuery((tags:yh_cug^10.0 | 
links:yh_cug^5.0 | thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 
| h_*:yh_cug^14.0 | inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | 
contentmanager:yh_cug^5.0 | title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | 
doctype:yh_cug^10.0)) ((expiration:[1394619501862 TO *] 
(+MatchAllDocsQuery(*:*) -expiration:*))^6.0) 
FunctionQuery((div(int(clicks),max(int(displays),const(1^8.0))/no_coord
  +(tags:yh_cug^10.0 | links:yh_cug^5.0 | 
thema:yh_cug^15.0 | plain_text:yh_cug^10.0 | url:yh_cug^5.0 | h_*:yh_cug^14.0 | 
inhaltstyp:yh_cug^6.0 | breadcrumb:yh_cug^6.0 | contentmanager:yh_cug^5.0 | 
title:yh_cug^20.0 | editorschoice:yh_cug^200.0 | doctype:yh_cug^10.0) 
((expiration:[1394619501862 TO *] (+*:* -expiration:*))^6.0) 
(div(int(clicks),max(int(displays),const(1^8.0
  
  
yh_cug
  
  
DidntFindAnySynonyms
No synonyms found for this query.  Check your 
synonyms file.
  
  
ExtendedDismaxQParser


  (expiration:[NOW TO *] OR (*:* -expiration:*))^6


  (expiration:[1394619501862 TO *] (+MatchAllDocsQuery(*:*) 
-expiration:*))^6.0


  div(clicks,max(displays,1))^8

  
  
ExtendedDismaxQParser


  div(clicks,max(displays,1))^8

  
  




-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Dienstag, 11. März 2014 14:25
To: solr-user@lucene.apache.org
Subject: Re: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

The usual use of an ngram filter is at index time and not at query time. 
What exactly are you trying to achieve by using ngram filtering at query time 
as well as index time?

Generally, it is inappropriate to combine the word delimiter filter with the 
standard tokenizer - the later removes the punctuation that normally influences 
how WDF treats the parts of a token. Use the white space tokenizer if you 
intend to use WDF.

Which query parser are you using? What fields are being queried?

Please post the parsed query string from the debug output - it will show the 
precise generated query.

I think what you are seeing is that the ngram filter is generating tokens like 
"h_cugtest" and then the WDF is removing the underscore and then "h" 
gets generated as a separate token.

-- Jack Krupansky

-Original Message-
From: Andreas Owen
Sent: Tuesday, March 11, 2014 5:09 AM
To: solr-user@lucene.apache.org
Subject: RE: NOT SOLVED searches for single char tokens instead of from 3 
uppwards

I got it roght the first time and here is my requesthandler. The field 
"plain_text" is searched correctly and has the sam fieldtype as "title" -> 
"text_de"


  

  
standard
  
  
shingle
true
true
2
4
  
  
synonym
solr.KeywordTokenizerFactory
synonyms.txt
true
true
  

  



 
   explicit
   10
   synonym_edismax
   true
   plain_text^10 editorschoice^200
title^20 h_*^14
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   

{!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *])
(+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r))
(+organisations:($org) -roles:["" TO *])
   (expiration:[NOW TO *] OR (*:* -expiration:*))^6

   div(clicks,max(displays,1))^8 

   text
   *,path,score
   json
   AND

   
   on
   plain_text,title
   200
   <b>
   </b>


on
1
{!ex=inhaltstyp_s}inhaltstyp_s
index
{!ex=doctype}doctype
index
{!ex=thema_f}thema_f
index
{!ex=author_s}author_s
index
{!ex=sachverstaendiger_s}sachverstaendiger_s
index
{!ex=veranstaltung_s}veranstaltung_s
index
{!ex=last_modified}last_modified
+1MONTH
NOW/MONTH+1MONTH
NOW/MONTH-36MONTHS
after

   




i have a field with the following type:


   
 
 


   
   
   
   
 


shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
query report of 2 results:

>   0   name="QTime">125   name="debugQuery">true name="fl">title,roles,organisations,id name="indent">trueyh_cugtest name="_">1394522589347xml name="fq">organisations:* roles:*name="response" numFound="5" start="0">
>..
> 
> 1.6365329 = (MATCH) sum of:   1.6346203 = (MATCH) max of:
> 0.14759353 = (MATCH) product of:   0.28596246 = (MATCH) sum of:
> 0.01528686 = (MATCH) weight(plain

use local param in solrconfig fq for access-control

2014-03-11 Thread Andreas Owen

i would like to use $r and $org for access control. it has to allow the fq's 
from my facet to work aswell. i'm not sure if i'm doing it wright or if i 
should add it to a qf or the q itself. the debugquery returns a parsed fq 
string and in them $r and $org are printed instead of their values. how do i 
get them to be intepreted? the lacal params are listed in the response so they 
should be valid.


      {!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) 
(+organisations:($org) -roles:["" TO *])

use local params in query

2014-03-11 Thread Andreas Owen

Shouldn't the numbers be in the output below (parsed_filter_queries) and not
$r and $org? 

 

This works great but i would like to use lacal params "r" and "org" instead
of hard-coded

 (*:* -organisations:[* TO *] -roles:[* TO
*]) (+organisations:(150 42) +roles:(174 72))

 

I would like

 (*:* -organisations:[* TO *] -roles:[* TO
*]) (+organisations:($org) +roles:($r))

 

I use this in my requesthandler under invariant because i need it to be
added to the query without being able to be overriden. Oh and i use facets
so fq has to be combinable. This should work or am i understanding it wrong?

 

Debug query:

 



  0

  109

  

true

true

267

yh_cug

1394533792473

xml

  

...



{!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *])
(+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r))
(+organisations:($org) -roles:["" TO *])

  

  

(MatchAllDocsQuery(*:*) -organisations:["" TO *] -roles:["" TO *])
(+organisations:$org +roles:$r) (-organisations:["" TO *] +roles:$r)
(+organisations:$org -roles:["" TO *])

query with local params

2014-03-11 Thread Andreas Owen

This works great but i would like to use lacal params "r" and "org" instead of 
hard-coded
 (*:* -organisations:[* TO *] -roles:[* TO *]) 
(+organisations:(150 42) +roles:(174 72))

I would like
 (*:* -organisations:[* TO *] -roles:[* TO *]) 
(+organisations:($org) +roles:($r))

Shouldn't the numbers be in the output below (parsed_filter_queries) and not $r 
and $org? I use this in my requesthandler and need it to be added as fq or 
query params without being able to be overriden, has anybody any idees? Oh and 
i use facets so fq has to be combinable.

Debug query:


  0
  109
  
true
true
267
yh_cug
1394533792473
xml
  
...

{!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) 
(+organisations:($org) -roles:["" TO *])
  
  
(MatchAllDocsQuery(*:*) -organisations:["" TO *] -roles:["" TO *]) 
(+organisations:$org +roles:$r) (-organisations:["" TO *] +roles:$r) 
(+organisations:$org -roles:["" TO *])

RE: NOT SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-11 Thread Andreas Owen

I got it roght the first time and here is my requesthandler. The field 
"plain_text" is searched correctly and has the sam fieldtype as "title" -> 
"text_de"


  

  
standard
  
  
shingle
true
true
2
4
  
  
synonym
solr.KeywordTokenizerFactory
synonyms.txt
true
true
  

  



 
   explicit
   10
   synonym_edismax
   true
   plain_text^10 editorschoice^200
title^20 h_*^14 
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   

{!q.op=OR} (*:* -organisations:["" TO *] -roles:["" TO *]) 
(+organisations:($org) +roles:($r)) (-organisations:["" TO *] +roles:($r)) 
(+organisations:($org) -roles:["" TO *])
   (expiration:[NOW TO *] OR (*:* 
-expiration:*))^6  
   div(clicks,max(displays,1))^8 

   text
   *,path,score
   json
   AND
   
   
   on
   plain_text,title
   200
   
   
   
 
on
1
{!ex=inhaltstyp_s}inhaltstyp_s
index
{!ex=doctype}doctype
index
{!ex=thema_f}thema_f
index
{!ex=author_s}author_s
index
{!ex=sachverstaendiger_s}sachverstaendiger_s
index
{!ex=veranstaltung_s}veranstaltung_s
index
{!ex=last_modified}last_modified
+1MONTH
NOW/MONTH+1MONTH
NOW/MONTH-36MONTHS
after


   

 


 i have a field with the following type:
 
 
   
 
 


   
   
   
   
 
 
 
 shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
query report of 2 results:

>   0   name="QTime">125   name="debugQuery">true name="fl">title,roles,organisations,id name="indent">trueyh_cugtest name="_">1394522589347xml name="fq">organisations:* roles:*name="response" numFound="5" start="0">
>..
> 
> 1.6365329 = (MATCH) sum of:   1.6346203 = (MATCH) max of: 
> 0.14759353 = (MATCH) product of:   0.28596246 = (MATCH) sum of: 
> 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], 
> result of:   0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 
> ), product of: 0.035319194 = queryWeight, product of: 
>   
> 5.540098 = idf(docFreq=9, maxDocs=937)   0.0063751927 = 
> queryNorm 0.43282017 = fieldWeight in 0, product of:  
>  
> 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0  
>  
> 5.540098 = idf(docFreq=9, maxDocs=937)   0.078125 = 
> fieldNorm(doc=0) 0.0119499 = (MATCH) weight(plain_text:ugt in 
> 0) [DefaultSimilarity], result of:   0.0119499 = 
> score(doc=0,freq=1.0 = termFreq=1.0 ),
product of: 0.031227252 = queryWeight, product of:  
> 4.8982444 = idf(docFreq=18, maxDocs=937)   0.0063751927 = 
> queryNorm 0.38267535 = fieldWeight in 0, product of:  
>  
> 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0  
>  
> 4.8982444 = idf(docFreq=18, maxDocs=937)   0.078125 = 
> fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhc 
> in 0) [DefaultSimilarity], result of:   0.019351374 = 
> score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
> 0.03973814 = queryWeight, product of:   6.2332454 = 
> idf(docFreq=4, maxDocs=937)   0.0063751927 = queryNorm
>  
> 0.4869723 = fieldWeight in 0, product of:   1.0 = 
> tf(freq=1.0), with freq of: 1.0 = termFreq=1.0   
> 6.2332454 =
idf(docFreq=4, maxDocs=937)   0.078125 = fieldNorm(doc=0) 
0.019351374 = (MATCH)
> weight(plain_text:hcu in 0) [DefaultSimilarity], result of:   
> 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
> 0.03973814 = queryWeight, product of:   6.2332454 = 
> idf(docFreq=4, maxDocs=937)   0.0063751927 = queryNorm
>  
> 0.4869723 = fieldWeight in 0, product of:   1.0 = 
> tf(freq=1.0), with freq of: 1.0 = termF

Re: SOLVED searches for single char tokens instead of from 3 uppwards

2014-03-11 Thread Andreas Owen

sorry i looked at the wrong fieldtype

-Original-Nachricht- 
> Von: "Andreas Owen"  
> An: solr-user@lucene.apache.org 
> Datum: 11/03/2014 08:45 
> Betreff: searches for single char tokens instead of from 3 uppwards 
> 
> i have a field with the following type:
> 
> 
>        
>         
>         
>    words="lang/stopwords_de.txt" format="snowball" 
> enablePositionIncrements="true"/> 
>                
>    language="German"/> 
>    maxGramSize="15"/>
>    generateWordParts="1" generateNumberParts="1" catenateWords="1" 
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>       
>     
> 
> 
> shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
> query report of 2 results:
>   0   name="QTime">125       name="debugQuery">true     name="fl">title,roles,organisations,id    true 
>    yh_cugtest    1394522589347    
> xml    organisations:* roles:*  
> 
> 
>    ..
> 
> 1.6365329 = (MATCH) sum of:   1.6346203 = (MATCH) max of:     0.14759353 = 
> (MATCH) product of:       0.28596246 = (MATCH) sum of:         0.01528686 = 
> (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of:           
> 0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of:             
> 0.035319194 = queryWeight, product of:               5.540098 = 
> idf(docFreq=9, maxDocs=937)               0.0063751927 = queryNorm            
>  0.43282017 = fieldWeight in 0, product of:               1.0 = tf(freq=1.0), 
> with freq of:                 1.0 = termFreq=1.0               5.540098 = 
> idf(docFreq=9, maxDocs=937)               0.078125 = fieldNorm(doc=0)         
> 0.0119499 = (MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result 
> of:           0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ),
product of:             0.031227252 = queryWeight, product of:              
> 4.8982444 = idf(docFreq=18, maxDocs=937)               0.0063751927 = 
> queryNorm             0.38267535 = fieldWeight in 0, product of:              
>  1.0 = tf(freq=1.0), with freq of:                 1.0 = termFreq=1.0         
>       4.8982444 = idf(docFreq=18, maxDocs=937)               0.078125 = 
> fieldNorm(doc=0)         0.019351374 = (MATCH) weight(plain_text:yhc in 0) 
> [DefaultSimilarity], result of:           0.019351374 = score(doc=0,freq=1.0 
> = termFreq=1.0 ), product of:             0.03973814 = queryWeight, product 
> of:               6.2332454 = idf(docFreq=4, maxDocs=937)               
> 0.0063751927 = queryNorm             0.4869723 = fieldWeight in 0, product 
> of:               1.0 = tf(freq=1.0), with freq of:                 1.0 = 
> termFreq=1.0               6.2332454 =
idf(docFreq=4, maxDocs=937)               0.078125 = fieldNorm(doc=0)         
0.019351374 = (MATCH)
> weight(plain_text:hcu in 0) [DefaultSimilarity], result of:           
> 0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of:             
> 0.03973814 = queryWeight, product of:               6.2332454 = 
> idf(docFreq=4, maxDocs=937)               0.0063751927 = queryNorm            
>  0.4869723 = fieldWeight in 0, product of:               1.0 = tf(freq=1.0), 
> with freq of:                 1.0 = termFreq=1.0               6.2332454 = 
> idf(docFreq=4, maxDocs=937)               0.078125 = fieldNorm(doc=0)         
> 0.01528686 = (MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result 
> of:           0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
>             0.035319194 = queryWeight, product of:               5.540098 = 
> idf(docFreq=9, maxDocs=937)               0.0063751927 =
queryNorm             0.43282017 = fieldWeight in 0, product of:               
1.0 =
> tf(freq=1.0), with freq of:                 1.0 = termFreq=1.0               
> 5.540098 = idf(docFreq=9, maxDocs=937)               0.078125 = 
> fieldNorm(doc=0)         0.019351374 = (MATCH) weight(plain_text:cugt in 0) 
> [DefaultSimilarity], result of:           0.019351374 = score(doc=0,freq=1.0 
> = termFreq=1.0 ), product of:             0.03973814 = queryWeight, product 
> of:               6.2332454 = idf(docFreq=4, maxDocs=937)               
> 0.0063751927 = queryNorm             0.4869723 = fieldWeight in 0, product 
> of:               1.0 = tf(freq=1.0), with freq of:                 1.0 = 
> termFreq=1.0               6.2332454 = idf(docFreq=4, maxDocs=937)            
>    0.078125 = fieldNorm(doc=0)         0.019351374 = (MATCH) 
> weight(plain_text:yhcu in 0) [DefaultSimilarity], result of:          
0.019351374

searches for single char tokens instead of from 3 uppwards

2014-03-11 Thread Andreas Owen

i have a field with the following type:


       
        
        
 
               
 


      
    


shouldn't this make tokens from 3 to 15 in length and not from 1? heres is a 
query report of 2 results:
  0  125  truetitle,roles,organisations,idtrue   
 yh_cugtest1394522589347xmlorganisations:* roles:*  


   ..

1.6365329 = (MATCH) sum of:   1.6346203 = (MATCH) max of: 0.14759353 = 
(MATCH) product of:   0.28596246 = (MATCH) sum of: 0.01528686 = 
(MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of:   
0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
0.035319194 = queryWeight, product of:   5.540098 = idf(docFreq=9, 
maxDocs=937)   0.0063751927 = queryNorm 0.43282017 = 
fieldWeight in 0, product of:   1.0 = tf(freq=1.0), with freq of:   
  1.0 = termFreq=1.0   5.540098 = idf(docFreq=9, 
maxDocs=937)   0.078125 = fieldNorm(doc=0) 0.0119499 = 
(MATCH) weight(plain_text:ugt in 0) [DefaultSimilarity], result of:   
0.0119499 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
0.031227252 = queryWeight, product of:  
4.8982444 = idf(docFreq=18, maxDocs=937)   0.0063751927 = queryNorm 
0.38267535 = fieldWeight in 0, product of:   1.0 = 
tf(freq=1.0), with freq of: 1.0 = termFreq=1.0   
4.8982444 = idf(docFreq=18, maxDocs=937)   0.078125 = 
fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhc in 0) 
[DefaultSimilarity], result of:   0.019351374 = score(doc=0,freq=1.0 = 
termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of:   
6.2332454 = idf(docFreq=4, maxDocs=937)   0.0063751927 
= queryNorm 0.4869723 = fieldWeight in 0, product of:   
1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
   6.2332454 = idf(docFreq=4, maxDocs=937)   0.078125 = 
fieldNorm(doc=0) 0.019351374 = (MATCH)
weight(plain_text:hcu in 0) [DefaultSimilarity], result of:   
0.019351374 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
0.03973814 = queryWeight, product of:   6.2332454 = idf(docFreq=4, 
maxDocs=937)   0.0063751927 = queryNorm 0.4869723 = 
fieldWeight in 0, product of:   1.0 = tf(freq=1.0), with freq of:   
  1.0 = termFreq=1.0   6.2332454 = idf(docFreq=4, 
maxDocs=937)   0.078125 = fieldNorm(doc=0) 0.01528686 = 
(MATCH) weight(plain_text:cug in 0) [DefaultSimilarity], result of:   
0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
0.035319194 = queryWeight, product of:   5.540098 = idf(docFreq=9, 
maxDocs=937)   0.0063751927 = queryNorm 0.43282017 = 
fieldWeight in 0, product of:   1.0 =
tf(freq=1.0), with freq of: 1.0 = termFreq=1.0   
5.540098 = idf(docFreq=9, maxDocs=937)   0.078125 = 
fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:cugt in 0) 
[DefaultSimilarity], result of:   0.019351374 = score(doc=0,freq=1.0 = 
termFreq=1.0 ), product of: 0.03973814 = queryWeight, product of:   
6.2332454 = idf(docFreq=4, maxDocs=937)   0.0063751927 
= queryNorm 0.4869723 = fieldWeight in 0, product of:   
1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
   6.2332454 = idf(docFreq=4, maxDocs=937)   0.078125 = 
fieldNorm(doc=0) 0.019351374 = (MATCH) weight(plain_text:yhcu in 0) 
[DefaultSimilarity], result of:   0.019351374 = score(doc=0,freq=1.0 = 
termFreq=1.0 ), product of: 0.03973814 =
queryWeight, product of:   6.2332454 = idf(docFreq=4, maxDocs=937)  
 0.0063751927 = queryNorm 0.4869723 = fieldWeight in 0, 
product of:   1.0 = tf(freq=1.0), with freq of: 1.0 
= termFreq=1.0   6.2332454 = idf(docFreq=4, maxDocs=937)
   0.078125 = fieldNorm(doc=0) 0.01528686 = (MATCH) 
weight(plain_text:cug in 0) [DefaultSimilarity], result of:   
0.01528686 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 
0.035319194 = queryWeight, product of:   5.540098 = idf(docFreq=9, 
maxDocs=937)   0.0063751927 = queryNorm 0.43282017 = 
fieldWeight in 0, product of:   1.0 = tf(freq=1.0), with freq of:   
  1.0 = termFreq=1.0   5.540098 = idf(docFreq=9, 
maxDocs=937)   0.078125 = fieldNorm(doc=0)
0.019351374 = (MATCH) weight(plain_text:hcug in 0) [DefaultSimila

maxClauseCount is set to 1024

2014-03-10 Thread Andreas Owen


does this maxClauseCount go over each field individually or all put together? 
is it the date fields?


when i execute a query i get this error:


   500   93true   
  Ein PDFchen als Dokument roles:* 1394436617394 xml 


.
0.10604319

   390   2   27   1   
1   3 
  3   1   
8   10   1   14   37   1 1   4
   8   44   4   1   6
57   11   11   3   3   4  
 1   2   1   2   2   2   2   29  
 1   1 17   1   1   4   1   3   5   1   5   1   2   1   1   1   35   1   2   26   2   1 
  2   3   1   1   1   27   3   1   1   3   1   1   3   
6   3   1   2   2   2   2   2   28   4   2   1   16 
  46   1   5   11   58   1   2   29   
2   2   1   1   1   9   4   75   2   2   1   2   2   1   4   1   1   2   1   1   1 91   1   11   3   3 20   
15   59   11   
36   204   18 
  2   25   7   5   2   7   3   7   10   10   4   1   34   4   35   25   9   +1MONTH   
2011-03-01T00:00:00Z   2014-04-01T00:00:00Z   0 
   maxClauseCount is set to 1024   org.apache.lucene.search.BooleanQuery$TooManyClauses: 
maxClauseCount is set to 1024at 
org.apache.lucene.search.ScoringRewrite$1.checkMaxClauseCount(ScoringRewrite.java:72)
at 
org.apache.lucene.search.ScoringRewrite$ParallelArraysTermCollector.collect(ScoringRewrite.java:152)
 at 
org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:79)
   at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:108)  
   at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:288)  
   at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:217)
 at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:99)
  at 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:469)
at
org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217)
  at 
org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:199)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:528)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:415)
 at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:139)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) 
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
  at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)   at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)   
 at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)  at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
   at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) 
   at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
   at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)   
 at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
  at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) 
 at org.eclipse.jetty.server.Server.handle(Server.java:365)  at 
org.eclipse.jetty.server.AbstractHttpConnection.han

set fq operator independently

2014-03-04 Thread Andreas Owen

i want to use the following in fq and i need to set the operator to OR. My q.op 
is AND but I need OR in fq. I have read about ofq but that is for putting OR 
between multiple fq. Can I set the operator for fq?

     (-organisations:["" TO *] -roles:["" TO *]) (+organisations:(150 42) 
+roles:(174 72))


The statement should find all docs without organisations and roles or those 
that have at least one roles and organisations entry. these fields are 
multivalued.

Re[2]: query parameters

2014-03-03 Thread Andreas Owen

ok i like the logic, you can do much more. i think this should do it for me:

         (-organisations:["" TO *] -roles:["" TO *]) (+organisations:(150 42) 
+roles:(174 72))


i want to use this in fq and i need to set the operator to OR. My q.op is AND 
but I need OR in fq. I have read about ofq but that is for putting OR between 
multiple fq. Can I set the operator for fq?

The statement should find all docs without organisations and roles or those 
that have at least one roles and organisations entry. these fields are 
multivalued.

-Original-Nachricht- 
> Von: "Erick Erickson"  
> An: solr-user@lucene.apache.org 
> Datum: 19/02/2014 04:09 
> Betreff: Re: query parameters 
> 
> Solr/Lucene query language is NOT strictly boolean, see
> Chris's excellent blog here:
> http://searchhub.org/dev/2011/12/28/why-not-and-or-and-not/
> 
> Best,
> Erick
> 
> 
> On Tue, Feb 18, 2014 at 11:54 AM, Andreas Owen  wrote:
> 
> > I tried it in solr admin query and it showed me all the docs without a
> > value
> > in ogranisations and roles. It didn't matter if i used a base term, isn't
> > that give through the q-parameter?
> >
> > -Original Message-
> > From: Raymond Wiker [mailto:rwi...@gmail.com]
> > Sent: Dienstag, 18. Februar 2014 13:19
> > To: solr-user@lucene.apache.org
> > Subject: Re: query parameters
> >
> > That could be because the second condition does not do what you think it
> > does... have you tried running the second condition separately?
> >
> > You may have to add a "base term" to the second condition, like what you
> > have for the "bq" parameter in your config file; i.e, something like
> >
> > (*:* -organisations:["" TO *] -roles:["" TO *])
> >
> >
> >
> >
> > On Tue, Feb 18, 2014 at 12:16 PM, Andreas Owen  wrote:
> >
> > > It seams that fq doesn't except OR because: (organisations:(150 OR 41)
> > > AND
> > > roles:(174)) OR  (-organisations:["" TO *] AND -roles:["" TO *]) only
> > > returns docs that match the first conditions. it doesn't return any
> > > docs with the empty fields organisations and roles.
> > >
> > > -Original Message-
> > > From: Andreas Owen [mailto:a...@conx.ch]
> > > Sent: Montag, 17. Februar 2014 05:08
> > > To: solr-user@lucene.apache.org
> > > Subject: query parameters
> > >
> > >
> > > in solrconfig of my solr 4.3 i have a userdefined requestHandler. i
> > > would like to use fq to force the following conditions:
> > >    1: organisations is empty and roles is empty
> > >    2: organisations contains one of the commadelimited list in
> > > variable $org
> > >    3: roles contains one of the commadelimited list in variable $r
> > >    4: rule 2 and 3
> > >
> > > snipet of what i got (havent checked out if the is a "in" operator
> > > like in sql for the list value)
> > >
> > > 
> > >        explicit
> > >        10
> > >        edismax
> > >            true
> > >            plain_text^10 editorschoice^200
> > >                 title^20 h_*^14
> > >                 tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
> > >                 contentmanager^5 links^5
> > >                 last_modified^5 url^5
> > >            
> > >            (organisations='' roles='') or
> > > (organisations=$org roles=$r) or (organisations='' roles=$r) or
> > > (organisations=$org roles='')
> > >            (expiration:[NOW TO *] OR (*:*
> > > -expiration:*))^6  
> > >            div(clicks,max(displays,1))^8 
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >

RE: query parameters

2014-02-18 Thread Andreas Owen

I tried it in solr admin query and it showed me all the docs without a value
in ogranisations and roles. It didn't matter if i used a base term, isn't
that give through the q-parameter?

-Original Message-
From: Raymond Wiker [mailto:rwi...@gmail.com] 
Sent: Dienstag, 18. Februar 2014 13:19
To: solr-user@lucene.apache.org
Subject: Re: query parameters

That could be because the second condition does not do what you think it
does... have you tried running the second condition separately?

You may have to add a "base term" to the second condition, like what you
have for the "bq" parameter in your config file; i.e, something like

(*:* -organisations:["" TO *] -roles:["" TO *])




On Tue, Feb 18, 2014 at 12:16 PM, Andreas Owen  wrote:

> It seams that fq doesn't except OR because: (organisations:(150 OR 41) 
> AND
> roles:(174)) OR  (-organisations:["" TO *] AND -roles:["" TO *]) only 
> returns docs that match the first conditions. it doesn't return any 
> docs with the empty fields organisations and roles.
>
> -Original Message-
> From: Andreas Owen [mailto:a...@conx.ch]
> Sent: Montag, 17. Februar 2014 05:08
> To: solr-user@lucene.apache.org
> Subject: query parameters
>
>
> in solrconfig of my solr 4.3 i have a userdefined requestHandler. i 
> would like to use fq to force the following conditions:
>1: organisations is empty and roles is empty
>2: organisations contains one of the commadelimited list in 
> variable $org
>3: roles contains one of the commadelimited list in variable $r
>4: rule 2 and 3
>
> snipet of what i got (havent checked out if the is a "in" operator 
> like in sql for the list value)
>
> 
>explicit
>10
>edismax
>true
>plain_text^10 editorschoice^200
> title^20 h_*^14
> tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
> contentmanager^5 links^5
> last_modified^5 url^5
>
>(organisations='' roles='') or 
> (organisations=$org roles=$r) or (organisations='' roles=$r) or 
> (organisations=$org roles='')
>(expiration:[NOW TO *] OR (*:* 
> -expiration:*))^6  
>div(clicks,max(displays,1))^8 
>
>
>
>
>
>

RE: query parameters

2014-02-18 Thread Andreas Owen

It seams that fq doesn't except OR because: (organisations:(150 OR 41) AND 
roles:(174)) OR  (-organisations:["" TO *] AND -roles:["" TO *]) only returns 
docs that match the first conditions. it doesn't return any docs with the empty 
fields organisations and roles.

-----Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Montag, 17. Februar 2014 05:08
To: solr-user@lucene.apache.org
Subject: query parameters

in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like 
to use fq to force the following conditions:
   1: organisations is empty and roles is empty
   2: organisations contains one of the commadelimited list in variable $org
   3: roles contains one of the commadelimited list in variable $r
   4: rule 2 and 3

snipet of what i got (havent checked out if the is a "in" operator like in sql 
for the list value)

   explicit
   10
   edismax
   true
   plain_text^10 editorschoice^200
title^20 h_*^14 
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5

   (organisations='' roles='') or (organisations=$org 
roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')
   (expiration:[NOW TO *] OR (*:* 
-expiration:*))^6  
   div(clicks,max(displays,1))^8

query parameters

2014-02-16 Thread Andreas Owen


in solrconfig of my solr 4.3 i have a userdefined requestHandler. i would like 
to use fq to force the following conditions:
   1: organisations is empty and roles is empty
   2: organisations contains one of the commadelimited list in variable $org
   3: roles contains one of the commadelimited list in variable $r
   4: rule 2 and 3

snipet of what i got (havent checked out if the is a "in" operator like in sql 
for the list value)


       explicit
       10
       edismax
   true
   plain_text^10 editorschoice^200
title^20 h_*^14 
tags^10 thema^15 inhaltstyp^6 breadcrumb^6 doctype^10
contentmanager^5 links^5
last_modified^5 url^5
   
   (organisations='' roles='') or (organisations=$org 
roles=$r) or (organisations='' roles=$r) or (organisations=$org roles='')
   (expiration:[NOW TO *] OR (*:* 
-expiration:*))^6  
   div(clicks,max(displays,1))^8

admin gui right side not loading

2014-01-15 Thread Andreas Owen

I'm using solr 4.3.1 and have installed it on a win 2008 server. Solr is
working, for example import & search. But the admin guis right side isn't
loading and I get a javascript error for several d3-objects. The last error
is:

 

Load timeout for modules: lib/order!lib/jquery.autogrow
lib/order!lib/jquery.cookie lib/order!lib/jquery.form
lib/order!lib/jquery.jstree lib/order!lib/jquery.sammy
lib/order!lib/jquery.timeago lib/order!lib/jquery.blockUI
lib/order!lib/highlight lib/order!lib/linker lib/order!lib/ZeroClipboard
lib/order!lib/d3 lib/order!lib/chosen lib/order!scripts/app
lib/order!scripts/analysis lib/order!scripts/cloud lib/order!scripts/cores
lib/order!scripts/dataimport lib/order!scripts/dashboard
lib/order!scripts/file lib/order!scripts/index
lib/order!scripts/java-properties lib/order!scripts/logging
lib/order!scripts/ping lib/order!scripts/plugins lib/order!scripts/query
lib/order!scripts/replication lib/order!scripts/schema-browser
lib/order!scripts/threads lib/jquery.autogrow lib/jquery.cookie
lib/jquery.form lib/jquery.jstree lib/jquery.sammy lib/jquery.timeago
lib/jquery.blockUI lib/highlight lib/linker lib/ZeroClipboard lib/d3
lib/chosen scripts/app scripts/analysis scripts/cloud scripts/cores
scripts/dataimport scripts/dashboard scripts/file scripts/index
scripts/java-properties scripts/logging scripts/ping scripts/plugins
scripts/query scripts/replication scripts/schema-browser scripts/threads 

http://requirejs.org/docs/errors.html#timeout

 

I have no apparent errors in the log file and the exact conf is working on a
other server. What can I do?

RE: json update moves doc to end

2013-12-04 Thread Andreas Owen

I changed my boost-function log(clickrate)^8 to div(clciks,displays)^8 and
it works now. I get the following output from debug

0.0022668892 = (MATCH) FunctionQuery(div(const(2),const(5))), product of:
0.4 = div(const(2),const(5))
8.0 = boost
7.0840283E-4 = queryNorm

Am i undestanding this right, that 0.4 and 8.0 result in 7.084? I'm
having trouble undestanding how much i boosted it.

As i use NgramFilterFactory i get a lot of hits because of the tokens. Can i
make the boost higher if the hole search-term is found and not just part of
it?


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Mittwoch, 4. Dezember 2013 15:07
To: solr-user@lucene.apache.org
Subject: Re: json update moves doc to end

Well, both have a score of -Infinity. So they're "equal" and the tiebreaker
is the internal Lucene doc ID.

Now this is not helpful since the question now is where -Infinity comes
from, this looks suspicious:
 -Infinity = (MATCH) FunctionQuery(log(int(clicks))), product of:
-Infinity = log(int(clicks)=0)

not much help I know, but

Erick


On Wed, Dec 4, 2013 at 7:24 AM, Andreas Owen  wrote:

> Hi Erick
>
> Here are the last 2 results from a search and i am not understanding 
> why the last one with the boost editorschoice^200 isn't at the top. By 
> the way can i also give a substantial boost to results that contain 
> the hole search-request and not just 3 or 4 letters (tokens)?
>
> 
> -Infinity = (MATCH) sum of:
>   0.013719446 = (MATCH) max of:
> 0.013719446 = (MATCH) sum of:
>   2.090396E-4 = (MATCH) weight(plain_text:ber in 841) 
> [DefaultSimilarity], result of:
> 2.090396E-4 = score(doc=841,freq=8.0 = termFreq=8.0 ), product 
> of:
>   0.009452709 = queryWeight, product of:
> 1.3343692 = idf(docFreq=611, maxDocs=855)
> 0.0070840283 = queryNorm
>   0.022114253 = fieldWeight in 841, product of:
> 2.828427 = tf(freq=8.0), with freq of:
>   8.0 = termFreq=8.0
> 1.3343692 = idf(docFreq=611, maxDocs=855)
> 0.005859375 = fieldNorm(doc=841)
>   0.0012402858 = (MATCH) weight(plain_text:eri in 841) 
> [DefaultSimilarity], result of:
> 0.0012402858 = score(doc=841,freq=9.0 = termFreq=9.0 ), 
> product of:
>   0.022357063 = queryWeight, product of:
> 3.1559815 = idf(docFreq=98, maxDocs=855)
> 0.0070840283 = queryNorm
>   0.05547624 = fieldWeight in 841, product of:
> 3.0 = tf(freq=9.0), with freq of:
>   9.0 = termFreq=9.0
> 3.1559815 = idf(docFreq=98, maxDocs=855)
> 0.005859375 = fieldNorm(doc=841)
>   5.0511415E-4 = (MATCH) weight(plain_text:ric in 841) 
> [DefaultSimilarity], result of:
> 5.0511415E-4 = score(doc=841,freq=1.0 = termFreq=1.0 ), 
> product of:
>   0.024712078 = queryWeight, product of:
> 3.4884217 = idf(docFreq=70, maxDocs=855)
> 0.0070840283 = queryNorm
>   0.020439971 = fieldWeight in 841, product of:
> 1.0 = tf(freq=1.0), with freq of:
>   1.0 = termFreq=1.0
> 3.4884217 = idf(docFreq=70, maxDocs=855)
> 0.005859375 = fieldNorm(doc=841)
>   8.721528E-4 = (MATCH) weight(plain_text:ich in 841) 
> [DefaultSimilarity], result of:
> 8.721528E-4 = score(doc=841,freq=12.0 = termFreq=12.0 ), 
> product of:
>   0.017446788 = queryWeight, product of:
> 2.4628344 = idf(docFreq=197, maxDocs=855)
> 0.0070840283 = queryNorm
>   0.049989305 = fieldWeight in 841, product of:
> 3.4641016 = tf(freq=12.0), with freq of:
>   12.0 = termFreq=12.0
> 2.4628344 = idf(docFreq=197, maxDocs=855)
> 0.005859375 = fieldNorm(doc=841)
>   7.725705E-4 = (MATCH) weight(plain_text:cht in 841) 
> [DefaultSimilarity], result of:
> 7.725705E-4 = score(doc=841,freq=4.0 = termFreq=4.0 ), product 
> of:
>   0.021610687 = queryWeight, product of:
> 3.050621 = idf(docFreq=109, maxDocs=855)
> 0.0070840283 = queryNorm
>   0.035749465 = fieldWeight in 841, product of:
> 2.0 = tf(freq=4.0), with freq of:
>   4.0 = termFreq=4.0
> 3.050621 = idf(docFreq=109, maxDocs=855)
> 0.005859375 = fieldNorm(doc=841)
>   0.0010287998 = (MATCH) weight(plain_text:beri in 841) 
> [DefaultSimilarity], result of:
> 0.0010287998 = score(doc=841,freq=1.0 = termFreq=1.0 ), 
> product of:
>   0.035267927 = queryWeight, product of:
> 4.978513 = idf(docFreq=15, maxDocs=855)
> 0.0070840283 = queryNorm
>

RE: json update moves doc to end

2013-12-04 Thread Andreas Owen

s=855)
0.0070840283 = queryNorm
  0.1359345 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
4.349904 = idf(docFreq=29, maxDocs=855)
0.03125 = fieldNorm(doc=0)
  0.006139375 = (MATCH) weight(plain_text:berich in 0)
[DefaultSimilarity], result of:
0.006139375 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  0.037305873 = queryWeight, product of:
5.266195 = idf(docFreq=11, maxDocs=855)
0.0070840283 = queryNorm
  0.16456859 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
5.266195 = idf(docFreq=11, maxDocs=855)
0.03125 = fieldNorm(doc=0)
  0.0059541636 = (MATCH) weight(plain_text:ericht in 0)
[DefaultSimilarity], result of:
0.0059541636 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  0.036738846 = queryWeight, product of:
5.186152 = idf(docFreq=12, maxDocs=855)
0.0070840283 = queryNorm
  0.16206725 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
5.186152 = idf(docFreq=12, maxDocs=855)
0.03125 = fieldNorm(doc=0)
  0.006139375 = (MATCH) weight(plain_text:bericht in 0)
[DefaultSimilarity], result of:
0.006139375 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
  0.037305873 = queryWeight, product of:
5.266195 = idf(docFreq=11, maxDocs=855)
0.0070840283 = queryNorm
  0.16456859 = fieldWeight in 0, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
5.266195 = idf(docFreq=11, maxDocs=855)
0.03125 = fieldNorm(doc=0)
7.054 = (MATCH) weight(editorschoice:bericht^200.0 in 0)
[DefaultSimilarity], result of:
  7.054 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
0.749 = queryWeight, product of:
  200.0 = boost
  7.0579543 = idf(docFreq=1, maxDocs=855)
  7.0840283E-4 = queryNorm
7.0579543 = fieldWeight in 0, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  7.0579543 = idf(docFreq=1, maxDocs=855)
  1.0 = fieldNorm(doc=0)
  0.0021252085 = (MATCH) product of:
0.004250417 = (MATCH) sum of:
  0.004250417 = (MATCH) sum of:
0.004250417 = (MATCH) MatchAllDocsQuery, product of:
  0.004250417 = queryNorm
0.5 = coord(1/2)
  -Infinity = (MATCH) FunctionQuery(log(int(clicks))), product of:
-Infinity = log(int(clicks)=0)
8.0 = boost
7.0840283E-4 = queryNorm


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Dienstag, 3. Dezember 2013 20:30
To: solr-user@lucene.apache.org
Subject: Re: json update moves doc to end

Try adding &debug=all and you'll see exactly how docs are scored. Also,
it'll show you exactly how your query is parsed. Paste that if it's
confused, it'll help figure out what's going wrong.


On Tue, Dec 3, 2013 at 1:37 PM, Andreas Owen  wrote:

> So isn't it sorted automaticly by relevance (boost value)? If not do 
> should i set it in solrconfig?
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Dienstag, 3. Dezember 2013 19:07
> To: solr-user@lucene.apache.org
> Subject: Re: json update moves doc to end
>
> What order, the order if you supply no explicit sort at all?
>
> Solr does not make any guarantees about what order documents will come 
> back in if you do not ask for a sort.
>
> In general in Solr/lucene, the only way to update a document is to 
> re-add it as a new document, so that's probably what's going on behind 
> the scenes, and it probably effects the 'default' sort order -- which 
> Solr makes no agreement about anyway, you probably shouldn't even 
> count on it being consistent at all.
>
> If you want a consistent sort order, maybe add a field with a 
> timestamp, and ask for results sorted by the timestamp field? And then 
> make sure not to change the timestamp when you do an update that you 
> don't want to change the order?
>
> Apologies if I've misunderstood the situation.
>
> On 12/3/13 1:00 PM, Andreas Owen wrote:
> > When I search for "agenda" I get a lot of hits. Now if I update the 2.
> > Result by json-update the doc is moved to the end of the index when 
> > I search for it again. The field I change is "editorschoice" and it 
> > never contains the search term "agenda" so I don't see why it 
> > changes the order. Why does it?
> >
> >
> >
> > Part of Solrconfig requesthandler I use:
> >
> > 
> >

RE: json update moves doc to end

2013-12-03 Thread Andreas Owen

So isn't it sorted automaticly by relevance (boost value)? If not do should
i set it in solrconfig?

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Dienstag, 3. Dezember 2013 19:07
To: solr-user@lucene.apache.org
Subject: Re: json update moves doc to end

What order, the order if you supply no explicit sort at all?

Solr does not make any guarantees about what order documents will come back
in if you do not ask for a sort.

In general in Solr/lucene, the only way to update a document is to re-add it
as a new document, so that's probably what's going on behind the scenes, and
it probably effects the 'default' sort order -- which Solr makes no
agreement about anyway, you probably shouldn't even count on it being
consistent at all.

If you want a consistent sort order, maybe add a field with a timestamp, and
ask for results sorted by the timestamp field? And then make sure not to
change the timestamp when you do an update that you don't want to change the
order?

Apologies if I've misunderstood the situation.

On 12/3/13 1:00 PM, Andreas Owen wrote:
> When I search for "agenda" I get a lot of hits. Now if I update the 2.
> Result by json-update the doc is moved to the end of the index when I 
> search for it again. The field I change is "editorschoice" and it 
> never contains the search term "agenda" so I don't see why it changes 
> the order. Why does it?
>
>
>
> Part of Solrconfig requesthandler I use:
>
> 
>
>   
>
>  explicit
>
>  10
>
>   synonym_edismax
>
> true
>
> plain_text^10 editorschoice^200
>
> title^20 h_*^14
>
> tags^10 thema^15 inhaltstyp^6 
> breadcrumb^6
> doctype^10
>
> contentmanager^5 links^5
>
> last_modified^5  url^5
>
> 
>
> (expiration:[NOW TO *] OR (*:* 
> -expiration:*))^6  
>
> log(clicks)^8 
>
> 
>
>   text
>
> *,path,score
>
> json
>
> AND
>
>
>
> 
>
>  on
>
>   plain_text,title
>
> <b>
>
>  </b>
>
>
>
>   
>
>  on
>
> 1
>
>   name="facet.field">{!ex=inhaltstyp}inhaltstyp
>
>  name="f.inhaltstyp.facet.sort">index
>
>  name="facet.field">{!ex=doctype}doctype
>
>  name="f.doctype.facet.sort">index
>
>  name="facet.field">{!ex=thema_f}thema_f
>
>  name="f.thema_f.facet.sort">index
>
>  name="facet.field">{!ex=author_s}author_s
>
>  name="f.author_s.facet.sort">index
>
>  name="facet.field">{!ex=sachverstaendiger_s}sachverstaendiger_s
>
>  name="f.sachverstaendiger_s.facet.sort">index
>
>  name="facet.field">{!ex=veranstaltung}veranstaltung
>
>  name="f.veranstaltung.facet.sort">index
>
>  name="facet.date">{!ex=last_modified}last_modified
>
>  name="facet.date.gap">+1MONTH
>
>  name="facet.date.end">NOW/MONTH+1MONTH
>
>  name="facet.date.start">NOW/MONTH-36MONTHS
>
>  name="facet.date.other">after
>
> 
>
> 
>
>

json update moves doc to end

2013-12-03 Thread Andreas Owen

When I search for agenda I get a lot of hits. Now if I update the 2.
Result by json-update the doc is moved to the end of the index when I search
for it again. The field I change is editorschoice and it never contains
the search term agenda so I dont see why it changes the order. Why does
it?

 

Part of Solrconfig requesthandler I use:



 

explicit

10

 synonym_edismax

   true

   plain_text^10 editorschoice^200

   title^20 h_*^14 

   tags^10 thema^15 inhaltstyp^6 breadcrumb^6
doctype^10

   contentmanager^5 links^5

   last_modified^5  url^5

   

   (expiration:[NOW TO *] OR (*:*
-expiration:*))^6  

   log(clicks)^8 

   

 text

   *,path,score

   json

   AND

   

   

on

 plain_text,title

   



   

 

on

   1

{!ex=inhaltstyp}inhaltstyp

   index

   {!ex=doctype}doctype

   index

   {!ex=thema_f}thema_f

   index

   {!ex=author_s}author_s

   index

   {!ex=sachverstaendiger_s}sachverstaendiger_s

   index

   {!ex=veranstaltung}veranstaltung

   index

   {!ex=last_modified}last_modified

   +1MONTH

   NOW/MONTH+1MONTH

   NOW/MONTH-36MONTHS

   after

search with wildcard

2013-11-21 Thread Andreas Owen

I am querying "test" in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
"Supertestplan" it isn't found unless I use a wildcards "*test*". This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.

RE: search with wildcard

2013-11-21 Thread Andreas Owen

I suppose i have to create another field with diffenet tokenizers and set
the boost very low so it doesn't really mess with my ranking because there
the word is now in 2 fields. What kind of tokenizer can do the job?

From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 21. November 2013 16:13
To: solr-user@lucene.apache.org
Subject: search with wildcard

I am querying "test" in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
"Supertestplan" it isn't found unless I use a wildcards "*test*". This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.

RE: date range tree

2013-11-13 Thread Andreas Owen

I solved it by adding a loop for years and one for quartals in which i count
the month-facets

-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Montag, 11. November 2013 17:52
To: solr-user@lucene.apache.org
Subject: RE: date range tree

Has someone at least got a idee how i could do a year/month-date-tree? 

In Solr-Wiki it is mentioned that facet.date.gap=+1DAY,+2DAY,+3DAY,+10DAY
should create 4 buckets but it doesn't work


-Original Message-
From: Andreas Owen [mailto:a...@conx.ch]
Sent: Donnerstag, 7. November 2013 18:23
To: solr-user@lucene.apache.org
Subject: date range tree

I would like to make a facet on a date field with the following tree:

 

2013

4.Quartal

December

November

Oktober

3.Quartal

September

August

Juli

2.Quartal

June

Mai

April

1.   Quartal

March

February

January

2012 .

Same as above

 

 

So far I have this in solrconfig.xml:

 

{!ex=last_modified,thema,inhaltstyp,doctype}last_modified<
/str>

   +1MONTH

   NOW/MONTH

   NOW/MONTH-36MONTHS

   after

 

Can I do this in one query or do I need multiple queries? If yes how would I
do the second and keep all the facet queries in the count?

RE: date range tree

2013-11-11 Thread Andreas Owen

Has someone at least got a idee how i could do a year/month-date-tree? 

In Solr-Wiki it is mentioned that facet.date.gap=+1DAY,+2DAY,+3DAY,+10DAY
should create 4 buckets but it doesn't work

-Original Message-
From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 7. November 2013 18:23
To: solr-user@lucene.apache.org
Subject: date range tree

I would like to make a facet on a date field with the following tree:

2013

4.Quartal

December

November

Oktober

3.Quartal

September

August

Juli

2.Quartal

June

Mai

April

1.   Quartal

March

February

January

2012 .

Same as above

So far I have this in solrconfig.xml:

{!ex=last_modified,thema,inhaltstyp,doctype}last_modified<
/str>

   +1MONTH

   NOW/MONTH

   NOW/MONTH-36MONTHS

   after

Can I do this in one query or do I need multiple queries? If yes how would I
do the second and keep all the facet queries in the count?

count links pointing to id

2013-11-09 Thread Andreas Owen

I have a multivalue field with links pointing to ids of solrdocuments. I
would like calculate how many links are pointing to each document und put
that number into the field links2me. How can I do this, I would prefer to do
it with a query and the updater so solr can do it internaly if possible?

date range tree

2013-11-07 Thread Andreas Owen

I would like to make a facet on a date field with the following tree:

 

2013

4.Quartal

December

November

Oktober

3.Quartal

September

August

Juli

2.Quartal

June

Mai

April

1.   Quartal

March

February

January

2012 .

Same as above

 

 

So far I have this in solrconfig.xml:

 

{!ex=last_modified,thema,inhaltstyp,doctype}last_modified<
/str>

   +1MONTH

   NOW/MONTH

   NOW/MONTH-36MONTHS

   after

 

Can I do this in one query or do I need multiple queries? If yes how would I
do the second and keep all the facet queries in the count?

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-10-01 Thread Andreas Owen

i'm already using URLDataSource

On 30. Sep 2013, at 5:41 PM, P Williams wrote:

> Hi Andreas,
> 
> When using 
> XPathEntityProcessor<http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor>your
> DataSource
> must be of type DataSource.  You shouldn't be using
> BinURLDataSource, it's giving you the cast exception.  Use
> URLDataSource<https://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/URLDataSource.html>
> or
> FileDataSource<https://builds.apache.org/job/Solr-Artifacts-4.x/javadoc/solr-dataimporthandler/org/apache/solr/handler/dataimport/FileDataSource.html>instead.
> 
> I don't think you need to specify namespaces, at least you didn't used to.
> The other thing that I've noticed is that the anywhere xpath expression //
> doesn't always work in DIH.  You might have to be more specific.
> 
> Cheers,
> Tricia
> 
> 
> 
> 
> 
> On Sun, Sep 29, 2013 at 9:47 AM, Andreas Owen  wrote:
> 
>> how dum can you get. obviously quite dum... i would have to analyze the
>> html-pages with a nested instance like this:
>> 
>> > url="file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml"
>> forEach="/docs/doc" dataSource="main">
>> 
>>> url="${rec.urlParse}" forEach="/xhtml:html" dataSource="dataUrl">
>>
>>
>>
>>
>>
>> 
>> 
>> but i'm pretty sure the foreach is wrong and the xpath expressions. in the
>> moment i getting the following error:
>> 
>>Caused by: java.lang.RuntimeException:
>> org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.ClassCastException:
>> sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast
>> to java.io.Reader
>> 
>> 
>> 
>> 
>> 
>> On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote:
>> 
>>> ok i see what your getting at but why doesn't the following work:
>>> 
>>>  
>>>  
>>> 
>>> i removed the tiki-processor. what am i missing, i haven't found
>> anything in the wiki?
>>> 
>>> 
>>> On 28. Sep 2013, at 12:28 AM, P Williams wrote:
>>> 
>>>> I spent some more time thinking about this.  Do you really need to use
>> the
>>>> TikaEntityProcessor?  It doesn't offer anything new to the document you
>> are
>>>> building that couldn't be accomplished by the XPathEntityProcessor alone
>>>> from what I can tell.
>>>> 
>>>> I also tried to get the Advanced
>>>> Parsing<http://wiki.apache.org/solr/TikaEntityProcessor>example to
>>>> work without success.  There are some obvious typos (
>>>> instead of ) and an odd order to the pieces ( is
>>>> enclosed by ).  It also looks like
>>>> FieldStreamDataSource<
>> http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html
>>> is
>>>> the one that is meant to work in this context. If Koji is still around
>>>> maybe he could offer some help?  Otherwise this bit of erroneous
>>>> instruction should probably be removed from the wiki.
>>>> 
>>>> Cheers,
>>>> Tricia
>>>> 
>>>> $ svn diff
>>>> Index:
>>>> 
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
>>>> ===
>>>> ---
>>>> 
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
>>>>   (revision 1526990)
>>>> +++
>>>> 
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
>>>>   (working copy)
>>>> @@ -99,13 +99,13 @@
>>>>   runFullImport(getConfigHTML("identity"));
>>>>   assertQ(req("*:*"), testsHTMLIdentity);
>>>> }
>>>> -
>>>> +
>>>> private String getConfigHTML(String htmlMapper) {
>>>>   return
>>>>   "" +
>&

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-29 Thread Andreas Owen

how dum can you get. obviously quite dum... i would have to analyze the 
html-pages with a nested instance like this:

 









but i'm pretty sure the foreach is wrong and the xpath expressions. in the 
moment i getting the following error:

Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: 
sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to 
java.io.Reader





On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote:

> ok i see what your getting at but why doesn't the following work:
>   
>   
>   
> 
> i removed the tiki-processor. what am i missing, i haven't found anything in 
> the wiki?
> 
> 
> On 28. Sep 2013, at 12:28 AM, P Williams wrote:
> 
>> I spent some more time thinking about this.  Do you really need to use the
>> TikaEntityProcessor?  It doesn't offer anything new to the document you are
>> building that couldn't be accomplished by the XPathEntityProcessor alone
>> from what I can tell.
>> 
>> I also tried to get the Advanced
>> Parsing<http://wiki.apache.org/solr/TikaEntityProcessor>example to
>> work without success.  There are some obvious typos (
>> instead of ) and an odd order to the pieces ( is
>> enclosed by ).  It also looks like
>> FieldStreamDataSource<http://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.html>is
>> the one that is meant to work in this context. If Koji is still around
>> maybe he could offer some help?  Otherwise this bit of erroneous
>> instruction should probably be removed from the wiki.
>> 
>> Cheers,
>> Tricia
>> 
>> $ svn diff
>> Index:
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
>> ===
>> ---
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
>>(revision 1526990)
>> +++
>> solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
>>(working copy)
>> @@ -99,13 +99,13 @@
>>runFullImport(getConfigHTML("identity"));
>>assertQ(req("*:*"), testsHTMLIdentity);
>>  }
>> -
>> +
>>  private String getConfigHTML(String htmlMapper) {
>>return
>>"" +
>>"  " +
>>"  " +
>> -"> processor='TikaEntityProcessor' " +
>> +"> processor='TikaEntityProcessor' " +
>>"   url='" +
>> getFile("dihextras/structured.html").getAbsolutePath() + "' " +
>>((htmlMapper == null) ? "" : (" htmlMapper='" + htmlMapper +
>> "'")) + ">" +
>>"  " +
>> @@ -114,4 +114,36 @@
>>"";
>> 
>>  }
>> +  private String[] testsHTMLH1 = {
>> +  "//*[@numFound='1']"
>> +  , "//str[@name='h1'][contains(.,'H1 Header')]"
>> +  };
>> +
>> +  @Test
>> +  public void testTikaHTMLMapperSubEntity() throws Exception {
>> +runFullImport(getConfigSubEntity("identity"));
>> +assertQ(req("*:*"), testsHTMLH1);
>> +  }
>> +
>> +  private String getConfigSubEntity(String htmlMapper) {
>> +return
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"> dataSource='bin' format='html' rootEntity='false'>" +
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"> dataSource='fld' dataField='tika.text' rootEntity='true' >" +
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"" +
>> +"";
>> +  }
>> +
>> }
>> Index:
>> solr/contrib/dataimporthandler-extras/src/test-files/dihextras/solr/collection1/conf/dataimp

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-28 Thread Andreas Owen

thanks but the first suggestion is already implemented and the 2. didn't work. 
i have also tried htmlMapper="identity" but nothing worked.

i also tried this but the html was stripped in both fields





but in the end i think it's best to cut tika out because i'm not getting any 
benefits from it. i would just need to get this to work:




the fields are empty and i'm not getting any errors in the logs.


On 28. Sep 2013, at 2:43 AM, Alexandre Rafalovitch wrote:

> This is a rather complicated example to chew through, but try the following
> two things:
> *) dataField="${tika.text}"  => dataField="text" (or less likely htmlMapper
> tika.text)
> You might be trying to read content of the field rather than passing
> reference to the field that seems to be expected. This might explain the
> exception.
> 
> *) It may help to be aware of
> https://issues.apache.org/jira/browse/SOLR-4530 . There is a new
> htmlMapper="identity" flag on Tika entries to ensure more of HTML structure
> passing through. By default, Tika strips out most of the HTML tags.
> 
> Regards,
>   Alex.
> 
> On Thu, Sep 26, 2013 at 5:17 PM, Andreas Owen  wrote:
> 
>>> url="${rec.urlParse}" dataSource="dataUrl" onError="skip" format="html">
>>
>> 
>>> forEach="/html" dataSource="fld" dataField="${tika.text}" rootEntity="true"
>> onError="skip">
>>
>>
>>
>> 
> 
> 
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-27 Thread Andreas Owen

solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
> at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
> at
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:469)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:495)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476)
> at
> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> at org.apache.solr.util.TestHarness.query(TestHarness.java:291)
> at
> org.apache.solr.handler.dataimport.AbstractDataImportHandlerTestCase.runFullImport(AbstractDataImportHandlerTestCase.java:96)
> at
> org.apache.solr.handler.dataimport.TestTikaEntityProcessor.testTikaHTMLMapperSubEntity(TestTikaEntityProcessor.java:124)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:601)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1559)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.access$600(RandomizedRunner.java:79)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:737)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:773)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:787)
> at
> com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53)
> at
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
> at
> org.apache.lucene.util.TestRuleFieldCacheSanity$1.evaluate(TestRuleFieldCacheSanity.java:51)
> at
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
> at
> com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
> at
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
> at
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:70)
> at
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:358)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:782)
> at
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:442)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:746)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:648)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:682)
> at
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:693)
> at
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at
> com.carrotsearch.randomizedtesting.rules.SystemPropertiesRestoreRule$1.evaluate(SystemPropertiesRestoreRule.java:53)
> at
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
> at
> org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42)
> at
> com.carrotsearch.randomizedtesting.rules.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:55)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
> at
> com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
> at
> com.carrotsear

Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-27 Thread Andreas Owen

i removed the FieldReaderDataSource and dataSource="fld" but it didn't help. i 
get the following for each document:
DataImportHandlerException: Exception in invoking url null Processing 
Document # 9
nullpointerexception


On 26. Sep 2013, at 8:39 PM, P Williams wrote:

> Hi,
> 
> Haven't tried this myself but maybe try leaving out the
> FieldReaderDataSource entirely.  From my quick searching looks like it's
> tied to SQL.  Did you try copying the
> http://wiki.apache.org/solr/TikaEntityProcessor Advanced Parsing example
> exactly?  What happens when you leave out FieldReaderDataSource?
> 
> Cheers,
> Tricia
> 
> 
> On Thu, Sep 26, 2013 at 4:17 AM, Andreas Owen  wrote:
> 
>> i'm using solr 4.3.1 and the dataimporter. i am trying to use
>> XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages
>> but i'm getting this error for each document. i have also tried
>> dataField="tika.text" and dataField="text" to no avail. the nested
>> XPathEntityProcessor "detail" creates the error, the rest works fine. what
>> am i doing wrong?
>> 
>> error:
>> 
>> ERROR - 2013-09-26 12:08:49.006;
>> org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed
>> 'null'
>> java.lang.ClassCastException: java.io.StringReader cannot be cast to
>> java.util.Iterator
>>at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
>>at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
>>at
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
>>at
>> org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
>>at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>>at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>>at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>>at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>>at
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
>>at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
>>at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>>at
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
>>at
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
>>at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
>>at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
>>at
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
>>at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
>>at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
>>at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
>>at
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
>>at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
>>at org.eclipse.jetty.server.Server.handle(Server.java:365)
>>at
>> org.eclipse.jetty.server.Abstract

XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-26 Thread Andreas Owen

i'm using solr 4.3.1 and the dataimporter. i am trying to use 
XPathEntityProcessor within the TikaEntityProcessor for indexing html-pages but 
i'm getting this error for each document. i have also tried 
dataField="tika.text" and dataField="text" to no avail. the nested 
XPathEntityProcessor "detail" creates the error, the rest works fine. what am i 
doing wrong?

error:

ERROR - 2013-09-26 12:08:49.006; 
org.apache.solr.handler.dataimport.SqlEntityProcessor; The query failed 'null'
java.lang.ClassCastException: java.io.StringReader cannot be cast to 
java.util.Iterator
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:465)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:491)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)
at 
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
ERROR - 2013-09-26 12:08:49.022; org.apache.solr.common.SolrException; 
Exception in entity : 
detail:org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: java.io.StringReader cannot be cast to 
java.util.Iterator
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65)
at 
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73)
at 
org.apache.solr.handler.dataim

dih HTMLStripTransformer

2013-09-24 Thread Andreas Owen

why does stripHTML="false" have no effect in dih? the html is strippedin text 
and text_nohtml when i do display the index with select?q=*

i'm trying to get a field without html and one with it so i can also index the 
links on the page.

data-config.xml

Re: dih delete doc per $deleteDocById

2013-09-22 Thread Andreas Owen

sorry, it works like this, i had a typo in my conf :-(

On 17. Sep 2013, at 2:44 PM, Andreas Owen wrote:

> i would like to know how to get it to work and delete documents per xml and 
> dih.
> 
> On 17. Sep 2013, at 1:47 PM, Shalin Shekhar Mangar wrote:
> 
>> What is your question?
>> 
>> On Tue, Sep 17, 2013 at 12:17 AM, andreas owen  wrote:
>>> i am using dih and want to delete indexed documents by xml-file with ids. i 
>>> have seen $deleteDocById used in 
>>> 
>>> data-config.xml:
>>> >> url="file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml"
>>>  forEach="/docs/doc" dataSource="main" >
>>>   
>>> 
>>> 
>>> xml-file:
>>> 
>>>   
>>>   2345
>>>   
>>> 
>> 
>> 
>> 
>> -- 
>> Regards,
>> Shalin Shekhar Mangar.

Re: dih delete doc per $deleteDocById

2013-09-17 Thread Andreas Owen

i would like to know how to get it to work and delete documents per xml and dih.

On 17. Sep 2013, at 1:47 PM, Shalin Shekhar Mangar wrote:

> What is your question?
> 
> On Tue, Sep 17, 2013 at 12:17 AM, andreas owen  wrote:
>> i am using dih and want to delete indexed documents by xml-file with ids. i 
>> have seen $deleteDocById used in 
>> 
>> data-config.xml:
>> > url="file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportDelete.xml"
>>  forEach="/docs/doc" dataSource="main" >
>>
>> 
>> 
>> xml-file:
>> 
>>
>>2345
>>
>> 
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.

dih delete doc per $deleteDocById

2013-09-16 Thread andreas owen

i am using dih and want to delete indexed documents by xml-file with ids. i 
have seen $deleteDocById used in 

data-config.xml:

  


xml-file:


2345

Re: charset encoding

2013-09-12 Thread Andreas Owen

it was the http-header, as soon as i force a iso-8859-1 header it worked

On 12. Sep 2013, at 9:44 AM, Andreas Owen wrote:

> could it have something to do with the meta encoding tag is iso-8859-1 but 
> the http-header tag is utf8 and firefox inteprets it as utf8?
> 
> On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote:
> 
>> no jetty, and yes for tomcat i've seen a couple of answers
>> 
>> On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:
>> 
>>> Using tomcat by any chance? The ML archive has the solution. May be on
>>> Wiki, too.
>>> 
>>> Otis
>>> Solr & ElasticSearch Support
>>> http://sematext.com/
>>> On Sep 11, 2013 8:56 AM, "Andreas Owen"  wrote:
>>> 
>>>> i'm using solr 4.3.1 with tika to index html-pages. the html files are
>>>> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the
>>>> server-http-header says it's utf8 and firefox-webdeveloper agrees.
>>>> 
>>>> when i index a page with special chars like ä,ö,ü solr outputs it
>>>> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
>>>> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
>>>> has anyone got a idea whats wrong?
>>>> 
>>>>

Re: charset encoding

2013-09-12 Thread Andreas Owen

could it have something to do with the meta encoding tag is iso-8859-1 but the 
http-header tag is utf8 and firefox inteprets it as utf8?

On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote:

> no jetty, and yes for tomcat i've seen a couple of answers
> 
> On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:
> 
>> Using tomcat by any chance? The ML archive has the solution. May be on
>> Wiki, too.
>> 
>> Otis
>> Solr & ElasticSearch Support
>> http://sematext.com/
>> On Sep 11, 2013 8:56 AM, "Andreas Owen"  wrote:
>> 
>>> i'm using solr 4.3.1 with tika to index html-pages. the html files are
>>> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the
>>> server-http-header says it's utf8 and firefox-webdeveloper agrees.
>>> 
>>> when i index a page with special chars like ä,ö,ü solr outputs it
>>> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
>>> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
>>> has anyone got a idea whats wrong?
>>> 
>>>

Re: charset encoding

2013-09-11 Thread Andreas Owen

no jetty, and yes for tomcat i've seen a couple of answers

On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:

> Using tomcat by any chance? The ML archive has the solution. May be on
> Wiki, too.
> 
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Sep 11, 2013 8:56 AM, "Andreas Owen"  wrote:
> 
>> i'm using solr 4.3.1 with tika to index html-pages. the html files are
>> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the
>> server-http-header says it's utf8 and firefox-webdeveloper agrees.
>> 
>> when i index a page with special chars like ä,ö,ü solr outputs it
>> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
>> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
>> has anyone got a idea whats wrong?
>> 
>>

charset encoding

2013-09-11 Thread Andreas Owen

i'm using solr 4.3.1 with tika to index html-pages. the html files are 
iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the 
server-http-header says it's utf8 and firefox-webdeveloper agrees. 

when i index a page with special chars like ä,ö,ü solr outputs it completly 
foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it 
seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone 
got a idea whats wrong?

Re: charfilter doesn't do anything

2013-09-11 Thread Andreas Owen

perfect, i tried it before but always at the tail of the expression with no 
effect. thanks a lot. a last question, do you know how to keep the html 
comments from being filtered before the transformer has done its work?


On 10. Sep 2013, at 3:17 PM, Jack Krupansky wrote:

> Okay, I can repro the problem. Yes, in appears that the pattern replace char 
> filter does not default to multiline mode for pattern matching, so  on 
> one line and  on another line cannot be matched.
> 
> Now, whether that is by design or a bug or an option for enhancement is a 
> matter for some committer to comment on.
> 
> But, the good news is that you can in fact set multiline mode in your pattern 
> my starting it with "(?s)", which means that dot accepts line break 
> characters as well.
> 
> So, here are my revised field types:
> 
>  positionIncrementGap="100" >
> 
>pattern="(?s)^.*<body>(.*)</body>.*$" replacement="$1" />
>   
>   
> 
> 
> 
>  positionIncrementGap="100" >
> 
>pattern="(?s)^.*<body>(.*)</body>.*$" replacement="$1" />
>   
>   
>   
> 
> 
> 
> The first type accepts everything within , including nested HTML 
> formatting, while the latter strips nested HTML formatting as well.
> 
> The tokenizer will in fact strip out white space, but that happens after all 
> character filters have completed.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Andreas Owen
> Sent: Tuesday, September 10, 2013 7:07 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> ok i am getting there now but if there are newlines involved the regex stops 
> as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have 
> to get rid of the newlines. why isn't whitespaceTokenizerFactory the right 
> element for this?
> 
> 
> On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:
> 
>> Use XML then. Although you will need to escape the XML special characters as 
>> I did in the pattern.
>> 
>> The point is simply: Quickly and simply try to find the simple test scenario 
>> that illustrates the problem.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Andreas Owen
>> Sent: Monday, September 09, 2013 7:05 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i tried but that isn't working either, it want a data-stream, i'll have to 
>> check how to post json instead of xml
>> 
>> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:
>> 
>>> Did you at least try the pattern I gave you?
>>> 
>>> The point of the curl was the data, not how you send the data. You can just 
>>> use the standard Solr simple post tool.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Andreas Owen
>>> Sent: Monday, September 09, 2013 6:40 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> i've downloaded curl and tried it in the comman prompt and power shell on 
>>> my win 2008r2 server, thats why i used my dataimporter with a single line 
>>> html file and copy/pastet the lines into schema.xml
>>> 
>>> 
>>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
>>> 
>>>> Did you in fact try my suggested example? If not, please do so.
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -Original Message- From: Andreas Owen
>>>> Sent: Monday, September 09, 2013 4:42 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: charfilter doesn't do anything
>>>> 
>>>> i index html pages with a lot of lines and not just a string with the 
>>>> body-tag.
>>>> it doesn't work with proper html files, even though i took all the new 
>>>> lines out.
>>>> 
>>>> html-file:
>>>> nav-content nur das will ich sehenfooter-content
>>>> 
>>>> solr update debug output:
>>>> "text_html": ["\r\n\r\n>>> content=\"ISO-8859-1\">\r\n>>> content=\"text/html; 
>>>> charset=ISO-8859-1\">\r\n\r\n\r\nnav-content nur das 
>>>> will ich sehenfooter-content"]
>>>> 
>>>> 
>>>> 
>>>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>>>> 
>>>>> I tried this and it se

Re: charfilter doesn't do anything

2013-09-10 Thread Andreas Owen

ok i am getting there now but if there are newlines involved the regex stops as 
soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have to 
get rid of the newlines. why isn't whitespaceTokenizerFactory the right element 
for this?


On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote:

> Use XML then. Although you will need to escape the XML special characters as 
> I did in the pattern.
> 
> The point is simply: Quickly and simply try to find the simple test scenario 
> that illustrates the problem.
> 
> -- Jack Krupansky
> 
> -----Original Message- From: Andreas Owen
> Sent: Monday, September 09, 2013 7:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> i tried but that isn't working either, it want a data-stream, i'll have to 
> check how to post json instead of xml
> 
> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:
> 
>> Did you at least try the pattern I gave you?
>> 
>> The point of the curl was the data, not how you send the data. You can just 
>> use the standard Solr simple post tool.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Andreas Owen
>> Sent: Monday, September 09, 2013 6:40 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i've downloaded curl and tried it in the comman prompt and power shell on my 
>> win 2008r2 server, thats why i used my dataimporter with a single line html 
>> file and copy/pastet the lines into schema.xml
>> 
>> 
>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
>> 
>>> Did you in fact try my suggested example? If not, please do so.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Andreas Owen
>>> Sent: Monday, September 09, 2013 4:42 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> i index html pages with a lot of lines and not just a string with the 
>>> body-tag.
>>> it doesn't work with proper html files, even though i took all the new 
>>> lines out.
>>> 
>>> html-file:
>>> nav-content nur das will ich sehenfooter-content
>>> 
>>> solr update debug output:
>>> "text_html": ["\r\n\r\n>> content=\"ISO-8859-1\">\r\n>> charset=ISO-8859-1\">\r\n\r\n\r\nnav-content nur das 
>>> will ich sehenfooter-content"]
>>> 
>>> 
>>> 
>>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>>> 
>>>> I tried this and it seems to work when added to the standard Solr example 
>>>> in 4.4:
>>>> 
>>>> 
>>>> 
>>>> >>> positionIncrementGap="100" >
>>>> 
>>>> >>> pattern="^.*<body>(.*)</body>.*$" replacement="$1" />
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> That char filter retains only text between  and . Is that 
>>>> what you wanted?
>>>> 
>>>> Indexing this data:
>>>> 
>>>> curl 'localhost:8983/solr/update?commit=true' -H 
>>>> 'Content-type:application/json' -d '
>>>> [{"id":"doc-1","body":"abc A test. def"}]'
>>>> 
>>>> And querying with these commands:
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
>>>> Shows all data
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
>>>> shows the body text
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
>>>> shows nothing (outside of body)
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
>>>> shows nothing (outside of body)
>>>> 
>>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
>>>> Shows nothing, HTML tag stripped
>>>> 
>>>> In your original query, you didn't show us what your default field, df 
>>>> parameter, was.
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -Original Message- From: Andreas Owen
>>>> Sent: Sunday, September 08,

Re: charfilter doesn't do anything

2013-09-09 Thread Andreas Owen

i've downloaded curl and tried it in the comman prompt and power shell on my 
win 2008r2 server, thats why i used my dataimporter with a single line html 
file and copy/pastet the lines into schema.xml


On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:

> Did you in fact try my suggested example? If not, please do so.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Andreas Owen
> Sent: Monday, September 09, 2013 4:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> i index html pages with a lot of lines and not just a string with the 
> body-tag.
> it doesn't work with proper html files, even though i took all the new lines 
> out.
> 
> html-file:
> nav-content nur das will ich sehenfooter-content
> 
> solr update debug output:
> "text_html": ["\r\n\r\n content=\"ISO-8859-1\">\r\n charset=ISO-8859-1\">\r\n\r\n\r\nnav-content nur das 
> will ich sehenfooter-content"]
> 
> 
> 
> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
> 
>> I tried this and it seems to work when added to the standard Solr example in 
>> 4.4:
>> 
>> 
>> 
>> > positionIncrementGap="100" >
>> 
>>  > pattern="^.*<body>(.*)</body>.*$" replacement="$1" />
>>  
>>  
>> 
>> 
>> 
>> That char filter retains only text between  and . Is that what 
>> you wanted?
>> 
>> Indexing this data:
>> 
>> curl 'localhost:8983/solr/update?commit=true' -H 
>> 'Content-type:application/json' -d '
>> [{"id":"doc-1","body":"abc A test. def"}]'
>> 
>> And querying with these commands:
>> 
>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
>> Shows all data
>> 
>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
>> shows the body text
>> 
>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
>> shows nothing (outside of body)
>> 
>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
>> shows nothing (outside of body)
>> 
>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
>> Shows nothing, HTML tag stripped
>> 
>> In your original query, you didn't show us what your default field, df 
>> parameter, was.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Andreas Owen
>> Sent: Sunday, September 08, 2013 5:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> yes but that filter html and not the specific tag i want.
>> 
>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>> 
>>> Hmmm, have you looked at:
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>> 
>>> Not quite the , perhaps, but might it help?
>>> 
>>> 
>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen  wrote:
>>> 
>>>> ok i have html pages with .content i
>>>> want.. i want to extract (index, store) only
>>>> that between the body-comments. i thought regexTransformer would be the
>>>> best because xpath doesn't work in tika and i cant nest a
>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>> formed html, which i would like to switch off.
>>>> 
>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>>> 
>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>>> i've managed to get it working if i use the regexTransformer and string
>>>> is on the same line in my tika entity. but when the string is multilined it
>>>> isn't working even though i tried ?s to set the flag dotall.
>>>>>> 
>>>>>> >>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>>> transformer="RegexTransformer">
>>>>>>   >>> replaceWith="QQQ" sourceColName="text"  />
>>>>>> 
>>>>>> 
>>>>>> then i tried it like this and i get a stackoverflow
>>>>>> 
>>>>>

Re: charfilter doesn't do anything

2013-09-09 Thread Andreas Owen

i tried but that isn't working either, it want a data-stream, i'll have to 
check how to post json instead of xml

On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:

> Did you at least try the pattern I gave you?
> 
> The point of the curl was the data, not how you send the data. You can just 
> use the standard Solr simple post tool.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Andreas Owen
> Sent: Monday, September 09, 2013 6:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> i've downloaded curl and tried it in the comman prompt and power shell on my 
> win 2008r2 server, thats why i used my dataimporter with a single line html 
> file and copy/pastet the lines into schema.xml
> 
> 
> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
> 
>> Did you in fact try my suggested example? If not, please do so.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Andreas Owen
>> Sent: Monday, September 09, 2013 4:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> i index html pages with a lot of lines and not just a string with the 
>> body-tag.
>> it doesn't work with proper html files, even though i took all the new lines 
>> out.
>> 
>> html-file:
>> nav-content nur das will ich sehenfooter-content
>> 
>> solr update debug output:
>> "text_html": ["\r\n\r\n> content=\"ISO-8859-1\">\r\n> charset=ISO-8859-1\">\r\n\r\n\r\nnav-content nur das 
>> will ich sehenfooter-content"]
>> 
>> 
>> 
>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
>> 
>>> I tried this and it seems to work when added to the standard Solr example 
>>> in 4.4:
>>> 
>>> 
>>> 
>>> >> positionIncrementGap="100" >
>>> 
>>> >> pattern="^.*<body>(.*)</body>.*$" replacement="$1" />
>>> 
>>> 
>>> 
>>> 
>>> 
>>> That char filter retains only text between  and . Is that what 
>>> you wanted?
>>> 
>>> Indexing this data:
>>> 
>>> curl 'localhost:8983/solr/update?commit=true' -H 
>>> 'Content-type:application/json' -d '
>>> [{"id":"doc-1","body":"abc A test. def"}]'
>>> 
>>> And querying with these commands:
>>> 
>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
>>> Shows all data
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
>>> shows the body text
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
>>> shows nothing (outside of body)
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
>>> shows nothing (outside of body)
>>> 
>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
>>> Shows nothing, HTML tag stripped
>>> 
>>> In your original query, you didn't show us what your default field, df 
>>> parameter, was.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Andreas Owen
>>> Sent: Sunday, September 08, 2013 5:21 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: charfilter doesn't do anything
>>> 
>>> yes but that filter html and not the specific tag i want.
>>> 
>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
>>> 
>>>> Hmmm, have you looked at:
>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>> 
>>>> Not quite the , perhaps, but might it help?
>>>> 
>>>> 
>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen  wrote:
>>>> 
>>>>> ok i have html pages with .content i
>>>>> want.. i want to extract (index, store) only
>>>>> that between the body-comments. i thought regexTransformer would be the
>>>>> best because xpath doesn't work in tika and i cant nest a
>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>>>> htmlparser from tika cuts my body-comments out and tries to make well
>>>>> formed htm

Re: charfilter doesn't do anything

2013-09-09 Thread Andreas Owen

i index html pages with a lot of lines and not just a string with the body-tag. 
it doesn't work with proper html files, even though i took all the new lines 
out.

html-file:
nav-content nur das will ich sehenfooter-content

solr update debug output:
"text_html": ["\r\n\r\n\r\n\r\n\r\n\r\nnav-content nur das will 
ich sehenfooter-content"]



On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:

> I tried this and it seems to work when added to the standard Solr example in 
> 4.4:
> 
> 
> 
>  positionIncrementGap="100" >
> 
>pattern="^.*<body>(.*)</body>.*$" replacement="$1" />
>   
>   
> 
> 
> 
> That char filter retains only text between  and . Is that what 
> you wanted?
> 
> Indexing this data:
> 
> curl 'localhost:8983/solr/update?commit=true' -H 
> 'Content-type:application/json' -d '
> [{"id":"doc-1","body":"abc A test. def"}]'
> 
> And querying with these commands:
> 
> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
> Shows all data
> 
> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
> shows the body text
> 
> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
> shows nothing (outside of body)
> 
> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
> shows nothing (outside of body)
> 
> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
> Shows nothing, HTML tag stripped
> 
> In your original query, you didn't show us what your default field, df 
> parameter, was.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Andreas Owen
> Sent: Sunday, September 08, 2013 5:21 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> yes but that filter html and not the specific tag i want.
> 
> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
> 
>> Hmmm, have you looked at:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>> 
>> Not quite the , perhaps, but might it help?
>> 
>> 
>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen  wrote:
>> 
>>> ok i have html pages with .content i
>>> want.. i want to extract (index, store) only
>>> that between the body-comments. i thought regexTransformer would be the
>>> best because xpath doesn't work in tika and i cant nest a
>>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>>> htmlparser from tika cuts my body-comments out and tries to make well
>>> formed html, which i would like to switch off.
>>> 
>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>>> 
>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>>> i've managed to get it working if i use the regexTransformer and string
>>> is on the same line in my tika entity. but when the string is multilined it
>>> isn't working even though i tried ?s to set the flag dotall.
>>>>> 
>>>>> >> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>>> transformer="RegexTransformer">
>>>>>>> replaceWith="QQQ" sourceColName="text"  />
>>>>> 
>>>>> 
>>>>> then i tried it like this and i get a stackoverflow
>>>>> 
>>>>> >> replaceWith="QQQ" sourceColName="text"  />
>>>>> 
>>>>> in javascript this works but maybe because i only used a small string.
>>>> 
>>>> Sounds like we've got an XY problem here.
>>>> 
>>>> http://people.apache.org/~hossman/#xyproblem
>>>> 
>>>> How about you tell us *exactly* what you'd actually like to have happen
>>>> and then we can find a solution for you?
>>>> 
>>>> It sounds a little bit like you're interested in stripping all the HTML
>>>> tags out.  Perhaps the HTMLStripCharFilter?
>>>> 
>>>> 
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>>> 
>>>> Something that I already said: By using the KeywordTokenizer, you won't
>>>> be able to search for individual words on your HTML input.  The entire
>>>> input string is treated as a single token, and therefore ONLY exact
>>>> entire-field matches (or certain wildcard matches) will be possible.
>>>> 
>>>> 
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>>> 
>>>> Note that no matter what you do to your data with the analysis chain,
>>>> Solr will always return the text that was originally indexed in search
>>>> results.  If you need to affect what gets stored as well, perhaps you
>>>> need an Update Processor.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>

Re: charfilter doesn't do anything

2013-09-08 Thread Andreas Owen

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:

> Hmmm, have you looked at:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> 
> Not quite the , perhaps, but might it help?
> 
> 
> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen  wrote:
> 
>> ok i have html pages with .content i
>> want.. i want to extract (index, store) only
>> that between the body-comments. i thought regexTransformer would be the
>> best because xpath doesn't work in tika and i cant nest a
>> xpathEntetyProcessor to use xpath. what i have also found out is that the
>> htmlparser from tika cuts my body-comments out and tries to make well
>> formed html, which i would like to switch off.
>> 
>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
>> 
>>> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>>>> i've managed to get it working if i use the regexTransformer and string
>> is on the same line in my tika entity. but when the string is multilined it
>> isn't working even though i tried ?s to set the flag dotall.
>>>> 
>>>> > dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
>> transformer="RegexTransformer">
>>>> > replaceWith="QQQ" sourceColName="text"  />
>>>> 
>>>> 
>>>> then i tried it like this and i get a stackoverflow
>>>> 
>>>> > replaceWith="QQQ" sourceColName="text"  />
>>>> 
>>>> in javascript this works but maybe because i only used a small string.
>>> 
>>> Sounds like we've got an XY problem here.
>>> 
>>> http://people.apache.org/~hossman/#xyproblem
>>> 
>>> How about you tell us *exactly* what you'd actually like to have happen
>>> and then we can find a solution for you?
>>> 
>>> It sounds a little bit like you're interested in stripping all the HTML
>>> tags out.  Perhaps the HTMLStripCharFilter?
>>> 
>>> 
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
>>> 
>>> Something that I already said: By using the KeywordTokenizer, you won't
>>> be able to search for individual words on your HTML input.  The entire
>>> input string is treated as a single token, and therefore ONLY exact
>>> entire-field matches (or certain wildcard matches) will be possible.
>>> 
>>> 
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
>>> 
>>> Note that no matter what you do to your data with the analysis chain,
>>> Solr will always return the text that was originally indexed in search
>>> results.  If you need to affect what gets stored as well, perhaps you
>>> need an Update Processor.
>>> 
>>> Thanks,
>>> Shawn
>> 
>>

Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen

ok i have html pages with .content i 
want.. i want to extract (index, store) only that 
between the body-comments. i thought regexTransformer would be the best because 
xpath doesn't work in tika and i cant nest a xpathEntetyProcessor to use xpath. 
what i have also found out is that the htmlparser from tika cuts my 
body-comments out and tries to make well formed html, which i would like to 
switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:

> On 9/6/2013 7:09 AM, Andreas Owen wrote:
>> i've managed to get it working if i use the regexTransformer and string is 
>> on the same line in my tika entity. but when the string is multilined it 
>> isn't working even though i tried ?s to set the flag dotall.
>> 
>> > dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" 
>> transformer="RegexTransformer">
>>  > replaceWith="QQQ" sourceColName="text"  />
>> 
>>  
>> then i tried it like this and i get a stackoverflow
>> 
>> > replaceWith="QQQ" sourceColName="text"  />
>> 
>> in javascript this works but maybe because i only used a small string.
> 
> Sounds like we've got an XY problem here.
> 
> http://people.apache.org/~hossman/#xyproblem
> 
> How about you tell us *exactly* what you'd actually like to have happen
> and then we can find a solution for you?
> 
> It sounds a little bit like you're interested in stripping all the HTML
> tags out.  Perhaps the HTMLStripCharFilter?
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
> 
> Something that I already said: By using the KeywordTokenizer, you won't
> be able to search for individual words on your HTML input.  The entire
> input string is treated as a single token, and therefore ONLY exact
> entire-field matches (or certain wildcard matches) will be possible.
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
> 
> Note that no matter what you do to your data with the analysis chain,
> Solr will always return the text that was originally indexed in search
> results.  If you need to affect what gets stored as well, perhaps you
> need an Update Processor.
> 
> Thanks,
> Shawn

Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen

i've managed to get it working if i use the regexTransformer and string is on 
the same line in my tika entity. but when the string is multilined it isn't 
working even though i tried ?s to set the flag dotall.





then i tried it like this and i get a stackoverflow



in javascript this works but maybe because i only used a small string.



On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote:

> Is there any chance that your changed your schema since you indexed the data? 
> If so, re-index the data.
> 
> If a "*" query finds nothing, that implies that the default field is empty. 
> Are you sure the "df" parameter is set to the field containing your data? 
> Show us your request handler definition and a sample of your actual Solr 
> input (Solr XML or JSON?) so that we can see what fields are being populated.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Andreas Owen
> Sent: Friday, September 06, 2013 4:01 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> the input string is a normal html page with the word Zahlungsverkehr in it 
> and my query is ...solr/collection1/select?q=*
> 
> On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:
> 
>> And show us an input string and a query that fail.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Shawn Heisey
>> Sent: Thursday, September 05, 2013 2:41 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> On 9/5/2013 10:03 AM, Andreas Owen wrote:
>>> i would like to filter / replace a word during indexing but it doesn't do 
>>> anything and i dont get a error.
>>> 
>>> in schema.xml i have the following:
>>> 
>>> >> multiValued="true"/>
>>> 
>>> 
>>> 
>>> 
>>> >> pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>>> 
>>> 
>>>  
>>> 
>>> my 2. question is where can i say that the expression is multilined like in 
>>> javascript i can use /m at the end of the pattern?
>> 
>> I don't know about your second question.  I don't know if that will be
>> possible, but I'll leave that to someone who's more expert than I.
>> 
>> As for the first question, here's what I have.  Did you reindex?  That
>> will be required.
>> 
>> http://wiki.apache.org/solr/HowToReindex
>> 
>> Assuming that you did reindex, are you trying to search for ASDFGHJK in
>> a field that contains more than just "Zahlungsverkehr"?  The keyword
>> tokenizer might not do what you expect - it tokenizes the entire input
>> string as a single token, which means that you won't be able to search
>> for single words in a multi-word field without wildcards, which are
>> pretty slow.
>> 
>> Note that both the pattern and replacement are case sensitive.  This is
>> how regex works.  You haven't used a lowercase filter, which means that
>> you won't be able to search for asdfghjk.
>> 
>> Use the analysis tab in the UI on your core to see what Solr does to
>> your field text.
>> 
>> Thanks,
>> Shawn

Re: charfilter doesn't do anything

2013-09-06 Thread Andreas Owen

the input string is a normal html page with the word Zahlungsverkehr in it and 
my query is ...solr/collection1/select?q=*

On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:

> And show us an input string and a query that fail.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Shawn Heisey
> Sent: Thursday, September 05, 2013 2:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> On 9/5/2013 10:03 AM, Andreas Owen wrote:
>> i would like to filter / replace a word during indexing but it doesn't do 
>> anything and i dont get a error.
>> 
>> in schema.xml i have the following:
>> 
>> > multiValued="true"/>
>> 
>> 
>> 
>>  
>>  > pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>>  
>> 
>>   
>> 
>> my 2. question is where can i say that the expression is multilined like in 
>> javascript i can use /m at the end of the pattern?
> 
> I don't know about your second question.  I don't know if that will be
> possible, but I'll leave that to someone who's more expert than I.
> 
> As for the first question, here's what I have.  Did you reindex?  That
> will be required.
> 
> http://wiki.apache.org/solr/HowToReindex
> 
> Assuming that you did reindex, are you trying to search for ASDFGHJK in
> a field that contains more than just "Zahlungsverkehr"?  The keyword
> tokenizer might not do what you expect - it tokenizes the entire input
> string as a single token, which means that you won't be able to search
> for single words in a multi-word field without wildcards, which are
> pretty slow.
> 
> Note that both the pattern and replacement are case sensitive.  This is
> how regex works.  You haven't used a lowercase filter, which means that
> you won't be able to search for asdfghjk.
> 
> Use the analysis tab in the UI on your core to see what Solr does to
> your field text.
> 
> Thanks,
> Shawn

charfilter doesn't do anything

2013-09-05 Thread Andreas Owen

i would like to filter / replace a word during indexing but it doesn't do 
anything and i dont get a error.

in schema.xml i have the following:





  
  
  

   

my 2. question is where can i say that the expression is multilined like in 
javascript i can use /m at the end of the pattern?

Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Andreas Owen

or could i use a filter in schema.xml where i define a fieldtype and use some 
filter that understands xpath?

On 4. Sep 2013, at 11:52 AM, Shalin Shekhar Mangar wrote:

> No that wouldn't work. It seems that you probably need a custom
> Transformer to extract the right div content. I do not know if
> TikaEntityProcessor supports such a thing.
> 
> On Wed, Sep 4, 2013 at 12:38 PM, Andreas Owen  wrote:
>> so could i just nest it in a XPathEntityProcessor to filter the html or is 
>> there something like xpath for tika?
>> 
>> > forEach="/div[@id='content']" dataSource="main">
>>> url="${htm}" dataSource="dataUrl" onError="skip" htmlMapper="identity" 
>> format="html" >
>>
>>
>>
>> 
>> but now i dont know how to pass the text to tika, what do i put in url and 
>> datasource?
>> 
>> 
>> On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:
>> 
>>> I don't know much about Tika but in the example data-config.xml that
>>> you posted, the "xpath" attribute on the field "text" won't work
>>> because the xpath attribute is used only by a XPathEntityProcessor.
>>> 
>>> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen  wrote:
>>>> I want tika to only index the content in ... for 
>>>> the field "text". unfortunately it's indexing the hole page. Can't xpath 
>>>> do this?
>>>> 
>>>> data-config.xml:
>>>> 
>>>> 
>>>>   
>>>>   
>>>>   
>>>> 
>>>>   >>> url="http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" 
>>>> dataSource="main"> 
>>>>   
>>>>   
>>>>   
>>>>   
>>>>   
>>>>   
>>>> 
>>>>   >>> url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" 
>>>> htmlMapper="identity" format="html" >
>>>>   
>>>> 
>>>>   
>>>>   
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>> 
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.

Re: dataimporter tika doesn't extract certain div

2013-09-04 Thread Andreas Owen

so could i just nest it in a XPathEntityProcessor to filter the html or is 
there something like xpath for tika?







but now i dont know how to pass the text to tika, what do i put in url and 
datasource?


On 3. Sep 2013, at 5:56 PM, Shalin Shekhar Mangar wrote:

> I don't know much about Tika but in the example data-config.xml that
> you posted, the "xpath" attribute on the field "text" won't work
> because the xpath attribute is used only by a XPathEntityProcessor.
> 
> On Thu, Aug 29, 2013 at 10:20 PM, Andreas Owen  wrote:
>> I want tika to only index the content in ... for the 
>> field "text". unfortunately it's indexing the hole page. Can't xpath do this?
>> 
>> data-config.xml:
>> 
>> 
>>
>>
>>
>> 
>>> url="http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" 
>> dataSource="main"> 
>>
>>
>>
>>
>>
>>
>> 
>>> url="${rec.path}${rec.file}" dataSource="dataUrl" onError="skip" 
>> htmlMapper="identity" format="html" >
>>
>> 
>>
>>
>> 
>> 
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.

dataimporter tika doesn't extract certain div

2013-08-29 Thread Andreas Owen

I want tika to only index the content in ... for the 
field "text". unfortunately it's indexing the hole page. Can't xpath do this?

data-config.xml:






http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" 
dataSource="main">

Re: dataimporter tika fields empty

2013-08-23 Thread Andreas Owen

i changed following line (xpath): 

On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote:

> Ah. That's because Tika processor does not support path extraction. You
> need to nest one more level.
> 
> Regards,
>  Alex
> On 22 Aug 2013 13:34, "Andreas Owen"  wrote:
> 
>> i can do it like this but then the content isn't copied to text. it's just
>> in text_test
>> 
>> > url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>
>>
>> 
>> 
>> 
>> On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:
>> 
>>> i put it in the tika-entity as attribute, but it doesn't change
>> anything. my bigger concern is why text_test isn't populated at all
>>> 
>>> On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
>>> 
>>>> Can you try SOLR-4530 switch:
>>>> https://issues.apache.org/jira/browse/SOLR-4530
>>>> 
>>>> Specifically, setting htmlMapper="identity" on the entity definition.
>> This
>>>> will tell Tika to send full HTML rather than a seriously stripped one.
>>>> 
>>>> Regards,
>>>> Alex.
>>>> 
>>>> Personal website: http://www.outerthoughts.com/
>>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>>> - Time is the quality of nature that keeps events from happening all at
>>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>>> 
>>>> 
>>>> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen  wrote:
>>>> 
>>>>> i'm trying to index a html page and only user the div with the
>>>>> id="content". unfortunately nothing is working within the tika-entity,
>> only
>>>>> the standard text (content) is populated.
>>>>> 
>>>>>  do i have to use copyField for test_text to get the data?
>>>>>  or is there a problem with the entity-hirarchy?
>>>>>  or is the xpath wrong, even though i've tried it without and just
>>>>> using text?
>>>>>  or should i use the updateextractor?
>>>>> 
>>>>> data-config.xml:
>>>>> 
>>>>> 
>>>>>  
>>>>>  
>>>>>  http://127.0.0.1/tkb/internet/"; name="main"/>
>>>>> 
>>>>>  >>>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main">
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>> 
>>>>>  >>>> url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>>>>  
>>>>>  >>>> xpath="//div[@id='content']" />
>>>>>  
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> docImporterUrl.xml:
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>  5
>>>>>  tkb
>>>>>  Startseite
>>>>>  blabla ...
>>>>>  http://localhost/tkb/internet/index.cfm
>>>>>  http://localhost/tkb/internet/index.cfm/url
>>>>>  http\specialConf
>>>>>  
>>>>>  
>>>>>  6
>>>>>  tkb
>>>>>  Eigenheim
>>>>>  Machen Sie sich erste Gedanken über den
>>>>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder
>> gar ein
>>>>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
>>>>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in
>> finanzieller
>>>>> Hinsicht gelingt.
>>>>>  
>>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm
>>>>>  
>>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url
>>>>>  
>>>>> 
>> 
>>

Re: dataimporter tika fields empty

2013-08-23 Thread Andreas Owen

ok but i'm not doing any path extraction, at least i don't think so.

htmlMapper="identity" isn't preserving html

it's reading the content of the pages but it's not putting it into "text_test" 
and "text". it's only in "text_test" the copyField isn't working. 

data-config.xml:






http://127.0.0.1/tkb/internet/docImportUrl.xml"; forEach="/docs/doc" 
dataSource="main"> 

















On 22. Aug 2013, at 10:06 PM, Alexandre Rafalovitch wrote:

> Ah. That's because Tika processor does not support path extraction. You
> need to nest one more level.
> 
> Regards,
>  Alex
> On 22 Aug 2013 13:34, "Andreas Owen"  wrote:
> 
>> i can do it like this but then the content isn't copied to text. it's just
>> in text_test
>> 
>> > url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>
>>
>> 
>> 
>> 
>> On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:
>> 
>>> i put it in the tika-entity as attribute, but it doesn't change
>> anything. my bigger concern is why text_test isn't populated at all
>>> 
>>> On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
>>> 
>>>> Can you try SOLR-4530 switch:
>>>> https://issues.apache.org/jira/browse/SOLR-4530
>>>> 
>>>> Specifically, setting htmlMapper="identity" on the entity definition.
>> This
>>>> will tell Tika to send full HTML rather than a seriously stripped one.
>>>> 
>>>> Regards,
>>>> Alex.
>>>> 
>>>> Personal website: http://www.outerthoughts.com/
>>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>>> - Time is the quality of nature that keeps events from happening all at
>>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>> book)
>>>> 
>>>> 
>>>> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen  wrote:
>>>> 
>>>>> i'm trying to index a html page and only user the div with the
>>>>> id="content". unfortunately nothing is working within the tika-entity,
>> only
>>>>> the standard text (content) is populated.
>>>>> 
>>>>>  do i have to use copyField for test_text to get the data?
>>>>>  or is there a problem with the entity-hirarchy?
>>>>>  or is the xpath wrong, even though i've tried it without and just
>>>>> using text?
>>>>>  or should i use the updateextractor?
>>>>> 
>>>>> data-config.xml:
>>>>> 
>>>>> 
>>>>>  
>>>>>  
>>>>>  http://127.0.0.1/tkb/internet/"; name="main"/>
>>>>> 
>>>>>  >>>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main">
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>> 
>>>>>  >>>> url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>>>>  
>>>>>  >>>> xpath="//div[@id='content']" />
>>>>>  
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> docImporterUrl.xml:
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>  5
>>>>>  tkb
>>>>>  Startseite
>>>>>  blabla ...
>>>>>  http://localhost/tkb/internet/index.cfm
>>>>>  http://localhost/tkb/internet/index.cfm/url
>>>>>  http\specialConf
>>>>>  
>>>>>  
>>>>>  6
>>>>>  tkb
>>>>>  Eigenheim
>>>>>  Machen Sie sich erste Gedanken über den
>>>>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder
>> gar ein
>>>>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
>>>>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in
>> finanzieller
>>>>> Hinsicht gelingt.
>>>>>  
>>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm
>>>>>  
>>>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url
>>>>>  
>>>>> 
>> 
>>

Re: dataimporter tika fields empty

2013-08-22 Thread Andreas Owen

i can do it like this but then the content isn't copied to text. it's just in 
text_test







On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:

> i put it in the tika-entity as attribute, but it doesn't change anything. my 
> bigger concern is why text_test isn't populated at all
> 
> On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
> 
>> Can you try SOLR-4530 switch:
>> https://issues.apache.org/jira/browse/SOLR-4530
>> 
>> Specifically, setting htmlMapper="identity" on the entity definition. This
>> will tell Tika to send full HTML rather than a seriously stripped one.
>> 
>> Regards,
>> Alex.
>> 
>> Personal website: http://www.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all at
>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>> 
>> 
>> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen  wrote:
>> 
>>> i'm trying to index a html page and only user the div with the
>>> id="content". unfortunately nothing is working within the tika-entity, only
>>> the standard text (content) is populated.
>>> 
>>>   do i have to use copyField for test_text to get the data?
>>>   or is there a problem with the entity-hirarchy?
>>>   or is the xpath wrong, even though i've tried it without and just
>>> using text?
>>>   or should i use the updateextractor?
>>> 
>>> data-config.xml:
>>> 
>>> 
>>>   
>>>   
>>>   http://127.0.0.1/tkb/internet/"; name="main"/>
>>> 
>>>   >> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main">
>>>   
>>>   
>>>   
>>>   
>>>   
>>>   
>>> 
>>>   >> url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>>   
>>>   >> xpath="//div[@id='content']" />
>>>   
>>>   
>>> 
>>> 
>>> 
>>> docImporterUrl.xml:
>>> 
>>> 
>>> 
>>> 
>>>   5
>>>   tkb
>>>   Startseite
>>>   blabla ...
>>>   http://localhost/tkb/internet/index.cfm
>>>   http://localhost/tkb/internet/index.cfm/url
>>>   http\specialConf
>>>   
>>>   
>>>   6
>>>   tkb
>>>   Eigenheim
>>>   Machen Sie sich erste Gedanken über den
>>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein
>>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
>>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller
>>> Hinsicht gelingt.
>>>   
>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm
>>>   
>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url
>>>   
>>>

Re: dataimporter tika fields empty

2013-08-22 Thread Andreas Owen

i put it in the tika-entity as attribute, but it doesn't change anything. my 
bigger concern is why text_test isn't populated at all

On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:

> Can you try SOLR-4530 switch:
> https://issues.apache.org/jira/browse/SOLR-4530
> 
> Specifically, setting htmlMapper="identity" on the entity definition. This
> will tell Tika to send full HTML rather than a seriously stripped one.
> 
> Regards,
> Alex.
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen  wrote:
> 
>> i'm trying to index a html page and only user the div with the
>> id="content". unfortunately nothing is working within the tika-entity, only
>> the standard text (content) is populated.
>> 
>>do i have to use copyField for test_text to get the data?
>>or is there a problem with the entity-hirarchy?
>>or is the xpath wrong, even though i've tried it without and just
>> using text?
>>or should i use the updateextractor?
>> 
>> data-config.xml:
>> 
>> 
>>
>>
>>http://127.0.0.1/tkb/internet/"; name="main"/>
>> 
>>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main">
>>
>>
>>
>>
>>
>>
>> 
>>> url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>
>>> xpath="//div[@id='content']" />
>>
>>
>> 
>> 
>> 
>> docImporterUrl.xml:
>> 
>> 
>> 
>> 
>>5
>>tkb
>>Startseite
>>blabla ...
>>http://localhost/tkb/internet/index.cfm
>>http://localhost/tkb/internet/index.cfm/url
>>http\specialConf
>>
>>
>>6
>>tkb
>>Eigenheim
>>Machen Sie sich erste Gedanken über den
>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein
>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller
>> Hinsicht gelingt.
>>
>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm
>>
>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url
>>
>>

dataimporter tika fields empty

2013-08-22 Thread Andreas Owen

i'm trying to index a html page and only user the div with the id="content". 
unfortunately nothing is working within the tika-entity, only the standard text 
(content) is populated. 

do i have to use copyField for test_text to get the data? 
or is there a problem with the entity-hirarchy?
or is the xpath wrong, even though i've tried it without and just using 
text?
or should i use the updateextractor?

data-config.xml:




http://127.0.0.1/tkb/internet/"; name="main"/>

 





  



   





docImporterUrl.xml:




5
tkb
Startseite
blabla ...
http://localhost/tkb/internet/index.cfm
http://localhost/tkb/internet/index.cfm/url
http\specialConf


6
tkb
Eigenheim
Machen Sie sich erste Gedanken über den Erwerb von 
Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein spruchreifes 
Projekt? Wir beraten Sie gerne in allen Fragen rund um den Erwerb oder Bau von 
Wohneigentum, damit Ihr Vorhaben auch in finanzieller Hinsicht 
gelingt.

http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm

http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url

Re: dataimporter, custom fields and parsing error

2013-07-23 Thread Andreas Owen

i have tried post.jar and it works when i set the literal.id in solrconfig.xml. 
i can't pass the id with post.jar (-Dparams=literal.id=abc) because i get a 
error: "could not find or load main class .id=abc".


On 20. Jul 2013, at 7:05 PM, Andreas Owen wrote:

> path was set text wasn't, but it doesn't make a difference. my importer says 
> 1 row fetched, 0 docs processed, 0 docs skipped. i don't understand how it 
> can have 2 docs indexed with such a output.
> 
> 
> On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote:
> 
>> Are the "path" and "text" fields set to "stored" in the schema.xml?
>> 
>> 
>> On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen  wrote:
>> 
>>> they are in my schema, path is typed correctly the others are default
>>> fields which already exist. all the other fields are populated and i can
>>> search for them, just path and text aren't.
>>> 
>>> 
>>> On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote:
>>> 
>>>> Dumb question: they are in your schema? Spelled right, in the right
>>>> section, using types also defined? Can you populate them by hand with a
>>> CSV
>>>> file and post.jar?
>>>> 
>>>> Regards,
>>>> Alex.
>>>> 
>>>> Personal website: http://www.outerthoughts.com/
>>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>>> - Time is the quality of nature that keeps events from happening all at
>>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>>>> 
>>>> 
>>>> On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen  wrote:
>>>> 
>>>>> i'm using solr 4.3 which i just downloaded today and am using only jars
>>>>> that came with it. i have enabled the dataimporter and it runs without
>>>>> error. but the field "path" (included in schema.xml) and "text" (file
>>>>> content) aren't indexed. what am i doing wrong?
>>>>> 
>>>>> solr-path: C:\ColdFusion10\cfusion\jetty-new
>>>>> collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
>>>>> pdf-doc-path: C:\web\development\tkb\internet\public
>>>>> 
>>>>> 
>>>>> data-config.xml:
>>>>> 
>>>>> 
>>>>>  
>>>>>  
>>>>>  http://127.0.0.1/tkb/internet/"; name="main"/>
>>>>> 
>>>>>  >>>> url="docImportUrl.xml" forEach="/albums/album" dataSource="main"> 
>>>>>  
>>>>>  
>>>>>  
>>>>>  
>>>>> 
>>>>>  
>>>>> 
>>>>>  >>>> url="../../../../../web/development/tkb/internet/public/${rec.path}/${
>>>>> rec.id}"
>>>>> 
>>>>> dataSource="data" >
>>>>>  
>>>>> 
>>>>>  
>>>>>  
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> docImportUrl.xml:
>>>>> 
>>>>> 
>>>>> 
>>>>>  
>>>>>  Peter Z.
>>>>>  Beratungsseminar kundenbrief
>>>>>  wie kommuniziert man
>>>>> 
>>>>> 0226520141_e-banking_Checkliste_CLX.Sentinel.pdf
>>>>>  download/online
>>>>>  
>>>>>  
>>>>>  Marcel X.
>>>>>  kuchen backen
>>>>>  torten, kuchen, geb‰ck ...
>>>>>  Kundenbrief.pdf
>>>>>  download/online
>>>>>  
>>>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> Regards,
>> Shalin Shekhar Mangar.

Re: dataimporter, custom fields and parsing error

2013-07-20 Thread Andreas Owen

path was set text wasn't, but it doesn't make a difference. my importer says 1 
row fetched, 0 docs processed, 0 docs skipped. i don't understand how it can 
have 2 docs indexed with such a output.


On 20. Jul 2013, at 12:47 PM, Shalin Shekhar Mangar wrote:

> Are the "path" and "text" fields set to "stored" in the schema.xml?
> 
> 
> On Sat, Jul 20, 2013 at 3:37 PM, Andreas Owen  wrote:
> 
>> they are in my schema, path is typed correctly the others are default
>> fields which already exist. all the other fields are populated and i can
>> search for them, just path and text aren't.
>> 
>> 
>> On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote:
>> 
>>> Dumb question: they are in your schema? Spelled right, in the right
>>> section, using types also defined? Can you populate them by hand with a
>> CSV
>>> file and post.jar?
>>> 
>>> Regards,
>>>  Alex.
>>> 
>>> Personal website: http://www.outerthoughts.com/
>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>> - Time is the quality of nature that keeps events from happening all at
>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>>> 
>>> 
>>> On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen  wrote:
>>> 
>>>> i'm using solr 4.3 which i just downloaded today and am using only jars
>>>> that came with it. i have enabled the dataimporter and it runs without
>>>> error. but the field "path" (included in schema.xml) and "text" (file
>>>> content) aren't indexed. what am i doing wrong?
>>>> 
>>>> solr-path: C:\ColdFusion10\cfusion\jetty-new
>>>> collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
>>>> pdf-doc-path: C:\web\development\tkb\internet\public
>>>> 
>>>> 
>>>> data-config.xml:
>>>> 
>>>> 
>>>>   
>>>>   
>>>>   http://127.0.0.1/tkb/internet/"; name="main"/>
>>>> 
>>>>   >>> url="docImportUrl.xml" forEach="/albums/album" dataSource="main"> 
>>>>   
>>>>   
>>>>   
>>>>   
>>>> 
>>>>   
>>>> 
>>>>   >>> url="../../../../../web/development/tkb/internet/public/${rec.path}/${
>>>> rec.id}"
>>>> 
>>>> dataSource="data" >
>>>>   
>>>> 
>>>>   
>>>>   
>>>> 
>>>> 
>>>> 
>>>> 
>>>> docImportUrl.xml:
>>>> 
>>>> 
>>>> 
>>>>   
>>>>   Peter Z.
>>>>   Beratungsseminar kundenbrief
>>>>   wie kommuniziert man
>>>> 
>>>> 0226520141_e-banking_Checkliste_CLX.Sentinel.pdf
>>>>   download/online
>>>>   
>>>>   
>>>>   Marcel X.
>>>>   kuchen backen
>>>>   torten, kuchen, geb‰ck ...
>>>>   Kundenbrief.pdf
>>>>   download/online
>>>>   
>>>> 
>> 
>> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.

Re: dataimporter, custom fields and parsing error

2013-07-20 Thread Andreas Owen

they are in my schema, path is typed correctly the others are default fields 
which already exist. all the other fields are populated and i can search for 
them, just path and text aren't.


On 19. Jul 2013, at 6:16 PM, Alexandre Rafalovitch wrote:

> Dumb question: they are in your schema? Spelled right, in the right
> section, using types also defined? Can you populate them by hand with a CSV
> file and post.jar?
> 
> Regards,
>   Alex.
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Fri, Jul 19, 2013 at 12:09 PM, Andreas Owen  wrote:
> 
>> i'm using solr 4.3 which i just downloaded today and am using only jars
>> that came with it. i have enabled the dataimporter and it runs without
>> error. but the field "path" (included in schema.xml) and "text" (file
>> content) aren't indexed. what am i doing wrong?
>> 
>> solr-path: C:\ColdFusion10\cfusion\jetty-new
>> collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
>> pdf-doc-path: C:\web\development\tkb\internet\public
>> 
>> 
>> data-config.xml:
>> 
>> 
>>
>>
>>http://127.0.0.1/tkb/internet/"; name="main"/>
>> 
>>> url="docImportUrl.xml" forEach="/albums/album" dataSource="main"> 
>>
>>
>>
>>
>> 
>>
>> 
>>> url="../../../../../web/development/tkb/internet/public/${rec.path}/${
>> rec.id}"
>> 
>> dataSource="data" >
>>
>> 
>>
>>
>> 
>> 
>> 
>> 
>> docImportUrl.xml:
>> 
>> 
>> 
>>
>>Peter Z.
>>Beratungsseminar kundenbrief
>>wie kommuniziert man
>> 
>> 0226520141_e-banking_Checkliste_CLX.Sentinel.pdf
>>download/online
>>
>>
>>Marcel X.
>>kuchen backen
>>torten, kuchen, geb‰ck ...
>>Kundenbrief.pdf
>>download/online
>>
>>

dataimporter, custom fields and parsing error

2013-07-19 Thread Andreas Owen

i'm using solr 4.3 which i just downloaded today and am using only jars that 
came with it. i have enabled the dataimporter and it runs without error. but 
the field "path" (included in schema.xml) and "text" (file content) aren't 
indexed. what am i doing wrong?

solr-path: C:\ColdFusion10\cfusion\jetty-new
collection-path: C:\ColdFusion10\cfusion\jetty-new\solr\collection1
pdf-doc-path: C:\web\development\tkb\internet\public


data-config.xml:




http://127.0.0.1/tkb/internet/"; name="main"/>

 
















docImportUrl.xml:




Peter Z.
Beratungsseminar kundenbrief
wie kommuniziert man
0226520141_e-banking_Checkliste_CLX.Sentinel.pdf
download/online


Marcel X.
kuchen backen
torten, kuchen, geb‰ck ...
Kundenbrief.pdf
download/online

Re: solr autodetectparser tikaconfig dataimporter error

2013-07-18 Thread Andreas Owen

i have now changed some things and the import runs without error. in schema.xml 
i haven't got the field "text" but "contentsExact". unfortunatly the text (from 
file) isn't indexed even though i mapped it to the proper field. what am i 
doing wrong?

data-config.xml:




http://127.0.0.1/tkb/internet/"; name="main"/>

 

















i noticed, that when I move the field author into the tika- it isn't 
indexed. can this have something to do why the text from the file isn't 
indexed? Do I have to do something special about the -levels in 


ps: how do i import tsstamp, it's a static value?




On 14. Jul 2013, at 10:30 PM, Jack Krupansky wrote:

> "Caused by: java.lang.NoSuchMethodError:"
> 
> That means you have some out of date jars or some newer jars mixed in with 
> the old ones.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Andreas Owen
> Sent: Sunday, July 14, 2013 3:07 PM
> To: solr-user@lucene.apache.org
> Subject: Re: solr autodetectparser tikaconfig dataimporter error
> 
> hi
> 
> is there nowone with a idea what this error is or even give me a pointer 
> where to look? If not is there a alternitave way to import documents from a 
> xml-file with meta-data and the filename to parse?
> 
> thanks for any help.
> 
> 
> On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote:
> 
>> i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
>> import a
>> file via xml i get this error, it doesn't matter what file format i try =
>> to index txt, cfm, pdf all the same error:
>> 
>> SEVERE: Exception while processing: rec document :
>> SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
>> title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
>> contents=3Dcontents(1.0)=3D{wie
>> kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
>> =
>> path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
>> DataImportHandlerException:
>> java.lang.NoSuchMethodError:
>> =
>> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
>> TikaConfig;)V
>> at
>> =
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
>> a:669)
>> at
>> =
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
>> a:622)
>> at
>> =
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
>> 68)
>> at
>> =
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
>> 
>> at
>> =
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
>> java:359)
>> at
>> =
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
>> 27)
>> at
>> =
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
>> 8)
>> Caused by: java.lang.NoSuchMethodError:
>> =
>> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
>> TikaConfig;)V
>> at
>> =
>> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
>> rocessor.java:122)
>> at
>> =
>> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
>> ocessorWrapper.java:238)
>> at
>> =
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
>> a:596)
>> ... 6 more
>> 
>> Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
>> SEVERE: Full Import
>> failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
>> java.lang.NoSuchMethodError:
>> =
>> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
>> TikaConfig;)V
>> at
>> =
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
>> a:669)
>> at
>> =
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
>> a:622)
>> at
>> =
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
>> 68)
>> at
>> =
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
>> 
>> at
>> =
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
>> java:359)
>> at
>> =
&

Re: solr autodetectparser tikaconfig dataimporter error

2013-07-14 Thread Andreas Owen

hi

is there nowone with a idea what this error is or even give me a pointer where 
to look? If not is there a alternitave way to import documents from a xml-file 
with meta-data and the filename to parse?

thanks for any help.


On 12. Jul 2013, at 10:38 PM, Andreas Owen wrote:

> i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
> import a
> file via xml i get this error, it doesn't matter what file format i try =
> to index txt, cfm, pdf all the same error:
> 
> SEVERE: Exception while processing: rec document :
> SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
> title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
> contents=3Dcontents(1.0)=3D{wie
> kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
> =
> path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
> DataImportHandlerException:
> java.lang.NoSuchMethodError:
> =
> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
> TikaConfig;)V
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:669)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:622)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
> 68)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
> 
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
> java:359)
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
> 27)
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
> 8)
> Caused by: java.lang.NoSuchMethodError:
> =
> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
> TikaConfig;)V
>   at
> =
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
> rocessor.java:122)
>   at
> =
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
> ocessorWrapper.java:238)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:596)
>   ... 6 more
> 
> Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
> SEVERE: Full Import
> failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.NoSuchMethodError:
> =
> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
> TikaConfig;)V
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:669)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:622)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
> 68)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=
> 
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
> java:359)
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
> 27)
>   at
> =
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
> 8)
> Caused by: java.lang.NoSuchMethodError:
> =
> org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
> TikaConfig;)V
>   at
> =
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
> rocessor.java:122)
>   at
> =
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
> ocessorWrapper.java:238)
>   at
> =
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
> a:596)
>   ... 6 more
> 
> Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 =
> rollback
> 
> data-config.xml:
> 
>   
>baseUrl=3D"http://127.0.0.1/tkb/internet/";
> name=3D"main"/>
> 
>url=3D"docImport.xml"
> forEach=3D"/albums/album" dataSource=3D"main">=20
>   
>   
>   
>   
>   
>   =09
>   =09
>   =09
>=
> url=3D"file:///C:\web\development\tkb\internet\public\download\online\${re=
> c.id}"
> dataSource=3D"data" onerror=3D"skip">
>
>   
>   
> 
> 
> 
> the lib are included and declared in the logs, i have also tried =
> tika-app
> 1.0 and tagsoup 1.2 with the same result. can someone please help, i =
> don't
> know where to start looking for the error.

solr autodetectparser tikaconfig dataimporter error

2013-07-12 Thread Andreas Owen

i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
import a
file via xml i get this error, it doesn't matter what file format i try =
to index txt, cfm, pdf all the same error:

SEVERE: Exception while processing: rec document :
SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
contents=3Dcontents(1.0)=3D{wie
kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
=
path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
SEVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 =
rollback

data-config.xml:


http://127.0.0.1/tkb/internet/";
name=3D"main"/>

=20





=09
=09
=09

 





the lib are included and declared in the logs, i have also tried =
tika-app
1.0 and tagsoup 1.2 with the same result. can someone please help, i =
don't
know where to start looking for the error.

78 matches

Mail list logo