Re: Edismax - bq taking precedence over pf

2017-10-26 Thread Josh Lincoln
I was asking about the field definitions from the schema.

It would also be helpful to see the debug info from the query. Just add
debug=true to see how the query and params were executed by solr and how
the calculation was done for each result.

On Thu, Oct 26, 2017 at 1:33 PM ruby  wrote:

> ok. Shouldn't pf be applied on top of bq=? that way among the object_types
> boosted, if one has "Manufacturing" then it should be listed first?
>
> following are my objects:
>
>
> 
> 1
> Configuration
> typeA
> Manufacturing
>  <--catch all field where contents of all fields get
> copied to
> 
>
> 
> 2
> Manufacturing
> typeA
> xyz
>  <--catch all field where contents of all fields get
> copied to
> 
>
> I'm hoping to get id=2 first then get id=1 but I'm not seeing that. is my
> understanding of qf= not correct?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Edismax - bq taking precedence over pf

2017-10-26 Thread Josh Lincoln
What's the analysis configuration for the object_name field and fieldType?
Perhaps the query is matching your catch-all field, but not the object_name
field, and therefore the pf boost never happens.




On Thu, Oct 26, 2017 at 8:55 AM ruby  wrote:

> I'm noticing in my following query bq= is taking precedence over pf.
>
> =Manufacturing
> =Catch_all_Copy_field
> =object_id^40+object_name^700
> =object_rating:(best)^10
> =object_rating:(candidate)^8
> =object_rating:(placeholder)^5
> =object_type_:(typeA)^10
> =object_type_:(typeB)^10
> =object_type_:(typeC)^10
>
> My intention is to show all objects of typeA having "Manufacturing" in name
> first
>
> But I'm seeing all typeA,TypeB,TypeC objects are being listed first,
> eventhough if their name is Not "Manufacturing".
>
> Is my query correct or my understanding of pf and bq parameters correct?
>
> Thanks
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: no search results for specific search in solr 6.6.0

2017-09-19 Thread Josh Lincoln
Can you provide the fieldType definition for text_fr?

Also, when you use the Analysis page in the admin UI, what tokens are
generated during indexing for FRaoo using the text_fr fieldType?

On Tue, Sep 19, 2017 at 12:01 PM Sascha Tuschinski 
wrote:

> Hello Community,
>
> We are using a Solr Core with Solr 6.6.0 on Windows 10 (latest updates)
> with field names defined like "f_1179014266_txt". The number in the middle
> of the name differs for each field we use. For language specific fields we
> are adding an language specific extension e.g. "f_1179014267_txt_fr",
> "f_1179014268_txt_de", "f_1179014269_txt_en" and so on.
> We are having the following odd issue within the french "_fr" field only:
> Field
> f_1197829835_txt_fr<
> http://localhost:8983/solr/#/test_core/schema?field=f_1197829835_txt_fr>
> Dynamic Field /
> *_txt_fr<
> http://localhost:8983/solr/#/test_core/schema?dynamic-field=*_txt_fr>
> Type
> text_fr
>
>   *   The saved value which had been added with no problem to the Solr
> index is "FRaoo".
>   *   When searching within the Solr query tool for
> "f_1197829839_txt_fr:*FRao*" it returns the items matching the term as seen
> below - OK.
> {
>   "responseHeader":{
> "status":0,
> "QTime":1,
> "params":{
>   "q":"f_1197829839_txt_fr:*FRao*",
>   "indent":"on",
>   "wt":"json",
>   "_":"1505808887827"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "id":"129",
> "f_1197829834_txt_en":"EnAir",
> "f_1197829822_txt_de":"Lufti",
> "f_1197829835_txt_fr":"FRaoi",
> "f_1197829836_txt_it":"ITAir",
> "f_1197829799_txt":["Lufti"],
> "f_1197829838_txt_en":"EnAir",
> "f_1197829839_txt_fr":"FRaoo",
> "f_1197829840_txt_it":"ITAir",
> "_version_":1578520424165146624}]
>   }}
>
>   *   When searching for "f_1197829839_txt_fr:*FRaoo*" NO item is found -
> Wrong!
> {
>   "responseHeader":{
> "status":0,
> "QTime":1,
> "params":{
>   "q":"f_1197829839_txt_fr:*FRaoo*",
>   "indent":"on",
>   "wt":"json",
>   "_":"1505808887827"}},
>   "response":{"numFound":0,"start":0,"docs":[]
>   }}
> When searching for "f_1197829839_txt_fr:FRaoo" (no wildcards) the matching
> items are found - OK
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":1,
> "params":{
>   "q":"f_1197829839_txt_fr:FRaoo",
>   "indent":"on",
>   "wt":"json",
>   "_":"1505808887827"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "id":"129",
> "f_1197829834_txt_en":"EnAir",
> "f_1197829822_txt_de":"Lufti",
> "f_1197829835_txt_fr":"FRaoi",
> "f_1197829836_txt_it":"ITAir",
> "f_1197829799_txt":["Lufti"],
> "f_1197829838_txt_en":"EnAir",
> "f_1197829839_txt_fr":"FRaoo",
> "f_1197829840_txt_it":"ITAir",
> "_version_":1578520424165146624}]
>   }}
> If we save exact the same value into a different language field e.g.
> ending on "_en", means "f_1197829834_txt_en", then the search
> "f_1197829834_txt_en:*FRaoo*" find all items correctly!
> We have no idea what's wrong here and we even recreated the index and can
> reproduce this problem all the time. I can only see that the value starts
> with "FR" and the field extension ends with "fr" but this is not problem
> for "en", "de" an so on. All fields are used in the same way and have the
> same field properties.
> Any help or ideas are highly appreciated. I filed a bug for this
> https://issues.apache.org/jira/browse/SOLR-11367 but had been asked to
> publish my question here. Thanks for reading.
> Greetings,
> ___
> Sascha Tuschinski
> Manager Quality Assurance // Canto GmbH
> Phone: +49 (0) 30 ­ 390 485 - 41 <+49%2030%2039048541>
> E-mail: stuschin...@canto.com
> Web: canto.com
>
> Canto GmbH
> Lietzenburger Str. 46
> 10789 Berlin
> Phone: +49 (0)30 390485-0
> Fax: +49 (0)30 390485-55 <+49%2030%2039048555>
> Amtsgericht Berlin-Charlottenburg HRB 88566
> Geschäftsführer: Jack McGannon, Thomas Mockenhaupt
>
>


Re: query with wild card with AND taking lot of time

2017-08-31 Thread Josh Lincoln
The closest thing to an execution plan that I know of is debug=true.That'll
show timings of some of the components
I also find it useful to add echoParams=all when troubleshooting. That'll
show every param solr is using for the request, including params set in
solrconfig.xml and not passed in the request. This can help explain the
debug output (e.g. what queryparser is being used, if fields are being
expanded through field aliases, etc.).

On Thu, Aug 31, 2017 at 1:35 PM suresh pendap 
wrote:

> Hello everybody,
>
> We are seeing that the below query is running very slow and taking almost 4
> seconds to finish
>
>
> [] webapp=/solr path=/select
>
> params={df=_text_=false=id=4=0=true=modified_dtm+desc=http://
> :8983/solr/flat_product_index_shard7_replica1/%7Chttp://:8983/solr/flat_product_index_shard7_replica2/%7Chttp://:8983/solr/flat_product_index_shard7_replica0/=11=2=product_identifier_type:DOTCOM_OFFER+AND+abstract_or_primary_product_id:*+AND+(gtin:)+AND+-product_class_type:BUNDLE+AND+-hasProduct:N=1504196301534=true=25000=javabin}
> hits=0 status=0 QTime=3663
>
>
> It seems like the abstract_or_primary_product_id:* clause is contributing
> to the overall response time. It seems that the
> abstract_or_primary_product_id:* . clause is not adding any value in the
> query criteria and can be safely removed.  Is my understanding correct?
>
> I would like to know if the order of the clauses in the AND query would
> affect the response time of the query?
>
> For e.g . f1: 3 AND f2:10 AND f3:* vs . f3:* AND f1:3 AND f2:10
>
> Doesn't Lucene/Solr pick up the optimal query execution plan?
>
> Is there anyway to look at the query execution plan generated by Lucene?
>
> Regards
> Suresh
>


Re: query with wild card with AND taking lot of time

2017-08-31 Thread Josh Lincoln
As I understand it, using a different fq for each clause makes the
resultant caches more likely to be used in future requests.

For the query
fq=first:bob AND last:smith
a subsequent query for
fq=first:tim AND last:smith
won't be able to use the fq cache from the first query.

However, if the first query was
fq=first:bob
fq=last:smith
and subsequently
fq=first:tim
fq=last:smith
then the second query will at least benefit from the last:smith cache

Because fq clauses are always ANDed, this does not work for ORed clauses.

I suppose if some conditions are frequently used together it may be better
to put them in the same fq so there's only one cache. E.g. if an ecommerce
site reqularly queried for featured:Y AND instock:Y

On Thu, Aug 31, 2017 at 1:48 PM David Hastings <hastings.recurs...@gmail.com>
wrote:

> >
> > 2) Because all your clauses are more like filters and are ANDed together,
> > you'll likely get better performance by putting them _each_ in an fq
> > E.g.
> > fq=product_identifier_type:DOTCOM_OFFER
> > fq=abstract_or_primary_product_id:[* TO *]
>
>
> why is this the case?  is it just better to have no logic operators in the
> filter queries?
>
>
>
> On Thu, Aug 31, 2017 at 1:47 PM, Josh Lincoln <josh.linc...@gmail.com>
> wrote:
>
> > Suresh,
> > Two things I noticed.
> > 1) If your intent is to only match records where there's something,
> > anything, in abstract_or_primary_product_id, you should use fieldname:[*
> > TO
> > *]  but that will exclude records where that field is empty/missing. If
> you
> > want to match records even if that field is empty/missing, then you
> should
> > remove that clause entirely
> > 2) Because all your clauses are more like filters and are ANDed together,
> > you'll likely get better performance by putting them _each_ in an fq
> > E.g.
> > fq=product_identifier_type:DOTCOM_OFFER
> > fq=abstract_or_primary_product_id:[* TO *]
> > fq=gtin:
> > fq=product_class_type:BUNDLE
> > fq=hasProduct:N
> >
> >
> > On Thu, Aug 31, 2017 at 1:35 PM suresh pendap <sureshfors...@gmail.com>
> > wrote:
> >
> > > Hello everybody,
> > >
> > > We are seeing that the below query is running very slow and taking
> > almost 4
> > > seconds to finish
> > >
> > >
> > > [] webapp=/solr path=/select
> > >
> > > params={df=_text_=false=id=4&
> > start=0=true=modified_dtm+desc=http://
> > > :8983/solr/flat_product_index_shard7_replica1/
> > %7Chttp://:8983/solr/flat_product_index_shard7_
> > replica2/%7Chttp://:8983/solr/flat_product_index_
> >
> shard7_replica0/=11=2=product_identifier_type:DOTCOM_OFFER+
> > AND+abstract_or_primary_product_id:*+AND+(gtin:<
> > numericValue>)+AND+-product_class_type:BUNDLE+AND+-hasProduct:N=
> > 1504196301534=true=25000=javabin}
> > > hits=0 status=0 QTime=3663
> > >
> > >
> > > It seems like the abstract_or_primary_product_id:* clause is
> > contributing
> > > to the overall response time. It seems that the
> > > abstract_or_primary_product_id:* . clause is not adding any value in
> the
> > > query criteria and can be safely removed.  Is my understanding correct?
> > >
> > > I would like to know if the order of the clauses in the AND query would
> > > affect the response time of the query?
> > >
> > > For e.g . f1: 3 AND f2:10 AND f3:* vs . f3:* AND f1:3 AND f2:10
> > >
> > > Doesn't Lucene/Solr pick up the optimal query execution plan?
> > >
> > > Is there anyway to look at the query execution plan generated by
> Lucene?
> > >
> > > Regards
> > > Suresh
> > >
> >
>


Re: query with wild card with AND taking lot of time

2017-08-31 Thread Josh Lincoln
Suresh,
Two things I noticed.
1) If your intent is to only match records where there's something,
anything, in abstract_or_primary_product_id, you should use fieldname:[* TO
*]  but that will exclude records where that field is empty/missing. If you
want to match records even if that field is empty/missing, then you should
remove that clause entirely
2) Because all your clauses are more like filters and are ANDed together,
you'll likely get better performance by putting them _each_ in an fq
E.g.
fq=product_identifier_type:DOTCOM_OFFER
fq=abstract_or_primary_product_id:[* TO *]
fq=gtin:
fq=product_class_type:BUNDLE
fq=hasProduct:N


On Thu, Aug 31, 2017 at 1:35 PM suresh pendap 
wrote:

> Hello everybody,
>
> We are seeing that the below query is running very slow and taking almost 4
> seconds to finish
>
>
> [] webapp=/solr path=/select
>
> params={df=_text_=false=id=4=0=true=modified_dtm+desc=http://
> :8983/solr/flat_product_index_shard7_replica1/%7Chttp://:8983/solr/flat_product_index_shard7_replica2/%7Chttp://:8983/solr/flat_product_index_shard7_replica0/=11=2=product_identifier_type:DOTCOM_OFFER+AND+abstract_or_primary_product_id:*+AND+(gtin:)+AND+-product_class_type:BUNDLE+AND+-hasProduct:N=1504196301534=true=25000=javabin}
> hits=0 status=0 QTime=3663
>
>
> It seems like the abstract_or_primary_product_id:* clause is contributing
> to the overall response time. It seems that the
> abstract_or_primary_product_id:* . clause is not adding any value in the
> query criteria and can be safely removed.  Is my understanding correct?
>
> I would like to know if the order of the clauses in the AND query would
> affect the response time of the query?
>
> For e.g . f1: 3 AND f2:10 AND f3:* vs . f3:* AND f1:3 AND f2:10
>
> Doesn't Lucene/Solr pick up the optimal query execution plan?
>
> Is there anyway to look at the query execution plan generated by Lucene?
>
> Regards
> Suresh
>


Re: Search by similarity?

2017-08-29 Thread Josh Lincoln
I reviewed the dismax docs and it doesn't support the fieldname:term
portion of the lucene syntax.
To restrict a search to a field and use mm you can either
A) use edismax exactly as you're currently trying to use dismax
B) use dismax, with the following changes
* remove the title: portion of the query and just pass
q="title-123123123-end"
* set qf=title

On Tue, Aug 29, 2017 at 10:25 AM Josh Lincoln <josh.linc...@gmail.com>
wrote:

> Darko,
> Can you use edismax instead?
>
> When using dismax, solr is parsing the title field as if it's a query
> term. E.g. the query seems to be interpreted as
> title "title-123123123-end"
> (note the lack of a colon)...which results in querying all your qf fields
> for both "title" and "title-123123123-end"
> I haven't used dismax in a very long time, so I don't know if this is
> intentional, but it's not what I expected.
>
> I'm able to reproduce the issue in 6.4.2 using the default techproducts
> Notice that in the below the parsedquery expands to both text:title and
> text:name (df=text)
> http://localhost:8983/solr/techproducts/select?indent=on=title
> :"name"=json=true=dismax
> rawquerystring: "title:"name"",
> querystring: "title:"name"",
> parsedquery: "(+(DisjunctionMaxQuery(((text:title)^1.0))
> DisjunctionMaxQuery(((text:name)^1.0))) ())/no_coord",
> parsedquery_toString: "+(((text:title)^1.0) ((text:name)^1.0)) ()"
>
> But it's not an issue if you use edismax
> http://localhost:8983/solr/techproducts/select?indent=on=title
> :"name"=json=true=edismax
> rawquerystring: "title:"name"",
> querystring: "title:"name"",
> parsedquery: "(+title:name)/no_coord",
> parsedquery_toString: "+title:name",
>
>
>
> On Tue, Aug 29, 2017 at 8:44 AM Darko Todoric <todo...@mdpi.com> wrote:
>
>> Hi Erick,
>>
>> "debug":{ "rawquerystring":"title:\"title-123123123-end\"",
>> "querystring":"title:\"title-123123123-end\"",
>> "parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 |
>> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
>> (authors:title)^4.0 | (doi:title:)^1.0))
>> DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 |
>> (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123
>> end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
>> (authors:\"title 123123123 end\"~1)^4.0 |
>> (doi:title-123123123-end)^1.0)))~1 ())/no_coord",
>> "parsedquery_toString":"+author_full:title)^7.0 |
>> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
>> (authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123
>> end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl
>> 123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
>> (authors:\"title 123123123 end\"~1)^4.0 |
>> (doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969
>> = sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 =
>> weight(abstract:titl in 23194) [], result of:\n 16.848969 =
>> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n
>> 5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 186.49593 = avgFieldLength\n 28.45 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 =
>> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max
>> of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n
>> 16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 =
>> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 =
>> score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, doc

Re: Search by similarity?

2017-08-29 Thread Josh Lincoln
Darko,
Can you use edismax instead?

When using dismax, solr is parsing the title field as if it's a query term.
E.g. the query seems to be interpreted as
title "title-123123123-end"
(note the lack of a colon)...which results in querying all your qf fields
for both "title" and "title-123123123-end"
I haven't used dismax in a very long time, so I don't know if this is
intentional, but it's not what I expected.

I'm able to reproduce the issue in 6.4.2 using the default techproducts
Notice that in the below the parsedquery expands to both text:title and
text:name (df=text)
http://localhost:8983/solr/techproducts/select?indent=on=title
:"name"=json=true=dismax
rawquerystring: "title:"name"",
querystring: "title:"name"",
parsedquery: "(+(DisjunctionMaxQuery(((text:title)^1.0))
DisjunctionMaxQuery(((text:name)^1.0))) ())/no_coord",
parsedquery_toString: "+(((text:title)^1.0) ((text:name)^1.0)) ()"

But it's not an issue if you use edismax
http://localhost:8983/solr/techproducts/select?indent=on=title
:"name"=json=true=edismax
rawquerystring: "title:"name"",
querystring: "title:"name"",
parsedquery: "(+title:name)/no_coord",
parsedquery_toString: "+title:name",



On Tue, Aug 29, 2017 at 8:44 AM Darko Todoric  wrote:

> Hi Erick,
>
> "debug":{ "rawquerystring":"title:\"title-123123123-end\"",
> "querystring":"title:\"title-123123123-end\"",
> "parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 |
> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
> (authors:title)^4.0 | (doi:title:)^1.0))
> DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 |
> (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123
> end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
> (authors:\"title 123123123 end\"~1)^4.0 |
> (doi:title-123123123-end)^1.0)))~1 ())/no_coord",
> "parsedquery_toString":"+author_full:title)^7.0 |
> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
> (authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123
> end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl
> 123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
> (authors:\"title 123123123 end\"~1)^4.0 |
> (doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969
> = sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 =
> weight(abstract:titl in 23194) [], result of:\n 16.848969 =
> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n
> 5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 186.49593 = avgFieldLength\n 28.45 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 =
> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max
> of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n
> 16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 =
> score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "28227":"\n15.670726 = sum of:\n 15.670726 = sum of:\n 15.670726 = max
> of:\n 15.670726 = weight(abstract:titl in 28156) [], result of:\n
> 15.670726 = score(doc=28156,freq=2.0 = termFreq=2.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.4236413 =
> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 28156) [], result of:\n 3.816711E-5 =
> score(doc=28156,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20375":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
> of:\n 15.052014 = weight(abstract:titl in 20369) [], result of:\n
> 15.052014 = score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = 

Re: Solr Search Problem with Multiple Data-Import Handler

2017-06-22 Thread Josh Lincoln
I suspect Erik's right that clean=true is the problem. That's the default
in the DIH interface.


I find that when using DIH, it's best to set preImportDeleteQuery for every
entity. This safely scopes the clean variable to just that entity.
It doesn't look like the docs have examples of using preImportDeleteQuery,
so I put one here:




On Wed, Jun 21, 2017 at 7:48 PM Erick Erickson 
wrote:

> First place I'd look is whether the jobs have clean=true set. If so the
> first thing DIH does is delete all documents.
>
> Best,
> Erick
>
> On Wed, Jun 21, 2017 at 3:52 PM, Pandey Brahmdev 
> wrote:
>
> > Hi,
> > I have setup Apache Solr 6.6.0 on Windows 10, 64-bit.
> >
> > I have created a simple core & configured DataImport Handlers.
> > I have configured 2 dataImport handlers in the Solr-config.xml file.
> >
> > First for to connect to DB & have data from DB Tables.
> > And Second for to have data from all pdf files using TikaEntityProcessor.
> >
> > Now the problem is there is no error in the console or anywhere but
> > whenever I want to search using "Query" tab it gives me the result of
> Data
> > Import.
> >
> > So let's say if I last Imported data for Tables then it gives me to
> result
> > from the table and if I imported PDF Files then it searches inside PDF
> > Files.
> >
> > But now when I again want to search for DB Tables values then It doesn't
> > give me the result instead I again need to Import Data for
> > DataImportHandler for File & vice-versa.
> >
> > Can you please help me out here?
> > Very sorry if I am doing anything wrong as I have started using Apache
> Solr
> > only 2 days back.
> >
> > Thanks & Regards,
> > Brahmdev Pandey
> > +46 767086309 <+46%2076%20708%2063%2009>
> >
>


Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread Josh Lincoln
I had the same issue as Vrinda and found a hacky way to limit the number of
times deltaImportQuery was executed.

As designed, solr executes *deltaQuery* to get a list of ids that need to
be indexed. For each of those it executes *deltaImportQuery*, which is
typically very similar to the full *query*.

I constructed a deltaQuery to purposely only return 1 row. E.g.

 deltaQuery = "SELECT id FROM table WHERE rownum=1"// written for
oracle, likely requires a different syntax for other dbs. Also, it occurred
to you could probably include the date>= '${dataimporter.last_index_time}'
filter here so this returns 0 rows if no data has changed

Since *deltaImportQuery now *only gets called once I needed to add the
filter logic to *deltaImportQuery *to only select the changed rows (that
logic is normally in *deltaQuery*). E.g.

deltaImportQuery = [normal import query] WHERE date >=
'${dataimporter.last_index_time}'


This significantly reduced the number of database queries for delta
imports, and sped up the processing.

On Thu, Jun 1, 2017 at 6:07 AM Amrit Sarkar  wrote:

> Erick,
>
> Thanks for the pointer. Getting astray from what Vrinda is looking for
> (sorry about that), what if there are no sub-entities? and no
> deltaImportQuery passed too. I looked into the code and determine it
> calculates the deltaImportQuery itself,
> SQLEntityProcessor:getDeltaImportQuery(..)::126.
>
> Ideally then, a full-import or the delta-import should take similar time to
> build the docs (fetch next row). I may very well be going entirely wrong
> here.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269 <(415)%20589-9269>
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Thu, Jun 1, 2017 at 1:50 PM, vrindavda  wrote:
>
> > Thanks Erick,
> >
> >  But how do I solve this? I tried creating Stored proc instead of plain
> > query, but no change in performance.
> >
> > For delta import it in processing more documents than the total
> documents.
> > In this case delta import is not helping at all, I cannot switch to full
> > import each time. This was working fine with less data.
> >
> > Thank you,
> > Vrinda Davda
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-
> > Import-tp4338162p4338444.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: Search for ISBN-like identifiers

2017-01-05 Thread Josh Lincoln
Sebastian,
You may want to try adding autoGeneratePhraseQueries="true" to the
fieldtype.
With that setting, a query for 978-3-8052-5094-8 will behave just like "978
3 8052 5094 8" (with the quotes)

A few notes about autoGeneratePhraseQueries
a) it used to be set to true by default, but that was changed several years
ago
b) does NOT require a reindex, so very easy to test
c) apparently not recommended for non-whitespace delimited languages (CJK,
etc), but maybe that's not an issue in your use case.
d) i'm unsure how it'll impact wildcard queries on that field. E.g. will
978-3-8052* match 978-3-8052-5094-8? At the very least, partial ISBNs (e.g.
978-3-8052) would match full ISBN without needing to use the wildcard. I'm
just not sure what happens if the user includes the wildcard.

Josh

On Thu, Jan 5, 2017 at 1:41 PM Sebastian Riemer <s.rie...@littera.eu> wrote:

> Thank you very much for taking the time to help me!
>
> I'll definitely have a look at the link you've posted.
>
> @ShawnHeisey Thanks too for shedding light on the wildcard behaviour!
>
> Allow me one further question:
> - Assuming that I define a separate field for storing the ISBNs, using the
> awesome analyzer provider by Mr. Bill Dueber. How do I get that field
> copied into my general text field, which is used by my QuickSearch-Input?
> Won't that field be processed again by the analyser defined on the text
> field?
> - Should I alternatively add more fields to the q-Parameter? As for now, I
> always have set q=text: but I guess one
> could try something like
> q=text:+isbnspeciallookupfield:
>
> I don't really know about that last idea though, since the searches are
> propably OR-combined which is not what I like to have.
>
> Third option would be, to pre-process the distinction to where to look at
> in the solr in my application of course. I.e. everything being a regex
> containing only numbers and hyphens with length 13 -> don't query on field
> text, instead use field isbnspeciallookupfield
>
>
> Many thanks again, and have a nice day!
> Sebastian
>
>
> -Ursprüngliche Nachricht-
> Von: Erik Hatcher [mailto:erik.hatc...@gmail.com]
> Gesendet: Donnerstag, 5. Januar 2017 19:10
> An: solr-user@lucene.apache.org
> Betreff: Re: Search for ISBN-like identifiers
>
> Sebastian -
>
> There’s some precedent out there for ISBN’s.  Bill Dueber and the
> UMICH/code4lib folks have done amazing work, check it out here -
>
> https://github.com/mlibrary/umich_solr_library_filters <
> https://github.com/mlibrary/umich_solr_library_filters>
>
>   - Erik
>
>
> > On Jan 5, 2017, at 5:08 AM, Sebastian Riemer <s.rie...@littera.eu>
> wrote:
> >
> > Hi folks,
> >
> >
> > TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general
> text field, respectively configure the analyser on that field, so that a
> search for the hyphenated ISBN returns exactly the matching document?
> >
> > Long version:
> > I've defined a field "text" of type "text_general", where I copy all
> > my other fields to, to be able to do a "quick search" where I set
> > q=text
> >
> > The definition of the type text_general is like this:
> >
> >
> >
> >  > positionIncrementGap="100">
> >
> >  
> >
> >
> >
> > > words="stopwords.txt" />
> >
> >
> >
> >  
> >
> >  
> >
> >
> >
> > > words="stopwords.txt" />
> >
> > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >
> >
> >
> >  
> >
> >
> >
> >
> > I now face the problem, that searching for a book with
> > text:978-3-8052-5094-8* does not return the single result I expect.
> > However searching for text:9783805250948* instead returns a result.
> > Note, that I am adding a wildcard at the end automatically, to further
> > broaden the resultset. Note also, that it does not seem to matter
> > whether I put backslashes in front of the hyphen or not (to be exact,
> > when sending via SolrJ from my application, I put in the backslashes,
> > but I don't see a difference when using SolrAdmin as I guess SolrAdmin
> > automatically inserts backslashes if needed?)
> >
> > When storing ISBNs, I do store them twice, once with hyphens
> (978-3-8052-5094-8) and once without (9783805250948). A pure phrase search
> on both those values return also the single document.
> >
> > I learned that the StandardToken

Boosting Question & Parser Selection

2015-11-27 Thread Josh Collins
All,

I have a few questions related to boosting and whether my use case makes sense 
for Dismax vs. the standard parser.

I have created a gist of my field definitions and current query structure here: 
https://gist.github.com/joshdcollins/0e3f24dd23c3fc6ac8e3

With the given configuration I am attempting to:

  *   Support partial and exact matches by indexing fields twice — once with 
ngram and once without
  *   Boost exact matches higher than partial matches
  *   Boost matches in the entity_name (and entity_name_exact) field higher 
than content and content_exact fields
  *   Boost matches with an entity_type of ‘company’ and ‘insight’ higher than 
other result types

1)  Does the field definition and query approach make sense given the above 
objectives?

2)  I have an additional use case to support a query syntax where terms wrapped 
in single quotes must be exact matches.  Example “hello ‘wor'”  would NOT match 
a document containing hello and world.

a) Using the dismax parser can you explicitly determine which terms will be 
checked against which fields?
In this case I would search “hello” against my general fields and “wor" against 
the _exact fields.

b) Does this level of structured query better lend itself to using the standard 
query parser?

3)  Does anyone have any experience or resources troubleshooting the fast 
vector highlighter?  It is working correctly in most cases, but some search 
terms (sized lower than the boundryScanner.maxScan) return no content in the 
highlighter results like:











In some other cases the highlighter will highlight a term once in a results, 
but not in another occurrence.

Appreciate any insight anyone can provide!

jc


Re: text search problem

2014-07-23 Thread Josh Lincoln
Ravi, for the hyphen issue, try setting autoGeneratePhraseQueries=true for
that fieldType (no re-index needed). As of 1.4, this defaults to false. One
word of caution, autoGeneratePhraseQueries may not work as expected for
langauges that aren't whitespace delimited. As Erick mentioned, the
Analysis page will help you verify that your content and your queries are
handled the way you expect them to be.

See this thread for more info on autoGeneratePhraseQueries
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%3c439f69a3-f292-482b-a102-7c011c576...@gmail.com%3E


On Mon, Jul 21, 2014 at 8:42 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Try escaping the hyphen as \-. Or enclosing it all
 in quotes.

 But you _really_ have to spend some time with the debug option
 an admin/analysis page or you will find endless surprises.

 Best,
 Erick


 On Mon, Jul 21, 2014 at 11:12 AM, EXTERNAL Taminidi Ravi (ETI,
 Automotive-Service-Solutions) external.ravi.tamin...@us.bosch.com wrote:

 
  Thanks for the reply Erick, I will try as you suggested. There I have
   another question related to this lines.
 
  When I have - in my description , name then the search results are
  different. For e.g.
 
  ABC-123 , it look sofr ABC or 123, I want to treat this search as exact
  match, i.e if my document has ABC-123 then I should get the results.
 
  When I check with hl-on, it has emABCem and get the results. How can
  I avoid this situation.
 
  Thanks
 
  Ravi
 
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Saturday, July 19, 2014 4:40 PM
  To: solr-user@lucene.apache.org
  Subject: Re: text search problem
 
  Try adding debug=all to the query and see what the parsed form of the
  query is, likely you're
  1 using phrase queries, so broadway hotel requires both words in the
  1 text
  or
  2 if you're not using phrases, you're searching for the AND of the two
  terms.
 
  But debug=all will show you.
 
  Plus, take a look at the admin/analysis page, your tokenization may not
 be
  what you expect.
 
  Best,
  Erick
 
 
  On Fri, Jul 18, 2014 at 2:00 PM, EXTERNAL Taminidi Ravi (ETI,
  Automotive-Service-Solutions) external.ravi.tamin...@us.bosch.com
 wrote:
 
   Hi,  Below is the text_general field type when I search Text:Boradway
   it is not returning all the records, it returning only few records.
   But when I search for Text:*Broadway*, it is getting more records.
   When I get into multiple words ln search like Broadway Hotel, it may
   not get Broadway , HotelBroadway Hotel. DO you have any
   thought how to handle these type of keyword search.
  
   Text:Broadway,Vehicle Detailing,Water Systems,Vehicle Detailing,Car
   Wash Water Recovery
  
   My Field type look like this.
  
   fieldType name=text_general class=solr.TextField
   positionIncrementGap=100
 analyzer type=index
charFilter class=solr.HTMLStripCharFilterFactory /
 tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
   words=stopwords.txt /
 filter class=solr.KStemFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory
   generateWordParts=0 generateNumberParts=0 splitOnCaseChange=0
   splitOnNumerics=0 stemEnglishPossessive=0 catenateWords=1
   catenateNumbers=1 catenateAll=1 preserveOriginal=0/
  
 !-- in this example, we will only use synonyms at query
  time
   filter class=solr.SynonymFilterFactory
   synonyms=index_synonyms.txt ignoreCase=true expand=false/
   --
  
 /analyzer
 analyzer type=query
charFilter class=solr.HTMLStripCharFilterFactory /
tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.KStemFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
   words=stopwords.txt /
   filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt
   ignoreCase=true expand=true/
   filter class=solr.LowerCaseFilterFactory/
 filter class=solr.WordDelimiterFilterFactory
   generateWordParts=0 generateNumberParts=0 splitOnCaseChange=0
   splitOnNumerics=0 stemEnglishPossessive=0 catenateWords=1
   catenateNumbers=1 catenateAll=1 preserveOriginal=0/
  
/analyzer
   /fieldType
  
  
  
   Do you have any thought the behavior or how to get this?
  
   Thanks
  
   Ravi
  
 



Re: Solr Permgen Exceptions when creating/removing cores

2014-03-03 Thread Josh
It's a windows installation using a bitnami solr installer. I incorrectly
put 64M into the configuration for this, as I had copied the test
configuration I was using to recreate the permgen issue we were seeing on
our production system (that is configured to 512M) as it takes awhile with
to recreate the issue with larger permgen values. In the test scenario
there was a small 180 document data core that's static with 8 dynamic user
cores that are used to index the unique document ids in the users view,
which is then merged into a single user core. The final user core contains
the same number of document ids as the data core and the data core is
queried against with the ids in the final merged user core as the limiter.
The user cores are then unloaded, and deleted from the drive and then the
test is reran again with the user cores re-created

We are also using the core discovery mode to store/find our cores and the
database data core is using dynamic fields with a mix of single value and
multi value fields. The user cores use a static configuration. The data is
indexed from SQL Server using jtDS for both the user and data cores. As a
note we also reversed the test case I mention above where we keep the user
cores static and dynamically create the database core and this created the
same issue only it leaked faster. We assumed this because the configuration
was larger/loaded more classes then the simpler user core.

When I get the time I'm going to put together a SolrJ test app to recreate
the issue outside of our environment to see if others see the same issue
we're seeing to rule out any kind of configuration problem. Right now we're
interacting with solr with POCO via the restful interface and it's not very
easy for us to spin this off into something someone else could use. In the
mean time we've made changes to make the user cores more static, this has
slowed down the build up of permgen to something that can be managed by a
weekly reset.

Sorry about the confusion in my initial email and I appreciate the
response. Anything about my configuration that you can think might be
useful just let me know and I can provide it. We have a work around, but it
really hampers what our long term goals were for our Solr implementation.

Thanks
Josh


On Mon, Mar 3, 2014 at 9:57 AM, Greg Walters greg.walt...@answers.comwrote:

 Josh,

 You've mentioned a couple of times that you've got PermGen set to 512M but
 then you say you're running with -XX:MaxPermSize=64M. These two statements
 are contradictory so are you *sure* that you're running with 512M of
 PermGen? Assuming your on a *nix box can you provide `ps` output proving
 this?

 Thanks,
 Greg

 On Feb 28, 2014, at 5:22 PM, Furkan KAMACI furkankam...@gmail.com wrote:

  Hi;
 
  You can also check here:
 
 http://stackoverflow.com/questions/3717937/cmspermgensweepingenabled-vs-cmsclassunloadingenabled
 
  Thanks;
  Furkan KAMACI
 
 
  2014-02-26 22:35 GMT+02:00 Josh jwda...@gmail.com:
 
  Thanks Timothy,
 
  I gave these a try and -XX:+CMSPermGenSweepingEnabled seemed to cause
 the
  error to happen more quickly. With this option on it didn't seemed to do
  any intermittent garbage collecting that delayed the issue in with it
 off.
  I was already using a max of 512MB, and I can reproduce it with it set
 this
  high or even higher. Right now because of how we have this implemented
 just
  increasing it to something high just delays the problem :/
 
  Anything else you could suggest I would really appreciate.
 
 
  On Wed, Feb 26, 2014 at 3:19 PM, Tim Potter tim.pot...@lucidworks.com
  wrote:
 
  Hi Josh,
 
  Try adding: -XX:+CMSPermGenSweepingEnabled as I think for some VM
  versions, permgen collection was disabled by default.
 
  Also, I use: -XX:MaxPermSize=512m -XX:PermSize=256m with Solr, so 64M
 may
  be too small.
 
 
  Timothy Potter
  Sr. Software Engineer, LucidWorks
  www.lucidworks.com
 
  
  From: Josh jwda...@gmail.com
  Sent: Wednesday, February 26, 2014 12:27 PM
  To: solr-user@lucene.apache.org
  Subject: Solr Permgen Exceptions when creating/removing cores
 
  We are using the Bitnami version of Solr 4.6.0-1 on a 64bit windows
  installation with 64bit Java 1.7U51 and we are seeing consistent issues
  with PermGen exceptions. We have the permgen configured to be 512MB.
  Bitnami ships with a 32bit version of Java for windows and we are
  replacing
  it with a 64bit version.
 
  Passed in Java Options:
 
  -XX:MaxPermSize=64M
  -Xms3072M
  -Xmx6144M
  -XX:+UseParNewGC
  -XX:+UseConcMarkSweepGC
  -XX:CMSInitiatingOccupancyFraction=75
  -XX:+CMSClassUnloadingEnabled
  -XX:NewRatio=3
 
  -XX:MaxTenuringThreshold=8
 
  This is our use case:
 
  We have what we call a database core which remains fairly static and
  contains the imported contents of a table from SQL server. We then have
  user cores which contain the record ids of results from a text search
  outside of Solr. We then query for the data we want from

Re: Solr Permgen Exceptions when creating/removing cores

2014-03-03 Thread Josh
In the user core there are two fields, the database core in question was
40, but in production environments the database core is dynamic. My time
has been pretty crazy trying to get this out the door and we haven't tried
a standard solr install yet but it's on my plate for the test app and I
don't know enough about Solr/Bitnami to know if they've done any serious
modifications to it.

I had tried doing a dump from VisualVM previously but it didn't seem to
give me anything useful but then again I didn't know how to look for
interned strings. This is something I can take another look at in the
coming weeks when I do my test case against a standard solr install with
SolrJ. The exception with user cores happens after 80'ish runs, so 640'ish
user cores with the PermGen set to 64MB. The database core test was far
lower, it was in the 10-15 range. As a note once the permgen limit is hit,
if we simply restart the service with the same number of cores loaded the
permgen usage is minimal even with the amount of user cores being high in
our production environment (500-600).

If this does end up being the interning of strings, is there anyway it can
be mitigated? Our production environment for our heavier users would see in
the range of 3200+ user cores created a day.

Thanks for the help.
Josh


On Mon, Mar 3, 2014 at 11:24 AM, Tri Cao tm...@me.com wrote:

 Hey Josh,

 I am not an expert in Java performance, but I would start with  dumping a
 the heap
 and investigate with visualvm (the free tool that comes with JDK).

 In my experience, the most common cause for PermGen exception is the app
 creates
 too many interned strings. Solr (actually Lucene) interns the field names
 so if you have
 too many fields, it might be the cause. How many fields in total across
 cores did you
 create before the exception?

 Can you reproduce the problem with the standard Solr? Is the bitnami
 distribution just
 Solr or do they have some other libraries?

 Hope this helps,
 Tri

 On Mar 03, 2014, at 07:28 AM, Josh jwda...@gmail.com wrote:

 It's a windows installation using a bitnami solr installer. I incorrectly
 put 64M into the configuration for this, as I had copied the test
 configuration I was using to recreate the permgen issue we were seeing on
 our production system (that is configured to 512M) as it takes awhile with
 to recreate the issue with larger permgen values. In the test scenario
 there was a small 180 document data core that's static with 8 dynamic user
 cores that are used to index the unique document ids in the users view,
 which is then merged into a single user core. The final user core contains
 the same number of document ids as the data core and the data core is
 queried against with the ids in the final merged user core as the limiter.
 The user cores are then unloaded, and deleted from the drive and then the
 test is reran again with the user cores re-created

 We are also using the core discovery mode to store/find our cores and the
 database data core is using dynamic fields with a mix of single value and
 multi value fields. The user cores use a static configuration. The data is
 indexed from SQL Server using jtDS for both the user and data cores. As a
 note we also reversed the test case I mention above where we keep the user
 cores static and dynamically create the database core and this created the
 same issue only it leaked faster. We assumed this because the configuration
 was larger/loaded more classes then the simpler user core.

 When I get the time I'm going to put together a SolrJ test app to recreate
 the issue outside of our environment to see if others see the same issue
 we're seeing to rule out any kind of configuration problem. Right now we're
 interacting with solr with POCO via the restful interface and it's not very
 easy for us to spin this off into something someone else could use. In the
 mean time we've made changes to make the user cores more static, this has
 slowed down the build up of permgen to something that can be managed by a
 weekly reset.

 Sorry about the confusion in my initial email and I appreciate the
 response. Anything about my configuration that you can think might be
 useful just let me know and I can provide it. We have a work around, but it
 really hampers what our long term goals were for our Solr implementation.

 Thanks
 Josh


 On Mon, Mar 3, 2014 at 9:57 AM, Greg Walters greg.walt...@answers.com
 wrote:

 Josh,

 You've mentioned a couple of times that you've got PermGen set to 512M but

 then you say you're running with -XX:MaxPermSize=64M. These two statements

 are contradictory so are you *sure* that you're running with 512M of

 PermGen? Assuming your on a *nix box can you provide `ps` output proving

 this?

 Thanks,

 Greg

 On Feb 28, 2014, at 5:22 PM, Furkan KAMACI furkankam...@gmail.com wrote:

  Hi;

 

  You can also check here:

 


 http://stackoverflow.com/questions/3717937/cmspermgensweepingenabled-vs-cmsclassunloadingenabled

 

  Thanks

Re: Solr Permgen Exceptions when creating/removing cores

2014-03-03 Thread Josh
Thanks Tri,

I really appreciate the response. When I get some free time shortly I'll
start giving some of these a try and report back.


On Mon, Mar 3, 2014 at 12:42 PM, Tri Cao tm...@me.com wrote:

 If it's really the interned strings, you could try upgrade JDK, as the
 newer HotSpot
 JVM puts interned strings in regular heap:

 http://www.oracle.com/technetwork/java/javase/jdk7-relnotes-418459.html

 http://www.oracle.com/technetwork/java/javase/jdk7-relnotes-418459.html(search
 for String.intern() in that release)

 I haven't got a chance to look into the new core auto discovery code, so I
 don't know
 if it's implemented with reflection or not. Reflection and dynamic class
 loading is another
 source of PermGen exception, in my experience.

 I don't see anything wrong with your JVM config, which is very much
 standard.

 Hope this helps,
 Tri


 On Mar 03, 2014, at 08:52 AM, Josh jwda...@gmail.com wrote:

 In the user core there are two fields, the database core in question was
 40, but in production environments the database core is dynamic. My time
 has been pretty crazy trying to get this out the door and we haven't tried
 a standard solr install yet but it's on my plate for the test app and I
 don't know enough about Solr/Bitnami to know if they've done any serious
 modifications to it.

 I had tried doing a dump from VisualVM previously but it didn't seem to
 give me anything useful but then again I didn't know how to look for
 interned strings. This is something I can take another look at in the
 coming weeks when I do my test case against a standard solr install with
 SolrJ. The exception with user cores happens after 80'ish runs, so 640'ish
 user cores with the PermGen set to 64MB. The database core test was far
 lower, it was in the 10-15 range. As a note once the permgen limit is hit,
 if we simply restart the service with the same number of cores loaded the
 permgen usage is minimal even with the amount of user cores being high in
 our production environment (500-600).

 If this does end up being the interning of strings, is there anyway it can
 be mitigated? Our production environment for our heavier users would see in
 the range of 3200+ user cores created a day.

 Thanks for the help.
 Josh


 On Mon, Mar 3, 2014 at 11:24 AM, Tri Cao tm...@me.com wrote:

 Hey Josh,

 I am not an expert in Java performance, but I would start with dumping a

 the heap

 and investigate with visualvm (the free tool that comes with JDK).

 In my experience, the most common cause for PermGen exception is the app

 creates

 too many interned strings. Solr (actually Lucene) interns the field names

 so if you have

 too many fields, it might be the cause. How many fields in total across

 cores did you

 create before the exception?

 Can you reproduce the problem with the standard Solr? Is the bitnami

 distribution just

 Solr or do they have some other libraries?

 Hope this helps,

 Tri

 On Mar 03, 2014, at 07:28 AM, Josh jwda...@gmail.com wrote:

 It's a windows installation using a bitnami solr installer. I incorrectly

 put 64M into the configuration for this, as I had copied the test

 configuration I was using to recreate the permgen issue we were seeing on

 our production system (that is configured to 512M) as it takes awhile with

 to recreate the issue with larger permgen values. In the test scenario

 there was a small 180 document data core that's static with 8 dynamic user

 cores that are used to index the unique document ids in the users view,

 which is then merged into a single user core. The final user core contains

 the same number of document ids as the data core and the data core is

 queried against with the ids in the final merged user core as the limiter.

 The user cores are then unloaded, and deleted from the drive and then the

 test is reran again with the user cores re-created

 We are also using the core discovery mode to store/find our cores and the

 database data core is using dynamic fields with a mix of single value and

 multi value fields. The user cores use a static configuration. The data is

 indexed from SQL Server using jtDS for both the user and data cores. As a

 note we also reversed the test case I mention above where we keep the user

 cores static and dynamically create the database core and this created the

 same issue only it leaked faster. We assumed this because the configuration

 was larger/loaded more classes then the simpler user core.

 When I get the time I'm going to put together a SolrJ test app to recreate

 the issue outside of our environment to see if others see the same issue

 we're seeing to rule out any kind of configuration problem. Right now we're

 interacting with solr with POCO via the restful interface and it's not very

 easy for us to spin this off into something someone else could use. In the

 mean time we've made changes to make the user cores more static, this has

 slowed down the build up of permgen to something that can

Re: network slows when solr is running - help

2014-02-28 Thread Josh
Is it indexing data from over the network? (high data throughput would
increase latency) Is it a virtual machine? (Other machines causing slow
downs) Another possible option is the network card is offloading processing
onto the CPU which is introducing latency when the CPU is under load.


On Fri, Feb 28, 2014 at 4:11 PM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:

 Hi guys,

 Got an odd thing going on right now.  Indexing into my master server (solr
 3.6.1) has slowed and it is because when solr runs ping shows latency.
  When I stop solr though, ping returns to normal.  This has been happening
 occasionally, rebooting didn't help.  This is the first time I noticed that
 stopping solr returns ping speeds to normal.  I was thinking it was
 something with our network.   Solr is not consuming all resources on the
 box or anything like that, and normally everything works fine.  Has anyone
 seen this type of thing before?  Let me know if more info of any kind is
 needed.

 Solr process is at 8% memory utilization and 35% cpu utilization in 'top'
 command.

 Note: solr is the only thing running on the box.

 C:\Users\robertpeping 10.12.132.101  -- Indexing

 Pinging 10.12.132.101 with 32 bytes of data:
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64

 Ping statistics for 10.12.132.101:
 Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
 Approximate round trip times in milli-seconds:
 Minimum = 0ms, Maximum = 0ms, Average = 0ms

 C:\Users\robertpeping 10.12.132.101  -- Solr stopped

 Pinging 10.12.132.101 with 32 bytes of data:
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64

 Ping statistics for 10.12.132.101:
 Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
 Approximate round trip times in milli-seconds:
 Minimum = 0ms, Maximum = 0ms, Average = 0ms

 C:\Users\robertpeping 10.12.132.101  -- Solr started but no indexing
 activity

 Pinging 10.12.132.101 with 32 bytes of data:
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64
 Reply from 10.12.132.101: bytes=32 time1ms TTL=64

 Ping statistics for 10.12.132.101:
 Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
 Approximate round trip times in milli-seconds:
 Minimum = 0ms, Maximum = 0ms, Average = 0ms

 C:\Users\robertpeping 10.12.132.101  -- Solr started and indexing started

 Pinging 10.12.132.101 with 32 bytes of data:
 Reply from 10.12.132.101: bytes=32 time=53ms TTL=64
 Reply from 10.12.132.101: bytes=32 time=51ms TTL=64
 Reply from 10.12.132.101: bytes=32 time=48ms TTL=64
 Reply from 10.12.132.101: bytes=32 time=51ms TTL=64

 Ping statistics for 10.12.132.101:
 Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
 Approximate round trip times in milli-seconds:
 Minimum = 48ms, Maximum = 53ms, Average = 50ms

 Robert (Robi) Petersen
 Senior Software Engineer
 Search Department






Solr Permgen Exceptions when creating/removing cores

2014-02-26 Thread Josh
We are using the Bitnami version of Solr 4.6.0-1 on a 64bit windows
installation with 64bit Java 1.7U51 and we are seeing consistent issues
with PermGen exceptions. We have the permgen configured to be 512MB.
Bitnami ships with a 32bit version of Java for windows and we are replacing
it with a 64bit version.

Passed in Java Options:

-XX:MaxPermSize=64M
-Xms3072M
-Xmx6144M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+CMSClassUnloadingEnabled
-XX:NewRatio=3

-XX:MaxTenuringThreshold=8

This is our use case:

We have what we call a database core which remains fairly static and
contains the imported contents of a table from SQL server. We then have
user cores which contain the record ids of results from a text search
outside of Solr. We then query for the data we want from the database core
and limit the results to the content of the user core. This allows us to
combine facet data from Solr with the search results from another engine.
We are creating the user cores on demand and removing them when the user
logs out.

Our issue is the constant creation and removal of user cores combined with
the constant importing seems to push us over our PermGen limit. The user
cores are removed at the end of every session and as a test I made an
application that would loop creating the user core, import a set of data to
it, query the database core using it as a limiter and then remove the user
core. My expectation was in this scenario that all the permgen associated
with that user cores would be freed upon it's unload and allow permgen to
reclaim that memory during a garbage collection. This was not the case, it
would constantly go up until the application would exhaust the memory.

I also investigated whether the there was a connection between the two
cores left behind because I was joining them together in a query but even
unloading the database core after unloading all the user cores won't
prevent the limit from being hit or any memory to be garbage collected from
Solr.

Is this a known issue with creating and unloading a large number of cores?
Could it be configuration based for the core? Is there something other than
unloading that needs to happen to free the references?

Thanks

Notes: I've tried using tools to determine if it's a leak within Solr such
as Plumbr and my activities turned up nothing.


Re: Solr Permgen Exceptions when creating/removing cores

2014-02-26 Thread Josh
Thanks Timothy,

I gave these a try and -XX:+CMSPermGenSweepingEnabled seemed to cause the
error to happen more quickly. With this option on it didn't seemed to do
any intermittent garbage collecting that delayed the issue in with it off.
I was already using a max of 512MB, and I can reproduce it with it set this
high or even higher. Right now because of how we have this implemented just
increasing it to something high just delays the problem :/

Anything else you could suggest I would really appreciate.


On Wed, Feb 26, 2014 at 3:19 PM, Tim Potter tim.pot...@lucidworks.comwrote:

 Hi Josh,

 Try adding: -XX:+CMSPermGenSweepingEnabled as I think for some VM
 versions, permgen collection was disabled by default.

 Also, I use: -XX:MaxPermSize=512m -XX:PermSize=256m with Solr, so 64M may
 be too small.


 Timothy Potter
 Sr. Software Engineer, LucidWorks
 www.lucidworks.com

 
 From: Josh jwda...@gmail.com
 Sent: Wednesday, February 26, 2014 12:27 PM
 To: solr-user@lucene.apache.org
 Subject: Solr Permgen Exceptions when creating/removing cores

 We are using the Bitnami version of Solr 4.6.0-1 on a 64bit windows
 installation with 64bit Java 1.7U51 and we are seeing consistent issues
 with PermGen exceptions. We have the permgen configured to be 512MB.
 Bitnami ships with a 32bit version of Java for windows and we are replacing
 it with a 64bit version.

 Passed in Java Options:

 -XX:MaxPermSize=64M
 -Xms3072M
 -Xmx6144M
 -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=75
 -XX:+CMSClassUnloadingEnabled
 -XX:NewRatio=3

 -XX:MaxTenuringThreshold=8

 This is our use case:

 We have what we call a database core which remains fairly static and
 contains the imported contents of a table from SQL server. We then have
 user cores which contain the record ids of results from a text search
 outside of Solr. We then query for the data we want from the database core
 and limit the results to the content of the user core. This allows us to
 combine facet data from Solr with the search results from another engine.
 We are creating the user cores on demand and removing them when the user
 logs out.

 Our issue is the constant creation and removal of user cores combined with
 the constant importing seems to push us over our PermGen limit. The user
 cores are removed at the end of every session and as a test I made an
 application that would loop creating the user core, import a set of data to
 it, query the database core using it as a limiter and then remove the user
 core. My expectation was in this scenario that all the permgen associated
 with that user cores would be freed upon it's unload and allow permgen to
 reclaim that memory during a garbage collection. This was not the case, it
 would constantly go up until the application would exhaust the memory.

 I also investigated whether the there was a connection between the two
 cores left behind because I was joining them together in a query but even
 unloading the database core after unloading all the user cores won't
 prevent the limit from being hit or any memory to be garbage collected from
 Solr.

 Is this a known issue with creating and unloading a large number of cores?
 Could it be configuration based for the core? Is there something other than
 unloading that needs to happen to free the references?

 Thanks

 Notes: I've tried using tools to determine if it's a leak within Solr such
 as Plumbr and my activities turned up nothing.



Re: how to best convert some term in q to a fq

2013-12-27 Thread Josh Lincoln
what if you add your country field to qf with a strong boost? the search
experience would be slightly different than if you filter on country, but
maybe still good enough for your users and certainly simpler to implement
and maintain. You'd likely only want exact matches. Assuming you are using
edismax and a stopword file for your main query fields, you'll run into an
issue if you just index your country field as a string and there's a
stopword anywhere in your query...see SOLR-3085. To avoid this, yet still
boost on country only when there's an exact match, you could index the
country field as text using KeywordTokenizerFactory and the same stopword
file as your other fields.

Regardless of the approach you take, unless there's only a small list of
countries you care about, multi-word countries might be too big an issue to
ignore, especially when the name contains common words (e.g. United States,
South Korea, New Zealand). This may be a good candidate for named entity
recognition on the query, possibly leveraging openNLP. I once saw a
presentation on how linkedin uses nlp on the query to detect the types of
entities the user is looking for. Seems similar to what you're trying to
accomplish. Of course, if countries are the only thing you're interested in
then you may be able to get away with client code for simple substring
matching using a static list of countries.

 On Dec 23, 2013 3:08 PM, Joel Bernstein joels...@gmail.com wrote:

 I  would suggest handling this in the client. You could write custom Solr
 code also but it would be more complicated because you'd be working with
 Solr's API's.

 Joel Bernstein
 Search Engineer at Heliosearch


 On Mon, Dec 23, 2013 at 2:36 PM, jmlucjav jmluc...@gmail.com wrote:

  Hi,
 
  I have this scenario that I think is no unusual: solr will get a user
  entered query string like 'apple pear france'.
 
  I need to do this: if any of the terms is a country, then change the
 query
  params to move that term to a fq, i.e:
  q=apple pear france
  to
  q=apple pearfq=country:france
 
  What do you guys would be the best way to implement this?
  - custom searchcomponent or queryparser
  - servlet in same jetty as solr
  - client code
 
  To simplify, consider countries are just a single term.
 
  Any pointer to an example to base this on would be great. thanks
 



Re: simple tokenizer question

2013-12-08 Thread Josh Lincoln
Have you tried adding autoGeneratePhraseQueries=true to the fieldType
without changing the index analysis behavior.

This works at querytime only, and will convert 12-34 to 12 34, as if the
user entered the query as a phrase. This gives the expected behavior as
long as the tokenization is the same for analysis and query.
This'll work for the 80-IA structure, and I think it'll also work for
the 9(1)(vii)
example (converting it to 9 1 vii), but I haven't tested it. Also, I
would think the 12AA example should already be working as you expect,
unless maybe you're already using the worddelimiterfilterfactory. When I
test the standardTokenizer on 12AA it preserves the string, resulting in
just one token of 12aa.

autoGeneratePhraseQueries is at least worth a quick try - it doesn't
require reindexing.

Two things to note
1) don't use autoGeneratePhraseQueries if you have CJK languages...probably
applies to any language that's not whitespace delimited. You mentioned
Indian, I presume Hindi, which I don't think will be an issue
2) In very rare cases you may have a few odd results if the
non-alphanumeric characters differ but generate the same phrase query. E.g.
9(1)(vii) would produce the same phrase as 9-1(vii), but this doesn't seem
worth considering until you know it's a problem.


On Sun, Dec 8, 2013 at 10:29 AM, Upayavira u...@odoko.co.uk wrote:

 If you want to just split on whitespace, then the WhitespaceTokenizer
 will do the job.

 However, this will mean that these two tokens aren't the same, and won't
 match each other:

 cat
 cat.

 A simple regex filter could handle those cases, remove a comma or dot
 when at the end of a word. Although there are other similar situations
 (quotes, colons, etc) that you may want to handle eventually.

 Upayavira

 On Sun, Dec 8, 2013, at 11:51 AM, Vulcanoid Developer wrote:
  Thanks for your email.
 
  Great, I will look at the WordDelimiterFactory. Just to make clear, I
  DON'T
  want any other tokenizing on digits, specialchars, punctuations etc done
  other than word delimiting on whitespace.
 
  All I want for my first version is NO removal of punctuations/special
  characters at indexing time and during search time i.e., input as-is and
  search as-is (like a simple sql db?) . I was assuming this would be a
  trivial case with SOLR and not sure what I am missing here.
 
  thanks
  Vulcanoid
 
 
 
  On Sun, Dec 8, 2013 at 4:33 AM, Upayavira u...@odoko.co.uk wrote:
 
   Have you tried a WhitespaceTokenizerFactory followed by the
   WordDelimiterFilterFactory? The latter is perhaps more configurable at
   what it does. Alternatively, you could use a RegexFilterFactory to
   remove extraneous punctuation that wasn't removed by the Whitespace
   Tokenizer.
  
   Upayavira
  
   On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote:
Hi,
   
I am new to solr and I guess this is a basic tokenizer question so
 please
bear with me.
   
I am trying to use SOLR to index a few (Indian) legal judgments in
 text
form and search against them. One of the key points with these
 documents
is
that the sections/provisions of law usually have punctuation/special
characters in them. For example search queries will TYPICALLY be
 section
12AA, section 80-IA, section 9(1)(vii) and the text of the judgments
themselves will contain these sort of text with section references
 all
over
the place.
   
Now, using a default schema setup with standardtokenizer, which
 seems to
delimit on whitespace AND punctuations, I get really bad results
 because
it
looks like 12AA is split and results such having 12 and AA in them
 turn
up.
 It becomes worse with 9(1)(vii) with results containing 9 and 1 etc
 being
turned up.
   
What is the best solution here? I really just want to index the
 document
as-is and also to do whitespace tokenizing on the search and nothing
more.
   
So in other words:
a) I would like the text document to be indexed as-is with say 12AA
 and
9(1)(vii) in the document stored as it is mentioned.
b) I would like to be able to search for 12AA and for 9(1)(vii) and
 get
proper full matches on them without any splitting up/munging etc.
   
Any suggestions are appreciated.  Thank you for your time.
   
Thanks
Vulcanoid
  



HdfsDirectory Implementation

2013-10-31 Thread Josh Clum
Hello,

I refactored out the HDFS directory implementation from Solr to use in my
own project and was surprised to see how it performed. I'm using the both
the HDFSDirectory class and the
HdfsDirectoryFactory class.

On my local machine when using the cache there was a significant speed up.
It was a small enough that each file making up lucene index (12 docs) fit
into one block inside the cache.

When running it on a multinode cluster on AWS the performance pulling back
1031 docs with the cache was not that much better than without. According
to my log statements, the cache was being hit every time, but the
difference between this an my local was that there were several blocks per
file.

When setting up the cache I used the default setting as specified in
HdfsDirectoryFactory.

Any ideas on how to speed up searches? Should I change the block size? Is
there something that blur does to put a wrapper around the cache?

ON A MULTI NODE CLUSTER
Number of documents in directory[1031]
Try #1 - Total execution time: 3776
Try #2 - Total execution time: 2995
Try #3 - Total execution time: 2683
Try #4 - Total execution time: 2301
Try #5 - Total execution time: 2174
Try #6 - Total execution time: 2253
Try #7 - Total execution time: 2184
Try #8 - Total execution time: 2087
Try #9 - Total execution time: 2157
Try #10 - Total execution time: 2089
Cached try #1 - Total execution time: 2065
Cached try #2 - Total execution time: 2298
Cached try #3 - Total execution time: 2398
Cached try #4 - Total execution time: 2421
Cached try #5 - Total execution time: 2080
Cached try #6 - Total execution time: 2060
Cached try #7 - Total execution time: 2285
Cached try #8 - Total execution time: 2048
Cached try #9 - Total execution time: 2087
Cached try #10 - Total execution time: 2106

ON MY LOCAL
Number of documents in directory[12]
Try #1 - Total execution time: 627
Try #2 - Total execution time: 620
Try #3 - Total execution time: 637
Try #4 - Total execution time: 535
Try #5 - Total execution time: 486
Try #6 - Total execution time: 527
Try #7 - Total execution time: 363
Try #8 - Total execution time: 430
Try #9 - Total execution time: 431
Try #10 - Total execution time: 337
Cached try #1 - Total execution time: 38
Cached try #2 - Total execution time: 38
Cached try #3 - Total execution time: 36
Cached try #4 - Total execution time: 35
Cached try #5 - Total execution time: 135
Cached try #6 - Total execution time: 31
Cached try #7 - Total execution time: 36
Cached try #8 - Total execution time: 30
Cached try #9 - Total execution time: 29
Cached try #10 - Total execution time: 28

Thanks,
Josh


Re: DIH - stream file with solrEntityProcessor

2013-10-15 Thread Josh Lincoln
ultimately I just temporarily increased the memory to handle this data set,
but that won't always be practical.

I did try the csv export/import and it worked well in this case. I hadn't
considered it at first. I am wary that the escaping and splitting may be
problematic with some data sets, so I'll look into adding XMLResponseParser
support to XPathEntityProcessor (essentially an option to
useSolrResponseSchema), though I have a feeling only a few other people
would be interested in this.

Thanks for the replies.


On Mon, Oct 14, 2013 at 11:19 PM, Lance Norskog goks...@gmail.com wrote:

 Can you do this data in CSV format? There is a CSV reader in the DIH.
 The SEP was not intended to read from files, since there are already
 better tools that do that.

 Lance


 On 10/14/2013 04:44 PM, Josh Lincoln wrote:

 Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
 the POST buffer being the issue. Thanks for suggesting I test this. The
 full file is over a gig.

 Lance, I'm actually pointing SEP at a static file (I simply named the file
 select and put it on a Web server). SEP thinks it's a large solr
 response, which it was, though now it's just static xml. Works well until
 I
 hit the memory limit of the new solr instance.

 I can't query the old solr from the new one b/c they're on two different
 networks. I can't copy the index files b/c I only want a subset of the
 data
 (identified with a query and dumped to xml...all fields of interest were
 stored). To further complicate things, the old solr is 1.4. I was hoping
 to
 use the result xml format to backup the old, and DIH SEP to import to the
 new dev solr4.x. It's promising as a simple and repeatable migration
 process, except that SEP fails on largish files.

 It seems my options are 1) use the xpathprocessor and identify each field
 (there are many fields); 2) write a small script to act as a proxy to the
 xml file and accept the row and start parameters from the SEP iterative
 calls and return just a subset of the docs; 3) a script to process the xml
 and push to solr, not using DIH; 4) consider XSLT to transform the result
 xml to an update message and use XPathEntityProcessor
 with useSolrAddSchema=true and streaming. The latter seems like the most
 elegant and reusable approach, though I'm not certain it'll work.

 It'd be great if solrEntityProcessor could stream static files, or if I
 could specify the solr result format while using the xpathentityprocessor
 (i.e. a useSolrResultSchema option)

 Any other ideas?






 On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog goks...@gmail.com wrote:

  On 10/13/2013 10:02 AM, Shawn Heisey wrote:

  On 10/13/2013 10:16 AM, Josh Lincoln wrote:

  I have a large solr response in xml format and would like to import it
 into
 a new solr collection. I'm able to use DIH with solrEntityProcessor,
 but
 only if I first truncate the file to a small subset of the records. I
 was
 hoping to set stream=true to handle the full file, but I still get an
 out
 of memory error, so I believe stream does not work with
 solrEntityProcessor
 (I know the docs only mention the stream option for the
 XPathEntityProcessor, but I was hoping solrEntityProcessor just might
 have
 the same capability).

 Before I open a jira to request stream support for solrEntityProcessor
 in
 DIH, is there an alternate approach for importing large files that are
 in
 the solr results format?
 Maybe a way to use xpath to get the values and a transformer to set the
 field names? I'm hoping to not have to declare the field names in
 dataConfig so I can reuse the process across data sets.

  How big is the XML file?  You might be running into a size limit for
 HTTP POST.

 In newer 4.x versions, Solr itself sets the size of the POST buffer
 regardless of what the container config has.  That size defaults to 2MB
 but is configurable using the formdataUploadLimitInKB setting that you
 can find in the example solrconfig.xml file, on the requestParsers tag.

 In Solr 3.x, if you used the included jetty, it had a configured HTTP
 POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
 included Jetty that prevented the configuration element from working, so
 the actual limit was Jetty's default of 200KB.  With other containers
 and these older versions, you would need to change your container
 configuration.

 https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130
 https**://bugs.eclipse.org/bugs/show_**bug.cgi?id=397130https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130
 


 Thanks,
 Shawn

   The SEP calls out to another Solr and reads. Are you importing data
 from

 another Solr and cross-connecting it with your uploaded XML?

 If the memory errors are a problem with streaming, you could try piping
 your uploaded documents through a processor that supports streaming. This
 would then push one document at a time into your processor that calls out
 to Solr

Re: DIH - stream file with solrEntityProcessor

2013-10-14 Thread Josh Lincoln
Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
the POST buffer being the issue. Thanks for suggesting I test this. The
full file is over a gig.

Lance, I'm actually pointing SEP at a static file (I simply named the file
select and put it on a Web server). SEP thinks it's a large solr
response, which it was, though now it's just static xml. Works well until I
hit the memory limit of the new solr instance.

I can't query the old solr from the new one b/c they're on two different
networks. I can't copy the index files b/c I only want a subset of the data
(identified with a query and dumped to xml...all fields of interest were
stored). To further complicate things, the old solr is 1.4. I was hoping to
use the result xml format to backup the old, and DIH SEP to import to the
new dev solr4.x. It's promising as a simple and repeatable migration
process, except that SEP fails on largish files.

It seems my options are 1) use the xpathprocessor and identify each field
(there are many fields); 2) write a small script to act as a proxy to the
xml file and accept the row and start parameters from the SEP iterative
calls and return just a subset of the docs; 3) a script to process the xml
and push to solr, not using DIH; 4) consider XSLT to transform the result
xml to an update message and use XPathEntityProcessor
with useSolrAddSchema=true and streaming. The latter seems like the most
elegant and reusable approach, though I'm not certain it'll work.

It'd be great if solrEntityProcessor could stream static files, or if I
could specify the solr result format while using the xpathentityprocessor
(i.e. a useSolrResultSchema option)

Any other ideas?






On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog goks...@gmail.com wrote:

 On 10/13/2013 10:02 AM, Shawn Heisey wrote:

 On 10/13/2013 10:16 AM, Josh Lincoln wrote:

 I have a large solr response in xml format and would like to import it
 into
 a new solr collection. I'm able to use DIH with solrEntityProcessor, but
 only if I first truncate the file to a small subset of the records. I was
 hoping to set stream=true to handle the full file, but I still get an
 out
 of memory error, so I believe stream does not work with
 solrEntityProcessor
 (I know the docs only mention the stream option for the
 XPathEntityProcessor, but I was hoping solrEntityProcessor just might
 have
 the same capability).

 Before I open a jira to request stream support for solrEntityProcessor in
 DIH, is there an alternate approach for importing large files that are in
 the solr results format?
 Maybe a way to use xpath to get the values and a transformer to set the
 field names? I'm hoping to not have to declare the field names in
 dataConfig so I can reuse the process across data sets.

 How big is the XML file?  You might be running into a size limit for
 HTTP POST.

 In newer 4.x versions, Solr itself sets the size of the POST buffer
 regardless of what the container config has.  That size defaults to 2MB
 but is configurable using the formdataUploadLimitInKB setting that you
 can find in the example solrconfig.xml file, on the requestParsers tag.

 In Solr 3.x, if you used the included jetty, it had a configured HTTP
 POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
 included Jetty that prevented the configuration element from working, so
 the actual limit was Jetty's default of 200KB.  With other containers
 and these older versions, you would need to change your container
 configuration.

 https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130

 Thanks,
 Shawn

  The SEP calls out to another Solr and reads. Are you importing data from
 another Solr and cross-connecting it with your uploaded XML?

 If the memory errors are a problem with streaming, you could try piping
 your uploaded documents through a processor that supports streaming. This
 would then push one document at a time into your processor that calls out
 to Solr and combines records.




DIH - stream file with solrEntityProcessor

2013-10-13 Thread Josh Lincoln
I have a large solr response in xml format and would like to import it into
a new solr collection. I'm able to use DIH with solrEntityProcessor, but
only if I first truncate the file to a small subset of the records. I was
hoping to set stream=true to handle the full file, but I still get an out
of memory error, so I believe stream does not work with solrEntityProcessor
(I know the docs only mention the stream option for the
XPathEntityProcessor, but I was hoping solrEntityProcessor just might have
the same capability).

Before I open a jira to request stream support for solrEntityProcessor in
DIH, is there an alternate approach for importing large files that are in
the solr results format?
Maybe a way to use xpath to get the values and a transformer to set the
field names? I'm hoping to not have to declare the field names in
dataConfig so I can reuse the process across data sets.

Anyone have ideas? thanks


Request to be added to ContributorsGroup

2013-06-06 Thread Josh Lincoln
Hello Wiki Admins,

I have been using Solr for a few years now and I would like to
contribute back by making minor changes and clarifications to the wiki
documentation.

Wiki User Name : JoshLincoln


Thanks


Re: Deleting an individual document while delta index is running

2012-11-07 Thread Josh Turmel
Here's what we have set in our data-config.xml 

dataSource name=jdbc driver=org.postgresql.Driver 
url=jdbc:postgresql://localhost:5432/reader user=data 
batchSize=1000 readOnly=true autoCommit=false
transactionIsolation=TRANSACTION_READ_COMMITTED 
holdability=CLOSE_CURSORS_AT_COMMIT
/


Thanks,
Josh Turmel


On Wednesday, November 7, 2012 at 1:00 PM, Shawn Heisey wrote:

 On 11/7/2012 10:55 AM, Otis Gospodnetic wrote:
  Hi Shawn,
  
  It the last part really correct? Optimization should be doable while
  updates are going on... or am I missing something?
  
 
 
 From what I recall when I was first putting my build system together, 
 which I will admit was on Solr 1.4.0, I couldn't do updates/deletes 
 while optimizing was underway. I don't think 3.x was a whole lot 
 different in this respect. From the little I understand about the 
 significant changes in 4.0, it is probably now possible to do everything 
 at the same time with no worry.
 
 Because they are using 3.3, I don't think they have access to this 
 ability. Given the limited amount of information available, it seemed 
 the most likely explanation. I could be wrong, and if I am, they will 
 have to keep looking for an explanation.
 
 I'm definitely no expert, and I have not tried optimizing and updating 
 at the same time since upgrading to 3.x. My indexing system 
 deliberately avoids doing the two at the same time because it caused 
 problems on 1.4.x.
 
 I would certainly love to know for sure whether it's possible on 4.0, 
 because I am in the process of updating my entire test environment in 
 preparation for a production rollout. If I can do updates/commits at 
 the same time as optimizing, my code will get smaller and a lot simpler.
 
 Thanks,
 Shawn
 
 




Re: Deleting an individual document while delta index is running

2012-11-07 Thread Josh Turmel
Okay, thanks for the help guys... I *think* that this can be resolved by 
kicking off the delta and passing optimize=false since the default was true in 
3.3.

I'll post back if I see the issue pop back up.

JT


On Wednesday, November 7, 2012 at 1:34 PM, Josh Turmel wrote:

 Here's what we have set in our data-config.xml 
 
 dataSource name=jdbc driver=org.postgresql.Driver 
 url=jdbc:postgresql://localhost:5432/reader user=data 
 batchSize=1000 readOnly=true autoCommit=false
 transactionIsolation=TRANSACTION_READ_COMMITTED 
 holdability=CLOSE_CURSORS_AT_COMMIT
 /
 
 
 Thanks,
 Josh Turmel
 
 
 On Wednesday, November 7, 2012 at 1:00 PM, Shawn Heisey wrote:
 
  On 11/7/2012 10:55 AM, Otis Gospodnetic wrote:
   Hi Shawn,
   
   It the last part really correct? Optimization should be doable while
   updates are going on... or am I missing something?
   
  
  
  From what I recall when I was first putting my build system together, 
  which I will admit was on Solr 1.4.0, I couldn't do updates/deletes 
  while optimizing was underway. I don't think 3.x was a whole lot 
  different in this respect. From the little I understand about the 
  significant changes in 4.0, it is probably now possible to do everything 
  at the same time with no worry.
  
  Because they are using 3.3, I don't think they have access to this 
  ability. Given the limited amount of information available, it seemed 
  the most likely explanation. I could be wrong, and if I am, they will 
  have to keep looking for an explanation.
  
  I'm definitely no expert, and I have not tried optimizing and updating 
  at the same time since upgrading to 3.x. My indexing system 
  deliberately avoids doing the two at the same time because it caused 
  problems on 1.4.x.
  
  I would certainly love to know for sure whether it's possible on 4.0, 
  because I am in the process of updating my entire test environment in 
  preparation for a production rollout. If I can do updates/commits at 
  the same time as optimizing, my code will get smaller and a lot simpler.
  
  Thanks,
  Shawn
  
  
  
 
 



fl Parameter and Wildcards for Dynamic Fields

2012-07-04 Thread Josh Harness
I'm using SOLR 3.3 and would like to know how to return a list of dynamic
fields in my search results using a wildcard with the fl parameter. I found
SOLR-2444 https://issues.apache.org/jira/browse/SOLR-2444 but this
appears to be for SOLR 4.0. Am I correct in assuming this isn't doable yet?
Please note that I don't want to query the dynamic fields, I just need them
returned in the search results. Using fl=myDynamicField_* doesn't seem to
work.

Many Thanks!

Josh


DataImportHandler Streaming XML Parse

2011-11-08 Thread Josh Harness
All -

 We're using DIH to import flat xml files. We're getting Heap memory
exceptions due to the file size. Is there any way to force DIH to do a
streaming parse rather than a DOM parse? I really don't want to chunk my
files up or increase the heap size.

Many Thanks!

Josh


Re: Managing solr machines (start/stop/status)

2011-09-14 Thread josh lucas
On Sep 13, 2011, at 5:05 PM, Jamie Johnson wrote:

 I know this isn't a solr specific question but I was wondering what
 folks do in regards to managing the machines in their solr cluster?
 Are there any recommendations for how to start/stop/manage these
 machines?  Any suggestions would be appreciated.


One thing I use is csshx (http://code.google.com/p/csshx/) on my Mac when 
dealing with the various boxes in our cluster.  You can issue commands in one 
terminal and they are duplicated in all other windows.  Very useful for global 
stop/starts and updates.

Re: using a function query with OR and spaces?

2011-09-13 Thread josh lucas
On Sep 13, 2011, at 8:37 AM, Jason Toy wrote:

 I had queries breaking on me when there were spaces in the text I was
 searching for. Originally I had :
 
 fq=state_s:New York
 and that would break, I found a work around by using:
 
 fq={!raw f=state_s}New York
 
 
 My problem now is doing this with an OR query,  this is what I have now, but
 it doesn't work:
 
 
 fq=({!raw f=country_s}United States OR {!raw f=city_s}New York

Couldn't you do:

fq=(country_s:(United States) OR city_s:(New York))

I think that should work though you probably will need to surround the queries 
with quotes to get the exact phrase match.

Suggestions for copying fields across cores...

2011-08-05 Thread josh lucas
Is there a suggested way to copy data in fields to additional fields that will 
only be in a different core?  Obviously I could index the data separately and I 
could build that into my current indexing process but I'm curious if there 
might be an easier, more automated way.

Thanks!


josh

mapping pdf metadata

2009-02-20 Thread Josh Joy
Hi,

I'm having trouble figuring out how to map the tika metadata fields to my
own solr schema document fields. I guess the first hurdle I need to
overcome, is where can I find a list of the Tika PDF metadata fields that
are available for mapping?

Thanks,
Josh


show first couple sentences from found doc

2009-02-20 Thread Josh Joy
Hi,

I would like to do something similar to Google, in that for my list of hits,
I would like to grab the surrounding text around my query term so I can
include that in my search results. What's the easiest way to do this?

Thanks,
Josh