Re: Edismax - bq taking precedence over pf

2017-10-26 Thread Josh Lincoln
I was asking about the field definitions from the schema.

It would also be helpful to see the debug info from the query. Just add
debug=true to see how the query and params were executed by solr and how
the calculation was done for each result.

On Thu, Oct 26, 2017 at 1:33 PM ruby  wrote:

> ok. Shouldn't pf be applied on top of bq=? that way among the object_types
> boosted, if one has "Manufacturing" then it should be listed first?
>
> following are my objects:
>
>
> 
> 1
> Configuration
> typeA
> Manufacturing
>  <--catch all field where contents of all fields get
> copied to
> 
>
> 
> 2
> Manufacturing
> typeA
> xyz
>  <--catch all field where contents of all fields get
> copied to
> 
>
> I'm hoping to get id=2 first then get id=1 but I'm not seeing that. is my
> understanding of qf= not correct?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Edismax - bq taking precedence over pf

2017-10-26 Thread Josh Lincoln
What's the analysis configuration for the object_name field and fieldType?
Perhaps the query is matching your catch-all field, but not the object_name
field, and therefore the pf boost never happens.




On Thu, Oct 26, 2017 at 8:55 AM ruby  wrote:

> I'm noticing in my following query bq= is taking precedence over pf.
>
> &q=Manufacturing
> &qf=Catch_all_Copy_field
> &pf=object_id^40+object_name^700
> &bq=object_rating:(best)^10
> &bq=object_rating:(candidate)^8
> &bq=object_rating:(placeholder)^5
> &bq=object_type_:(typeA)^10
> &bq=object_type_:(typeB)^10
> &bq=object_type_:(typeC)^10
>
> My intention is to show all objects of typeA having "Manufacturing" in name
> first
>
> But I'm seeing all typeA,TypeB,TypeC objects are being listed first,
> eventhough if their name is Not "Manufacturing".
>
> Is my query correct or my understanding of pf and bq parameters correct?
>
> Thanks
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: no search results for specific search in solr 6.6.0

2017-09-19 Thread Josh Lincoln
Can you provide the fieldType definition for text_fr?

Also, when you use the Analysis page in the admin UI, what tokens are
generated during indexing for FRaoo using the text_fr fieldType?

On Tue, Sep 19, 2017 at 12:01 PM Sascha Tuschinski 
wrote:

> Hello Community,
>
> We are using a Solr Core with Solr 6.6.0 on Windows 10 (latest updates)
> with field names defined like "f_1179014266_txt". The number in the middle
> of the name differs for each field we use. For language specific fields we
> are adding an language specific extension e.g. "f_1179014267_txt_fr",
> "f_1179014268_txt_de", "f_1179014269_txt_en" and so on.
> We are having the following odd issue within the french "_fr" field only:
> Field
> f_1197829835_txt_fr<
> http://localhost:8983/solr/#/test_core/schema?field=f_1197829835_txt_fr>
> Dynamic Field /
> *_txt_fr<
> http://localhost:8983/solr/#/test_core/schema?dynamic-field=*_txt_fr>
> Type
> text_fr
>
>   *   The saved value which had been added with no problem to the Solr
> index is "FRaoo".
>   *   When searching within the Solr query tool for
> "f_1197829839_txt_fr:*FRao*" it returns the items matching the term as seen
> below - OK.
> {
>   "responseHeader":{
> "status":0,
> "QTime":1,
> "params":{
>   "q":"f_1197829839_txt_fr:*FRao*",
>   "indent":"on",
>   "wt":"json",
>   "_":"1505808887827"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "id":"129",
> "f_1197829834_txt_en":"EnAir",
> "f_1197829822_txt_de":"Lufti",
> "f_1197829835_txt_fr":"FRaoi",
> "f_1197829836_txt_it":"ITAir",
> "f_1197829799_txt":["Lufti"],
> "f_1197829838_txt_en":"EnAir",
> "f_1197829839_txt_fr":"FRaoo",
> "f_1197829840_txt_it":"ITAir",
> "_version_":1578520424165146624}]
>   }}
>
>   *   When searching for "f_1197829839_txt_fr:*FRaoo*" NO item is found -
> Wrong!
> {
>   "responseHeader":{
> "status":0,
> "QTime":1,
> "params":{
>   "q":"f_1197829839_txt_fr:*FRaoo*",
>   "indent":"on",
>   "wt":"json",
>   "_":"1505808887827"}},
>   "response":{"numFound":0,"start":0,"docs":[]
>   }}
> When searching for "f_1197829839_txt_fr:FRaoo" (no wildcards) the matching
> items are found - OK
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":1,
> "params":{
>   "q":"f_1197829839_txt_fr:FRaoo",
>   "indent":"on",
>   "wt":"json",
>   "_":"1505808887827"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "id":"129",
> "f_1197829834_txt_en":"EnAir",
> "f_1197829822_txt_de":"Lufti",
> "f_1197829835_txt_fr":"FRaoi",
> "f_1197829836_txt_it":"ITAir",
> "f_1197829799_txt":["Lufti"],
> "f_1197829838_txt_en":"EnAir",
> "f_1197829839_txt_fr":"FRaoo",
> "f_1197829840_txt_it":"ITAir",
> "_version_":1578520424165146624}]
>   }}
> If we save exact the same value into a different language field e.g.
> ending on "_en", means "f_1197829834_txt_en", then the search
> "f_1197829834_txt_en:*FRaoo*" find all items correctly!
> We have no idea what's wrong here and we even recreated the index and can
> reproduce this problem all the time. I can only see that the value starts
> with "FR" and the field extension ends with "fr" but this is not problem
> for "en", "de" an so on. All fields are used in the same way and have the
> same field properties.
> Any help or ideas are highly appreciated. I filed a bug for this
> https://issues.apache.org/jira/browse/SOLR-11367 but had been asked to
> publish my question here. Thanks for reading.
> Greetings,
> ___
> Sascha Tuschinski
> Manager Quality Assurance // Canto GmbH
> Phone: +49 (0) 30 ­ 390 485 - 41 <+49%2030%2039048541>
> E-mail: stuschin...@canto.com
> Web: canto.com
>
> Canto GmbH
> Lietzenburger Str. 46
> 10789 Berlin
> Phone: +49 (0)30 390485-0
> Fax: +49 (0)30 390485-55 <+49%2030%2039048555>
> Amtsgericht Berlin-Charlottenburg HRB 88566
> Geschäftsführer: Jack McGannon, Thomas Mockenhaupt
>
>


Re: query with wild card with AND taking lot of time

2017-08-31 Thread Josh Lincoln
The closest thing to an execution plan that I know of is debug=true.That'll
show timings of some of the components
I also find it useful to add echoParams=all when troubleshooting. That'll
show every param solr is using for the request, including params set in
solrconfig.xml and not passed in the request. This can help explain the
debug output (e.g. what queryparser is being used, if fields are being
expanded through field aliases, etc.).

On Thu, Aug 31, 2017 at 1:35 PM suresh pendap 
wrote:

> Hello everybody,
>
> We are seeing that the below query is running very slow and taking almost 4
> seconds to finish
>
>
> [] webapp=/solr path=/select
>
> params={df=_text_&distrib=false&fl=id&shards.purpose=4&start=0&fsv=true&sort=modified_dtm+desc&shard.url=http://
> :8983/solr/flat_product_index_shard7_replica1/%7Chttp://:8983/solr/flat_product_index_shard7_replica2/%7Chttp://:8983/solr/flat_product_index_shard7_replica0/&rows=11&version=2&q=product_identifier_type:DOTCOM_OFFER+AND+abstract_or_primary_product_id:*+AND+(gtin:)+AND+-product_class_type:BUNDLE+AND+-hasProduct:N&NOW=1504196301534&isShard=true&timeAllowed=25000&wt=javabin}
> hits=0 status=0 QTime=3663
>
>
> It seems like the abstract_or_primary_product_id:* clause is contributing
> to the overall response time. It seems that the
> abstract_or_primary_product_id:* . clause is not adding any value in the
> query criteria and can be safely removed.  Is my understanding correct?
>
> I would like to know if the order of the clauses in the AND query would
> affect the response time of the query?
>
> For e.g . f1: 3 AND f2:10 AND f3:* vs . f3:* AND f1:3 AND f2:10
>
> Doesn't Lucene/Solr pick up the optimal query execution plan?
>
> Is there anyway to look at the query execution plan generated by Lucene?
>
> Regards
> Suresh
>


Re: query with wild card with AND taking lot of time

2017-08-31 Thread Josh Lincoln
As I understand it, using a different fq for each clause makes the
resultant caches more likely to be used in future requests.

For the query
fq=first:bob AND last:smith
a subsequent query for
fq=first:tim AND last:smith
won't be able to use the fq cache from the first query.

However, if the first query was
fq=first:bob
fq=last:smith
and subsequently
fq=first:tim
fq=last:smith
then the second query will at least benefit from the last:smith cache

Because fq clauses are always ANDed, this does not work for ORed clauses.

I suppose if some conditions are frequently used together it may be better
to put them in the same fq so there's only one cache. E.g. if an ecommerce
site reqularly queried for featured:Y AND instock:Y

On Thu, Aug 31, 2017 at 1:48 PM David Hastings 
wrote:

> >
> > 2) Because all your clauses are more like filters and are ANDed together,
> > you'll likely get better performance by putting them _each_ in an fq
> > E.g.
> > fq=product_identifier_type:DOTCOM_OFFER
> > fq=abstract_or_primary_product_id:[* TO *]
>
>
> why is this the case?  is it just better to have no logic operators in the
> filter queries?
>
>
>
> On Thu, Aug 31, 2017 at 1:47 PM, Josh Lincoln 
> wrote:
>
> > Suresh,
> > Two things I noticed.
> > 1) If your intent is to only match records where there's something,
> > anything, in abstract_or_primary_product_id, you should use fieldname:[*
> > TO
> > *]  but that will exclude records where that field is empty/missing. If
> you
> > want to match records even if that field is empty/missing, then you
> should
> > remove that clause entirely
> > 2) Because all your clauses are more like filters and are ANDed together,
> > you'll likely get better performance by putting them _each_ in an fq
> > E.g.
> > fq=product_identifier_type:DOTCOM_OFFER
> > fq=abstract_or_primary_product_id:[* TO *]
> > fq=gtin:
> > fq=product_class_type:BUNDLE
> > fq=hasProduct:N
> >
> >
> > On Thu, Aug 31, 2017 at 1:35 PM suresh pendap 
> > wrote:
> >
> > > Hello everybody,
> > >
> > > We are seeing that the below query is running very slow and taking
> > almost 4
> > > seconds to finish
> > >
> > >
> > > [] webapp=/solr path=/select
> > >
> > > params={df=_text_&distrib=false&fl=id&shards.purpose=4&
> > start=0&fsv=true&sort=modified_dtm+desc&shard.url=http://
> > > :8983/solr/flat_product_index_shard7_replica1/
> > %7Chttp://:8983/solr/flat_product_index_shard7_
> > replica2/%7Chttp://:8983/solr/flat_product_index_
> >
> shard7_replica0/&rows=11&version=2&q=product_identifier_type:DOTCOM_OFFER+
> > AND+abstract_or_primary_product_id:*+AND+(gtin:<
> > numericValue>)+AND+-product_class_type:BUNDLE+AND+-hasProduct:N&NOW=
> > 1504196301534&isShard=true&timeAllowed=25000&wt=javabin}
> > > hits=0 status=0 QTime=3663
> > >
> > >
> > > It seems like the abstract_or_primary_product_id:* clause is
> > contributing
> > > to the overall response time. It seems that the
> > > abstract_or_primary_product_id:* . clause is not adding any value in
> the
> > > query criteria and can be safely removed.  Is my understanding correct?
> > >
> > > I would like to know if the order of the clauses in the AND query would
> > > affect the response time of the query?
> > >
> > > For e.g . f1: 3 AND f2:10 AND f3:* vs . f3:* AND f1:3 AND f2:10
> > >
> > > Doesn't Lucene/Solr pick up the optimal query execution plan?
> > >
> > > Is there anyway to look at the query execution plan generated by
> Lucene?
> > >
> > > Regards
> > > Suresh
> > >
> >
>


Re: query with wild card with AND taking lot of time

2017-08-31 Thread Josh Lincoln
Suresh,
Two things I noticed.
1) If your intent is to only match records where there's something,
anything, in abstract_or_primary_product_id, you should use fieldname:[* TO
*]  but that will exclude records where that field is empty/missing. If you
want to match records even if that field is empty/missing, then you should
remove that clause entirely
2) Because all your clauses are more like filters and are ANDed together,
you'll likely get better performance by putting them _each_ in an fq
E.g.
fq=product_identifier_type:DOTCOM_OFFER
fq=abstract_or_primary_product_id:[* TO *]
fq=gtin:
fq=product_class_type:BUNDLE
fq=hasProduct:N


On Thu, Aug 31, 2017 at 1:35 PM suresh pendap 
wrote:

> Hello everybody,
>
> We are seeing that the below query is running very slow and taking almost 4
> seconds to finish
>
>
> [] webapp=/solr path=/select
>
> params={df=_text_&distrib=false&fl=id&shards.purpose=4&start=0&fsv=true&sort=modified_dtm+desc&shard.url=http://
> :8983/solr/flat_product_index_shard7_replica1/%7Chttp://:8983/solr/flat_product_index_shard7_replica2/%7Chttp://:8983/solr/flat_product_index_shard7_replica0/&rows=11&version=2&q=product_identifier_type:DOTCOM_OFFER+AND+abstract_or_primary_product_id:*+AND+(gtin:)+AND+-product_class_type:BUNDLE+AND+-hasProduct:N&NOW=1504196301534&isShard=true&timeAllowed=25000&wt=javabin}
> hits=0 status=0 QTime=3663
>
>
> It seems like the abstract_or_primary_product_id:* clause is contributing
> to the overall response time. It seems that the
> abstract_or_primary_product_id:* . clause is not adding any value in the
> query criteria and can be safely removed.  Is my understanding correct?
>
> I would like to know if the order of the clauses in the AND query would
> affect the response time of the query?
>
> For e.g . f1: 3 AND f2:10 AND f3:* vs . f3:* AND f1:3 AND f2:10
>
> Doesn't Lucene/Solr pick up the optimal query execution plan?
>
> Is there anyway to look at the query execution plan generated by Lucene?
>
> Regards
> Suresh
>


Re: Search by similarity?

2017-08-29 Thread Josh Lincoln
I reviewed the dismax docs and it doesn't support the fieldname:term
portion of the lucene syntax.
To restrict a search to a field and use mm you can either
A) use edismax exactly as you're currently trying to use dismax
B) use dismax, with the following changes
* remove the title: portion of the query and just pass
q="title-123123123-end"
* set qf=title

On Tue, Aug 29, 2017 at 10:25 AM Josh Lincoln 
wrote:

> Darko,
> Can you use edismax instead?
>
> When using dismax, solr is parsing the title field as if it's a query
> term. E.g. the query seems to be interpreted as
> title "title-123123123-end"
> (note the lack of a colon)...which results in querying all your qf fields
> for both "title" and "title-123123123-end"
> I haven't used dismax in a very long time, so I don't know if this is
> intentional, but it's not what I expected.
>
> I'm able to reproduce the issue in 6.4.2 using the default techproducts
> Notice that in the below the parsedquery expands to both text:title and
> text:name (df=text)
> http://localhost:8983/solr/techproducts/select?indent=on&q=title
> :"name"&wt=json&debug=true&defType=dismax
> rawquerystring: "title:"name"",
> querystring: "title:"name"",
> parsedquery: "(+(DisjunctionMaxQuery(((text:title)^1.0))
> DisjunctionMaxQuery(((text:name)^1.0))) ())/no_coord",
> parsedquery_toString: "+(((text:title)^1.0) ((text:name)^1.0)) ()"
>
> But it's not an issue if you use edismax
> http://localhost:8983/solr/techproducts/select?indent=on&q=title
> :"name"&wt=json&debug=true&defType=edismax
> rawquerystring: "title:"name"",
> querystring: "title:"name"",
> parsedquery: "(+title:name)/no_coord",
> parsedquery_toString: "+title:name",
>
>
>
> On Tue, Aug 29, 2017 at 8:44 AM Darko Todoric  wrote:
>
>> Hi Erick,
>>
>> "debug":{ "rawquerystring":"title:\"title-123123123-end\"",
>> "querystring":"title:\"title-123123123-end\"",
>> "parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 |
>> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
>> (authors:title)^4.0 | (doi:title:)^1.0))
>> DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 |
>> (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123
>> end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
>> (authors:\"title 123123123 end\"~1)^4.0 |
>> (doi:title-123123123-end)^1.0)))~1 ())/no_coord",
>> "parsedquery_toString":"+author_full:title)^7.0 |
>> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
>> (authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123
>> end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl
>> 123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
>> (authors:\"title 123123123 end\"~1)^4.0 |
>> (doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969
>> = sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 =
>> weight(abstract:titl in 23194) [], result of:\n 16.848969 =
>> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n
>> 5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 186.49593 = avgFieldLength\n 28.45 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 =
>> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
>> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
>> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
>> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
>> "20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max
>> of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n
>> 16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n
>> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 =
>> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
>> = parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n
>> 3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 =
>> score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 

Re: Search by similarity?

2017-08-29 Thread Josh Lincoln
Darko,
Can you use edismax instead?

When using dismax, solr is parsing the title field as if it's a query term.
E.g. the query seems to be interpreted as
title "title-123123123-end"
(note the lack of a colon)...which results in querying all your qf fields
for both "title" and "title-123123123-end"
I haven't used dismax in a very long time, so I don't know if this is
intentional, but it's not what I expected.

I'm able to reproduce the issue in 6.4.2 using the default techproducts
Notice that in the below the parsedquery expands to both text:title and
text:name (df=text)
http://localhost:8983/solr/techproducts/select?indent=on&q=title
:"name"&wt=json&debug=true&defType=dismax
rawquerystring: "title:"name"",
querystring: "title:"name"",
parsedquery: "(+(DisjunctionMaxQuery(((text:title)^1.0))
DisjunctionMaxQuery(((text:name)^1.0))) ())/no_coord",
parsedquery_toString: "+(((text:title)^1.0) ((text:name)^1.0)) ()"

But it's not an issue if you use edismax
http://localhost:8983/solr/techproducts/select?indent=on&q=title
:"name"&wt=json&debug=true&defType=edismax
rawquerystring: "title:"name"",
querystring: "title:"name"",
parsedquery: "(+title:name)/no_coord",
parsedquery_toString: "+title:name",



On Tue, Aug 29, 2017 at 8:44 AM Darko Todoric  wrote:

> Hi Erick,
>
> "debug":{ "rawquerystring":"title:\"title-123123123-end\"",
> "querystring":"title:\"title-123123123-end\"",
> "parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 |
> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
> (authors:title)^4.0 | (doi:title:)^1.0))
> DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 |
> (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123
> end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
> (authors:\"title 123123123 end\"~1)^4.0 |
> (doi:title-123123123-end)^1.0)))~1 ())/no_coord",
> "parsedquery_toString":"+author_full:title)^7.0 |
> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
> (authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123
> end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl
> 123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
> (authors:\"title 123123123 end\"~1)^4.0 |
> (doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969
> = sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 =
> weight(abstract:titl in 23194) [], result of:\n 16.848969 =
> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n
> 5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 186.49593 = avgFieldLength\n 28.45 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 =
> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max
> of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n
> 16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 =
> score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "28227":"\n15.670726 = sum of:\n 15.670726 = sum of:\n 15.670726 = max
> of:\n 15.670726 = weight(abstract:titl in 28156) [], result of:\n
> 15.670726 = score(doc=28156,freq=2.0 = termFreq=2.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.4236413 =
> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 28156) [], result of:\n 3.816711E-5 =
> score(doc=28156,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20375":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
> of:\n 15.052014 = weight(abstract:titl in 20369) [], result of:\n
> 15.052014 = score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parame

Re: Solr Search Problem with Multiple Data-Import Handler

2017-06-22 Thread Josh Lincoln
I suspect Erik's right that clean=true is the problem. That's the default
in the DIH interface.


I find that when using DIH, it's best to set preImportDeleteQuery for every
entity. This safely scopes the clean variable to just that entity.
It doesn't look like the docs have examples of using preImportDeleteQuery,
so I put one here:




On Wed, Jun 21, 2017 at 7:48 PM Erick Erickson 
wrote:

> First place I'd look is whether the jobs have clean=true set. If so the
> first thing DIH does is delete all documents.
>
> Best,
> Erick
>
> On Wed, Jun 21, 2017 at 3:52 PM, Pandey Brahmdev 
> wrote:
>
> > Hi,
> > I have setup Apache Solr 6.6.0 on Windows 10, 64-bit.
> >
> > I have created a simple core & configured DataImport Handlers.
> > I have configured 2 dataImport handlers in the Solr-config.xml file.
> >
> > First for to connect to DB & have data from DB Tables.
> > And Second for to have data from all pdf files using TikaEntityProcessor.
> >
> > Now the problem is there is no error in the console or anywhere but
> > whenever I want to search using "Query" tab it gives me the result of
> Data
> > Import.
> >
> > So let's say if I last Imported data for Tables then it gives me to
> result
> > from the table and if I imported PDF Files then it searches inside PDF
> > Files.
> >
> > But now when I again want to search for DB Tables values then It doesn't
> > give me the result instead I again need to Import Data for
> > DataImportHandler for File & vice-versa.
> >
> > Can you please help me out here?
> > Very sorry if I am doing anything wrong as I have started using Apache
> Solr
> > only 2 days back.
> >
> > Thanks & Regards,
> > Brahmdev Pandey
> > +46 767086309 <+46%2076%20708%2063%2009>
> >
>


Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread Josh Lincoln
I had the same issue as Vrinda and found a hacky way to limit the number of
times deltaImportQuery was executed.

As designed, solr executes *deltaQuery* to get a list of ids that need to
be indexed. For each of those it executes *deltaImportQuery*, which is
typically very similar to the full *query*.

I constructed a deltaQuery to purposely only return 1 row. E.g.

 deltaQuery = "SELECT id FROM table WHERE rownum=1"// written for
oracle, likely requires a different syntax for other dbs. Also, it occurred
to you could probably include the date>= '${dataimporter.last_index_time}'
filter here so this returns 0 rows if no data has changed

Since *deltaImportQuery now *only gets called once I needed to add the
filter logic to *deltaImportQuery *to only select the changed rows (that
logic is normally in *deltaQuery*). E.g.

deltaImportQuery = [normal import query] WHERE date >=
'${dataimporter.last_index_time}'


This significantly reduced the number of database queries for delta
imports, and sped up the processing.

On Thu, Jun 1, 2017 at 6:07 AM Amrit Sarkar  wrote:

> Erick,
>
> Thanks for the pointer. Getting astray from what Vrinda is looking for
> (sorry about that), what if there are no sub-entities? and no
> deltaImportQuery passed too. I looked into the code and determine it
> calculates the deltaImportQuery itself,
> SQLEntityProcessor:getDeltaImportQuery(..)::126.
>
> Ideally then, a full-import or the delta-import should take similar time to
> build the docs (fetch next row). I may very well be going entirely wrong
> here.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269 <(415)%20589-9269>
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Thu, Jun 1, 2017 at 1:50 PM, vrindavda  wrote:
>
> > Thanks Erick,
> >
> >  But how do I solve this? I tried creating Stored proc instead of plain
> > query, but no change in performance.
> >
> > For delta import it in processing more documents than the total
> documents.
> > In this case delta import is not helping at all, I cannot switch to full
> > import each time. This was working fine with less data.
> >
> > Thank you,
> > Vrinda Davda
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-
> > Import-tp4338162p4338444.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: Search for ISBN-like identifiers

2017-01-05 Thread Josh Lincoln
Sebastian,
You may want to try adding autoGeneratePhraseQueries="true" to the
fieldtype.
With that setting, a query for 978-3-8052-5094-8 will behave just like "978
3 8052 5094 8" (with the quotes)

A few notes about autoGeneratePhraseQueries
a) it used to be set to true by default, but that was changed several years
ago
b) does NOT require a reindex, so very easy to test
c) apparently not recommended for non-whitespace delimited languages (CJK,
etc), but maybe that's not an issue in your use case.
d) i'm unsure how it'll impact wildcard queries on that field. E.g. will
978-3-8052* match 978-3-8052-5094-8? At the very least, partial ISBNs (e.g.
978-3-8052) would match full ISBN without needing to use the wildcard. I'm
just not sure what happens if the user includes the wildcard.

Josh

On Thu, Jan 5, 2017 at 1:41 PM Sebastian Riemer  wrote:

> Thank you very much for taking the time to help me!
>
> I'll definitely have a look at the link you've posted.
>
> @ShawnHeisey Thanks too for shedding light on the wildcard behaviour!
>
> Allow me one further question:
> - Assuming that I define a separate field for storing the ISBNs, using the
> awesome analyzer provider by Mr. Bill Dueber. How do I get that field
> copied into my general text field, which is used by my QuickSearch-Input?
> Won't that field be processed again by the analyser defined on the text
> field?
> - Should I alternatively add more fields to the q-Parameter? As for now, I
> always have set q=text: but I guess one
> could try something like
> q=text:+isbnspeciallookupfield:
>
> I don't really know about that last idea though, since the searches are
> propably OR-combined which is not what I like to have.
>
> Third option would be, to pre-process the distinction to where to look at
> in the solr in my application of course. I.e. everything being a regex
> containing only numbers and hyphens with length 13 -> don't query on field
> text, instead use field isbnspeciallookupfield
>
>
> Many thanks again, and have a nice day!
> Sebastian
>
>
> -Ursprüngliche Nachricht-
> Von: Erik Hatcher [mailto:erik.hatc...@gmail.com]
> Gesendet: Donnerstag, 5. Januar 2017 19:10
> An: solr-user@lucene.apache.org
> Betreff: Re: Search for ISBN-like identifiers
>
> Sebastian -
>
> There’s some precedent out there for ISBN’s.  Bill Dueber and the
> UMICH/code4lib folks have done amazing work, check it out here -
>
> https://github.com/mlibrary/umich_solr_library_filters <
> https://github.com/mlibrary/umich_solr_library_filters>
>
>   - Erik
>
>
> > On Jan 5, 2017, at 5:08 AM, Sebastian Riemer 
> wrote:
> >
> > Hi folks,
> >
> >
> > TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general
> text field, respectively configure the analyser on that field, so that a
> search for the hyphenated ISBN returns exactly the matching document?
> >
> > Long version:
> > I've defined a field "text" of type "text_general", where I copy all
> > my other fields to, to be able to do a "quick search" where I set
> > q=text
> >
> > The definition of the type text_general is like this:
> >
> >
> >
> >  > positionIncrementGap="100">
> >
> >  
> >
> >
> >
> > > words="stopwords.txt" />
> >
> >
> >
> >  
> >
> >  
> >
> >
> >
> > > words="stopwords.txt" />
> >
> > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >
> >
> >
> >  
> >
> >
> >
> >
> > I now face the problem, that searching for a book with
> > text:978-3-8052-5094-8* does not return the single result I expect.
> > However searching for text:9783805250948* instead returns a result.
> > Note, that I am adding a wildcard at the end automatically, to further
> > broaden the resultset. Note also, that it does not seem to matter
> > whether I put backslashes in front of the hyphen or not (to be exact,
> > when sending via SolrJ from my application, I put in the backslashes,
> > but I don't see a difference when using SolrAdmin as I guess SolrAdmin
> > automatically inserts backslashes if needed?)
> >
> > When storing ISBNs, I do store them twice, once with hyphens
> (978-3-8052-5094-8) and once without (9783805250948). A pure phrase search
> on both those values return also the single document.
> >
> > I learned that the StandardToke

Boosting Question & Parser Selection

2015-11-27 Thread Josh Collins
All,

I have a few questions related to boosting and whether my use case makes sense 
for Dismax vs. the standard parser.

I have created a gist of my field definitions and current query structure here: 
https://gist.github.com/joshdcollins/0e3f24dd23c3fc6ac8e3

With the given configuration I am attempting to:

  *   Support partial and exact matches by indexing fields twice — once with 
ngram and once without
  *   Boost exact matches higher than partial matches
  *   Boost matches in the entity_name (and entity_name_exact) field higher 
than content and content_exact fields
  *   Boost matches with an entity_type of ‘company’ and ‘insight’ higher than 
other result types

1)  Does the field definition and query approach make sense given the above 
objectives?

2)  I have an additional use case to support a query syntax where terms wrapped 
in single quotes must be exact matches.  Example “hello ‘wor'”  would NOT match 
a document containing hello and world.

a) Using the dismax parser can you explicitly determine which terms will be 
checked against which fields?
In this case I would search “hello” against my general fields and “wor" against 
the _exact fields.

b) Does this level of structured query better lend itself to using the standard 
query parser?

3)  Does anyone have any experience or resources troubleshooting the fast 
vector highlighter?  It is working correctly in most cases, but some search 
terms (sized lower than the boundryScanner.maxScan) return no content in the 
highlighter results like:











In some other cases the highlighter will highlight a term once in a results, 
but not in another occurrence.

Appreciate any insight anyone can provide!

jc


Re: text search problem

2014-07-23 Thread Josh Lincoln
Ravi, for the hyphen issue, try setting autoGeneratePhraseQueries=true for
that fieldType (no re-index needed). As of 1.4, this defaults to false. One
word of caution, autoGeneratePhraseQueries may not work as expected for
langauges that aren't whitespace delimited. As Erick mentioned, the
Analysis page will help you verify that your content and your queries are
handled the way you expect them to be.

See this thread for more info on autoGeneratePhraseQueries
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%3c439f69a3-f292-482b-a102-7c011c576...@gmail.com%3E


On Mon, Jul 21, 2014 at 8:42 PM, Erick Erickson 
wrote:

> Try escaping the hyphen as \-. Or enclosing it all
> in quotes.
>
> But you _really_ have to spend some time with the debug option
> an admin/analysis page or you will find endless surprises.
>
> Best,
> Erick
>
>
> On Mon, Jul 21, 2014 at 11:12 AM, EXTERNAL Taminidi Ravi (ETI,
> Automotive-Service-Solutions)  wrote:
>
> >
> > Thanks for the reply Erick, I will try as you suggested. There I have
> >  another question related to this lines.
> >
> > When I have "-" in my description , name then the search results are
> > different. For e.g.
> >
> > "ABC-123" , it look sofr ABC or 123, I want to treat this search as exact
> > match, i.e if my document has ABC-123 then I should get the results.
> >
> > When I check with &hl-on, it has ABC and get the results. How can
> > I avoid this situation.
> >
> > Thanks
> >
> > Ravi
> >
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Saturday, July 19, 2014 4:40 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: text search problem
> >
> > Try adding &debug=all to the query and see what the parsed form of the
> > query is, likely you're
> > 1> using phrase queries, so "broadway hotel" requires both words in the
> > 1> text
> > or
> > 2> if you're not using phrases, you're searching for the AND of the two
> > terms.
> >
> > But debug=all will show you.
> >
> > Plus, take a look at the admin/analysis page, your tokenization may not
> be
> > what you expect.
> >
> > Best,
> > Erick
> >
> >
> > On Fri, Jul 18, 2014 at 2:00 PM, EXTERNAL Taminidi Ravi (ETI,
> > Automotive-Service-Solutions) 
> wrote:
> >
> > > Hi,  Below is the text_general field type when I search Text:Boradway
> > > it is not returning all the records, it returning only few records.
> > > But when I search for Text:*Broadway*, it is getting more records.
> > > When I get into multiple words ln search like "Broadway Hotel", it may
> > > not get "Broadway" , "Hotel"  &  "Broadway Hotel". DO you have any
> > > thought how to handle these type of keyword search.
> > >
> > > Text:"Broadway,Vehicle Detailing,Water Systems,Vehicle Detailing,Car
> > > Wash Water Recovery"
> > >
> > > My Field type look like this.
> > >
> > >  > > positionIncrementGap="100">
> > >   
> > >  
> > >   
> > >  > > words="stopwords.txt" />
> > >   
> > >   
> > >> > generateWordParts="0" generateNumberParts="0" splitOnCaseChange="0"
> > > splitOnNumerics="0" stemEnglishPossessive="0" catenateWords="1"
> > > catenateNumbers="1" catenateAll="1" preserveOriginal="0"/>
> > >
> > >   
> > >
> > >   
> > >   
> > >  
> > >  
> > >   
> > >  > > words="stopwords.txt" />
> > >  synonyms="synonyms.txt"
> > > ignoreCase="true" expand="true"/>
> > > 
> > >> > generateWordParts="0" generateNumberParts="0" splitOnCaseChange="0"
> > > splitOnNumerics="0" stemEnglishPossessive="0" catenateWords="1"
> > > catenateNumbers="1" catenateAll="1" preserveOriginal="0"/>
> > >
> > >  
> > > 
> > >
> > >
> > >
> > > Do you have any thought the behavior or how to get this?
> > >
> > > Thanks
> > >
> > > Ravi
> > >
> >
>


Re: Solr Permgen Exceptions when creating/removing cores

2014-03-03 Thread Josh
Thanks Tri,

I really appreciate the response. When I get some free time shortly I'll
start giving some of these a try and report back.


On Mon, Mar 3, 2014 at 12:42 PM, Tri Cao  wrote:

> If it's really the interned strings, you could try upgrade JDK, as the
> newer HotSpot
> JVM puts interned strings in regular heap:
>
> http://www.oracle.com/technetwork/java/javase/jdk7-relnotes-418459.html
>
> <http://www.oracle.com/technetwork/java/javase/jdk7-relnotes-418459.html>(search
> for String.intern() in that release)
>
> I haven't got a chance to look into the new core auto discovery code, so I
> don't know
> if it's implemented with reflection or not. Reflection and dynamic class
> loading is another
> source of PermGen exception, in my experience.
>
> I don't see anything wrong with your JVM config, which is very much
> standard.
>
> Hope this helps,
> Tri
>
>
> On Mar 03, 2014, at 08:52 AM, Josh  wrote:
>
> In the user core there are two fields, the database core in question was
> 40, but in production environments the database core is dynamic. My time
> has been pretty crazy trying to get this out the door and we haven't tried
> a standard solr install yet but it's on my plate for the test app and I
> don't know enough about Solr/Bitnami to know if they've done any serious
> modifications to it.
>
> I had tried doing a dump from VisualVM previously but it didn't seem to
> give me anything useful but then again I didn't know how to look for
> interned strings. This is something I can take another look at in the
> coming weeks when I do my test case against a standard solr install with
> SolrJ. The exception with user cores happens after 80'ish runs, so 640'ish
> user cores with the PermGen set to 64MB. The database core test was far
> lower, it was in the 10-15 range. As a note once the permgen limit is hit,
> if we simply restart the service with the same number of cores loaded the
> permgen usage is minimal even with the amount of user cores being high in
> our production environment (500-600).
>
> If this does end up being the interning of strings, is there anyway it can
> be mitigated? Our production environment for our heavier users would see in
> the range of 3200+ user cores created a day.
>
> Thanks for the help.
> Josh
>
>
> On Mon, Mar 3, 2014 at 11:24 AM, Tri Cao  wrote:
>
> Hey Josh,
>
> I am not an expert in Java performance, but I would start with dumping a
>
> the heap
>
> and investigate with visualvm (the free tool that comes with JDK).
>
> In my experience, the most common cause for PermGen exception is the app
>
> creates
>
> too many interned strings. Solr (actually Lucene) interns the field names
>
> so if you have
>
> too many fields, it might be the cause. How many fields in total across
>
> cores did you
>
> create before the exception?
>
> Can you reproduce the problem with the standard Solr? Is the bitnami
>
> distribution just
>
> Solr or do they have some other libraries?
>
> Hope this helps,
>
> Tri
>
> On Mar 03, 2014, at 07:28 AM, Josh  wrote:
>
> It's a windows installation using a bitnami solr installer. I incorrectly
>
> put 64M into the configuration for this, as I had copied the test
>
> configuration I was using to recreate the permgen issue we were seeing on
>
> our production system (that is configured to 512M) as it takes awhile with
>
> to recreate the issue with larger permgen values. In the test scenario
>
> there was a small 180 document data core that's static with 8 dynamic user
>
> cores that are used to index the unique document ids in the users view,
>
> which is then merged into a single user core. The final user core contains
>
> the same number of document ids as the data core and the data core is
>
> queried against with the ids in the final merged user core as the limiter.
>
> The user cores are then unloaded, and deleted from the drive and then the
>
> test is reran again with the user cores re-created
>
> We are also using the core discovery mode to store/find our cores and the
>
> database data core is using dynamic fields with a mix of single value and
>
> multi value fields. The user cores use a static configuration. The data is
>
> indexed from SQL Server using jtDS for both the user and data cores. As a
>
> note we also reversed the test case I mention above where we keep the user
>
> cores static and dynamically create the database core and this created the
>
> same issue only it leaked faster. We assumed this because the configuration
>
> was larger/loaded more classes then the simpler 

Re: Solr Permgen Exceptions when creating/removing cores

2014-03-03 Thread Josh
In the user core there are two fields, the database core in question was
40, but in production environments the database core is dynamic. My time
has been pretty crazy trying to get this out the door and we haven't tried
a standard solr install yet but it's on my plate for the test app and I
don't know enough about Solr/Bitnami to know if they've done any serious
modifications to it.

I had tried doing a dump from VisualVM previously but it didn't seem to
give me anything useful but then again I didn't know how to look for
interned strings. This is something I can take another look at in the
coming weeks when I do my test case against a standard solr install with
SolrJ. The exception with user cores happens after 80'ish runs, so 640'ish
user cores with the PermGen set to 64MB. The database core test was far
lower, it was in the 10-15 range. As a note once the permgen limit is hit,
if we simply restart the service with the same number of cores loaded the
permgen usage is minimal even with the amount of user cores being high in
our production environment (500-600).

If this does end up being the interning of strings, is there anyway it can
be mitigated? Our production environment for our heavier users would see in
the range of 3200+ user cores created a day.

Thanks for the help.
Josh


On Mon, Mar 3, 2014 at 11:24 AM, Tri Cao  wrote:

> Hey Josh,
>
> I am not an expert in Java performance, but I would start with  dumping a
> the heap
> and investigate with visualvm (the free tool that comes with JDK).
>
> In my experience, the most common cause for PermGen exception is the app
> creates
> too many interned strings. Solr (actually Lucene) interns the field names
> so if you have
> too many fields, it might be the cause. How many fields in total across
> cores did you
> create before the exception?
>
> Can you reproduce the problem with the standard Solr? Is the bitnami
> distribution just
> Solr or do they have some other libraries?
>
> Hope this helps,
> Tri
>
> On Mar 03, 2014, at 07:28 AM, Josh  wrote:
>
> It's a windows installation using a bitnami solr installer. I incorrectly
> put 64M into the configuration for this, as I had copied the test
> configuration I was using to recreate the permgen issue we were seeing on
> our production system (that is configured to 512M) as it takes awhile with
> to recreate the issue with larger permgen values. In the test scenario
> there was a small 180 document data core that's static with 8 dynamic user
> cores that are used to index the unique document ids in the users view,
> which is then merged into a single user core. The final user core contains
> the same number of document ids as the data core and the data core is
> queried against with the ids in the final merged user core as the limiter.
> The user cores are then unloaded, and deleted from the drive and then the
> test is reran again with the user cores re-created
>
> We are also using the core discovery mode to store/find our cores and the
> database data core is using dynamic fields with a mix of single value and
> multi value fields. The user cores use a static configuration. The data is
> indexed from SQL Server using jtDS for both the user and data cores. As a
> note we also reversed the test case I mention above where we keep the user
> cores static and dynamically create the database core and this created the
> same issue only it leaked faster. We assumed this because the configuration
> was larger/loaded more classes then the simpler user core.
>
> When I get the time I'm going to put together a SolrJ test app to recreate
> the issue outside of our environment to see if others see the same issue
> we're seeing to rule out any kind of configuration problem. Right now we're
> interacting with solr with POCO via the restful interface and it's not very
> easy for us to spin this off into something someone else could use. In the
> mean time we've made changes to make the user cores more static, this has
> slowed down the build up of permgen to something that can be managed by a
> weekly reset.
>
> Sorry about the confusion in my initial email and I appreciate the
> response. Anything about my configuration that you can think might be
> useful just let me know and I can provide it. We have a work around, but it
> really hampers what our long term goals were for our Solr implementation.
>
> Thanks
> Josh
>
>
> On Mon, Mar 3, 2014 at 9:57 AM, Greg Walters  >wrote:
>
> Josh,
>
> You've mentioned a couple of times that you've got PermGen set to 512M but
>
> then you say you're running with -XX:MaxPermSize=64M. These two statements
>
> are contradictory so are you *sure* that you're 

Re: Solr Permgen Exceptions when creating/removing cores

2014-03-03 Thread Josh
It's a windows installation using a bitnami solr installer. I incorrectly
put 64M into the configuration for this, as I had copied the test
configuration I was using to recreate the permgen issue we were seeing on
our production system (that is configured to 512M) as it takes awhile with
to recreate the issue with larger permgen values. In the test scenario
there was a small 180 document data core that's static with 8 dynamic user
cores that are used to index the unique document ids in the users view,
which is then merged into a single user core. The final user core contains
the same number of document ids as the data core and the data core is
queried against with the ids in the final merged user core as the limiter.
The user cores are then unloaded, and deleted from the drive and then the
test is reran again with the user cores re-created

We are also using the core discovery mode to store/find our cores and the
database data core is using dynamic fields with a mix of single value and
multi value fields. The user cores use a static configuration. The data is
indexed from SQL Server using jtDS for both the user and data cores. As a
note we also reversed the test case I mention above where we keep the user
cores static and dynamically create the database core and this created the
same issue only it leaked faster. We assumed this because the configuration
was larger/loaded more classes then the simpler user core.

When I get the time I'm going to put together a SolrJ test app to recreate
the issue outside of our environment to see if others see the same issue
we're seeing to rule out any kind of configuration problem. Right now we're
interacting with solr with POCO via the restful interface and it's not very
easy for us to spin this off into something someone else could use. In the
mean time we've made changes to make the user cores more static, this has
slowed down the build up of permgen to something that can be managed by a
weekly reset.

Sorry about the confusion in my initial email and I appreciate the
response. Anything about my configuration that you can think might be
useful just let me know and I can provide it. We have a work around, but it
really hampers what our long term goals were for our Solr implementation.

Thanks
Josh


On Mon, Mar 3, 2014 at 9:57 AM, Greg Walters wrote:

> Josh,
>
> You've mentioned a couple of times that you've got PermGen set to 512M but
> then you say you're running with -XX:MaxPermSize=64M. These two statements
> are contradictory so are you *sure* that you're running with 512M of
> PermGen? Assuming your on a *nix box can you provide `ps` output proving
> this?
>
> Thanks,
> Greg
>
> On Feb 28, 2014, at 5:22 PM, Furkan KAMACI  wrote:
>
> > Hi;
> >
> > You can also check here:
> >
> http://stackoverflow.com/questions/3717937/cmspermgensweepingenabled-vs-cmsclassunloadingenabled
> >
> > Thanks;
> > Furkan KAMACI
> >
> >
> > 2014-02-26 22:35 GMT+02:00 Josh :
> >
> >> Thanks Timothy,
> >>
> >> I gave these a try and -XX:+CMSPermGenSweepingEnabled seemed to cause
> the
> >> error to happen more quickly. With this option on it didn't seemed to do
> >> any intermittent garbage collecting that delayed the issue in with it
> off.
> >> I was already using a max of 512MB, and I can reproduce it with it set
> this
> >> high or even higher. Right now because of how we have this implemented
> just
> >> increasing it to something high just delays the problem :/
> >>
> >> Anything else you could suggest I would really appreciate.
> >>
> >>
> >> On Wed, Feb 26, 2014 at 3:19 PM, Tim Potter  >>> wrote:
> >>
> >>> Hi Josh,
> >>>
> >>> Try adding: -XX:+CMSPermGenSweepingEnabled as I think for some VM
> >>> versions, permgen collection was disabled by default.
> >>>
> >>> Also, I use: -XX:MaxPermSize=512m -XX:PermSize=256m with Solr, so 64M
> may
> >>> be too small.
> >>>
> >>>
> >>> Timothy Potter
> >>> Sr. Software Engineer, LucidWorks
> >>> www.lucidworks.com
> >>>
> >>> 
> >>> From: Josh 
> >>> Sent: Wednesday, February 26, 2014 12:27 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Solr Permgen Exceptions when creating/removing cores
> >>>
> >>> We are using the Bitnami version of Solr 4.6.0-1 on a 64bit windows
> >>> installation with 64bit Java 1.7U51 and we are seeing consistent issues
> >>> with PermGen exceptions. We have the permgen configured to be 512MB.

Re: network slows when solr is running - help

2014-02-28 Thread Josh
Is it indexing data from over the network? (high data throughput would
increase latency) Is it a virtual machine? (Other machines causing slow
downs) Another possible option is the network card is offloading processing
onto the CPU which is introducing latency when the CPU is under load.


On Fri, Feb 28, 2014 at 4:11 PM, Petersen, Robert <
robert.peter...@mail.rakuten.com> wrote:

> Hi guys,
>
> Got an odd thing going on right now.  Indexing into my master server (solr
> 3.6.1) has slowed and it is because when solr runs ping shows latency.
>  When I stop solr though, ping returns to normal.  This has been happening
> occasionally, rebooting didn't help.  This is the first time I noticed that
> stopping solr returns ping speeds to normal.  I was thinking it was
> something with our network.   Solr is not consuming all resources on the
> box or anything like that, and normally everything works fine.  Has anyone
> seen this type of thing before?  Let me know if more info of any kind is
> needed.
>
> Solr process is at 8% memory utilization and 35% cpu utilization in 'top'
> command.
>
> Note: solr is the only thing running on the box.
>
> C:\Users\robertpe>ping 10.12.132.101  <-- Indexing
>
> Pinging 10.12.132.101 with 32 bytes of data:
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
>
> Ping statistics for 10.12.132.101:
> Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
> Approximate round trip times in milli-seconds:
> Minimum = 0ms, Maximum = 0ms, Average = 0ms
>
> C:\Users\robertpe>ping 10.12.132.101  <-- Solr stopped
>
> Pinging 10.12.132.101 with 32 bytes of data:
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
>
> Ping statistics for 10.12.132.101:
> Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
> Approximate round trip times in milli-seconds:
> Minimum = 0ms, Maximum = 0ms, Average = 0ms
>
> C:\Users\robertpe>ping 10.12.132.101  <-- Solr started but no indexing
> activity
>
> Pinging 10.12.132.101 with 32 bytes of data:
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
> Reply from 10.12.132.101: bytes=32 time<1ms TTL=64
>
> Ping statistics for 10.12.132.101:
> Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
> Approximate round trip times in milli-seconds:
> Minimum = 0ms, Maximum = 0ms, Average = 0ms
>
> C:\Users\robertpe>ping 10.12.132.101  <-- Solr started and indexing started
>
> Pinging 10.12.132.101 with 32 bytes of data:
> Reply from 10.12.132.101: bytes=32 time=53ms TTL=64
> Reply from 10.12.132.101: bytes=32 time=51ms TTL=64
> Reply from 10.12.132.101: bytes=32 time=48ms TTL=64
> Reply from 10.12.132.101: bytes=32 time=51ms TTL=64
>
> Ping statistics for 10.12.132.101:
> Packets: Sent = 4, Received = 4, Lost = 0 (0% lo
> Approximate round trip times in milli-seconds:
> Minimum = 48ms, Maximum = 53ms, Average = 50ms
>
> Robert (Robi) Petersen
> Senior Software Engineer
> Search Department
>
>
>
>


Re: Solr Permgen Exceptions when creating/removing cores

2014-02-26 Thread Josh
Thanks Timothy,

I gave these a try and -XX:+CMSPermGenSweepingEnabled seemed to cause the
error to happen more quickly. With this option on it didn't seemed to do
any intermittent garbage collecting that delayed the issue in with it off.
I was already using a max of 512MB, and I can reproduce it with it set this
high or even higher. Right now because of how we have this implemented just
increasing it to something high just delays the problem :/

Anything else you could suggest I would really appreciate.


On Wed, Feb 26, 2014 at 3:19 PM, Tim Potter wrote:

> Hi Josh,
>
> Try adding: -XX:+CMSPermGenSweepingEnabled as I think for some VM
> versions, permgen collection was disabled by default.
>
> Also, I use: -XX:MaxPermSize=512m -XX:PermSize=256m with Solr, so 64M may
> be too small.
>
>
> Timothy Potter
> Sr. Software Engineer, LucidWorks
> www.lucidworks.com
>
> ____
> From: Josh 
> Sent: Wednesday, February 26, 2014 12:27 PM
> To: solr-user@lucene.apache.org
> Subject: Solr Permgen Exceptions when creating/removing cores
>
> We are using the Bitnami version of Solr 4.6.0-1 on a 64bit windows
> installation with 64bit Java 1.7U51 and we are seeing consistent issues
> with PermGen exceptions. We have the permgen configured to be 512MB.
> Bitnami ships with a 32bit version of Java for windows and we are replacing
> it with a 64bit version.
>
> Passed in Java Options:
>
> -XX:MaxPermSize=64M
> -Xms3072M
> -Xmx6144M
> -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=75
> -XX:+CMSClassUnloadingEnabled
> -XX:NewRatio=3
>
> -XX:MaxTenuringThreshold=8
>
> This is our use case:
>
> We have what we call a database core which remains fairly static and
> contains the imported contents of a table from SQL server. We then have
> user cores which contain the record ids of results from a text search
> outside of Solr. We then query for the data we want from the database core
> and limit the results to the content of the user core. This allows us to
> combine facet data from Solr with the search results from another engine.
> We are creating the user cores on demand and removing them when the user
> logs out.
>
> Our issue is the constant creation and removal of user cores combined with
> the constant importing seems to push us over our PermGen limit. The user
> cores are removed at the end of every session and as a test I made an
> application that would loop creating the user core, import a set of data to
> it, query the database core using it as a limiter and then remove the user
> core. My expectation was in this scenario that all the permgen associated
> with that user cores would be freed upon it's unload and allow permgen to
> reclaim that memory during a garbage collection. This was not the case, it
> would constantly go up until the application would exhaust the memory.
>
> I also investigated whether the there was a connection between the two
> cores left behind because I was joining them together in a query but even
> unloading the database core after unloading all the user cores won't
> prevent the limit from being hit or any memory to be garbage collected from
> Solr.
>
> Is this a known issue with creating and unloading a large number of cores?
> Could it be configuration based for the core? Is there something other than
> unloading that needs to happen to free the references?
>
> Thanks
>
> Notes: I've tried using tools to determine if it's a leak within Solr such
> as Plumbr and my activities turned up nothing.
>


Solr Permgen Exceptions when creating/removing cores

2014-02-26 Thread Josh
We are using the Bitnami version of Solr 4.6.0-1 on a 64bit windows
installation with 64bit Java 1.7U51 and we are seeing consistent issues
with PermGen exceptions. We have the permgen configured to be 512MB.
Bitnami ships with a 32bit version of Java for windows and we are replacing
it with a 64bit version.

Passed in Java Options:

-XX:MaxPermSize=64M
-Xms3072M
-Xmx6144M
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+CMSClassUnloadingEnabled
-XX:NewRatio=3

-XX:MaxTenuringThreshold=8

This is our use case:

We have what we call a database core which remains fairly static and
contains the imported contents of a table from SQL server. We then have
user cores which contain the record ids of results from a text search
outside of Solr. We then query for the data we want from the database core
and limit the results to the content of the user core. This allows us to
combine facet data from Solr with the search results from another engine.
We are creating the user cores on demand and removing them when the user
logs out.

Our issue is the constant creation and removal of user cores combined with
the constant importing seems to push us over our PermGen limit. The user
cores are removed at the end of every session and as a test I made an
application that would loop creating the user core, import a set of data to
it, query the database core using it as a limiter and then remove the user
core. My expectation was in this scenario that all the permgen associated
with that user cores would be freed upon it's unload and allow permgen to
reclaim that memory during a garbage collection. This was not the case, it
would constantly go up until the application would exhaust the memory.

I also investigated whether the there was a connection between the two
cores left behind because I was joining them together in a query but even
unloading the database core after unloading all the user cores won't
prevent the limit from being hit or any memory to be garbage collected from
Solr.

Is this a known issue with creating and unloading a large number of cores?
Could it be configuration based for the core? Is there something other than
unloading that needs to happen to free the references?

Thanks

Notes: I've tried using tools to determine if it's a leak within Solr such
as Plumbr and my activities turned up nothing.


Re: how to best convert some term in q to a fq

2013-12-27 Thread Josh Lincoln
what if you add your country field to qf with a strong boost? the search
experience would be slightly different than if you filter on country, but
maybe still good enough for your users and certainly simpler to implement
and maintain. You'd likely only want exact matches. Assuming you are using
edismax and a stopword file for your main query fields, you'll run into an
issue if you just index your country field as a string and there's a
stopword anywhere in your query...see SOLR-3085. To avoid this, yet still
boost on country only when there's an exact match, you could index the
country field as text using KeywordTokenizerFactory and the same stopword
file as your other fields.

Regardless of the approach you take, unless there's only a small list of
countries you care about, multi-word countries might be too big an issue to
ignore, especially when the name contains common words (e.g. United States,
South Korea, New Zealand). This may be a good candidate for named entity
recognition on the query, possibly leveraging openNLP. I once saw a
presentation on how linkedin uses nlp on the query to detect the types of
entities the user is looking for. Seems similar to what you're trying to
accomplish. Of course, if countries are the only thing you're interested in
then you may be able to get away with client code for simple substring
matching using a static list of countries.

 On Dec 23, 2013 3:08 PM, "Joel Bernstein"  wrote:

> I  would suggest handling this in the client. You could write custom Solr
> code also but it would be more complicated because you'd be working with
> Solr's API's.
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Mon, Dec 23, 2013 at 2:36 PM, jmlucjav  wrote:
>
> > Hi,
> >
> > I have this scenario that I think is no unusual: solr will get a user
> > entered query string like 'apple pear france'.
> >
> > I need to do this: if any of the terms is a country, then change the
> query
> > params to move that term to a fq, i.e:
> > q=apple pear france
> > to
> > q=apple pear&fq=country:france
> >
> > What do you guys would be the best way to implement this?
> > - custom searchcomponent or queryparser
> > - servlet in same jetty as solr
> > - client code
> >
> > To simplify, consider countries are just a single term.
> >
> > Any pointer to an example to base this on would be great. thanks
> >
>


Re: simple tokenizer question

2013-12-08 Thread Josh Lincoln
Have you tried adding autoGeneratePhraseQueries=true to the fieldType
without changing the index analysis behavior.

This works at querytime only, and will convert 12-34 to "12 34", as if the
user entered the query as a phrase. This gives the expected behavior as
long as the tokenization is the same for analysis and query.
This'll work for the 80-IA structure, and I think it'll also work for
the 9(1)(vii)
example (converting it to "9 1 vii"), but I haven't tested it. Also, I
would think the 12AA example should already be working as you expect,
unless maybe you're already using the worddelimiterfilterfactory. When I
test the standardTokenizer on 12AA it preserves the string, resulting in
just one token of 12aa.

autoGeneratePhraseQueries is at least worth a quick try - it doesn't
require reindexing.

Two things to note
1) don't use autoGeneratePhraseQueries if you have CJK languages...probably
applies to any language that's not whitespace delimited. You mentioned
Indian, I presume Hindi, which I don't think will be an issue
2) In very rare cases you may have a few odd results if the
non-alphanumeric characters differ but generate the same phrase query. E.g.
9(1)(vii) would produce the same phrase as 9-1(vii), but this doesn't seem
worth considering until you know it's a problem.


On Sun, Dec 8, 2013 at 10:29 AM, Upayavira  wrote:

> If you want to just split on whitespace, then the WhitespaceTokenizer
> will do the job.
>
> However, this will mean that these two tokens aren't the same, and won't
> match each other:
>
> cat
> cat.
>
> A simple regex filter could handle those cases, remove a comma or dot
> when at the end of a word. Although there are other similar situations
> (quotes, colons, etc) that you may want to handle eventually.
>
> Upayavira
>
> On Sun, Dec 8, 2013, at 11:51 AM, Vulcanoid Developer wrote:
> > Thanks for your email.
> >
> > Great, I will look at the WordDelimiterFactory. Just to make clear, I
> > DON'T
> > want any other tokenizing on digits, specialchars, punctuations etc done
> > other than word delimiting on whitespace.
> >
> > All I want for my first version is NO removal of punctuations/special
> > characters at indexing time and during search time i.e., input as-is and
> > search as-is (like a simple sql db?) . I was assuming this would be a
> > trivial case with SOLR and not sure what I am missing here.
> >
> > thanks
> > Vulcanoid
> >
> >
> >
> > On Sun, Dec 8, 2013 at 4:33 AM, Upayavira  wrote:
> >
> > > Have you tried a WhitespaceTokenizerFactory followed by the
> > > WordDelimiterFilterFactory? The latter is perhaps more configurable at
> > > what it does. Alternatively, you could use a RegexFilterFactory to
> > > remove extraneous punctuation that wasn't removed by the Whitespace
> > > Tokenizer.
> > >
> > > Upayavira
> > >
> > > On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote:
> > > > Hi,
> > > >
> > > > I am new to solr and I guess this is a basic tokenizer question so
> please
> > > > bear with me.
> > > >
> > > > I am trying to use SOLR to index a few (Indian) legal judgments in
> text
> > > > form and search against them. One of the key points with these
> documents
> > > > is
> > > > that the sections/provisions of law usually have punctuation/special
> > > > characters in them. For example search queries will TYPICALLY be
> section
> > > > 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments
> > > > themselves will contain these sort of text with section references
> all
> > > > over
> > > > the place.
> > > >
> > > > Now, using a default schema setup with standardtokenizer, which
> seems to
> > > > delimit on whitespace AND punctuations, I get really bad results
> because
> > > > it
> > > > looks like 12AA is split and results such having 12 and AA in them
> turn
> > > > up.
> > > >  It becomes worse with 9(1)(vii) with results containing 9 and 1 etc
> > > >  being
> > > > turned up.
> > > >
> > > > What is the best solution here? I really just want to index the
> document
> > > > as-is and also to do whitespace tokenizing on the search and nothing
> > > > more.
> > > >
> > > > So in other words:
> > > > a) I would like the text document to be indexed as-is with say 12AA
> and
> > > > 9(1)(vii) in the document stored as it is mentioned.
> > > > b) I would like to be able to search for 12AA and for 9(1)(vii) and
> get
> > > > proper full matches on them without any splitting up/munging etc.
> > > >
> > > > Any suggestions are appreciated.  Thank you for your time.
> > > >
> > > > Thanks
> > > > Vulcanoid
> > >
>


HdfsDirectory Implementation

2013-10-31 Thread Josh Clum
Hello,

I refactored out the HDFS directory implementation from Solr to use in my
own project and was surprised to see how it performed. I'm using the both
the HDFSDirectory class and the
HdfsDirectoryFactory class.

On my local machine when using the cache there was a significant speed up.
It was a small enough that each file making up lucene index (12 docs) fit
into one block inside the cache.

When running it on a multinode cluster on AWS the performance pulling back
1031 docs with the cache was not that much better than without. According
to my log statements, the cache was being hit every time, but the
difference between this an my local was that there were several blocks per
file.

When setting up the cache I used the default setting as specified in
HdfsDirectoryFactory.

Any ideas on how to speed up searches? Should I change the block size? Is
there something that blur does to put a wrapper around the cache?

ON A MULTI NODE CLUSTER
Number of documents in directory[1031]
Try #1 -> Total execution time: 3776
Try #2 -> Total execution time: 2995
Try #3 -> Total execution time: 2683
Try #4 -> Total execution time: 2301
Try #5 -> Total execution time: 2174
Try #6 -> Total execution time: 2253
Try #7 -> Total execution time: 2184
Try #8 -> Total execution time: 2087
Try #9 -> Total execution time: 2157
Try #10 -> Total execution time: 2089
Cached try #1 -> Total execution time: 2065
Cached try #2 -> Total execution time: 2298
Cached try #3 -> Total execution time: 2398
Cached try #4 -> Total execution time: 2421
Cached try #5 -> Total execution time: 2080
Cached try #6 -> Total execution time: 2060
Cached try #7 -> Total execution time: 2285
Cached try #8 -> Total execution time: 2048
Cached try #9 -> Total execution time: 2087
Cached try #10 -> Total execution time: 2106

ON MY LOCAL
Number of documents in directory[12]
Try #1 -> Total execution time: 627
Try #2 -> Total execution time: 620
Try #3 -> Total execution time: 637
Try #4 -> Total execution time: 535
Try #5 -> Total execution time: 486
Try #6 -> Total execution time: 527
Try #7 -> Total execution time: 363
Try #8 -> Total execution time: 430
Try #9 -> Total execution time: 431
Try #10 -> Total execution time: 337
Cached try #1 -> Total execution time: 38
Cached try #2 -> Total execution time: 38
Cached try #3 -> Total execution time: 36
Cached try #4 -> Total execution time: 35
Cached try #5 -> Total execution time: 135
Cached try #6 -> Total execution time: 31
Cached try #7 -> Total execution time: 36
Cached try #8 -> Total execution time: 30
Cached try #9 -> Total execution time: 29
Cached try #10 -> Total execution time: 28

Thanks,
Josh


Re: DIH - stream file with solrEntityProcessor

2013-10-15 Thread Josh Lincoln
ultimately I just temporarily increased the memory to handle this data set,
but that won't always be practical.

I did try the csv export/import and it worked well in this case. I hadn't
considered it at first. I am wary that the escaping and splitting may be
problematic with some data sets, so I'll look into adding XMLResponseParser
support to XPathEntityProcessor (essentially an option to
useSolrResponseSchema), though I have a feeling only a few other people
would be interested in this.

Thanks for the replies.


On Mon, Oct 14, 2013 at 11:19 PM, Lance Norskog  wrote:

> Can you do this data in CSV format? There is a CSV reader in the DIH.
> The SEP was not intended to read from files, since there are already
> better tools that do that.
>
> Lance
>
>
> On 10/14/2013 04:44 PM, Josh Lincoln wrote:
>
>> Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
>> the POST buffer being the issue. Thanks for suggesting I test this. The
>> full file is over a gig.
>>
>> Lance, I'm actually pointing SEP at a static file (I simply named the file
>> "select" and put it on a Web server). SEP thinks it's a large solr
>> response, which it was, though now it's just static xml. Works well until
>> I
>> hit the memory limit of the new solr instance.
>>
>> I can't query the old solr from the new one b/c they're on two different
>> networks. I can't copy the index files b/c I only want a subset of the
>> data
>> (identified with a query and dumped to xml...all fields of interest were
>> stored). To further complicate things, the old solr is 1.4. I was hoping
>> to
>> use the result xml format to backup the old, and DIH SEP to import to the
>> new dev solr4.x. It's promising as a simple and repeatable migration
>> process, except that SEP fails on largish files.
>>
>> It seems my options are 1) use the xpathprocessor and identify each field
>> (there are many fields); 2) write a small script to act as a proxy to the
>> xml file and accept the row and start parameters from the SEP iterative
>> calls and return just a subset of the docs; 3) a script to process the xml
>> and push to solr, not using DIH; 4) consider XSLT to transform the result
>> xml to an update message and use XPathEntityProcessor
>> with useSolrAddSchema=true and streaming. The latter seems like the most
>> elegant and reusable approach, though I'm not certain it'll work.
>>
>> It'd be great if solrEntityProcessor could stream static files, or if I
>> could specify the solr result format while using the xpathentityprocessor
>> (i.e. a useSolrResultSchema option)
>>
>> Any other ideas?
>>
>>
>>
>>
>>
>>
>> On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog  wrote:
>>
>>  On 10/13/2013 10:02 AM, Shawn Heisey wrote:
>>>
>>>  On 10/13/2013 10:16 AM, Josh Lincoln wrote:
>>>>
>>>>  I have a large solr response in xml format and would like to import it
>>>>> into
>>>>> a new solr collection. I'm able to use DIH with solrEntityProcessor,
>>>>> but
>>>>> only if I first truncate the file to a small subset of the records. I
>>>>> was
>>>>> hoping to set stream="true" to handle the full file, but I still get an
>>>>> out
>>>>> of memory error, so I believe stream does not work with
>>>>> solrEntityProcessor
>>>>> (I know the docs only mention the stream option for the
>>>>> XPathEntityProcessor, but I was hoping solrEntityProcessor just might
>>>>> have
>>>>> the same capability).
>>>>>
>>>>> Before I open a jira to request stream support for solrEntityProcessor
>>>>> in
>>>>> DIH, is there an alternate approach for importing large files that are
>>>>> in
>>>>> the solr results format?
>>>>> Maybe a way to use xpath to get the values and a transformer to set the
>>>>> field names? I'm hoping to not have to declare the field names in
>>>>> dataConfig so I can reuse the process across data sets.
>>>>>
>>>>>  How big is the XML file?  You might be running into a size limit for
>>>> HTTP POST.
>>>>
>>>> In newer 4.x versions, Solr itself sets the size of the POST buffer
>>>> regardless of what the container config has.  That size defaults to 2MB
>>>> but is configurable using the formdataUploadLimitInKB setting tha

Re: DIH - stream file with solrEntityProcessor

2013-10-14 Thread Josh Lincoln
Shawn, I'm able to read in a 4mb file using SEP, so I think that rules out
the POST buffer being the issue. Thanks for suggesting I test this. The
full file is over a gig.

Lance, I'm actually pointing SEP at a static file (I simply named the file
"select" and put it on a Web server). SEP thinks it's a large solr
response, which it was, though now it's just static xml. Works well until I
hit the memory limit of the new solr instance.

I can't query the old solr from the new one b/c they're on two different
networks. I can't copy the index files b/c I only want a subset of the data
(identified with a query and dumped to xml...all fields of interest were
stored). To further complicate things, the old solr is 1.4. I was hoping to
use the result xml format to backup the old, and DIH SEP to import to the
new dev solr4.x. It's promising as a simple and repeatable migration
process, except that SEP fails on largish files.

It seems my options are 1) use the xpathprocessor and identify each field
(there are many fields); 2) write a small script to act as a proxy to the
xml file and accept the row and start parameters from the SEP iterative
calls and return just a subset of the docs; 3) a script to process the xml
and push to solr, not using DIH; 4) consider XSLT to transform the result
xml to an update message and use XPathEntityProcessor
with useSolrAddSchema=true and streaming. The latter seems like the most
elegant and reusable approach, though I'm not certain it'll work.

It'd be great if solrEntityProcessor could stream static files, or if I
could specify the solr result format while using the xpathentityprocessor
(i.e. a useSolrResultSchema option)

Any other ideas?






On Mon, Oct 14, 2013 at 6:24 PM, Lance Norskog  wrote:

> On 10/13/2013 10:02 AM, Shawn Heisey wrote:
>
>> On 10/13/2013 10:16 AM, Josh Lincoln wrote:
>>
>>> I have a large solr response in xml format and would like to import it
>>> into
>>> a new solr collection. I'm able to use DIH with solrEntityProcessor, but
>>> only if I first truncate the file to a small subset of the records. I was
>>> hoping to set stream="true" to handle the full file, but I still get an
>>> out
>>> of memory error, so I believe stream does not work with
>>> solrEntityProcessor
>>> (I know the docs only mention the stream option for the
>>> XPathEntityProcessor, but I was hoping solrEntityProcessor just might
>>> have
>>> the same capability).
>>>
>>> Before I open a jira to request stream support for solrEntityProcessor in
>>> DIH, is there an alternate approach for importing large files that are in
>>> the solr results format?
>>> Maybe a way to use xpath to get the values and a transformer to set the
>>> field names? I'm hoping to not have to declare the field names in
>>> dataConfig so I can reuse the process across data sets.
>>>
>> How big is the XML file?  You might be running into a size limit for
>> HTTP POST.
>>
>> In newer 4.x versions, Solr itself sets the size of the POST buffer
>> regardless of what the container config has.  That size defaults to 2MB
>> but is configurable using the formdataUploadLimitInKB setting that you
>> can find in the example solrconfig.xml file, on the requestParsers tag.
>>
>> In Solr 3.x, if you used the included jetty, it had a configured HTTP
>> POST size limit of 1MB.  In early Solr 4.x, there was a bug in the
>> included Jetty that prevented the configuration element from working, so
>> the actual limit was Jetty's default of 200KB.  With other containers
>> and these older versions, you would need to change your container
>> configuration.
>>
>> https://bugs.eclipse.org/bugs/**show_bug.cgi?id=397130<https://bugs.eclipse.org/bugs/show_bug.cgi?id=397130>
>>
>> Thanks,
>> Shawn
>>
>>  The SEP calls out to another Solr and reads. Are you importing data from
> another Solr and cross-connecting it with your uploaded XML?
>
> If the memory errors are a problem with streaming, you could try "piping"
> your uploaded documents through a processor that supports streaming. This
> would then push one document at a time into your processor that calls out
> to Solr and combines records.
>
>


DIH - stream file with solrEntityProcessor

2013-10-13 Thread Josh Lincoln
I have a large solr response in xml format and would like to import it into
a new solr collection. I'm able to use DIH with solrEntityProcessor, but
only if I first truncate the file to a small subset of the records. I was
hoping to set stream="true" to handle the full file, but I still get an out
of memory error, so I believe stream does not work with solrEntityProcessor
(I know the docs only mention the stream option for the
XPathEntityProcessor, but I was hoping solrEntityProcessor just might have
the same capability).

Before I open a jira to request stream support for solrEntityProcessor in
DIH, is there an alternate approach for importing large files that are in
the solr results format?
Maybe a way to use xpath to get the values and a transformer to set the
field names? I'm hoping to not have to declare the field names in
dataConfig so I can reuse the process across data sets.

Anyone have ideas? thanks


Request to be added to ContributorsGroup

2013-06-06 Thread Josh Lincoln
Hello Wiki Admins,

I have been using Solr for a few years now and I would like to
contribute back by making minor changes and clarifications to the wiki
documentation.

Wiki User Name : JoshLincoln


Thanks


Re: Deleting an individual document while delta index is running

2012-11-07 Thread Josh Turmel
Okay, thanks for the help guys... I *think* that this can be resolved by 
kicking off the delta and passing optimize=false since the default was true in 
3.3.

I'll post back if I see the issue pop back up.

JT


On Wednesday, November 7, 2012 at 1:34 PM, Josh Turmel wrote:

> Here's what we have set in our data-config.xml 
> 
>  url="jdbc:postgresql://localhost:5432/reader" user="data" 
> batchSize="1000" readOnly="true" autoCommit="false"
> transactionIsolation="TRANSACTION_READ_COMMITTED" 
> holdability="CLOSE_CURSORS_AT_COMMIT"
> />
> 
> 
> Thanks,
> Josh Turmel
> 
> 
> On Wednesday, November 7, 2012 at 1:00 PM, Shawn Heisey wrote:
> 
> > On 11/7/2012 10:55 AM, Otis Gospodnetic wrote:
> > > Hi Shawn,
> > > 
> > > It the last part really correct? Optimization should be doable while
> > > updates are going on... or am I missing something?
> > > 
> > 
> > 
> > From what I recall when I was first putting my build system together, 
> > which I will admit was on Solr 1.4.0, I couldn't do updates/deletes 
> > while optimizing was underway. I don't think 3.x was a whole lot 
> > different in this respect. From the little I understand about the 
> > significant changes in 4.0, it is probably now possible to do everything 
> > at the same time with no worry.
> > 
> > Because they are using 3.3, I don't think they have access to this 
> > ability. Given the limited amount of information available, it seemed 
> > the most likely explanation. I could be wrong, and if I am, they will 
> > have to keep looking for an explanation.
> > 
> > I'm definitely no expert, and I have not tried optimizing and updating 
> > at the same time since upgrading to 3.x. My indexing system 
> > deliberately avoids doing the two at the same time because it caused 
> > problems on 1.4.x.
> > 
> > I would certainly love to know for sure whether it's possible on 4.0, 
> > because I am in the process of updating my entire test environment in 
> > preparation for a production rollout. If I can do updates/commits at 
> > the same time as optimizing, my code will get smaller and a lot simpler.
> > 
> > Thanks,
> > Shawn
> > 
> > 
> > 
> 
> 



Re: Deleting an individual document while delta index is running

2012-11-07 Thread Josh Turmel
Here's what we have set in our data-config.xml 




Thanks,
Josh Turmel


On Wednesday, November 7, 2012 at 1:00 PM, Shawn Heisey wrote:

> On 11/7/2012 10:55 AM, Otis Gospodnetic wrote:
> > Hi Shawn,
> > 
> > It the last part really correct? Optimization should be doable while
> > updates are going on... or am I missing something?
> > 
> 
> 
> From what I recall when I was first putting my build system together, 
> which I will admit was on Solr 1.4.0, I couldn't do updates/deletes 
> while optimizing was underway. I don't think 3.x was a whole lot 
> different in this respect. From the little I understand about the 
> significant changes in 4.0, it is probably now possible to do everything 
> at the same time with no worry.
> 
> Because they are using 3.3, I don't think they have access to this 
> ability. Given the limited amount of information available, it seemed 
> the most likely explanation. I could be wrong, and if I am, they will 
> have to keep looking for an explanation.
> 
> I'm definitely no expert, and I have not tried optimizing and updating 
> at the same time since upgrading to 3.x. My indexing system 
> deliberately avoids doing the two at the same time because it caused 
> problems on 1.4.x.
> 
> I would certainly love to know for sure whether it's possible on 4.0, 
> because I am in the process of updating my entire test environment in 
> preparation for a production rollout. If I can do updates/commits at 
> the same time as optimizing, my code will get smaller and a lot simpler.
> 
> Thanks,
> Shawn
> 
> 




Deleting an individual document while delta index is running

2012-11-06 Thread Josh Turmel
Running Solr 3.3

We're running into issues where deleting individual documents (by ID) will
timeout but it only seems to happen when our hourly delta index is being
ran to pull in new documents, is there a way to work around this?

Thank you,
Josh


fl Parameter and Wildcards for Dynamic Fields

2012-07-04 Thread Josh Harness
I'm using SOLR 3.3 and would like to know how to return a list of dynamic
fields in my search results using a wildcard with the fl parameter. I found
SOLR-2444 <https://issues.apache.org/jira/browse/SOLR-2444> but this
appears to be for SOLR 4.0. Am I correct in assuming this isn't doable yet?
Please note that I don't want to query the dynamic fields, I just need them
returned in the search results. Using fl=myDynamicField_* doesn't seem to
work.

Many Thanks!

Josh


DataImportHandler Streaming XML Parse

2011-11-08 Thread Josh Harness
All -

 We're using DIH to import flat xml files. We're getting Heap memory
exceptions due to the file size. Is there any way to force DIH to do a
streaming parse rather than a DOM parse? I really don't want to chunk my
files up or increase the heap size.

Many Thanks!

Josh


Re: Managing solr machines (start/stop/status)

2011-09-14 Thread josh lucas
On Sep 13, 2011, at 5:05 PM, Jamie Johnson wrote:

> I know this isn't a solr specific question but I was wondering what
> folks do in regards to managing the machines in their solr cluster?
> Are there any recommendations for how to start/stop/manage these
> machines?  Any suggestions would be appreciated.


One thing I use is csshx (http://code.google.com/p/csshx/) on my Mac when 
dealing with the various boxes in our cluster.  You can issue commands in one 
terminal and they are duplicated in all other windows.  Very useful for global 
stop/starts and updates.

Re: using a function query with OR and spaces?

2011-09-13 Thread josh lucas
On Sep 13, 2011, at 8:37 AM, Jason Toy wrote:

> I had queries breaking on me when there were spaces in the text I was
> searching for. Originally I had :
> 
> fq=state_s:New York
> and that would break, I found a work around by using:
> 
> fq={!raw f=state_s}New York
> 
> 
> My problem now is doing this with an OR query,  this is what I have now, but
> it doesn't work:
> 
> 
> fq=({!raw f=country_s}United States OR {!raw f=city_s}New York

Couldn't you do:

fq=(country_s:(United States) OR city_s:(New York))

I think that should work though you probably will need to surround the queries 
with quotes to get the exact phrase match.

Suggestions for copying fields across cores...

2011-08-05 Thread josh lucas
Is there a suggested way to copy data in fields to additional fields that will 
only be in a different core?  Obviously I could index the data separately and I 
could build that into my current indexing process but I'm curious if there 
might be an easier, more automated way.

Thanks!


josh

show first couple sentences from found doc

2009-02-20 Thread Josh Joy
Hi,

I would like to do something similar to Google, in that for my list of hits,
I would like to grab the surrounding text around my query term so I can
include that in my search results. What's the easiest way to do this?

Thanks,
Josh


mapping pdf metadata

2009-02-20 Thread Josh Joy
Hi,

I'm having trouble figuring out how to map the tika metadata fields to my
own solr schema document fields. I guess the first hurdle I need to
overcome, is where can I find a list of the Tika PDF metadata fields that
are available for mapping?

Thanks,
Josh