Need Debug Direction on Performance Problem

2015-01-16 Thread Naresh Yadav
Hi all,

We have single solr index with 3 fixed fields(on of field is tokenized with
space) and rest dynamic fields(string fields in range of 10-20).

Current size of index is 2 GB with around 12 lakh docs and solr
nodes are of 4 core, 16 gb ram linux machines.

Writes performance is good then we tested one read query(In select query we
are applying filter criteria on tokenized field & reading only score field,
no grouping/faceting) in two setups :

*Setup1 : *Single Node Cloud with shards=1, replication=1
In this setup whole 12 lakh docs are on same machine. Our filter query
reading
around 10 lakh docs with only score field is taking *1 minutes*.

*Setup2 : *Two Node Cloud with shards=2, replication=1
In this setup whole 6 lakh docs on node1 and 6 lakh on node2. Our same
filter query reading around 10 lakh docs with only score field is taking *114
minutes.*

Please guide us what can be possible reasons of degradation of
performance after sharding of index. How can we check where solr server
is taking time to return results.

Thanks
Naresh


Solr Cloud Stress Test

2015-01-16 Thread david mitche
Hi,

   I am a student, planning to learn and do a features and functionality
test of solr-cloud as one of my project. I liked to do the stress and
performance test of solr-cloud on my local machine.  (machine of 16gb ram,
250 gb ssd and 2.2 GHz Intel Core i7).  Multiple features of cloud. What is
the recommended way to get started with it?

Thanks.
David


Re: Solr numFound > 0 but doc list empty in Solr Cloud setup

2015-01-16 Thread Jaikit Savla
Anshuman,

You are right about @shards param not required. One of my shard was down and 
hence when I added 
&shards.tolerant=true, it worked without shards param. However document list is 
still empty.


content of solrconfig.xml
http://pastebin.com/CJxD22t1

 




On Friday, January 16, 2015 1:24 PM, Jaikit Savla  
wrote:
I followed all the steps listed here: 
http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster

I have not updated solrconfig.xml and it is same as what comes default with 
4.10.

The only thing I added extra was list of my fields in 
example/solr/collection1/conf/schema.xml

@shards: If I query with out that param, it returns below error:
http://localhost:/solr/collection1/select?q=*:*


503
3

*:*



no servers hosting shard:
503












On Friday, January 16, 2015 12:37 PM, Anshum Gupta  
wrote:
Looks like a config issue to me more than anything else.
Can you share your solrconfig? You will not be able to attach a file here
but you could share it via pastebin or something similar.
Also, why are you adding the "shards=http://localhost:8983/solr/collection1";
part to your request? You don't need to do that in most cases.


On Fri, Jan 16, 2015 at 12:20 PM, Jaikit Savla <
jaikit.sa...@yahoo.com.invalid> wrote:

> One more point:
> In cloud mode: If I submit a request with fl=id, it returns doc list. But
> when I add any other field, I get an empty doc list.
>
>
> http://localhost:/solr/select?q=domain:ebay&wt=json&shards=http://localhost:/solr/&fl=id&rows=1
>
> {
> responseHeader: {
> status: 0,
> QTime: 7,
> params: {
> fl: "id",
> shards: "http://localhost:/solr/";,
> q: "domain:ebay",
> wt: "json",
> rows: "1"
> }
> },
> response: {
> numFound: 17,
> start: 0,
> maxScore: 3.8559604,
> docs: [
> {
> id: "d8406557-6cd8-46d9-9a5e-29844387afc4"
> }
> ]
> }
> }
>
>
> Note: all of above works in single core mode.
>
>
>
> On Friday, January 16, 2015 12:13 PM, Jaikit Savla
>  wrote:
> As I said earlier - single core set up works fine with same solrconfig.xml
> and schema.xml
>
> cd example
> java -Djetty.port= -Dsolr.data.dir=/index/path -jar start.jar
>
> I am running Solr-4.10. Do I need to change any other configuration for
> running in solr cloud mode ?
>
>
>
>
> On Friday, January 16, 2015 11:56 AM, Jaikit Savla
>  wrote:
> Verified that all my fields are stored and marked as indexed.
>  multiValued="true" />
>
>
>
> -->
>
> http://localhost:/solr/collection1/query?q=body%3A%22from%22&wt=json&indent=true&shards=http://localhost:/solr/collection1&start=1&rows=10&shards.info=true
>
> {
> responseHeader: {
> status: 0,
> QTime: 19,
> params: {
> shards: "http://localhost:/solr/collection1";,
> indent: "true",
> start: "1",
> q: "body:"from"",
> shards.info: "true",
> wt: "json",
> rows: "10"
> }
> },
> shards.info: {
> http://localhost:/solr/collection1: {
> numFound: 1717,
> maxScore: 0.5327856,
> shardAddress: "http://localhost:/solr/collection1";,
> time: 12
> }
> },
> response: {
> numFound: 1707,
> start: 1,
> maxScore: 0.5327856,
> docs: [ ]
> }
> }
>
>
>
>
> On Friday, January 16, 2015 9:56 AM, Erick Erickson <
> erickerick...@gmail.com> wrote:
> Any chance that you've defined &rows=0 in your handler? Or is it possible
> that you have not set stored="true" for any of your fields?
>
> Best,
> Erick
>
>
> On Fri, Jan 16, 2015 at 9:46 AM, Jaikit Savla
>  wrote:
> > I am using below tutorial for Solr Cloud setup with 2 shards
> >
> http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
> >
> >
> > I am able to get the default set up working. However, I have a
> requirement where my index is not in default location (data/index) and
> hence when I start jvm for each shard I run with -Dsolr.data.dir= index path> . Now when I query I get results with numFound > 0 but doc list
> is always empty.
> >
> > I verified that my index does have fields stored and indexed. Anyone
> else faced similar issue or have an idea on what I am missing ? Verified
> that by loading single core.
> >
> > Appreciate any help.
> >
> > request:
> >
> >
> http://localhost:/solr/collection1/select?q=body%3A%22to%22&wt=json&indent=true&shards=http://localhost:/solr/collection1
> >
> >
> > response:
> > { "responseHeader": { "status": 0, "QTime": 18, "params": { "shards": "
> http://localhost:/solr/collection1";, "indent": "true", "q":
> "body:\"to\"", "_": "1421390858638", "wt": "json" } }, "response": {
> "numFound": 2564, "start": 0, "maxScore": 0.4523638, "docs": [] } }
>



-- 
Anshum Gupta
http://about.me/anshumgupta


Re: Solr numFound > 0 but doc list empty in Solr Cloud setup

2015-01-16 Thread Jaikit Savla
I followed all the steps listed here: 
http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster

I have not updated solrconfig.xml and it is same as what comes default with 
4.10.

The only thing I added extra was list of my fields in 
example/solr/collection1/conf/schema.xml

@shards: If I query with out that param, it returns below error:
http://localhost:/solr/collection1/select?q=*:*


503
3

*:*



no servers hosting shard:
503




 

 




On Friday, January 16, 2015 12:37 PM, Anshum Gupta  
wrote:
Looks like a config issue to me more than anything else.
Can you share your solrconfig? You will not be able to attach a file here
but you could share it via pastebin or something similar.
Also, why are you adding the "shards=http://localhost:8983/solr/collection1";
part to your request? You don't need to do that in most cases.


On Fri, Jan 16, 2015 at 12:20 PM, Jaikit Savla <
jaikit.sa...@yahoo.com.invalid> wrote:

> One more point:
> In cloud mode: If I submit a request with fl=id, it returns doc list. But
> when I add any other field, I get an empty doc list.
>
>
> http://localhost:/solr/select?q=domain:ebay&wt=json&shards=http://localhost:/solr/&fl=id&rows=1
>
> {
> responseHeader: {
> status: 0,
> QTime: 7,
> params: {
> fl: "id",
> shards: "http://localhost:/solr/";,
> q: "domain:ebay",
> wt: "json",
> rows: "1"
> }
> },
> response: {
> numFound: 17,
> start: 0,
> maxScore: 3.8559604,
> docs: [
> {
> id: "d8406557-6cd8-46d9-9a5e-29844387afc4"
> }
> ]
> }
> }
>
>
> Note: all of above works in single core mode.
>
>
>
> On Friday, January 16, 2015 12:13 PM, Jaikit Savla
>  wrote:
> As I said earlier - single core set up works fine with same solrconfig.xml
> and schema.xml
>
> cd example
> java -Djetty.port= -Dsolr.data.dir=/index/path -jar start.jar
>
> I am running Solr-4.10. Do I need to change any other configuration for
> running in solr cloud mode ?
>
>
>
>
> On Friday, January 16, 2015 11:56 AM, Jaikit Savla
>  wrote:
> Verified that all my fields are stored and marked as indexed.
>  multiValued="true" />
>
>
>
> -->
>
> http://localhost:/solr/collection1/query?q=body%3A%22from%22&wt=json&indent=true&shards=http://localhost:/solr/collection1&start=1&rows=10&shards.info=true
>
> {
> responseHeader: {
> status: 0,
> QTime: 19,
> params: {
> shards: "http://localhost:/solr/collection1";,
> indent: "true",
> start: "1",
> q: "body:"from"",
> shards.info: "true",
> wt: "json",
> rows: "10"
> }
> },
> shards.info: {
> http://localhost:/solr/collection1: {
> numFound: 1717,
> maxScore: 0.5327856,
> shardAddress: "http://localhost:/solr/collection1";,
> time: 12
> }
> },
> response: {
> numFound: 1707,
> start: 1,
> maxScore: 0.5327856,
> docs: [ ]
> }
> }
>
>
>
>
> On Friday, January 16, 2015 9:56 AM, Erick Erickson <
> erickerick...@gmail.com> wrote:
> Any chance that you've defined &rows=0 in your handler? Or is it possible
> that you have not set stored="true" for any of your fields?
>
> Best,
> Erick
>
>
> On Fri, Jan 16, 2015 at 9:46 AM, Jaikit Savla
>  wrote:
> > I am using below tutorial for Solr Cloud setup with 2 shards
> >
> http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
> >
> >
> > I am able to get the default set up working. However, I have a
> requirement where my index is not in default location (data/index) and
> hence when I start jvm for each shard I run with -Dsolr.data.dir= index path> . Now when I query I get results with numFound > 0 but doc list
> is always empty.
> >
> > I verified that my index does have fields stored and indexed. Anyone
> else faced similar issue or have an idea on what I am missing ? Verified
> that by loading single core.
> >
> > Appreciate any help.
> >
> > request:
> >
> >
> http://localhost:/solr/collection1/select?q=body%3A%22to%22&wt=json&indent=true&shards=http://localhost:/solr/collection1
> >
> >
> > response:
> > { "responseHeader": { "status": 0, "QTime": 18, "params": { "shards": "
> http://localhost:/solr/collection1";, "indent": "true", "q":
> "body:\"to\"", "_": "1421390858638", "wt": "json" } }, "response": {
> "numFound": 2564, "start": 0, "maxScore": 0.4523638, "docs": [] } }
>



-- 
Anshum Gupta
http://about.me/anshumgupta


Re: Solr numFound > 0 but doc list empty in Solr Cloud setup

2015-01-16 Thread Anshum Gupta
Looks like a config issue to me more than anything else.
Can you share your solrconfig? You will not be able to attach a file here
but you could share it via pastebin or something similar.
Also, why are you adding the "shards=http://localhost:8983/solr/collection1";
part to your request? You don't need to do that in most cases.

On Fri, Jan 16, 2015 at 12:20 PM, Jaikit Savla <
jaikit.sa...@yahoo.com.invalid> wrote:

> One more point:
> In cloud mode: If I submit a request with fl=id, it returns doc list. But
> when I add any other field, I get an empty doc list.
>
>
> http://localhost:/solr/select?q=domain:ebay&wt=json&shards=http://localhost:/solr/&fl=id&rows=1
>
> {
> responseHeader: {
> status: 0,
> QTime: 7,
> params: {
> fl: "id",
> shards: "http://localhost:/solr/";,
> q: "domain:ebay",
> wt: "json",
> rows: "1"
> }
> },
> response: {
> numFound: 17,
> start: 0,
> maxScore: 3.8559604,
> docs: [
> {
> id: "d8406557-6cd8-46d9-9a5e-29844387afc4"
> }
> ]
> }
> }
>
>
> Note: all of above works in single core mode.
>
>
>
> On Friday, January 16, 2015 12:13 PM, Jaikit Savla
>  wrote:
> As I said earlier - single core set up works fine with same solrconfig.xml
> and schema.xml
>
> cd example
> java -Djetty.port= -Dsolr.data.dir=/index/path -jar start.jar
>
> I am running Solr-4.10. Do I need to change any other configuration for
> running in solr cloud mode ?
>
>
>
>
> On Friday, January 16, 2015 11:56 AM, Jaikit Savla
>  wrote:
> Verified that all my fields are stored and marked as indexed.
>  multiValued="true" />
>
>
>
> -->
>
> http://localhost:/solr/collection1/query?q=body%3A%22from%22&wt=json&indent=true&shards=http://localhost:/solr/collection1&start=1&rows=10&shards.info=true
>
> {
> responseHeader: {
> status: 0,
> QTime: 19,
> params: {
> shards: "http://localhost:/solr/collection1";,
> indent: "true",
> start: "1",
> q: "body:"from"",
> shards.info: "true",
> wt: "json",
> rows: "10"
> }
> },
> shards.info: {
> http://localhost:/solr/collection1: {
> numFound: 1717,
> maxScore: 0.5327856,
> shardAddress: "http://localhost:/solr/collection1";,
> time: 12
> }
> },
> response: {
> numFound: 1707,
> start: 1,
> maxScore: 0.5327856,
> docs: [ ]
> }
> }
>
>
>
>
> On Friday, January 16, 2015 9:56 AM, Erick Erickson <
> erickerick...@gmail.com> wrote:
> Any chance that you've defined &rows=0 in your handler? Or is it possible
> that you have not set stored="true" for any of your fields?
>
> Best,
> Erick
>
>
> On Fri, Jan 16, 2015 at 9:46 AM, Jaikit Savla
>  wrote:
> > I am using below tutorial for Solr Cloud setup with 2 shards
> >
> http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
> >
> >
> > I am able to get the default set up working. However, I have a
> requirement where my index is not in default location (data/index) and
> hence when I start jvm for each shard I run with -Dsolr.data.dir= index path> . Now when I query I get results with numFound > 0 but doc list
> is always empty.
> >
> > I verified that my index does have fields stored and indexed. Anyone
> else faced similar issue or have an idea on what I am missing ? Verified
> that by loading single core.
> >
> > Appreciate any help.
> >
> > request:
> >
> >
> http://localhost:/solr/collection1/select?q=body%3A%22to%22&wt=json&indent=true&shards=http://localhost:/solr/collection1
> >
> >
> > response:
> > { "responseHeader": { "status": 0, "QTime": 18, "params": { "shards": "
> http://localhost:/solr/collection1";, "indent": "true", "q":
> "body:\"to\"", "_": "1421390858638", "wt": "json" } }, "response": {
> "numFound": 2564, "start": 0, "maxScore": 0.4523638, "docs": [] } }
>



-- 
Anshum Gupta
http://about.me/anshumgupta


Re: Solr example for Solr 4.10.2 gives warning about Multiple request handlers with same name

2015-01-16 Thread Michael Sokolov
I've seen the same thing, poked around a bit and eventually decided to 
ignore it.  I think there may be a ticket related to that saying it's a 
logging bug (ie not a real issue), but I couldn't swear to it.


-Mike

On 01/16/2015 12:36 PM, Tom Burton-West wrote:

Hello,

I'm running Solr 4.10.2 out of the box with the Solr example.

i.e. ant example
cd solr/example
java -jar start.jar

in /example/log

At start-up the example gives this message in the log:

WARN  - 2015-01-16 12:31:40.895; org.apache.solr.core.RequestHandlers;
Multiple requestHandler registered to the same name: /update ignoring:
org.apache.solr.handler.UpdateRequestHandler

Is this a bug?   Is there something wrong with the out of the box example
configuration?

Tom





Re: Solr numFound > 0 but doc list empty in Solr Cloud setup

2015-01-16 Thread Jaikit Savla
One more point: 
In cloud mode: If I submit a request with fl=id, it returns doc list. But when 
I add any other field, I get an empty doc list. 

http://localhost:/solr/select?q=domain:ebay&wt=json&shards=http://localhost:/solr/&fl=id&rows=1

{
responseHeader: {
status: 0,
QTime: 7,
params: {
fl: "id",
shards: "http://localhost:/solr/";,
q: "domain:ebay",
wt: "json",
rows: "1"
}
},
response: {
numFound: 17,
start: 0,
maxScore: 3.8559604,
docs: [
{
id: "d8406557-6cd8-46d9-9a5e-29844387afc4"
}
]
}
}


Note: all of above works in single core mode.



On Friday, January 16, 2015 12:13 PM, Jaikit Savla 
 wrote:
As I said earlier - single core set up works fine with same solrconfig.xml and 
schema.xml

cd example
java -Djetty.port= -Dsolr.data.dir=/index/path -jar start.jar 

I am running Solr-4.10. Do I need to change any other configuration for running 
in solr cloud mode ?




On Friday, January 16, 2015 11:56 AM, Jaikit Savla 
 wrote:
Verified that all my fields are stored and marked as indexed.




--> 
http://localhost:/solr/collection1/query?q=body%3A%22from%22&wt=json&indent=true&shards=http://localhost:/solr/collection1&start=1&rows=10&shards.info=true

{
responseHeader: {
status: 0,
QTime: 19,
params: {
shards: "http://localhost:/solr/collection1";,
indent: "true",
start: "1",
q: "body:"from"",
shards.info: "true",
wt: "json",
rows: "10"
}
},
shards.info: {
http://localhost:/solr/collection1: {
numFound: 1717,
maxScore: 0.5327856,
shardAddress: "http://localhost:/solr/collection1";,
time: 12
}
},
response: {
numFound: 1707,
start: 1,
maxScore: 0.5327856,
docs: [ ]
}
}




On Friday, January 16, 2015 9:56 AM, Erick Erickson  
wrote:
Any chance that you've defined &rows=0 in your handler? Or is it possible
that you have not set stored="true" for any of your fields?

Best,
Erick


On Fri, Jan 16, 2015 at 9:46 AM, Jaikit Savla
 wrote:
> I am using below tutorial for Solr Cloud setup with 2 shards
> http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
>
>
> I am able to get the default set up working. However, I have a requirement 
> where my index is not in default location (data/index) and hence when I start 
> jvm for each shard I run with -Dsolr.data.dir= . Now when 
> I query I get results with numFound > 0 but doc list is always empty.
>
> I verified that my index does have fields stored and indexed. Anyone else 
> faced similar issue or have an idea on what I am missing ? Verified that by 
> loading single core.
>
> Appreciate any help.
>
> request:
>
> http://localhost:/solr/collection1/select?q=body%3A%22to%22&wt=json&indent=true&shards=http://localhost:/solr/collection1
>
>
> response:
> { "responseHeader": { "status": 0, "QTime": 18, "params": { "shards": 
> "http://localhost:/solr/collection1";, "indent": "true", "q": 
> "body:\"to\"", "_": "1421390858638", "wt": "json" } }, "response": { 
> "numFound": 2564, "start": 0, "maxScore": 0.4523638, "docs": [] } }


Re: Solr numFound > 0 but doc list empty in Solr Cloud setup

2015-01-16 Thread Jaikit Savla
As I said earlier - single core set up works fine with same solrconfig.xml and 
schema.xml

cd example
java -Djetty.port= -Dsolr.data.dir=/index/path -jar start.jar 

I am running Solr-4.10. Do I need to change any other configuration for running 
in solr cloud mode ?



On Friday, January 16, 2015 11:56 AM, Jaikit Savla 
 wrote:
Verified that all my fields are stored and marked as indexed.




--> 
http://localhost:/solr/collection1/query?q=body%3A%22from%22&wt=json&indent=true&shards=http://localhost:/solr/collection1&start=1&rows=10&shards.info=true

{
responseHeader: {
status: 0,
QTime: 19,
params: {
shards: "http://localhost:/solr/collection1";,
indent: "true",
start: "1",
q: "body:"from"",
shards.info: "true",
wt: "json",
rows: "10"
}
},
shards.info: {
http://localhost:/solr/collection1: {
numFound: 1717,
maxScore: 0.5327856,
shardAddress: "http://localhost:/solr/collection1";,
time: 12
}
},
response: {
numFound: 1707,
start: 1,
maxScore: 0.5327856,
docs: [ ]
}
}




On Friday, January 16, 2015 9:56 AM, Erick Erickson  
wrote:
Any chance that you've defined &rows=0 in your handler? Or is it possible
that you have not set stored="true" for any of your fields?

Best,
Erick


On Fri, Jan 16, 2015 at 9:46 AM, Jaikit Savla
 wrote:
> I am using below tutorial for Solr Cloud setup with 2 shards
> http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
>
>
> I am able to get the default set up working. However, I have a requirement 
> where my index is not in default location (data/index) and hence when I start 
> jvm for each shard I run with -Dsolr.data.dir= . Now when 
> I query I get results with numFound > 0 but doc list is always empty.
>
> I verified that my index does have fields stored and indexed. Anyone else 
> faced similar issue or have an idea on what I am missing ? Verified that by 
> loading single core.
>
> Appreciate any help.
>
> request:
>
> http://localhost:/solr/collection1/select?q=body%3A%22to%22&wt=json&indent=true&shards=http://localhost:/solr/collection1
>
>
> response:
> { "responseHeader": { "status": 0, "QTime": 18, "params": { "shards": 
> "http://localhost:/solr/collection1";, "indent": "true", "q": 
> "body:\"to\"", "_": "1421390858638", "wt": "json" } }, "response": { 
> "numFound": 2564, "start": 0, "maxScore": 0.4523638, "docs": [] } }


Re: Solr numFound > 0 but doc list empty in Solr Cloud setup

2015-01-16 Thread Jaikit Savla
Verified that all my fields are stored and marked as indexed.




--> 
http://localhost:/solr/collection1/query?q=body%3A%22from%22&wt=json&indent=true&shards=http://localhost:/solr/collection1&start=1&rows=10&shards.info=true

{
responseHeader: {
status: 0,
QTime: 19,
params: {
shards: "http://localhost:/solr/collection1";,
indent: "true",
start: "1",
q: "body:"from"",
shards.info: "true",
wt: "json",
rows: "10"
}
},
shards.info: {
http://localhost:/solr/collection1: {
numFound: 1717,
maxScore: 0.5327856,
shardAddress: "http://localhost:/solr/collection1";,
time: 12
}
},
response: {
numFound: 1707,
start: 1,
maxScore: 0.5327856,
docs: [ ]
}
}



On Friday, January 16, 2015 9:56 AM, Erick Erickson  
wrote:
Any chance that you've defined &rows=0 in your handler? Or is it possible
that you have not set stored="true" for any of your fields?

Best,
Erick


On Fri, Jan 16, 2015 at 9:46 AM, Jaikit Savla
 wrote:
> I am using below tutorial for Solr Cloud setup with 2 shards
> http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
>
>
> I am able to get the default set up working. However, I have a requirement 
> where my index is not in default location (data/index) and hence when I start 
> jvm for each shard I run with -Dsolr.data.dir= . Now when 
> I query I get results with numFound > 0 but doc list is always empty.
>
> I verified that my index does have fields stored and indexed. Anyone else 
> faced similar issue or have an idea on what I am missing ? Verified that by 
> loading single core.
>
> Appreciate any help.
>
> request:
>
> http://localhost:/solr/collection1/select?q=body%3A%22to%22&wt=json&indent=true&shards=http://localhost:/solr/collection1
>
>
> response:
> { "responseHeader": { "status": 0, "QTime": 18, "params": { "shards": 
> "http://localhost:/solr/collection1";, "indent": "true", "q": 
> "body:\"to\"", "_": "1421390858638", "wt": "json" } }, "response": { 
> "numFound": 2564, "start": 0, "maxScore": 0.4523638, "docs": [] } }


Re: Solr numFound > 0 but doc list empty in Solr Cloud setup

2015-01-16 Thread Erick Erickson
Any chance that you've defined &rows=0 in your handler? Or is it possible
that you have not set stored="true" for any of your fields?

Best,
Erick

On Fri, Jan 16, 2015 at 9:46 AM, Jaikit Savla
 wrote:
> I am using below tutorial for Solr Cloud setup with 2 shards
> http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
>
>
> I am able to get the default set up working. However, I have a requirement 
> where my index is not in default location (data/index) and hence when I start 
> jvm for each shard I run with -Dsolr.data.dir= . Now when 
> I query I get results with numFound > 0 but doc list is always empty.
>
> I verified that my index does have fields stored and indexed. Anyone else 
> faced similar issue or have an idea on what I am missing ? Verified that by 
> loading single core.
>
> Appreciate any help.
>
> request:
>
> http://localhost:/solr/collection1/select?q=body%3A%22to%22&wt=json&indent=true&shards=http://localhost:/solr/collection1
>
>
> response:
> { "responseHeader": { "status": 0, "QTime": 18, "params": { "shards": 
> "http://localhost:/solr/collection1";, "indent": "true", "q": 
> "body:\"to\"", "_": "1421390858638", "wt": "json" } }, "response": { 
> "numFound": 2564, "start": 0, "maxScore": 0.4523638, "docs": [] } }


Re: Query ReRanking question

2015-01-16 Thread Erick Erickson
Ravi:

Yep, this is the standard way to have recency influence the rank rather than
take over absolute ordering via a sort=date_time or similar.

Of course how strongly the rank is influenced is "more an art than a science"
as far as figuring out what actual constants to put in

Best,
Erick

On Fri, Jan 16, 2015 at 8:03 AM, Ravi Solr  wrote:
> As per Erick's suggestion reposting my response to the group. Joel and
> Erick Thank you very much for helping me out with the ReRanking question a
> while ago.
>
> I have an alternative which seems to be working better for me than
> ReRanking, can you kindly let me know of any pitfalls that you guys can
> think of about the this approach ?? Since we value relevancy & recency at
> the same time even though both are mutually exclusive, i thought maybe I
> can use the function queries to adjust the boost as follows
>
> boost=max(recip(ms(NOW/HOUR,publish_date),7.889e-10,1,1),scale(query($q),0,1))
>
> What I intended to do here is - if it matched a more recent doc it will
> take recency into consideration, however if the relevancy is better than
> date boost we keep relevancy. What do you guys think ??
>
> Thanks,
>
> Ravi Kiran Bhaskar
>
>
> On Mon, Sep 8, 2014 at 12:35 PM, Ravi Solr  wrote:
>
>> Joel and Erick,
>>Thank you very much for explaining how the ReRanking works. Now
>> its a bit more clear.
>>
>> Thanks,
>>
>> Ravi Kiran Bhaskar
>>
>> On Sun, Sep 7, 2014 at 4:45 PM, Joel Bernstein  wrote:
>>
>>> Oops wrong usage pattern. It should be:
>>>
>>> 1) Main query is sorted by a field (scores tracked silently in the
>>> background).
>>> 2) Reranker is reRanking docs based on the score from the main query.
>>>
>>>
>>>
>>> Joel Bernstein
>>> Search Engineer at Heliosearch
>>>
>>>
>>> On Sun, Sep 7, 2014 at 4:43 PM, Joel Bernstein 
>>> wrote:
>>>
>>> > Ok, just reviewed the code. The ReRankingQParserPlugin always tracks the
>>> > scores from the main query. So this explains things. Speaking of
>>> explaining
>>> > things, the ReRankingParserPlugin also works with Lucene's explain. So
>>> if
>>> > you use debugQuery=true we should see that the score from the initial
>>> query
>>> > was combined with the score from the reRankQuery, which should be 1.
>>> >
>>> > You have stumbled on a interesting usage pattern which I never
>>> considered.
>>> > But basically what's happening is:
>>> >
>>> > 1) Main query is sorted by score.
>>> > 2) Reranker is reRanking docs based on the score from the main query.
>>> >
>>> > No, worries Erick, you've taught me a lot over the past couple of years!
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Joel Bernstein
>>> > Search Engineer at Heliosearch
>>> >
>>> >
>>> > On Sun, Sep 7, 2014 at 11:37 AM, Erick Erickson <
>>> erickerick...@gmail.com>
>>> > wrote:
>>> >
>>> >> Joel:
>>> >>
>>> >> I find that whenever I say something totally wrong publicly, I
>>> >> remember the correction really really well...
>>> >>
>>> >> Thanks for straightening that out!
>>> >> Erick
>>> >>
>>> >> On Sat, Sep 6, 2014 at 12:58 PM, Joel Bernstein 
>>> >> wrote:
>>> >> > This folllowing query:
>>> >> >
>>> >> > http://localhost:8080/solr/select?q=malaysian airline
>>> crash&rq={!rerank
>>> >> > reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
>>> >> > desc&fl=headline,publish_date,score
>>> >> >
>>> >> > Is doing the following:
>>> >> >
>>> >> > The main query is sorted by publish_date. Then the results are
>>> reranked
>>> >> by
>>> >> > *:*, which in theory would have no effect at all.
>>> >> >
>>> >> > The reRankQuery only uses the reRankQuery to re-rank the results. The
>>> >> sort
>>> >> > param will always apply to the main query.
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> >
>>> >> > Joel Bernstein
>>> >> > Search Engineer at Heliosearch
>>> >> >
>>> >> >
>>> >> > On Sat, Sep 6, 2014 at 2:33 PM, Ravi Solr 
>>> wrote:
>>> >> >
>>> >> >> Erick,
>>> >> >> Your idea about reversing Joel's suggestion seems to give
>>> the
>>> >> best
>>> >> >> results of all the options I tried...but I cant seem to understand
>>> >> why. I
>>> >> >> thought the query shown below should give irrelevant results as
>>> >> sorting by
>>> >> >> date would throw relevancy off...but somehow its getting relevant
>>> >> results
>>> >> >> with fair enough reverse chronology. It is as if the sort is applied
>>> >> after
>>> >> >> the docs are collected and reranked (which is what I wanted). One
>>> more
>>> >> >> thing that baffled me was, if I change reRankDocs from 1000 to100
>>> the
>>> >> >> results become irrelevant, which doesnt make sense.
>>> >> >>
>>> >> >> So can you kindly explain whats going on in the following query.
>>> >> >>
>>> >> >> http://localhost:8080/solr/select?q=malaysian airline
>>> >> crash&rq={!rerank
>>> >> >> reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
>>> >> >> desc&fl=headline,publish_date,score
>>> >> >>
>>> >> >> I love the solr community,

Solr numFound > 0 but doc list empty in Solr Cloud setup

2015-01-16 Thread Jaikit Savla
I am using below tutorial for Solr Cloud setup with 2 shards
http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster


I am able to get the default set up working. However, I have a requirement 
where my index is not in default location (data/index) and hence when I start 
jvm for each shard I run with -Dsolr.data.dir= . Now when I 
query I get results with numFound > 0 but doc list is always empty.

I verified that my index does have fields stored and indexed. Anyone else faced 
similar issue or have an idea on what I am missing ? Verified that by loading 
single core.

Appreciate any help.

request:

http://localhost:/solr/collection1/select?q=body%3A%22to%22&wt=json&indent=true&shards=http://localhost:/solr/collection1


response:
{ "responseHeader": { "status": 0, "QTime": 18, "params": { "shards": 
"http://localhost:/solr/collection1";, "indent": "true", "q": "body:\"to\"", 
"_": "1421390858638", "wt": "json" } }, "response": { "numFound": 2564, 
"start": 0, "maxScore": 0.4523638, "docs": [] } }


Solr example for Solr 4.10.2 gives warning about Multiple request handlers with same name

2015-01-16 Thread Tom Burton-West
Hello,

I'm running Solr 4.10.2 out of the box with the Solr example.

i.e. ant example
cd solr/example
java -jar start.jar

in /example/log

At start-up the example gives this message in the log:

WARN  - 2015-01-16 12:31:40.895; org.apache.solr.core.RequestHandlers;
Multiple requestHandler registered to the same name: /update ignoring:
org.apache.solr.handler.UpdateRequestHandler

Is this a bug?   Is there something wrong with the out of the box example
configuration?

Tom


Re: Apache Solr quickstart tutorial - error while loading main class SimplePostTool

2015-01-16 Thread Shubhanshu Gupta
Thanks a lot. It did work. A last favor - can you please explain me, why
did the old command didn't work and why this one worked? Although, I do
know that the command you have given assumes that I did not set the
environment through: "export CLASSPATH

=dist/solr-core-4.10.2.jar" .  But I already set the environment,
still there was no effect. Please correct me if I am wrong anywhere.


Thanks.


LinkedIn

|
Twitter 

On Fri, Jan 16, 2015 at 7:26 PM, Ahmet Arslan 
wrote:

> Hi Shubhanshu,
>
> How about this one?
>
> java -classpath dist/solr-core-*jar -Dauto -Drecursive
> org.apache.solr.util.SimplePostTool docs/
>
> Ahmet
>
>
> On Friday, January 16, 2015 3:13 PM, Shubhanshu Gupta <
> shubhanshu.gupt...@gmail.com> wrote:
> I am following Apache Solr quickstart tutorial
> . The tutorial comes across
> indexing a directory of rich files which requires implementing java -Dauto
> -Drecursive org.apache.solr.util.SimplePostTool docs/ .
>
> I am getting an error which says: Could not find or load main class
> org.apache.solr.util.SimplePostTool inspite of following the quickstart
> tutorial closely. I am not getting how to resolve the error and proceed
> ahead with the tutorial.
> I would whole-heartedly appreciate any help. Thanks in advance.
>
>
> Regards,
> Shubhanshu Gupta
>
> LinkedIn
> <
> https://www.linkedin.com/profile/view?id=310143808&snapshotID=0&authType=name&authToken=cDRN&trk=NUS-body-member-name&sl=NPU_REG%3Bno_results%3B-1%3Bactivity%3A5903287270026268672%3B
> >
> |
> Twitter 
>


Re: Query ReRanking question

2015-01-16 Thread Ravi Solr
As per Erick's suggestion reposting my response to the group. Joel and
Erick Thank you very much for helping me out with the ReRanking question a
while ago.

I have an alternative which seems to be working better for me than
ReRanking, can you kindly let me know of any pitfalls that you guys can
think of about the this approach ?? Since we value relevancy & recency at
the same time even though both are mutually exclusive, i thought maybe I
can use the function queries to adjust the boost as follows

boost=max(recip(ms(NOW/HOUR,publish_date),7.889e-10,1,1),scale(query($q),0,1))

What I intended to do here is - if it matched a more recent doc it will
take recency into consideration, however if the relevancy is better than
date boost we keep relevancy. What do you guys think ??

Thanks,

Ravi Kiran Bhaskar


On Mon, Sep 8, 2014 at 12:35 PM, Ravi Solr  wrote:

> Joel and Erick,
>Thank you very much for explaining how the ReRanking works. Now
> its a bit more clear.
>
> Thanks,
>
> Ravi Kiran Bhaskar
>
> On Sun, Sep 7, 2014 at 4:45 PM, Joel Bernstein  wrote:
>
>> Oops wrong usage pattern. It should be:
>>
>> 1) Main query is sorted by a field (scores tracked silently in the
>> background).
>> 2) Reranker is reRanking docs based on the score from the main query.
>>
>>
>>
>> Joel Bernstein
>> Search Engineer at Heliosearch
>>
>>
>> On Sun, Sep 7, 2014 at 4:43 PM, Joel Bernstein 
>> wrote:
>>
>> > Ok, just reviewed the code. The ReRankingQParserPlugin always tracks the
>> > scores from the main query. So this explains things. Speaking of
>> explaining
>> > things, the ReRankingParserPlugin also works with Lucene's explain. So
>> if
>> > you use debugQuery=true we should see that the score from the initial
>> query
>> > was combined with the score from the reRankQuery, which should be 1.
>> >
>> > You have stumbled on a interesting usage pattern which I never
>> considered.
>> > But basically what's happening is:
>> >
>> > 1) Main query is sorted by score.
>> > 2) Reranker is reRanking docs based on the score from the main query.
>> >
>> > No, worries Erick, you've taught me a lot over the past couple of years!
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Joel Bernstein
>> > Search Engineer at Heliosearch
>> >
>> >
>> > On Sun, Sep 7, 2014 at 11:37 AM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Joel:
>> >>
>> >> I find that whenever I say something totally wrong publicly, I
>> >> remember the correction really really well...
>> >>
>> >> Thanks for straightening that out!
>> >> Erick
>> >>
>> >> On Sat, Sep 6, 2014 at 12:58 PM, Joel Bernstein 
>> >> wrote:
>> >> > This folllowing query:
>> >> >
>> >> > http://localhost:8080/solr/select?q=malaysian airline
>> crash&rq={!rerank
>> >> > reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
>> >> > desc&fl=headline,publish_date,score
>> >> >
>> >> > Is doing the following:
>> >> >
>> >> > The main query is sorted by publish_date. Then the results are
>> reranked
>> >> by
>> >> > *:*, which in theory would have no effect at all.
>> >> >
>> >> > The reRankQuery only uses the reRankQuery to re-rank the results. The
>> >> sort
>> >> > param will always apply to the main query.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Joel Bernstein
>> >> > Search Engineer at Heliosearch
>> >> >
>> >> >
>> >> > On Sat, Sep 6, 2014 at 2:33 PM, Ravi Solr 
>> wrote:
>> >> >
>> >> >> Erick,
>> >> >> Your idea about reversing Joel's suggestion seems to give
>> the
>> >> best
>> >> >> results of all the options I tried...but I cant seem to understand
>> >> why. I
>> >> >> thought the query shown below should give irrelevant results as
>> >> sorting by
>> >> >> date would throw relevancy off...but somehow its getting relevant
>> >> results
>> >> >> with fair enough reverse chronology. It is as if the sort is applied
>> >> after
>> >> >> the docs are collected and reranked (which is what I wanted). One
>> more
>> >> >> thing that baffled me was, if I change reRankDocs from 1000 to100
>> the
>> >> >> results become irrelevant, which doesnt make sense.
>> >> >>
>> >> >> So can you kindly explain whats going on in the following query.
>> >> >>
>> >> >> http://localhost:8080/solr/select?q=malaysian airline
>> >> crash&rq={!rerank
>> >> >> reRankQuery=$rqq reRankDocs=1000}&rqq=*:*&sort=publish_date
>> >> >> desc&fl=headline,publish_date,score
>> >> >>
>> >> >> I love the solr community, so much to learn from so many
>> knowledgeable
>> >> >> people.
>> >> >>
>> >> >> Thanks
>> >> >>
>> >> >> Ravi Kiran Bhaskar
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Fri, Sep 5, 2014 at 1:23 PM, Erick Erickson <
>> >> erickerick...@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> > OK, why can't you switch the clauses from Joel's suggestion?
>> >> >> >
>> >> >> > Something like:
>> >> >> > q=Malaysia plane crash&rq={!rerank reRankDocs=1000
>> >> >> > reRankQuery=$myquery}&myquery=*:*&sort=date+desc
>> >> >> >
>> >>

Re: OutOfMemoryError for PDF document upload into Solr

2015-01-16 Thread Erick Erickson
Here's an example of using Tika in a stand-alone Java program.

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Fri, Jan 16, 2015 at 7:42 AM, Jack Krupansky
 wrote:
> It would be nice to have a SolrJ-level implementation as well as a
> command-line implementation of the extraction request handler so that app
> ingestion code could do the extraction outside of Solr at the app level and
> even as a separate process to stream to the app or Solr. That would permit
> the  to do customization, entity extraction, boiler-plate removal, etc. in
> app-friendly code, before transport to the Solr server.
>
> The extraction request handler is a really cool feature and quite
> sufficient for a lot of scenarios, but additional architectural flexibility
> would be a big win.
>
> -- Jack Krupansky
>
> On Fri, Jan 16, 2015 at 10:21 AM, Charlie Hull  wrote:
>
>> On 16/01/2015 04:02, Dan Davis wrote:
>>
>>> Why re-write all the document conversion in Java ;)  Tika is very slow.
>>>  5
>>> GB PDF is very big.
>>>
>>
>> Or you can run Tika in a separate process, or even on a separate machine,
>> wrapped with something to cope if it dies due to some horrible input...we
>> generally avoid document format translation within Solr and do it
>> externally before feeding documents to Solr.
>>
>> Charlie
>>
>>
>>> If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
>>> mode.   The HTML mode captures some meta-data that would otherwise be
>>> lost.
>>>
>>>
>>> If you need to go faster still, you can  also write some stuff linked
>>> directly against poppler library.
>>>
>>> Before you jump down by through about Tika being slow - I wrote a PDF
>>> indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
>>> getjmp/longjmp.   But fast...
>>>
>>>
>>>
>>> On Thu, Jan 15, 2015 at 1:54 PM,  wrote:
>>>
>>>  Siegfried and Michael Thank you for your replies and help.

 -Original Message-
 From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
 Sent: Thursday, January 15, 2015 3:45 AM
 To: solr-user@lucene.apache.org
 Subject: Re: OutOfMemoryError for PDF document upload into Solr

 Hi Ganesh,

 you can increase the heap size but parsing a 4 GB PDF document will very
 likely consume A LOT OF memory - I think you need to check if that large
 PDF can be parsed at all :-)

 Cheers,

 Siegfried Goeschl

 On 14.01.15 18:04, Michael Della Bitta wrote:

> Yep, you'll have to increase the heap size for your Tomcat container.
>
> http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
> -heap-size-correctly
>
> Michael Della Bitta
>
> Senior Software Engineer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions  | g+:
> plus.google.com/appinions
>  3336/posts>
> w: appinions.com 
>
> On Wed, Jan 14, 2015 at 12:00 PM,  wrote:
>
>  Hello,
>>
>> Can someone pass on the hints to get around following error? Is there
>> any Heap Size parameter I can set in Tomcat or in Solr webApp that
>> gets deployed in Solr?
>>
>> I am running Solr webapp inside Tomcat on my local machine which has
>> RAM of 12 GB. I have PDF document which is 4 GB max in size that
>> needs to be loaded into Solr
>>
>>
>>
>>
>> Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap
>>
> space

>   at java.util.AbstractCollection.toArray(Unknown Source)
>>   at java.util.ArrayList.(Unknown Source)
>>   at
>> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
>>   at
>>
> org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)

>   at
>>
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)

>   at
>>
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)

>   at
>>
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)

>   at
>>
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)

>   at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>   at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>   at
>> org.apache.tika.parser.AutoDetectParser.parse(
>> AutoDetectParser.java:120)
>>   at
>>
>>  org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(
 ExtractingDocumentLoader.java:219)

>   at
>>
>>  org.apache.solr.handler.ContentStreamHandlerBase.handleRequ

Re: OutOfMemoryError for PDF document upload into Solr

2015-01-16 Thread Jack Krupansky
It would be nice to have a SolrJ-level implementation as well as a
command-line implementation of the extraction request handler so that app
ingestion code could do the extraction outside of Solr at the app level and
even as a separate process to stream to the app or Solr. That would permit
the  to do customization, entity extraction, boiler-plate removal, etc. in
app-friendly code, before transport to the Solr server.

The extraction request handler is a really cool feature and quite
sufficient for a lot of scenarios, but additional architectural flexibility
would be a big win.

-- Jack Krupansky

On Fri, Jan 16, 2015 at 10:21 AM, Charlie Hull  wrote:

> On 16/01/2015 04:02, Dan Davis wrote:
>
>> Why re-write all the document conversion in Java ;)  Tika is very slow.
>>  5
>> GB PDF is very big.
>>
>
> Or you can run Tika in a separate process, or even on a separate machine,
> wrapped with something to cope if it dies due to some horrible input...we
> generally avoid document format translation within Solr and do it
> externally before feeding documents to Solr.
>
> Charlie
>
>
>> If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
>> mode.   The HTML mode captures some meta-data that would otherwise be
>> lost.
>>
>>
>> If you need to go faster still, you can  also write some stuff linked
>> directly against poppler library.
>>
>> Before you jump down by through about Tika being slow - I wrote a PDF
>> indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
>> getjmp/longjmp.   But fast...
>>
>>
>>
>> On Thu, Jan 15, 2015 at 1:54 PM,  wrote:
>>
>>  Siegfried and Michael Thank you for your replies and help.
>>>
>>> -Original Message-
>>> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
>>> Sent: Thursday, January 15, 2015 3:45 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: OutOfMemoryError for PDF document upload into Solr
>>>
>>> Hi Ganesh,
>>>
>>> you can increase the heap size but parsing a 4 GB PDF document will very
>>> likely consume A LOT OF memory - I think you need to check if that large
>>> PDF can be parsed at all :-)
>>>
>>> Cheers,
>>>
>>> Siegfried Goeschl
>>>
>>> On 14.01.15 18:04, Michael Della Bitta wrote:
>>>
 Yep, you'll have to increase the heap size for your Tomcat container.

 http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
 -heap-size-correctly

 Michael Della Bitta

 Senior Software Engineer

 o: +1 646 532 3062

 appinions inc.

 “The Science of Influence Marketing”

 18 East 41st Street

 New York, NY 10017

 t: @appinions  | g+:
 plus.google.com/appinions
 
 w: appinions.com 

 On Wed, Jan 14, 2015 at 12:00 PM,  wrote:

  Hello,
>
> Can someone pass on the hints to get around following error? Is there
> any Heap Size parameter I can set in Tomcat or in Solr webApp that
> gets deployed in Solr?
>
> I am running Solr webapp inside Tomcat on my local machine which has
> RAM of 12 GB. I have PDF document which is 4 GB max in size that
> needs to be loaded into Solr
>
>
>
>
> Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap
>
 space
>>>
   at java.util.AbstractCollection.toArray(Unknown Source)
>   at java.util.ArrayList.(Unknown Source)
>   at
> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
>   at
>
 org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
>>>
   at
>
 org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
>>>
   at
>
 org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
>>>
   at
>
 org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
>>>
   at
>
 org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
>>>
   at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at
> org.apache.tika.parser.AutoDetectParser.parse(
> AutoDetectParser.java:120)
>   at
>
>  org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(
>>> ExtractingDocumentLoader.java:219)
>>>
   at
>
>  org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(
>>> ContentStreamHandlerBase.java:74)
>>>
   at
>
>  org.apache.solr.handler.RequestHandlerBase.handleRequest(
>>> RequestHandlerBase.java:135)
>>>
   at
>
>  org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.
>>> handleRequest(RequestHandlers.java:246)
>>>
   at org.a

Re: OutOfMemoryError for PDF document upload into Solr

2015-01-16 Thread Markus Jelsma
Tika 1.6 has PDFBox 1.8.4, which has memory issues, eating excessive RAM! 
Either upgrade to Tika 1.7 (out now) or manually use the PDFBox 1.8.8 
dependency.

M.

On Friday 16 January 2015 15:21:55 Charlie Hull wrote:
> On 16/01/2015 04:02, Dan Davis wrote:
> > Why re-write all the document conversion in Java ;)  Tika is very slow.  
> > 5
> > GB PDF is very big.
> 
> Or you can run Tika in a separate process, or even on a separate
> machine, wrapped with something to cope if it dies due to some horrible
> input...we generally avoid document format translation within Solr and
> do it externally before feeding documents to Solr.
> 
> Charlie
> 
> > If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
> > mode.   The HTML mode captures some meta-data that would otherwise be
> > lost.
> > 
> > 
> > If you need to go faster still, you can  also write some stuff linked
> > directly against poppler library.
> > 
> > Before you jump down by through about Tika being slow - I wrote a PDF
> > indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
> > getjmp/longjmp.   But fast...
> > 
> > On Thu, Jan 15, 2015 at 1:54 PM,  wrote:
> >> Siegfried and Michael Thank you for your replies and help.
> >> 
> >> -Original Message-
> >> From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
> >> Sent: Thursday, January 15, 2015 3:45 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: OutOfMemoryError for PDF document upload into Solr
> >> 
> >> Hi Ganesh,
> >> 
> >> you can increase the heap size but parsing a 4 GB PDF document will very
> >> likely consume A LOT OF memory - I think you need to check if that large
> >> PDF can be parsed at all :-)
> >> 
> >> Cheers,
> >> 
> >> Siegfried Goeschl
> >> 
> >> On 14.01.15 18:04, Michael Della Bitta wrote:
> >>> Yep, you'll have to increase the heap size for your Tomcat container.
> >>> 
> >>> http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
> >>> -heap-size-correctly
> >>> 
> >>> Michael Della Bitta
> >>> 
> >>> Senior Software Engineer
> >>> 
> >>> o: +1 646 532 3062
> >>> 
> >>> appinions inc.
> >>> 
> >>> “The Science of Influence Marketing”
> >>> 
> >>> 18 East 41st Street
> >>> 
> >>> New York, NY 10017
> >>> 
> >>> t: @appinions  | g+:
> >>> plus.google.com/appinions
> >>>  >>> 3336/posts>
> >>> w: appinions.com 
> >>> 
> >>> On Wed, Jan 14, 2015 at 12:00 PM,  wrote:
>  Hello,
>  
>  Can someone pass on the hints to get around following error? Is there
>  any Heap Size parameter I can set in Tomcat or in Solr webApp that
>  gets deployed in Solr?
>  
>  I am running Solr webapp inside Tomcat on my local machine which has
>  RAM of 12 GB. I have PDF document which is 4 GB max in size that
>  needs to be loaded into Solr
>  
>  
>  
>  
>  Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap
> >> 
> >> space
> >> 
>    at java.util.AbstractCollection.toArray(Unknown Source)
>    at java.util.ArrayList.(Unknown Source)
>    at
>  
>  org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
>  
>    at
> >> 
> >> org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
> >> 
>    at
> >> 
> >> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
> >> 
>    at
> >> 
> >> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
> >> 
>    at
> >> 
> >> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
> >> 
>    at
> >> 
> >> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
> >> 
>    at
>  
>  org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>  
>    at
>  
>  org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>  
>    at
>  
>  org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120
>  )
>  
>    at
> >> 
> >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extracti
> >> ngDocumentLoader.java:219)>> 
>    at
> >> 
> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Conten
> >> tStreamHandlerBase.java:74)>> 
>    at
> >> 
> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBa
> >> se.java:135)>> 
>    at
> >> 
> >> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequ
> >> est(RequestHandlers.java:246)>> 
>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>    at
> >> 
> >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.jav
> >> a:777)>> 
>    at
> >> 
> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispa

Re: OutOfMemoryError for PDF document upload into Solr

2015-01-16 Thread Charlie Hull

On 16/01/2015 04:02, Dan Davis wrote:

Why re-write all the document conversion in Java ;)  Tika is very slow.   5
GB PDF is very big.


Or you can run Tika in a separate process, or even on a separate 
machine, wrapped with something to cope if it dies due to some horrible 
input...we generally avoid document format translation within Solr and 
do it externally before feeding documents to Solr.


Charlie


If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
mode.   The HTML mode captures some meta-data that would otherwise be lost.


If you need to go faster still, you can  also write some stuff linked
directly against poppler library.

Before you jump down by through about Tika being slow - I wrote a PDF
indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
getjmp/longjmp.   But fast...



On Thu, Jan 15, 2015 at 1:54 PM,  wrote:


Siegfried and Michael Thank you for your replies and help.

-Original Message-
From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
Sent: Thursday, January 15, 2015 3:45 AM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemoryError for PDF document upload into Solr

Hi Ganesh,

you can increase the heap size but parsing a 4 GB PDF document will very
likely consume A LOT OF memory - I think you need to check if that large
PDF can be parsed at all :-)

Cheers,

Siegfried Goeschl

On 14.01.15 18:04, Michael Della Bitta wrote:

Yep, you'll have to increase the heap size for your Tomcat container.

http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
-heap-size-correctly

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions

w: appinions.com 

On Wed, Jan 14, 2015 at 12:00 PM,  wrote:


Hello,

Can someone pass on the hints to get around following error? Is there
any Heap Size parameter I can set in Tomcat or in Solr webApp that
gets deployed in Solr?

I am running Solr webapp inside Tomcat on my local machine which has
RAM of 12 GB. I have PDF document which is 4 GB max in size that
needs to be loaded into Solr




Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap

space

  at java.util.AbstractCollection.toArray(Unknown Source)
  at java.util.ArrayList.(Unknown Source)
  at
org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
  at

org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)

  at

org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)

  at

org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)

  at

org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)

  at

org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)

  at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
  at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
  at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
  at


org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)

  at


org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)

  at


org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

  at


org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)

  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
  at


org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)

  at


org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)

  at


org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

  at


org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)

  at


org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)

  at


org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)

  at


org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)

  at


org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)

  at


org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)

  at


org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)

  at


org.apache.coyote.http1

Re: Apache Solr quickstart tutorial - error while loading main class SimplePostTool

2015-01-16 Thread Ahmet Arslan
Hi Shubhanshu,

How about this one?

java -classpath dist/solr-core-*jar -Dauto -Drecursive 
org.apache.solr.util.SimplePostTool docs/

Ahmet


On Friday, January 16, 2015 3:13 PM, Shubhanshu Gupta 
 wrote:
I am following Apache Solr quickstart tutorial
. The tutorial comes across
indexing a directory of rich files which requires implementing java -Dauto
-Drecursive org.apache.solr.util.SimplePostTool docs/ .

I am getting an error which says: Could not find or load main class
org.apache.solr.util.SimplePostTool inspite of following the quickstart
tutorial closely. I am not getting how to resolve the error and proceed
ahead with the tutorial.
I would whole-heartedly appreciate any help. Thanks in advance.


Regards,
Shubhanshu Gupta

LinkedIn

|
Twitter 


Apache Solr quickstart tutorial - error while loading main class SimplePostTool

2015-01-16 Thread Shubhanshu Gupta
I am following Apache Solr quickstart tutorial
. The tutorial comes across
indexing a directory of rich files which requires implementing java -Dauto
-Drecursive org.apache.solr.util.SimplePostTool docs/ .

I am getting an error which says: Could not find or load main class
org.apache.solr.util.SimplePostTool inspite of following the quickstart
tutorial closely. I am not getting how to resolve the error and proceed
ahead with the tutorial.
I would whole-heartedly appreciate any help. Thanks in advance.


Regards,
Shubhanshu Gupta

LinkedIn

|
Twitter 


Re: How to select the correct number of Shards in SolrCloud

2015-01-16 Thread Manohar Sripada
Thanks Daniel and Shawn for your valuable suggestions,

Daniel,
If you have a query and it needs to get results from 64 cores, if 63 return
in 100ms but the last core is in GC pause and takes 500ms, your query will
take just over 500ms.
> There is only single JVM running per machine. I will get the QTime from
each Solr Core and will check if this is the root cause.

Lastly, you mentioned you allocated 32Gb to "solr", do you mean to the
JVM heap?
That's quite a lot of a 64Gb machine, you haven't left much for the page
cache.
> Yes, 32GB to Solr's JVM heap. I wanted to enable Filter & FieldValue
Cache, as most of my search queries revolves around filters and facets.
Also, I am planning to use Document cache.

Shawn,
Each server has 8 CPU cores and 64GB of RAM.  Solr requires a 6GB heap
> Can you please tell me what is the size of your index? And what is the
size of the large cold shard?
> Can you please suggest if any tool that you use for collecting the
statistics? like the QTime's for the queries etc.

Thanks,
Manohar


On Fri, Jan 16, 2015 at 3:23 PM, Shawn Heisey  wrote:

> On 1/15/2015 10:58 PM, Manohar Sripada wrote:
> > The reason I have created 64 Shards is there are 4 CPU cores on each VM;
> > while querying I can make use of all the CPU cores. On an average, Solr
> > QTime is around 500ms here.
> >
> > Last time to my other discussion, Erick suggested that I might be over
> > sharding, So, I tried reducing the number of shards to 32 and then 16. To
> > my surprise, it started performing better. It came down to 300 ms (for 32
> > shards) and 100 ms (for 16 shards). I haven't tested with filters and
> > facets yet here. But, the simple search queries had shown lot of
> > improvement.
> >
> > So, how come the less number of shards performing better?? Is it because
> > there are less number of posting lists to search on OR less merges that
> are
> > happening? And how to determine the correct number of shards?
>
> Daniel has replied with good information.
>
> One additional problem I can think of when there are too many shards: If
> your Solr server is busy enough to have any possibility of simultaneous
> requests, then you will find that it's NOT a good idea to create enough
> shards to use all your CPU cores.  In that situation, when you do a
> single query, all your CPU cores will be in use.  When multiple queries
> happen at the same time, they have to share the available CPU resources,
> slowing them down.  With a smaller number of shards, the additional CPU
> cores can handle simultaneous queries.
>
> I have an index with nearly 100 million documents.  I've divided it into
> six large cold shards and one very small hot shard.  It's not SolrCloud.
>  I put three large shards on each of two servers, and the small shard on
> one of those two servers.  The distributed query normally happens on the
> server without the small shard.  Each server has 8 CPU cores and 64GB of
> RAM.  Solr requires a 6GB heap.
>
> My median QTime over the last 231836 queries is 25 milliseconds and my
> 95th percentile QTime is 376 milliseconds.  My query rate is pretty low
> - I've never seen Solr's statistics for the 15 minute query rate go
> above a single digit per second.
>
> Thanks,
> Shawn
>
>


Re: Solr groups not matching with terms in a field

2015-01-16 Thread Naresh Yadav
thanks Ahmet..my problem solved...reason of slow performance of facet query
was : not doing setRows(0)..
once i done it then it came out in seconds like terms query.

On Fri, Jan 16, 2015 at 3:25 PM, Ahmet Arslan 
wrote:

> Hi,
>
> Thats a different problem : speed-up faceting.
> Faceting used all over the place and it is fast. I suggest you looks for
> faceting improvements.
>
> Ahmet
>
>
>
> On Friday, January 16, 2015 11:17 AM, Naresh Yadav 
> wrote:
> I tried facetting also but not worked smoothly for me. Case i had mentioned
> in email is dummy one and my actual index is with
> 12 lakh docs and 2 GB size on single machine. Each of tenant_pool field
> value has 20-30 tokens.
> Getting all terms in tenant_pool is fast in seconds but when i go with
> facet path after filter criteria then that is very slow. Because
> it is reading whole field from disk and i am only interested in terms.
>
>
> On Fri, Jan 16, 2015 at 1:48 PM, Ahmet Arslan 
> wrote:
>
> > Hi Naresh,
> >
> > Yup terms component does not respect q or fq parameter.
> > Luckily, thats easy with facet component. Example :
> > facet=true&facet.field=tenant_pool&q=type:1
> >
> > Please see more here :
> > https://cwiki.apache.org/confluence/display/solr/Faceting
> >
> > happy faceting,
> > ahmet
> >
> >
> >
> > On Friday, January 16, 2015 10:13 AM, Naresh Yadav  >
> > wrote:
> > Hi ahmet,
> >
> > Thanks, now i understand better, i will not try my usecase with grouping.
> > Actually i am interested in unique terms in a field i.e tenant_pool.
> That i
> > get perfectly with http://www.imagesup.net/?di=614212438580
> >
> > But i am not able to get terms after applying some filter say "type":"1".
> > That is I need unique terms in "tenant_pool" field for "type":"1" query
> and
> > answer will be P1, L1.
> > Please suggest me if i can get this with out reading each doc from disk.
> >
> >
> > On Fri, Jan 16, 2015 at 1:28 PM, Ahmet Arslan  >
> > wrote:
> >
> > > Hi Naresh,
> > >
> > > I have never grouped on a tokenised field and I am not sure it makes
> > sense
> > > to do so.
> > >
> > > Reading back ref-guide it says this about group.field parameter
> > >
> > > "The name of the field by which to group results. The field must be
> > > single-valued, and either be indexed or a field type that has a value
> > > source and works in a function query, such as ExternalFileField. It
> must
> > > also be a string-based field, such as StrField or TextField"
> > >
> > >
> > > https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> > >
> > > Therefore, it should be single valued. P.S. Don't get confused with
> > > TextField type, for example it could create single token when used with
> > > keyword tokenizer.
> > >
> > > Ahmet
> > >
> > > On Friday, January 16, 2015 4:43 AM, Naresh Yadav <
> nyadav@gmail.com>
> > > wrote:
> > > Hi ahmet,
> > >
> > > If you observe output ngroups is 1 and returning only one group P1.
> > > But my expectation is it should return three groups P1, L1, L2 as my
> > > field is tokenized with space.
> > >
> > > Please correct me if wrong?
> > >
> > >
> > > On 1/15/15, Ahmet Arslan  wrote:
> > > >
> > > >
> > > > Hi Naresh,
> > > >
> > > > Everything looks correct, what is the problem here?
> > > >
> > > > If you want to see more than one document per group, there is a
> > parameter
> > > > for that which defaults to 1.
> > > >
> > > > Ahmet
> > > >
> > > >
> > > >
> > > > On Thursday, January 15, 2015 9:02 AM, Naresh Yadav <
> > > nyadav@gmail.com>
> > > > wrote:
> > > > Hi all,
> > > >
> > > > I had done following configuration to test Solr grouping concept.
> > > >
> > > > solr version :  4.6.1 (tried in latest version 4.10.3 also)
> > > > Schema : http://www.imagesup.net/?di=10142124357616
> > > > Solrj code to insert docs :
> > > http://www.imagesup.net/?di=10142124381116
> > > > Response Group's :  http://www.imagesup.net/?di=1114212438351
> > > > Response Terms' : http://www.imagesup.net/?di=614212438580
> > > >
> > > > Please let me know if am i doing something wrong her
> > >
> >
>


Re: Easiest way to embed solr in a desktop application

2015-01-16 Thread Ramkumar R. Aiyengar
That's correct, even though it should still be possible to embed Jetty,
that could change in the future, and that's why support for pluggable
containers is being taken away.

If you need to deal with the index at a lower level, there's always Lucene
you can use as a library instead of Solr.

But I am assuming you need to use the search engine at a higher level than
that and hence you ask for Solr. In which case, I urge you to think through
if you really can't run this out of process, may be this is an XY problem.
Keep in mind that Solr has the ability to provide higher level
functionality because it can control almost the entirety of the application
(which is the philosophical reason behind removal of the war as well), and
that's the reason something like EmbeddedSolrServer will always have
caveats.
On 15 Jan 2015 15:09, "Robert Krüger"  wrote:

> I was considering the programmatic Jetty option but then I read that Solr 5
> no longer supports being run with an external servlet container but maybe
> they still support programmatic jetty use in some way. atm I am using solr
> 4.x, so this would work. No idea if this gets messy classloader-wise in any
> way.
>
> I have been using exactly the approach you described in the past, i.e. I
> built a really, really simple swing dialogue to input queries and display
> results in a table but was just guessing that the built-in ui was far
> superior but maybe I should just live with it for the time being.
>
> On Thu, Jan 15, 2015 at 3:56 PM, Erik Hatcher 
> wrote:
>
> > It’d certainly be easiest to just embed Jetty into your application.  You
> > don’t need to have Jetty as a separate process, you could launch it
> through
> > it’s friendly Java API, configured to use solr.war.
> >
> > If all you needed was to make HTTP(-like) queries to Solr instead of the
> > full admin UI, your application could stick to using EmbeddedSolrServer
> and
> > also provide a UI that takes in a Solr query string (or builds one up)
> and
> > then sends it to the embedded Solr and displays the result.
> >
> > Erik
> >
> > > On Jan 15, 2015, at 9:44 AM, Robert Krüger 
> wrote:
> > >
> > > Hi Andrea,
> > >
> > > you are assuming correctly. It is a local, non-distributed index that
> is
> > > only accessed by the containing desktop application. Do you know if
> there
> > > is a possibility to run the Solr admin UI on top of an embedded
> instance
> > > somehow?
> > >
> > > Thanks a lot,
> > >
> > > Robert
> > >
> > > On Thu, Jan 15, 2015 at 3:17 PM, Andrea Gazzarini <
> a.gazzar...@gmail.com
> > >
> > > wrote:
> > >
> > >> Hi Robert,
> > >> I've used the EmbeddedSolrServer in a scenario like that and I never
> had
> > >> problems.
> > >> I assume you're talking about a standalone application, where the
> whole
> > >> index resides locally and you don't need any cluster / cloud /
> > distributed
> > >> feature.
> > >>
> > >> I think the usage of EmbeddedSolrServer is discouraged in a
> > (distributed)
> > >> service scenario, because it is a direct connection to a SolrCore
> > >> instance...but this is not a problem in the situation you described
> (as
> > far
> > >> as I know)
> > >>
> > >> Best,
> > >> Andrea
> > >>
> > >>
> > >> On 01/15/2015 03:10 PM, Robert Krüger wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I have been using an embedded instance of solr in my desktop
> > application
> > >>> for a long time and it works fine. At the time when I made that
> > decision
> > >>> (vs. firing up a solr web application within my swing application) I
> > got
> > >>> the impression embedded use is somewhat unsupported and I should
> expect
> > >>> problems.
> > >>>
> > >>> My first question is, is this still the case now (4 years later),
> that
> > >>> embedded solr is discouraged?
> > >>>
> > >>> The one limitation I am running into is that I cannot use the solr
> > admin
> > >>> UI
> > >>> for debugging purposes (mainly for running queries). Is there any
> other
> > >>> way
> > >>> to do this other than no longer using embedded solr and
> > programmatically
> > >>> firing up a web application (e.g. using jetty)? Should I do the
> latter
> > >>> anyway?
> > >>>
> > >>> Any insights/advice greatly appreciated.
> > >>>
> > >>> Best regards,
> > >>>
> > >>> Robert
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > Robert Krüger
> > > Managing Partner
> > > Lesspain GmbH & Co. KG
> > >
> > > www.lesspain-software.com
> >
> >
>
>
> --
> Robert Krüger
> Managing Partner
> Lesspain GmbH & Co. KG
>
> www.lesspain-software.com
>


Re: Solr groups not matching with terms in a field

2015-01-16 Thread Ahmet Arslan
Hi,

Thats a different problem : speed-up faceting.
Faceting used all over the place and it is fast. I suggest you looks for 
faceting improvements.

Ahmet



On Friday, January 16, 2015 11:17 AM, Naresh Yadav  wrote:
I tried facetting also but not worked smoothly for me. Case i had mentioned
in email is dummy one and my actual index is with
12 lakh docs and 2 GB size on single machine. Each of tenant_pool field
value has 20-30 tokens.
Getting all terms in tenant_pool is fast in seconds but when i go with
facet path after filter criteria then that is very slow. Because
it is reading whole field from disk and i am only interested in terms.


On Fri, Jan 16, 2015 at 1:48 PM, Ahmet Arslan 
wrote:

> Hi Naresh,
>
> Yup terms component does not respect q or fq parameter.
> Luckily, thats easy with facet component. Example :
> facet=true&facet.field=tenant_pool&q=type:1
>
> Please see more here :
> https://cwiki.apache.org/confluence/display/solr/Faceting
>
> happy faceting,
> ahmet
>
>
>
> On Friday, January 16, 2015 10:13 AM, Naresh Yadav 
> wrote:
> Hi ahmet,
>
> Thanks, now i understand better, i will not try my usecase with grouping.
> Actually i am interested in unique terms in a field i.e tenant_pool. That i
> get perfectly with http://www.imagesup.net/?di=614212438580
>
> But i am not able to get terms after applying some filter say "type":"1".
> That is I need unique terms in "tenant_pool" field for "type":"1" query and
> answer will be P1, L1.
> Please suggest me if i can get this with out reading each doc from disk.
>
>
> On Fri, Jan 16, 2015 at 1:28 PM, Ahmet Arslan 
> wrote:
>
> > Hi Naresh,
> >
> > I have never grouped on a tokenised field and I am not sure it makes
> sense
> > to do so.
> >
> > Reading back ref-guide it says this about group.field parameter
> >
> > "The name of the field by which to group results. The field must be
> > single-valued, and either be indexed or a field type that has a value
> > source and works in a function query, such as ExternalFileField. It must
> > also be a string-based field, such as StrField or TextField"
> >
> >
> > https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> >
> > Therefore, it should be single valued. P.S. Don't get confused with
> > TextField type, for example it could create single token when used with
> > keyword tokenizer.
> >
> > Ahmet
> >
> > On Friday, January 16, 2015 4:43 AM, Naresh Yadav 
> > wrote:
> > Hi ahmet,
> >
> > If you observe output ngroups is 1 and returning only one group P1.
> > But my expectation is it should return three groups P1, L1, L2 as my
> > field is tokenized with space.
> >
> > Please correct me if wrong?
> >
> >
> > On 1/15/15, Ahmet Arslan  wrote:
> > >
> > >
> > > Hi Naresh,
> > >
> > > Everything looks correct, what is the problem here?
> > >
> > > If you want to see more than one document per group, there is a
> parameter
> > > for that which defaults to 1.
> > >
> > > Ahmet
> > >
> > >
> > >
> > > On Thursday, January 15, 2015 9:02 AM, Naresh Yadav <
> > nyadav@gmail.com>
> > > wrote:
> > > Hi all,
> > >
> > > I had done following configuration to test Solr grouping concept.
> > >
> > > solr version :  4.6.1 (tried in latest version 4.10.3 also)
> > > Schema : http://www.imagesup.net/?di=10142124357616
> > > Solrj code to insert docs :
> > http://www.imagesup.net/?di=10142124381116
> > > Response Group's :  http://www.imagesup.net/?di=1114212438351
> > > Response Terms' : http://www.imagesup.net/?di=614212438580
> > >
> > > Please let me know if am i doing something wrong her
> >
>


Re: How to select the correct number of Shards in SolrCloud

2015-01-16 Thread Shawn Heisey
On 1/15/2015 10:58 PM, Manohar Sripada wrote:
> The reason I have created 64 Shards is there are 4 CPU cores on each VM;
> while querying I can make use of all the CPU cores. On an average, Solr
> QTime is around 500ms here.
> 
> Last time to my other discussion, Erick suggested that I might be over
> sharding, So, I tried reducing the number of shards to 32 and then 16. To
> my surprise, it started performing better. It came down to 300 ms (for 32
> shards) and 100 ms (for 16 shards). I haven't tested with filters and
> facets yet here. But, the simple search queries had shown lot of
> improvement.
> 
> So, how come the less number of shards performing better?? Is it because
> there are less number of posting lists to search on OR less merges that are
> happening? And how to determine the correct number of shards?

Daniel has replied with good information.

One additional problem I can think of when there are too many shards: If
your Solr server is busy enough to have any possibility of simultaneous
requests, then you will find that it's NOT a good idea to create enough
shards to use all your CPU cores.  In that situation, when you do a
single query, all your CPU cores will be in use.  When multiple queries
happen at the same time, they have to share the available CPU resources,
slowing them down.  With a smaller number of shards, the additional CPU
cores can handle simultaneous queries.

I have an index with nearly 100 million documents.  I've divided it into
six large cold shards and one very small hot shard.  It's not SolrCloud.
 I put three large shards on each of two servers, and the small shard on
one of those two servers.  The distributed query normally happens on the
server without the small shard.  Each server has 8 CPU cores and 64GB of
RAM.  Solr requires a 6GB heap.

My median QTime over the last 231836 queries is 25 milliseconds and my
95th percentile QTime is 376 milliseconds.  My query rate is pretty low
- I've never seen Solr's statistics for the 15 minute query rate go
above a single digit per second.

Thanks,
Shawn



Re: Solr groups not matching with terms in a field

2015-01-16 Thread Naresh Yadav
I tried facetting also but not worked smoothly for me. Case i had mentioned
in email is dummy one and my actual index is with
12 lakh docs and 2 GB size on single machine. Each of tenant_pool field
value has 20-30 tokens.
Getting all terms in tenant_pool is fast in seconds but when i go with
facet path after filter criteria then that is very slow. Because
it is reading whole field from disk and i am only interested in terms.

On Fri, Jan 16, 2015 at 1:48 PM, Ahmet Arslan 
wrote:

> Hi Naresh,
>
> Yup terms component does not respect q or fq parameter.
> Luckily, thats easy with facet component. Example :
> facet=true&facet.field=tenant_pool&q=type:1
>
> Please see more here :
> https://cwiki.apache.org/confluence/display/solr/Faceting
>
> happy faceting,
> ahmet
>
>
>
> On Friday, January 16, 2015 10:13 AM, Naresh Yadav 
> wrote:
> Hi ahmet,
>
> Thanks, now i understand better, i will not try my usecase with grouping.
> Actually i am interested in unique terms in a field i.e tenant_pool. That i
> get perfectly with http://www.imagesup.net/?di=614212438580
>
> But i am not able to get terms after applying some filter say "type":"1".
> That is I need unique terms in "tenant_pool" field for "type":"1" query and
> answer will be P1, L1.
> Please suggest me if i can get this with out reading each doc from disk.
>
>
> On Fri, Jan 16, 2015 at 1:28 PM, Ahmet Arslan 
> wrote:
>
> > Hi Naresh,
> >
> > I have never grouped on a tokenised field and I am not sure it makes
> sense
> > to do so.
> >
> > Reading back ref-guide it says this about group.field parameter
> >
> > "The name of the field by which to group results. The field must be
> > single-valued, and either be indexed or a field type that has a value
> > source and works in a function query, such as ExternalFileField. It must
> > also be a string-based field, such as StrField or TextField"
> >
> >
> > https://cwiki.apache.org/confluence/display/solr/Result+Grouping
> >
> > Therefore, it should be single valued. P.S. Don't get confused with
> > TextField type, for example it could create single token when used with
> > keyword tokenizer.
> >
> > Ahmet
> >
> > On Friday, January 16, 2015 4:43 AM, Naresh Yadav 
> > wrote:
> > Hi ahmet,
> >
> > If you observe output ngroups is 1 and returning only one group P1.
> > But my expectation is it should return three groups P1, L1, L2 as my
> > field is tokenized with space.
> >
> > Please correct me if wrong?
> >
> >
> > On 1/15/15, Ahmet Arslan  wrote:
> > >
> > >
> > > Hi Naresh,
> > >
> > > Everything looks correct, what is the problem here?
> > >
> > > If you want to see more than one document per group, there is a
> parameter
> > > for that which defaults to 1.
> > >
> > > Ahmet
> > >
> > >
> > >
> > > On Thursday, January 15, 2015 9:02 AM, Naresh Yadav <
> > nyadav@gmail.com>
> > > wrote:
> > > Hi all,
> > >
> > > I had done following configuration to test Solr grouping concept.
> > >
> > > solr version :  4.6.1 (tried in latest version 4.10.3 also)
> > > Schema : http://www.imagesup.net/?di=10142124357616
> > > Solrj code to insert docs :
> > http://www.imagesup.net/?di=10142124381116
> > > Response Group's :  http://www.imagesup.net/?di=1114212438351
> > > Response Terms' : http://www.imagesup.net/?di=614212438580
> > >
> > > Please let me know if am i doing something wrong her
> >
>


Re: How to select the correct number of Shards in SolrCloud

2015-01-16 Thread Daniel Collins
Sharding a query lets you parallel the actual querying the index part of
the search. But remember that as soon as you spread the query out more, you
also need to bring all 64 results sets back together and consolidate them
into a single result set for the end user.  At some point, the gain of
being able to search the data quicker is outweighed by the cost of this
consolidation activity.

One other point to mention, which we noticed as a by-product of some
large-scale sharding we were testing (256 shards, no caches, whole
different kettle of fish!)

The resulting query is only as fast as the slowest shard.  If you have 64
shards, and 8 shards/cores per machine, how many JVMs are you running per
machine?  If you have a single JVM with 8 cores in it, then remember as
soon as that JVM enters a GC cycle, all those 8 cores will stall
processing.  If you have a query and it needs to get results from 64 cores,
if 63 return in 100ms but the last core is in GC pause and takes 500ms,
your query will take just over 500ms.

With respect to sharding, I would never start with a large number of shards
(and 64 is reasonably large in Solr terms). You might be able to get away
without sharding at all, if that meets your latency requirements, then why
bother with the complexity of sharding?  Use those extra CPUs for
processing more QPS instead of a single query faster?

Lastly, you mentioned you allocated 32Gb to "solr", do you mean to the JVM
heap?  That's quite a lot of a 64Gb machine, you haven't left much for the
page cache.  The general rule for Solr is to make the JVM heap as small as
you can get away with, to let the OS page cache (which is needed to cache
all the index files) with as much memory as possible.

On 16 January 2015 at 05:58, Manohar Sripada  wrote:

> Hi All,
>
> My Setup is as follows. There are 16 nodes in my SolrCloud and 4 CPU cores
> on each Solr Node VM. Each having 64 GB of RAM, out of which I have
> allocated 32 GB to Solr. I have a collection which contains around 100
> million Docs, which I created with 64 shards, replication factor 2, and 8
> shards per node. Each shard is getting around 1.6 Million Documents.
>
> The reason I have created 64 Shards is there are 4 CPU cores on each VM;
> while querying I can make use of all the CPU cores. On an average, Solr
> QTime is around 500ms here.
>
> Last time to my other discussion, Erick suggested that I might be over
> sharding, So, I tried reducing the number of shards to 32 and then 16. To
> my surprise, it started performing better. It came down to 300 ms (for 32
> shards) and 100 ms (for 16 shards). I haven't tested with filters and
> facets yet here. But, the simple search queries had shown lot of
> improvement.
>
> So, how come the less number of shards performing better?? Is it because
> there are less number of posting lists to search on OR less merges that are
> happening? And how to determine the correct number of shards?
>
> Thanks,
> Manohar
>


Re: OutOfMemoryError for PDF document upload into Solr

2015-01-16 Thread Siegfried Goeschl

Hi Dan,

neat idea - made a mental note :-)

That brings us back to the point that in complex setups you should not 
do the document pre-processing directly in SOLR but have an import 
process which can safely crash when processing a 4GB PDF file


Cheers,

Siegfried Goeschl

On 16.01.15 05:02, Dan Davis wrote:

Why re-write all the document conversion in Java ;)  Tika is very slow.   5
GB PDF is very big.

If you have a lot of PDF like that try pdftotext in HTML and UTF-8 output
mode.   The HTML mode captures some meta-data that would otherwise be lost.


If you need to go faster still, you can  also write some stuff linked
directly against poppler library.

Before you jump down by through about Tika being slow - I wrote a PDF
indexer that ran at 36 MB/s per core.   Different indexer, all C, lots of
getjmp/longjmp.   But fast...



On Thu, Jan 15, 2015 at 1:54 PM,  wrote:


Siegfried and Michael Thank you for your replies and help.

-Original Message-
From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
Sent: Thursday, January 15, 2015 3:45 AM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemoryError for PDF document upload into Solr

Hi Ganesh,

you can increase the heap size but parsing a 4 GB PDF document will very
likely consume A LOT OF memory - I think you need to check if that large
PDF can be parsed at all :-)

Cheers,

Siegfried Goeschl

On 14.01.15 18:04, Michael Della Bitta wrote:

Yep, you'll have to increase the heap size for your Tomcat container.

http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
-heap-size-correctly

Michael Della Bitta

Senior Software Engineer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions

w: appinions.com 

On Wed, Jan 14, 2015 at 12:00 PM,  wrote:


Hello,

Can someone pass on the hints to get around following error? Is there
any Heap Size parameter I can set in Tomcat or in Solr webApp that
gets deployed in Solr?

I am running Solr webapp inside Tomcat on my local machine which has
RAM of 12 GB. I have PDF document which is 4 GB max in size that
needs to be loaded into Solr




Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap

space

  at java.util.AbstractCollection.toArray(Unknown Source)
  at java.util.ArrayList.(Unknown Source)
  at
org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
  at

org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)

  at

org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)

  at

org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)

  at

org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)

  at

org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)

  at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
  at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
  at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
  at


org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)

  at


org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)

  at


org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

  at


org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)

  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
  at


org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)

  at


org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)

  at


org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)

  at


org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)

  at


org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)

  at


org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)

  at


org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)

  at


org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)

  at


org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)

  at


org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)

  at


org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)

  at


org.apache.coyote.http1

Re: Solr groups not matching with terms in a field

2015-01-16 Thread Ahmet Arslan
Hi Naresh,

Yup terms component does not respect q or fq parameter. 
Luckily, thats easy with facet component. Example : 
facet=true&facet.field=tenant_pool&q=type:1

Please see more here : https://cwiki.apache.org/confluence/display/solr/Faceting

happy faceting,
ahmet



On Friday, January 16, 2015 10:13 AM, Naresh Yadav  wrote:
Hi ahmet,

Thanks, now i understand better, i will not try my usecase with grouping.
Actually i am interested in unique terms in a field i.e tenant_pool. That i
get perfectly with http://www.imagesup.net/?di=614212438580

But i am not able to get terms after applying some filter say "type":"1".
That is I need unique terms in "tenant_pool" field for "type":"1" query and
answer will be P1, L1.
Please suggest me if i can get this with out reading each doc from disk.


On Fri, Jan 16, 2015 at 1:28 PM, Ahmet Arslan 
wrote:

> Hi Naresh,
>
> I have never grouped on a tokenised field and I am not sure it makes sense
> to do so.
>
> Reading back ref-guide it says this about group.field parameter
>
> "The name of the field by which to group results. The field must be
> single-valued, and either be indexed or a field type that has a value
> source and works in a function query, such as ExternalFileField. It must
> also be a string-based field, such as StrField or TextField"
>
>
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping
>
> Therefore, it should be single valued. P.S. Don't get confused with
> TextField type, for example it could create single token when used with
> keyword tokenizer.
>
> Ahmet
>
> On Friday, January 16, 2015 4:43 AM, Naresh Yadav 
> wrote:
> Hi ahmet,
>
> If you observe output ngroups is 1 and returning only one group P1.
> But my expectation is it should return three groups P1, L1, L2 as my
> field is tokenized with space.
>
> Please correct me if wrong?
>
>
> On 1/15/15, Ahmet Arslan  wrote:
> >
> >
> > Hi Naresh,
> >
> > Everything looks correct, what is the problem here?
> >
> > If you want to see more than one document per group, there is a parameter
> > for that which defaults to 1.
> >
> > Ahmet
> >
> >
> >
> > On Thursday, January 15, 2015 9:02 AM, Naresh Yadav <
> nyadav@gmail.com>
> > wrote:
> > Hi all,
> >
> > I had done following configuration to test Solr grouping concept.
> >
> > solr version :  4.6.1 (tried in latest version 4.10.3 also)
> > Schema : http://www.imagesup.net/?di=10142124357616
> > Solrj code to insert docs :
> http://www.imagesup.net/?di=10142124381116
> > Response Group's :  http://www.imagesup.net/?di=1114212438351
> > Response Terms' : http://www.imagesup.net/?di=614212438580
> >
> > Please let me know if am i doing something wrong her
>


Re: Solr groups not matching with terms in a field

2015-01-16 Thread Naresh Yadav
Hi ahmet,

Thanks, now i understand better, i will not try my usecase with grouping.
Actually i am interested in unique terms in a field i.e tenant_pool. That i
get perfectly with http://www.imagesup.net/?di=614212438580

But i am not able to get terms after applying some filter say "type":"1".
That is I need unique terms in "tenant_pool" field for "type":"1" query and
answer will be P1, L1.
Please suggest me if i can get this with out reading each doc from disk.

On Fri, Jan 16, 2015 at 1:28 PM, Ahmet Arslan 
wrote:

> Hi Naresh,
>
> I have never grouped on a tokenised field and I am not sure it makes sense
> to do so.
>
> Reading back ref-guide it says this about group.field parameter
>
> "The name of the field by which to group results. The field must be
> single-valued, and either be indexed or a field type that has a value
> source and works in a function query, such as ExternalFileField. It must
> also be a string-based field, such as StrField or TextField"
>
>
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping
>
> Therefore, it should be single valued. P.S. Don't get confused with
> TextField type, for example it could create single token when used with
> keyword tokenizer.
>
> Ahmet
>
> On Friday, January 16, 2015 4:43 AM, Naresh Yadav 
> wrote:
> Hi ahmet,
>
> If you observe output ngroups is 1 and returning only one group P1.
> But my expectation is it should return three groups P1, L1, L2 as my
> field is tokenized with space.
>
> Please correct me if wrong?
>
>
> On 1/15/15, Ahmet Arslan  wrote:
> >
> >
> > Hi Naresh,
> >
> > Everything looks correct, what is the problem here?
> >
> > If you want to see more than one document per group, there is a parameter
> > for that which defaults to 1.
> >
> > Ahmet
> >
> >
> >
> > On Thursday, January 15, 2015 9:02 AM, Naresh Yadav <
> nyadav@gmail.com>
> > wrote:
> > Hi all,
> >
> > I had done following configuration to test Solr grouping concept.
> >
> > solr version :  4.6.1 (tried in latest version 4.10.3 also)
> > Schema : http://www.imagesup.net/?di=10142124357616
> > Solrj code to insert docs :
> http://www.imagesup.net/?di=10142124381116
> > Response Group's :  http://www.imagesup.net/?di=1114212438351
> > Response Terms' : http://www.imagesup.net/?di=614212438580
> >
> > Please let me know if am i doing something wrong her
>