date:20170601

Is the filter cache separate for each host and then for each collection and then for each shard and then for each replica in SolrCloud?

2017-06-01 Thread Daniel Angelov

Is the filter cache separate for each host and then for each collection and
then for each shard and then for each replica in SolrCloud?
For example, on host1 we have, coll1 shard1 replica1 and coll2 shard1
replica1, on host2 we have, coll1 shard2 replica2 and coll2 shard2
replica2. Does this mean, that we have 4 filter caches, i.e. separate
memory for each core?
If they are separated and for example, query1 is handling from coll1 shard1
replica1 and 1 sec later the same query is handling from coll2 shard1
replica1, this means, that the later query will not use the result set
cached from the first query...

BR
Daniel

Re: Performance Issue in Streaming Expressions

2017-06-01 Thread Susmit Shukla

Hi,

Which version of solr are you on?
Increasing memory may not be useful as streaming API does not keep stuff in
memory (except may be hash joins).
Increasing replicas (not sharding) and pushing the join computation on
worker solr cluster with #workers > 1 would definitely make things faster.
Are you limiting your results at some cutoff? if yes, then SOLR-10698
 can be useful fix. Also
binary response format for streaming would be faster. (available in 6.5
probably)



On Thu, Jun 1, 2017 at 3:04 PM, thiaga rajan <
ecethiagu2...@yahoo.co.in.invalid> wrote:

> We are working on a proposal and feeling streaming API along with export
> handler will best fit for our usecases. We are already of having a
> structure in solr in which we are using graph queries to produce
> hierarchical structure. Now from the structure we need to join couple of
> more collections. We have 5 different collections.
>   Collection 1- 800 k records.
> Collection 2- 200k records.   Collection 3
> - 7k records.   Collection 4 - 6
> million records. Collection 5 - 150 k records
> we are using the below strategy
> innerJoin( intersect( innerJoin(collection 1,collection 2),
> innerJoin(Collection 3, Collection 4)), collection 5).
>We are seeing performance is too slow when we start having
> collection 4. Just with collection 1 2 5 the results are coming in 2 secs.
> The moment I have included collection 4 in the query I could see  a
> performance impact. I believe exporting large results from collection 4 is
> causing the issie. Currently I am using single sharded collection with no
> replica. I thinking if we can increase the memory as first option to
> increase performance as processing doc values need more memory. Then if
> that did not worked I can check using parallel stream/ sharding. Kindly
> advise is there could be anything else I  missing?
> Sent from Yahoo Mail on Android

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood

Nutch was built for that, but it is a pain to use. I’m still sad that I 
couldn’t get Mike Lynch to open source Ultraseek. So easy and much more 
powerful than Nutch.

Ignoring robots.txt is often a bad idea. You may get into a REST API or into a 
calendar that generates an unending number of valid, different pages. Or the 
combinatorial explosion of diffs between revisions of a wiki page. Those are 
really fun.

There are some web servers that put a session ID in the path, so you get an 
endless set of URLs for the exact same page. We called those a “black hole” 
because it sucked spiders in and never let them out.

The comments in the Wikipedia robots.txt are instructive. For example, they 
allow access to the documentation for the REST API (Allow: /api/rest_v1/?doc) 
then disallow the other paths in the API (Disallow: /api)

https://en.wikipedia.org/robots.txt 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 4:58 PM, Mike Drob  wrote:
> 
> Isn't this exactly what Apache Nutch was built for?
> 
> On Thu, Jun 1, 2017 at 6:56 PM, David Choi  wrote:
> 
>> In any case after digging further I have found where it checks for
>> robots.txt. Thanks!
>> 
>> On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood 
>> wrote:
>> 
>>> Which was exactly what I suggested.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
 On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
 
 In the mean time I have found a better solution at the moment is to
>> test
>>> on
 a site that allows users to crawl their site.
 
 On Thu, Jun 1, 2017 at 5:26 PM David Choi 
>>> wrote:
 
> I think you misunderstand the argument was about stealing content.
>> Sorry
> but I think you need to read what people write before making bold
> statements.
> 
> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <
>> wun...@wunderwood.org>
> wrote:
> 
>> Let’s not get snarky right away, especially when you are wrong.
>> 
>> Corporations do not generally ignore robots.txt. I worked on a
>>> commercial
>> web spider for ten years. Occasionally, our customers did need to
>>> bypass
>> portions of robots.txt. That was usually because of a
>>> poorly-maintained web
>> server, or because our spider could safely crawl some content that
>>> would
>> cause problems for other crawlers.
>> 
>> If you want to learn crawling, don’t start by breaking the
>> conventions
>>> of
>> good web citizenship. Instead, start with sitemap.xml and crawl the
>> preferred portions of a site.
>> 
>> https://www.sitemaps.org/index.html <
>>> https://www.sitemaps.org/index.html>
>> 
>> If the site blocks you, find a different site to learn on.
>> 
>> I like the looks of “Scrapy”, written in Python. I haven’t used it
>> for
>> anything big, but I’d start with that for learning.
>> 
>> https://scrapy.org/ 
>> 
>> If you want to learn on a site with a lot of content, try ours,
>>> chegg.com
>> But if your crawler gets out of hand, crawling too fast, we’ll block
>>> it.
>> Any other site will do the same.
>> 
>> I would not base the crawler directly on Solr. A crawler needs a
>> dedicated database to record the URLs visited, errors, duplicates,
>>> etc. The
>> output of the crawl goes to Solr. That is how we did it with
>> Ultraseek
>> (before Solr existed).
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jun 1, 2017, at 3:01 PM, David Choi 
>>> wrote:
>>> 
>>> Oh well I guess its ok if a corporation does it but not someone
>>> wanting
>> to
>>> learn more about the field. I actually have written a crawler before
>>> as
>>> well as the you know Inverted Index of how solr works but I just
>>> thought
>>> its architecture was better suited for scaling.
>>> 
>>> On Thu, Jun 1, 2017 at 4:47 PM Dave 
>> wrote:
>>> 
 And I mean that in the context of stealing content from sites that
 explicitly declare they don't want to be crawled. Robots.txt is to
>> be
 followed.
 
> On Jun 1, 2017, at 5:31 PM, David Choi 
>> wrote:
> 
> Hello,
> 
> I was wondering if anyone could guide me on how to crawl the web
>> and
> ignore the robots.txt since I can not index some big sites. Or if
>> someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
>

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Mike Drob

Isn't this exactly what Apache Nutch was built for?

On Thu, Jun 1, 2017 at 6:56 PM, David Choi  wrote:

> In any case after digging further I have found where it checks for
> robots.txt. Thanks!
>
> On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood 
> wrote:
>
> > Which was exactly what I suggested.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
> > >
> > > In the mean time I have found a better solution at the moment is to
> test
> > on
> > > a site that allows users to crawl their site.
> > >
> > > On Thu, Jun 1, 2017 at 5:26 PM David Choi 
> > wrote:
> > >
> > >> I think you misunderstand the argument was about stealing content.
> Sorry
> > >> but I think you need to read what people write before making bold
> > >> statements.
> > >>
> > >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <
> wun...@wunderwood.org>
> > >> wrote:
> > >>
> > >>> Let’s not get snarky right away, especially when you are wrong.
> > >>>
> > >>> Corporations do not generally ignore robots.txt. I worked on a
> > commercial
> > >>> web spider for ten years. Occasionally, our customers did need to
> > bypass
> > >>> portions of robots.txt. That was usually because of a
> > poorly-maintained web
> > >>> server, or because our spider could safely crawl some content that
> > would
> > >>> cause problems for other crawlers.
> > >>>
> > >>> If you want to learn crawling, don’t start by breaking the
> conventions
> > of
> > >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> > >>> preferred portions of a site.
> > >>>
> > >>> https://www.sitemaps.org/index.html <
> > https://www.sitemaps.org/index.html>
> > >>>
> > >>> If the site blocks you, find a different site to learn on.
> > >>>
> > >>> I like the looks of “Scrapy”, written in Python. I haven’t used it
> for
> > >>> anything big, but I’d start with that for learning.
> > >>>
> > >>> https://scrapy.org/ 
> > >>>
> > >>> If you want to learn on a site with a lot of content, try ours,
> > chegg.com
> > >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> > it.
> > >>> Any other site will do the same.
> > >>>
> > >>> I would not base the crawler directly on Solr. A crawler needs a
> > >>> dedicated database to record the URLs visited, errors, duplicates,
> > etc. The
> > >>> output of the crawl goes to Solr. That is how we did it with
> Ultraseek
> > >>> (before Solr existed).
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> > >>>
> >  On Jun 1, 2017, at 3:01 PM, David Choi 
> > wrote:
> > 
> >  Oh well I guess its ok if a corporation does it but not someone
> > wanting
> > >>> to
> >  learn more about the field. I actually have written a crawler before
> > as
> >  well as the you know Inverted Index of how solr works but I just
> > thought
> >  its architecture was better suited for scaling.
> > 
> >  On Thu, Jun 1, 2017 at 4:47 PM Dave 
> > >>> wrote:
> > 
> > > And I mean that in the context of stealing content from sites that
> > > explicitly declare they don't want to be crawled. Robots.txt is to
> be
> > > followed.
> > >
> > >> On Jun 1, 2017, at 5:31 PM, David Choi 
> > >>> wrote:
> > >>
> > >> Hello,
> > >>
> > >> I was wondering if anyone could guide me on how to crawl the web
> and
> > >> ignore the robots.txt since I can not index some big sites. Or if
> > >>> someone
> > >> could point how to get around it. I read somewhere about a
> > >> protocol.plugin.check.robots
> > >> but that was for nutch.
> > >>
> > >> The way I index is
> > >> bin/post -c gettingstarted https://en.wikipedia.org/
> > >>
> > >> but I can't index the site I'm guessing because of the robots.txt.
> > >> I can index with
> > >> bin/post -c gettingstarted http://lucene.apache.org/solr
> > >>
> > >> which I am guessing allows it. I was also wondering how to find
> the
> > >>> name
> > > of
> > >> the crawler bin/post uses.
> > >
> > >>>
> > >>>
> >
> >
>

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi

In any case after digging further I have found where it checks for
robots.txt. Thanks!

On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood 
wrote:

> Which was exactly what I suggested.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
> >
> > In the mean time I have found a better solution at the moment is to test
> on
> > a site that allows users to crawl their site.
> >
> > On Thu, Jun 1, 2017 at 5:26 PM David Choi 
> wrote:
> >
> >> I think you misunderstand the argument was about stealing content. Sorry
> >> but I think you need to read what people write before making bold
> >> statements.
> >>
> >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
> >> wrote:
> >>
> >>> Let’s not get snarky right away, especially when you are wrong.
> >>>
> >>> Corporations do not generally ignore robots.txt. I worked on a
> commercial
> >>> web spider for ten years. Occasionally, our customers did need to
> bypass
> >>> portions of robots.txt. That was usually because of a
> poorly-maintained web
> >>> server, or because our spider could safely crawl some content that
> would
> >>> cause problems for other crawlers.
> >>>
> >>> If you want to learn crawling, don’t start by breaking the conventions
> of
> >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> >>> preferred portions of a site.
> >>>
> >>> https://www.sitemaps.org/index.html <
> https://www.sitemaps.org/index.html>
> >>>
> >>> If the site blocks you, find a different site to learn on.
> >>>
> >>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> >>> anything big, but I’d start with that for learning.
> >>>
> >>> https://scrapy.org/ 
> >>>
> >>> If you want to learn on a site with a lot of content, try ours,
> chegg.com
> >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> it.
> >>> Any other site will do the same.
> >>>
> >>> I would not base the crawler directly on Solr. A crawler needs a
> >>> dedicated database to record the URLs visited, errors, duplicates,
> etc. The
> >>> output of the crawl goes to Solr. That is how we did it with Ultraseek
> >>> (before Solr existed).
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>
>  On Jun 1, 2017, at 3:01 PM, David Choi 
> wrote:
> 
>  Oh well I guess its ok if a corporation does it but not someone
> wanting
> >>> to
>  learn more about the field. I actually have written a crawler before
> as
>  well as the you know Inverted Index of how solr works but I just
> thought
>  its architecture was better suited for scaling.
> 
>  On Thu, Jun 1, 2017 at 4:47 PM Dave 
> >>> wrote:
> 
> > And I mean that in the context of stealing content from sites that
> > explicitly declare they don't want to be crawled. Robots.txt is to be
> > followed.
> >
> >> On Jun 1, 2017, at 5:31 PM, David Choi 
> >>> wrote:
> >>
> >> Hello,
> >>
> >> I was wondering if anyone could guide me on how to crawl the web and
> >> ignore the robots.txt since I can not index some big sites. Or if
> >>> someone
> >> could point how to get around it. I read somewhere about a
> >> protocol.plugin.check.robots
> >> but that was for nutch.
> >>
> >> The way I index is
> >> bin/post -c gettingstarted https://en.wikipedia.org/
> >>
> >> but I can't index the site I'm guessing because of the robots.txt.
> >> I can index with
> >> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>
> >> which I am guessing allows it. I was also wondering how to find the
> >>> name
> > of
> >> the crawler bin/post uses.
> >
> >>>
> >>>
>
>

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Doug Turnbull

Scrapy is fantastic and I use it scrape search results pages for clients to
take quality snapshots for relevance work

Ignoring robots.txt sometimes legit comes up because a staging site might
be telling google not to crawl but don't care about a developer crawling
for internal purposes.

Doug
On Thu, Jun 1, 2017 at 6:34 PM Walter Underwood 
wrote:

> Which was exactly what I suggested.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
> >
> > In the mean time I have found a better solution at the moment is to test
> on
> > a site that allows users to crawl their site.
> >
> > On Thu, Jun 1, 2017 at 5:26 PM David Choi 
> wrote:
> >
> >> I think you misunderstand the argument was about stealing content. Sorry
> >> but I think you need to read what people write before making bold
> >> statements.
> >>
> >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
> >> wrote:
> >>
> >>> Let’s not get snarky right away, especially when you are wrong.
> >>>
> >>> Corporations do not generally ignore robots.txt. I worked on a
> commercial
> >>> web spider for ten years. Occasionally, our customers did need to
> bypass
> >>> portions of robots.txt. That was usually because of a
> poorly-maintained web
> >>> server, or because our spider could safely crawl some content that
> would
> >>> cause problems for other crawlers.
> >>>
> >>> If you want to learn crawling, don’t start by breaking the conventions
> of
> >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> >>> preferred portions of a site.
> >>>
> >>> https://www.sitemaps.org/index.html <
> https://www.sitemaps.org/index.html>
> >>>
> >>> If the site blocks you, find a different site to learn on.
> >>>
> >>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> >>> anything big, but I’d start with that for learning.
> >>>
> >>> https://scrapy.org/ 
> >>>
> >>> If you want to learn on a site with a lot of content, try ours,
> chegg.com
> >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> it.
> >>> Any other site will do the same.
> >>>
> >>> I would not base the crawler directly on Solr. A crawler needs a
> >>> dedicated database to record the URLs visited, errors, duplicates,
> etc. The
> >>> output of the crawl goes to Solr. That is how we did it with Ultraseek
> >>> (before Solr existed).
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
> >>>
>  On Jun 1, 2017, at 3:01 PM, David Choi 
> wrote:
> 
>  Oh well I guess its ok if a corporation does it but not someone
> wanting
> >>> to
>  learn more about the field. I actually have written a crawler before
> as
>  well as the you know Inverted Index of how solr works but I just
> thought
>  its architecture was better suited for scaling.
> 
>  On Thu, Jun 1, 2017 at 4:47 PM Dave 
> >>> wrote:
> 
> > And I mean that in the context of stealing content from sites that
> > explicitly declare they don't want to be crawled. Robots.txt is to be
> > followed.
> >
> >> On Jun 1, 2017, at 5:31 PM, David Choi 
> >>> wrote:
> >>
> >> Hello,
> >>
> >> I was wondering if anyone could guide me on how to crawl the web and
> >> ignore the robots.txt since I can not index some big sites. Or if
> >>> someone
> >> could point how to get around it. I read somewhere about a
> >> protocol.plugin.check.robots
> >> but that was for nutch.
> >>
> >> The way I index is
> >> bin/post -c gettingstarted https://en.wikipedia.org/
> >>
> >> but I can't index the site I'm guessing because of the robots.txt.
> >> I can index with
> >> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>
> >> which I am guessing allows it. I was also wondering how to find the
> >>> name
> > of
> >> the crawler bin/post uses.
> >
> >>>
> >>>
>
>

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood

Which was exactly what I suggested.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
> 
> In the mean time I have found a better solution at the moment is to test on
> a site that allows users to crawl their site.
> 
> On Thu, Jun 1, 2017 at 5:26 PM David Choi  wrote:
> 
>> I think you misunderstand the argument was about stealing content. Sorry
>> but I think you need to read what people write before making bold
>> statements.
>> 
>> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
>> wrote:
>> 
>>> Let’s not get snarky right away, especially when you are wrong.
>>> 
>>> Corporations do not generally ignore robots.txt. I worked on a commercial
>>> web spider for ten years. Occasionally, our customers did need to bypass
>>> portions of robots.txt. That was usually because of a poorly-maintained web
>>> server, or because our spider could safely crawl some content that would
>>> cause problems for other crawlers.
>>> 
>>> If you want to learn crawling, don’t start by breaking the conventions of
>>> good web citizenship. Instead, start with sitemap.xml and crawl the
>>> preferred portions of a site.
>>> 
>>> https://www.sitemaps.org/index.html 
>>> 
>>> If the site blocks you, find a different site to learn on.
>>> 
>>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
>>> anything big, but I’d start with that for learning.
>>> 
>>> https://scrapy.org/ 
>>> 
>>> If you want to learn on a site with a lot of content, try ours, chegg.com
>>> But if your crawler gets out of hand, crawling too fast, we’ll block it.
>>> Any other site will do the same.
>>> 
>>> I would not base the crawler directly on Solr. A crawler needs a
>>> dedicated database to record the URLs visited, errors, duplicates, etc. The
>>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>>> (before Solr existed).
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
 On Jun 1, 2017, at 3:01 PM, David Choi  wrote:
 
 Oh well I guess its ok if a corporation does it but not someone wanting
>>> to
 learn more about the field. I actually have written a crawler before as
 well as the you know Inverted Index of how solr works but I just thought
 its architecture was better suited for scaling.
 
 On Thu, Jun 1, 2017 at 4:47 PM Dave 
>>> wrote:
 
> And I mean that in the context of stealing content from sites that
> explicitly declare they don't want to be crawled. Robots.txt is to be
> followed.
> 
>> On Jun 1, 2017, at 5:31 PM, David Choi 
>>> wrote:
>> 
>> Hello,
>> 
>> I was wondering if anyone could guide me on how to crawl the web and
>> ignore the robots.txt since I can not index some big sites. Or if
>>> someone
>> could point how to get around it. I read somewhere about a
>> protocol.plugin.check.robots
>> but that was for nutch.
>> 
>> The way I index is
>> bin/post -c gettingstarted https://en.wikipedia.org/
>> 
>> but I can't index the site I'm guessing because of the robots.txt.
>> I can index with
>> bin/post -c gettingstarted http://lucene.apache.org/solr
>> 
>> which I am guessing allows it. I was also wondering how to find the
>>> name
> of
>> the crawler bin/post uses.
> 
>>> 
>>>

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi

In the mean time I have found a better solution at the moment is to test on
a site that allows users to crawl their site.

On Thu, Jun 1, 2017 at 5:26 PM David Choi  wrote:

> I think you misunderstand the argument was about stealing content. Sorry
> but I think you need to read what people write before making bold
> statements.
>
> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
> wrote:
>
>> Let’s not get snarky right away, especially when you are wrong.
>>
>> Corporations do not generally ignore robots.txt. I worked on a commercial
>> web spider for ten years. Occasionally, our customers did need to bypass
>> portions of robots.txt. That was usually because of a poorly-maintained web
>> server, or because our spider could safely crawl some content that would
>> cause problems for other crawlers.
>>
>> If you want to learn crawling, don’t start by breaking the conventions of
>> good web citizenship. Instead, start with sitemap.xml and crawl the
>> preferred portions of a site.
>>
>> https://www.sitemaps.org/index.html 
>>
>> If the site blocks you, find a different site to learn on.
>>
>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
>> anything big, but I’d start with that for learning.
>>
>> https://scrapy.org/ 
>>
>> If you want to learn on a site with a lot of content, try ours, chegg.com
>> But if your crawler gets out of hand, crawling too fast, we’ll block it.
>> Any other site will do the same.
>>
>> I would not base the crawler directly on Solr. A crawler needs a
>> dedicated database to record the URLs visited, errors, duplicates, etc. The
>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>> (before Solr existed).
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Jun 1, 2017, at 3:01 PM, David Choi  wrote:
>> >
>> > Oh well I guess its ok if a corporation does it but not someone wanting
>> to
>> > learn more about the field. I actually have written a crawler before as
>> > well as the you know Inverted Index of how solr works but I just thought
>> > its architecture was better suited for scaling.
>> >
>> > On Thu, Jun 1, 2017 at 4:47 PM Dave 
>> wrote:
>> >
>> >> And I mean that in the context of stealing content from sites that
>> >> explicitly declare they don't want to be crawled. Robots.txt is to be
>> >> followed.
>> >>
>> >>> On Jun 1, 2017, at 5:31 PM, David Choi 
>> wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>>  I was wondering if anyone could guide me on how to crawl the web and
>> >>> ignore the robots.txt since I can not index some big sites. Or if
>> someone
>> >>> could point how to get around it. I read somewhere about a
>> >>> protocol.plugin.check.robots
>> >>> but that was for nutch.
>> >>>
>> >>> The way I index is
>> >>> bin/post -c gettingstarted https://en.wikipedia.org/
>> >>>
>> >>> but I can't index the site I'm guessing because of the robots.txt.
>> >>> I can index with
>> >>> bin/post -c gettingstarted http://lucene.apache.org/solr
>> >>>
>> >>> which I am guessing allows it. I was also wondering how to find the
>> name
>> >> of
>> >>> the crawler bin/post uses.
>> >>
>>
>>

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi

I think you misunderstand the argument was about stealing content. Sorry
but I think you need to read what people write before making bold
statements.

On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood 
wrote:

> Let’s not get snarky right away, especially when you are wrong.
>
> Corporations do not generally ignore robots.txt. I worked on a commercial
> web spider for ten years. Occasionally, our customers did need to bypass
> portions of robots.txt. That was usually because of a poorly-maintained web
> server, or because our spider could safely crawl some content that would
> cause problems for other crawlers.
>
> If you want to learn crawling, don’t start by breaking the conventions of
> good web citizenship. Instead, start with sitemap.xml and crawl the
> preferred portions of a site.
>
> https://www.sitemaps.org/index.html 
>
> If the site blocks you, find a different site to learn on.
>
> I like the looks of “Scrapy”, written in Python. I haven’t used it for
> anything big, but I’d start with that for learning.
>
> https://scrapy.org/ 
>
> If you want to learn on a site with a lot of content, try ours, chegg.com
> But if your crawler gets out of hand, crawling too fast, we’ll block it.
> Any other site will do the same.
>
> I would not base the crawler directly on Solr. A crawler needs a dedicated
> database to record the URLs visited, errors, duplicates, etc. The output of
> the crawl goes to Solr. That is how we did it with Ultraseek (before Solr
> existed).
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jun 1, 2017, at 3:01 PM, David Choi  wrote:
> >
> > Oh well I guess its ok if a corporation does it but not someone wanting
> to
> > learn more about the field. I actually have written a crawler before as
> > well as the you know Inverted Index of how solr works but I just thought
> > its architecture was better suited for scaling.
> >
> > On Thu, Jun 1, 2017 at 4:47 PM Dave 
> wrote:
> >
> >> And I mean that in the context of stealing content from sites that
> >> explicitly declare they don't want to be crawled. Robots.txt is to be
> >> followed.
> >>
> >>> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> >>>
> >>> Hello,
> >>>
> >>>  I was wondering if anyone could guide me on how to crawl the web and
> >>> ignore the robots.txt since I can not index some big sites. Or if
> someone
> >>> could point how to get around it. I read somewhere about a
> >>> protocol.plugin.check.robots
> >>> but that was for nutch.
> >>>
> >>> The way I index is
> >>> bin/post -c gettingstarted https://en.wikipedia.org/
> >>>
> >>> but I can't index the site I'm guessing because of the robots.txt.
> >>> I can index with
> >>> bin/post -c gettingstarted http://lucene.apache.org/solr
> >>>
> >>> which I am guessing allows it. I was also wondering how to find the
> name
> >> of
> >>> the crawler bin/post uses.
> >>
>
>

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Walter Underwood

Let’s not get snarky right away, especially when you are wrong.

Corporations do not generally ignore robots.txt. I worked on a commercial web 
spider for ten years. Occasionally, our customers did need to bypass portions 
of robots.txt. That was usually because of a poorly-maintained web server, or 
because our spider could safely crawl some content that would cause problems 
for other crawlers.

If you want to learn crawling, don’t start by breaking the conventions of good 
web citizenship. Instead, start with sitemap.xml and crawl the preferred 
portions of a site.

https://www.sitemaps.org/index.html 

If the site blocks you, find a different site to learn on.

I like the looks of “Scrapy”, written in Python. I haven’t used it for anything 
big, but I’d start with that for learning.

https://scrapy.org/ 

If you want to learn on a site with a lot of content, try ours, chegg.com But 
if your crawler gets out of hand, crawling too fast, we’ll block it. Any other 
site will do the same.

I would not base the crawler directly on Solr. A crawler needs a dedicated 
database to record the URLs visited, errors, duplicates, etc. The output of the 
crawl goes to Solr. That is how we did it with Ultraseek (before Solr existed).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 1, 2017, at 3:01 PM, David Choi  wrote:
> 
> Oh well I guess its ok if a corporation does it but not someone wanting to
> learn more about the field. I actually have written a crawler before as
> well as the you know Inverted Index of how solr works but I just thought
> its architecture was better suited for scaling.
> 
> On Thu, Jun 1, 2017 at 4:47 PM Dave  wrote:
> 
>> And I mean that in the context of stealing content from sites that
>> explicitly declare they don't want to be crawled. Robots.txt is to be
>> followed.
>> 
>>> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
>>> 
>>> Hello,
>>> 
>>>  I was wondering if anyone could guide me on how to crawl the web and
>>> ignore the robots.txt since I can not index some big sites. Or if someone
>>> could point how to get around it. I read somewhere about a
>>> protocol.plugin.check.robots
>>> but that was for nutch.
>>> 
>>> The way I index is
>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>> 
>>> but I can't index the site I'm guessing because of the robots.txt.
>>> I can index with
>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>> 
>>> which I am guessing allows it. I was also wondering how to find the name
>> of
>>> the crawler bin/post uses.
>>

Performance Issue in Streaming Expressions

2017-06-01 Thread thiaga rajan

We are working on a proposal and feeling streaming API along with export
handler will best fit for our usecases. We are already of having a structure in
solr in which we are using graph queries to produce hierarchical structure. Now
from the structure we need to join couple of more collections. We have
5 different collections. Collection 1- 800 k records.
Collection 2- 200k records.
Collection 3 - 7k records.
Collection 4 - 6 million records. Collection 5 -
150 k records we are using the below strategy
innerJoin( intersect( innerJoin(collection 1,collection
2), innerJoin(Collection 3, Collection 4)), collection 5).
We are seeing performance is too slow when we start having
collection 4. Just with collection 1 2 5 the results are coming in 2 secs. The
moment I have included collection 4 in the query I could see a performance
impact. I believe exporting large results from collection 4 is causing the
issie. Currently I am using single sharded collection with no replica. I
thinking if we can increase the memory as first option to increase performance
as processing doc values need more memory. Then if that did not worked I can
check using parallel stream/ sharding. Kindly advise is there could be anything
else I missing?
Sent from Yahoo Mail on Android

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi

Oh well I guess its ok if a corporation does it but not someone wanting to
learn more about the field. I actually have written a crawler before as
well as the you know Inverted Index of how solr works but I just thought
its architecture was better suited for scaling.

On Thu, Jun 1, 2017 at 4:47 PM Dave  wrote:

> And I mean that in the context of stealing content from sites that
> explicitly declare they don't want to be crawled. Robots.txt is to be
> followed.
>
> > On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> >
> > Hello,
> >
> >   I was wondering if anyone could guide me on how to crawl the web and
> > ignore the robots.txt since I can not index some big sites. Or if someone
> > could point how to get around it. I read somewhere about a
> > protocol.plugin.check.robots
> > but that was for nutch.
> >
> > The way I index is
> > bin/post -c gettingstarted https://en.wikipedia.org/
> >
> > but I can't index the site I'm guessing because of the robots.txt.
> > I can index with
> > bin/post -c gettingstarted http://lucene.apache.org/solr
> >
> > which I am guessing allows it. I was also wondering how to find the name
> of
> > the crawler bin/post uses.
>

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave

And I mean that in the context of stealing content from sites that explicitly 
declare they don't want to be crawled. Robots.txt is to be followed. 

> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> 
> Hello,
> 
>   I was wondering if anyone could guide me on how to crawl the web and
> ignore the robots.txt since I can not index some big sites. Or if someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 
> The way I index is
> bin/post -c gettingstarted https://en.wikipedia.org/
> 
> but I can't index the site I'm guessing because of the robots.txt.
> I can index with
> bin/post -c gettingstarted http://lucene.apache.org/solr
> 
> which I am guessing allows it. I was also wondering how to find the name of
> the crawler bin/post uses.

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Vivek Pathak

I can help.  We can chat in some freenode chatroom in an hour or so.  
Let me know where you hang out.


Thanks

Vivek


On 6/1/17 5:45 PM, Dave wrote:

If you are not capable of even writing your own indexing code, let alone 
crawler, I would prefer that you just stop now.  No one is going to help you 
with this request, at least I'd hope not.


On Jun 1, 2017, at 5:31 PM, David Choi  wrote:

Hello,

   I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the name of
the crawler bin/post uses.

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave

If you are not capable of even writing your own indexing code, let alone 
crawler, I would prefer that you just stop now.  No one is going to help you 
with this request, at least I'd hope not. 

> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> 
> Hello,
> 
>   I was wondering if anyone could guide me on how to crawl the web and
> ignore the robots.txt since I can not index some big sites. Or if someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 
> The way I index is
> bin/post -c gettingstarted https://en.wikipedia.org/
> 
> but I can't index the site I'm guessing because of the robots.txt.
> I can index with
> bin/post -c gettingstarted http://lucene.apache.org/solr
> 
> which I am guessing allows it. I was also wondering how to find the name of
> the crawler bin/post uses.

Solr Web Crawler - Robots.txt

2017-06-01 Thread David Choi

Hello,

   I was wondering if anyone could guide me on how to crawl the web and
ignore the robots.txt since I can not index some big sites. Or if someone
could point how to get around it. I read somewhere about a
protocol.plugin.check.robots
but that was for nutch.

The way I index is
bin/post -c gettingstarted https://en.wikipedia.org/

but I can't index the site I'm guessing because of the robots.txt.
I can index with
bin/post -c gettingstarted http://lucene.apache.org/solr

which I am guessing allows it. I was also wondering how to find the name of
the crawler bin/post uses.

Re: Solr query with more than one field

2017-06-01 Thread Chris Hostetter

: I could have sworn I was paraphrasing _your_ presentation Hoss. I
: guess I did not learn my lesson well enough.
: 
: Thank you for the correction.

Trust but verify! ... we're both wrong.

Boolean functions (like lt(), gt(), etc...) behave just like sum() -- they 
"exist" for a  document if and only if all the args exist for the document 
-- and that's what matters for wether a '{!func}' query considers a 
document a match. (if the function "exists" then the query "matches")

I think my second suggestion of using frange is the only thing that works 
-- you have to explicitly use 'frange' which will only match a document if 
the function "exists" *AND* if the resulting value is in the range...

fq={!frange l=0}sub(value,cost)

what would be nice is the inverse of the "exists()" function ... that 
returns true/false depending on wether the function it wraps "exists" for 
a document -- but is always a "match" for every doc.  we need *something* 
that can wrap a function that returns a boolean and only "exists" if the 
boolean is true, otherwise it's considered a non-exists/match for the doc.

then you could do:  fq={!func}something(gt(value,cost))

of perhaps just a nomatch()/noexist() function that takes no args and does 
nothing but never exists/matches any doc  then you could do...

fq={!func}if(gt(value,cost),42,nomatch())

?

-Hoss
http://www.lucidworks.com/

Replace a solr node which is using a block storage

2017-06-01 Thread Minu Theresa Thomas

Hi,

I am new to Solr. I have a use case to add a new node when an existing node
goes down. The new node with a new IP should contain all the replicas that
the previous node had. So I am using a network storage (cinder storage) in
which the data directory (where the solr.xml and the core directories
resides) is getting created when a node starts up. The new node with a new
IP after the replacement will contain the same set of directories which the
old node had. I have noticed the new node is added to the cluster without
the need for an ADDREPLICA.

Is this an expected behavior in Solr? Does ZK still hold the references to
old node? What's the recommended solution if I want to re-use the data
directory associated with the old new while spinning up a new node. The
goal is to avoid data loss and to reduce the time taken to recover a node.

Thanks in advance!

-Minu

Re: Solr query with more than one field

2017-06-01 Thread Alexandre Rafalovitch

Bother,

I could have sworn I was paraphrasing _your_ presentation Hoss. I
guess I did not learn my lesson well enough.

Thank you for the correction.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 1 June 2017 at 15:26, Chris Hostetter  wrote:
>
> : Because the value of the function will be treated as a relevance value
> : and relevance value of 0 (and less?) will cause the record to be
> : filtered out.
>
> I don't believe that's true? ... IIRC 'fq' doesn't care what the scores
> are as long as the query is a "match" and a 'func' query will match as
> long as the function says it matches ... something like sub() should be a
> match as long sa both fields exist in the document.
>
> Pretty sure the simplest version of what you want is
> 'fq={!func}gt(value,cost)' .. of if you need more complex functions/rules
> you can use the 'frange' QParser to only match documents where the result
> of an equation is in a specific range of values...
>
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser
>
> : On 1 June 2017 at 13:46, Mikhail Ibraheem
> :  wrote:
> : > Hi,I have 2 fields "cost" and "value" at my records. I want to get all 
> documents that have "value" greater than "cost". Something likeq=value:[cost 
> TO *]
> : > Please advise.
> : > Thanks
> :
>
> -Hoss
> http://www.lucidworks.com/

Re: Solr query with more than one field

2017-06-01 Thread Chris Hostetter


: Because the value of the function will be treated as a relevance value
: and relevance value of 0 (and less?) will cause the record to be
: filtered out.

I don't believe that's true? ... IIRC 'fq' doesn't care what the scores 
are as long as the query is a "match" and a 'func' query will match as 
long as the function says it matches ... something like sub() should be a 
match as long sa both fields exist in the document.

Pretty sure the simplest version of what you want is 
'fq={!func}gt(value,cost)' .. of if you need more complex functions/rules 
you can use the 'frange' QParser to only match documents where the result 
of an equation is in a specific range of values...

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser

: On 1 June 2017 at 13:46, Mikhail Ibraheem
:  wrote:
: > Hi,I have 2 fields "cost" and "value" at my records. I want to get all 
documents that have "value" greater than "cost". Something likeq=value:[cost TO 
*]
: > Please advise.
: > Thanks
: 

-Hoss
http://www.lucidworks.com/

Re: Solr query with more than one field

2017-06-01 Thread Alexandre Rafalovitch

Function queries:
https://cwiki.apache.org/confluence/display/solr/Function+Queries
The function would be sub
Then you want its result mapped to a fq. could probably be as simple
as fq={!func}sub(value,cost).

Because the value of the function will be treated as a relevance value
and relevance value of 0 (and less?) will cause the record to be
filtered out.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced

On 1 June 2017 at 13:46, Mikhail Ibraheem
 wrote:
> Hi,I have 2 fields "cost" and "value" at my records. I want to get all 
> documents that have "value" greater than "cost". Something likeq=value:[cost 
> TO *]
> Please advise.
> Thanks

Solr query with more than one field

2017-06-01 Thread Mikhail Ibraheem

Hi,I have 2 fields "cost" and "value" at my records. I want to get all 
documents that have "value" greater than "cost". Something likeq=value:[cost TO 
*]
Please advise.
Thanks

DateUtil in SOLR-6

2017-06-01 Thread SOLR4189

In SOLR-4.10.1 I use DateUtil.parse in my UpdateProcessor for different
datetime formats.
In indexing of document datetime format is *-MM-dd'T'HH:mm:ss'Z'* and in
reindexing document datetime format is *EEE MMM d hh:mm:ss z *. And it
works fine.

But what can I do in SOLR-6? I don't understand  this issue
  . How *using new
Date(Instant.parse(d).toEpochMilli()); for parsing and
DateTimeFormatter.ISO_INSTANT.format(d.toInstant()) for formatting* will be
help if I want the same behavior?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DateUtil-in-SOLR-6-tp4338503.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: _version_ / Versioning using timespan

2017-06-01 Thread Susheel Kumar

Which version of solr are you using? I tested in 6.0 and if I supply same
version, it overwrite/update the document exactly as per the wiki
documentation.

Thanks,
Susheel

On Thu, Jun 1, 2017 at 7:57 AM, marotosg  wrote:

> Thanks a lot Susheel.
> I see this is actually what I need.  I have been testing it and  notice the
> value of the field has to be always greater for a new document to get
> indexed. if you send the same version number it doesn't work.
>
> Is it possible somehow to overwrite documents with the same version?
>
> Thanks
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/version-Versioning-using-timespan-tp4338171p4338475.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Error with polygon search

2017-06-01 Thread BenCall

Thanks! This helped me out as well.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-with-polygon-search-tp4326117p4338496.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Configuration of parallel indexing threads

2017-06-01 Thread Susheel Kumar

How are you indexing currently? Are you using DIH or using SolrJ/Java? And
are you indexing with multiple threads/machines simultaneously etc or just
one thread/machine etc.

Thnx
Susheel

On Thu, Jun 1, 2017 at 11:45 AM, Erick Erickson 
wrote:

> That's been removed in LUCENE-6659. I regularly max out my CPUs by
> having multiple _clients_ send update simultaneously rather than
> trying to up the number of threads the indexing process takes.
>
> But Mike McCandless can answer authoritatively...
>
> Best,
> Erick
>
> On Thu, Jun 1, 2017 at 4:16 AM, gigo314  wrote:
> > During performance testing a question was raised whether Solr indexing
> > performance could be improved by adding more concurrent index writer
> > threads. I discovered traces of such functionality  here
> >   , but not sure how
> to use
> > it in Solr 6.2. Hopefully there is a setting in Solr configuration file,
> but
> > I cannot find it.
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> nabble.com/Configuration-of-parallel-indexing-threads-tp4338466.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr Document Routing

2017-06-01 Thread Erick Erickson

Can you check if those IDs are on shard8? You can do this by pointing
the URL at the core and specifying =false...

Best,
Erick

On Thu, Jun 1, 2017 at 1:42 AM, Amrit Sarkar  wrote:
> Sorry, The confluence link:
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Thu, Jun 1, 2017 at 2:11 PM, Amrit Sarkar  wrote:
>
>> Sathyam,
>>
>> It seems your interpretation is wrong as CloudSolrClient calculates
>> (hashes the document id and determine the range it belongs to) which shard
>> the document incoming belongs to. As you have 10 shards, the document will
>> belong to one of them, that is what being calculated and eventually pushed
>> to the leader of that shard.
>>
>> The confluence link provides the insights in much detail:
>> https://lucidworks.com/2013/06/13/solr-cloud-document-routing/
>> Another useful link: https://lucidworks.com/2013/06/13/solr-cloud-
>> document-routing/
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>
>> On Thu, Jun 1, 2017 at 11:52 AM, Sathyam 
>> wrote:
>>
>>> HI,
>>>
>>> I am indexing documents to a 10 shard collection (testcollection, having
>>> no
>>> replicas) in solr6 cluster using CloudSolrClient. I saw that there is a
>>> lot
>>> of peer to peer document distribution going on when I looked at the solr
>>> logs.
>>>
>>> An example log statement is as follows:
>>> 2017-06-01 06:07:28.378 INFO  (qtp1358444045-3673692) [c:testcollection
>>> s:shard8 r:core_node7 x:testcollection_shard8_replica1]
>>> o.a.s.u.p.LogUpdateProcessorFactory [testcollection_shard8_replica1]
>>>  webapp=/solr path=/update params={update.distrib=TOLEADER=
>>> http://10.199.42.29:8983/solr/testcollection_shard7_replica1
>>> /=javabin=2}{add=[BQECDwZGTCEBHZZBBiIP
>>> (1568981383488995328), BQEBBQZB2il3wGT/0/mB (1568981383490043904),
>>> BQEBBQZFnhOJRj+m9RJC (1568981383491092480), BQEGBgZIeBE1klHS4fxk
>>> (1568981383492141056), BQEBBQZFVTmRx2VuCgfV (1568981383493189632)]} 0 25
>>>
>>> When I went through the code of CloudSolrClient on grepcode I saw that the
>>> client itself finds out which server it needs to hit by using the message
>>> id hash and getting the shard range information from state.json.
>>> Then it is quite confusing to me why there is a distribution of data
>>> between peers as there is no replication and each shard is a leader.
>>>
>>> I would like to know why this is happening and how to avoid it or if the
>>> above log statement means something else and I am misinterpreting
>>> something.
>>>
>>> --
>>> Sathyam Doraswamy
>>>
>>
>>

Re: Configuration of parallel indexing threads

2017-06-01 Thread Erick Erickson

That's been removed in LUCENE-6659. I regularly max out my CPUs by
having multiple _clients_ send update simultaneously rather than
trying to up the number of threads the indexing process takes.

But Mike McCandless can answer authoritatively...

Best,
Erick

On Thu, Jun 1, 2017 at 4:16 AM, gigo314  wrote:
> During performance testing a question was raised whether Solr indexing
> performance could be improved by adding more concurrent index writer
> threads. I discovered traces of such functionality  here
>   , but not sure how to use
> it in Solr 6.2. Hopefully there is a setting in Solr configuration file, but
> I cannot find it.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Configuration-of-parallel-indexing-threads-tp4338466.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread Erick Erickson

Well, personally I like to use SolrJ rather than DIH for both
debugging ease and the reasons outlined here:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

FWIW
Erick

On Thu, Jun 1, 2017 at 7:59 AM, Josh Lincoln  wrote:
> I had the same issue as Vrinda and found a hacky way to limit the number of
> times deltaImportQuery was executed.
>
> As designed, solr executes *deltaQuery* to get a list of ids that need to
> be indexed. For each of those it executes *deltaImportQuery*, which is
> typically very similar to the full *query*.
>
> I constructed a deltaQuery to purposely only return 1 row. E.g.
>
>  deltaQuery = "SELECT id FROM table WHERE rownum=1"// written for
> oracle, likely requires a different syntax for other dbs. Also, it occurred
> to you could probably include the date>= '${dataimporter.last_index_time}'
> filter here so this returns 0 rows if no data has changed
>
> Since *deltaImportQuery now *only gets called once I needed to add the
> filter logic to *deltaImportQuery *to only select the changed rows (that
> logic is normally in *deltaQuery*). E.g.
>
> deltaImportQuery = [normal import query] WHERE date >=
> '${dataimporter.last_index_time}'
>
>
> This significantly reduced the number of database queries for delta
> imports, and sped up the processing.
>
> On Thu, Jun 1, 2017 at 6:07 AM Amrit Sarkar  wrote:
>
>> Erick,
>>
>> Thanks for the pointer. Getting astray from what Vrinda is looking for
>> (sorry about that), what if there are no sub-entities? and no
>> deltaImportQuery passed too. I looked into the code and determine it
>> calculates the deltaImportQuery itself,
>> SQLEntityProcessor:getDeltaImportQuery(..)::126.
>>
>> Ideally then, a full-import or the delta-import should take similar time to
>> build the docs (fetch next row). I may very well be going entirely wrong
>> here.
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269 <(415)%20589-9269>
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>
>> On Thu, Jun 1, 2017 at 1:50 PM, vrindavda  wrote:
>>
>> > Thanks Erick,
>> >
>> >  But how do I solve this? I tried creating Stored proc instead of plain
>> > query, but no change in performance.
>> >
>> > For delta import it in processing more documents than the total
>> documents.
>> > In this case delta import is not helping at all, I cannot switch to full
>> > import each time. This was working fine with less data.
>> >
>> > Thank you,
>> > Vrinda Davda
>> >
>> >
>> >
>> > --
>> > View this message in context: http://lucene.472066.n3.
>> > nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-
>> > Import-tp4338162p4338444.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>>

Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread Josh Lincoln

I had the same issue as Vrinda and found a hacky way to limit the number of
times deltaImportQuery was executed.

As designed, solr executes *deltaQuery* to get a list of ids that need to
be indexed. For each of those it executes *deltaImportQuery*, which is
typically very similar to the full *query*.

I constructed a deltaQuery to purposely only return 1 row. E.g.

 deltaQuery = "SELECT id FROM table WHERE rownum=1"// written for
oracle, likely requires a different syntax for other dbs. Also, it occurred
to you could probably include the date>= '${dataimporter.last_index_time}'
filter here so this returns 0 rows if no data has changed

Since *deltaImportQuery now *only gets called once I needed to add the
filter logic to *deltaImportQuery *to only select the changed rows (that
logic is normally in *deltaQuery*). E.g.

deltaImportQuery = [normal import query] WHERE date >=
'${dataimporter.last_index_time}'

This significantly reduced the number of database queries for delta
imports, and sped up the processing.

On Thu, Jun 1, 2017 at 6:07 AM Amrit Sarkar  wrote:

> Erick,
>
> Thanks for the pointer. Getting astray from what Vrinda is looking for
> (sorry about that), what if there are no sub-entities? and no
> deltaImportQuery passed too. I looked into the code and determine it
> calculates the deltaImportQuery itself,
> SQLEntityProcessor:getDeltaImportQuery(..)::126.
>
> Ideally then, a full-import or the delta-import should take similar time to
> build the docs (fetch next row). I may very well be going entirely wrong
> here.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269 <(415)%20589-9269>
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Thu, Jun 1, 2017 at 1:50 PM, vrindavda  wrote:
>
> > Thanks Erick,
> >
> >  But how do I solve this? I tried creating Stored proc instead of plain
> > query, but no change in performance.
> >
> > For delta import it in processing more documents than the total
> documents.
> > In this case delta import is not helping at all, I cannot switch to full
> > import each time. This was working fine with less data.
> >
> > Thank you,
> > Vrinda Davda
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> > nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-
> > Import-tp4338162p4338444.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>

Aw: Re: Re: Facet ranges and stats

2017-06-01 Thread Per Newgro

Thank you for your offer. But i think i need to think about the concept at all.
I need to configure the limit in the database and use in in all appropriate 
places.
I already have a clue how to do it - but lack of time :-)

> Gesendet: Donnerstag, 01. Juni 2017 um 15:00 Uhr
> Von: "Susheel Kumar" 
> An: solr-user@lucene.apache.org
> Betreff: Re: Re: Facet ranges and stats
>
> Great, it worked out.  If you want to share where and in what code you have
> 90 configured, we can brainstorm if we can simplify it to have only one
> place.
> 
> On Thu, Jun 1, 2017 at 3:16 AM, Per Newgro  wrote:
> 
> > Thanks for your support.
> >
> > Because the null handling is one of the important things i decided to use
> > another way.
> >
> > I added a script in my data import handler that decides if object was
> > audited
> >   function auditComplete(row) {
> > var total = row.get('TOTAL');
> > if (total == null || total < 90) {
> >   row.remove('audit_complete');
> > } else {
> >   row.put('audit_complete', 1);
> > }
> > return row;
> >   }
> >
> > When i add/update a document the same will be done in my code.
> > So i can do my query based on audit_complete field, because i only need to
> > know how many are complete and how many not.
> >
> > A drawback is surely that the "complete" limit of 90 is now implemented in
> > two places (DIH script and my code).
> > But so far i can life with it.
> >
> > Thank you
> > Per
> >
> > > Gesendet: Mittwoch, 31. Mai 2017 um 17:28 Uhr
> > > Von: "Susheel Kumar" 
> > > An: solr-user@lucene.apache.org
> > > Betreff: Re: Facet ranges and stats
> > >
> > > Hi,
> > >
> > > You may want to explore the JSON facets.  The closest I can go to meet
> > > above requirement is below query (replace inStock with your rank field
> > and
> > > price below with total.  Null handling something also will have to look.
> > >
> > > --
> > > Susheel
> > >
> > > curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&
> > >
> > >   json.facet={inStocks:{ terms:{
> > > field: inStock,
> > > limit: 5,
> > > facet:{
> > > priceRange:{ range:{  // nested terms
> > > facet will be executed for the top 5 genre buckets of the parent
> > >   field: price,
> > >   start : 0,
> > >   end : 90,
> > >   gap : 90,
> > >   other : "after"
> > > }}
> > >   }
> > >   }}
> > >   }'
> > >
> > >
> > > On Wed, May 31, 2017 at 7:33 AM, Per Newgro  wrote:
> > >
> > > > Hello,
> > > >
> > > > i would like to generate some stats on my facets. This is working so
> > far.
> > > > My problem is that i don't know how to generate Ranges on my facets and
> > > > calculate the stats for it.
> > > >
> > > > I have two fields in my schema -> rank(string) and total(float,
> > nullable)
> > > > Rank can be A or B or C. In case my object was audited document
> > contains a
> > > > total value (78 or 45 or ...). Otherwise the value is null.
> > > >
> > > > What i need to calculate per Rank is the count of documents having a
> > total
> > > > value >= 90 and the count of the other documents (null or < 90).
> > > >
> > > > My solution would be to implement 2 queries. But what i learned so far:
> > > > Solr is build to avoid that.
> > > >
> > > > Can you please give me hint how i could solve this problem.
> > > >
> > > > Thanks for your support
> > > > Per
> > > >
> > >
> >
>

Re: Re: Facet ranges and stats

2017-06-01 Thread Susheel Kumar

Great, it worked out.  If you want to share where and in what code you have
90 configured, we can brainstorm if we can simplify it to have only one
place.

On Thu, Jun 1, 2017 at 3:16 AM, Per Newgro  wrote:

> Thanks for your support.
>
> Because the null handling is one of the important things i decided to use
> another way.
>
> I added a script in my data import handler that decides if object was
> audited
>   function auditComplete(row) {
> var total = row.get('TOTAL');
> if (total == null || total < 90) {
>   row.remove('audit_complete');
> } else {
>   row.put('audit_complete', 1);
> }
> return row;
>   }
>
> When i add/update a document the same will be done in my code.
> So i can do my query based on audit_complete field, because i only need to
> know how many are complete and how many not.
>
> A drawback is surely that the "complete" limit of 90 is now implemented in
> two places (DIH script and my code).
> But so far i can life with it.
>
> Thank you
> Per
>
> > Gesendet: Mittwoch, 31. Mai 2017 um 17:28 Uhr
> > Von: "Susheel Kumar" 
> > An: solr-user@lucene.apache.org
> > Betreff: Re: Facet ranges and stats
> >
> > Hi,
> >
> > You may want to explore the JSON facets.  The closest I can go to meet
> > above requirement is below query (replace inStock with your rank field
> and
> > price below with total.  Null handling something also will have to look.
> >
> > --
> > Susheel
> >
> > curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&
> >
> >   json.facet={inStocks:{ terms:{
> > field: inStock,
> > limit: 5,
> > facet:{
> > priceRange:{ range:{  // nested terms
> > facet will be executed for the top 5 genre buckets of the parent
> >   field: price,
> >   start : 0,
> >   end : 90,
> >   gap : 90,
> >   other : "after"
> > }}
> >   }
> >   }}
> >   }'
> >
> >
> > On Wed, May 31, 2017 at 7:33 AM, Per Newgro  wrote:
> >
> > > Hello,
> > >
> > > i would like to generate some stats on my facets. This is working so
> far.
> > > My problem is that i don't know how to generate Ranges on my facets and
> > > calculate the stats for it.
> > >
> > > I have two fields in my schema -> rank(string) and total(float,
> nullable)
> > > Rank can be A or B or C. In case my object was audited document
> contains a
> > > total value (78 or 45 or ...). Otherwise the value is null.
> > >
> > > What i need to calculate per Rank is the count of documents having a
> total
> > > value >= 90 and the count of the other documents (null or < 90).
> > >
> > > My solution would be to implement 2 queries. But what i learned so far:
> > > Solr is build to avoid that.
> > >
> > > Can you please give me hint how i could solve this problem.
> > >
> > > Thanks for your support
> > > Per
> > >
> >
>

Re: _version_ / Versioning using timespan

2017-06-01 Thread marotosg

Thanks a lot Susheel.
I see this is actually what I need.  I have been testing it and  notice the
value of the field has to be always greater for a new document to get
indexed. if you send the same version number it doesn't work.

Is it possible somehow to overwrite documents with the same version?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/version-Versioning-using-timespan-tp4338171p4338475.html
Sent from the Solr - User mailing list archive at Nabble.com.

Configuration of parallel indexing threads

2017-06-01 Thread gigo314

During performance testing a question was raised whether Solr indexing
performance could be improved by adding more concurrent index writer
threads. I discovered traces of such functionality  here
  , but not sure how to use
it in Solr 6.2. Hopefully there is a setting in Solr configuration file, but
I cannot find it.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Configuration-of-parallel-indexing-threads-tp4338466.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Analyzer for Vietnamese

2017-06-01 Thread Eirik Hungnes

Thanks Erick,

Dat:

Do you have more info about the subject?

2017-05-22 17:08 GMT+02:00 Erick Erickson :

> Eirik:
>
> That code is 4 years old and for Lucene 4. I doubt it applies cleanly
> to the current code base, but feel free to give it a try but it's not
> guaranteed.
>
> I know of no other Vietnamese analyzers available.
>
> Dat is active in the community, don't know whether he has plans to
> update/commit that bit of code.
>
> Best,
> Erick
>
> On Mon, May 22, 2017 at 12:25 AM, Eirik Hungnes
>  wrote:
> > Hi,
> >
> > There doesn't seem to be any Tokenizer / Analyzer for Vietnamese built in
> > to Lucene at the moment. Does anyone know if something like this exists
> > today or is planned for? We found this
> > https://github.com/CaoManhDat/VNAnalyzer made by Cao Mahn Dat, but not
> sure
> > if it's up to date. Any info highly appreciated!
> >
> > Thanks,
> >
> > Eirik
>



-- 
Best regards,

Eirik Hungnes
CTO
Rubrikk Group AS

Cell: +4797027732
skypeid: blindkorn44

Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread Amrit Sarkar

Erick,

Thanks for the pointer. Getting astray from what Vrinda is looking for
(sorry about that), what if there are no sub-entities? and no
deltaImportQuery passed too. I looked into the code and determine it
calculates the deltaImportQuery itself,
SQLEntityProcessor:getDeltaImportQuery(..)::126.

Ideally then, a full-import or the delta-import should take similar time to
build the docs (fetch next row). I may very well be going entirely wrong
here.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Thu, Jun 1, 2017 at 1:50 PM, vrindavda  wrote:

> Thanks Erick,
>
>  But how do I solve this? I tried creating Stored proc instead of plain
> query, but no change in performance.
>
> For delta import it in processing more documents than the total documents.
> In this case delta import is not helping at all, I cannot switch to full
> import each time. This was working fine with less data.
>
> Thank you,
> Vrinda Davda
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-
> Import-tp4338162p4338444.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr Document Routing

2017-06-01 Thread Amrit Sarkar

Sorry, The confluence link:
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Thu, Jun 1, 2017 at 2:11 PM, Amrit Sarkar  wrote:

> Sathyam,
>
> It seems your interpretation is wrong as CloudSolrClient calculates
> (hashes the document id and determine the range it belongs to) which shard
> the document incoming belongs to. As you have 10 shards, the document will
> belong to one of them, that is what being calculated and eventually pushed
> to the leader of that shard.
>
> The confluence link provides the insights in much detail:
> https://lucidworks.com/2013/06/13/solr-cloud-document-routing/
> Another useful link: https://lucidworks.com/2013/06/13/solr-cloud-
> document-routing/
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Thu, Jun 1, 2017 at 11:52 AM, Sathyam 
> wrote:
>
>> HI,
>>
>> I am indexing documents to a 10 shard collection (testcollection, having
>> no
>> replicas) in solr6 cluster using CloudSolrClient. I saw that there is a
>> lot
>> of peer to peer document distribution going on when I looked at the solr
>> logs.
>>
>> An example log statement is as follows:
>> 2017-06-01 06:07:28.378 INFO  (qtp1358444045-3673692) [c:testcollection
>> s:shard8 r:core_node7 x:testcollection_shard8_replica1]
>> o.a.s.u.p.LogUpdateProcessorFactory [testcollection_shard8_replica1]
>>  webapp=/solr path=/update params={update.distrib=TOLEADER=
>> http://10.199.42.29:8983/solr/testcollection_shard7_replica1
>> /=javabin=2}{add=[BQECDwZGTCEBHZZBBiIP
>> (1568981383488995328), BQEBBQZB2il3wGT/0/mB (1568981383490043904),
>> BQEBBQZFnhOJRj+m9RJC (1568981383491092480), BQEGBgZIeBE1klHS4fxk
>> (1568981383492141056), BQEBBQZFVTmRx2VuCgfV (1568981383493189632)]} 0 25
>>
>> When I went through the code of CloudSolrClient on grepcode I saw that the
>> client itself finds out which server it needs to hit by using the message
>> id hash and getting the shard range information from state.json.
>> Then it is quite confusing to me why there is a distribution of data
>> between peers as there is no replication and each shard is a leader.
>>
>> I would like to know why this is happening and how to avoid it or if the
>> above log statement means something else and I am misinterpreting
>> something.
>>
>> --
>> Sathyam Doraswamy
>>
>
>

Re: Solr Document Routing

2017-06-01 Thread Amrit Sarkar

Sathyam,

It seems your interpretation is wrong as CloudSolrClient calculates (hashes
the document id and determine the range it belongs to) which shard the
document incoming belongs to. As you have 10 shards, the document will
belong to one of them, that is what being calculated and eventually pushed
to the leader of that shard.

The confluence link provides the insights in much detail:
https://lucidworks.com/2013/06/13/solr-cloud-document-routing/
Another useful link:
https://lucidworks.com/2013/06/13/solr-cloud-document-routing/

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Thu, Jun 1, 2017 at 11:52 AM, Sathyam 
wrote:

> HI,
>
> I am indexing documents to a 10 shard collection (testcollection, having no
> replicas) in solr6 cluster using CloudSolrClient. I saw that there is a lot
> of peer to peer document distribution going on when I looked at the solr
> logs.
>
> An example log statement is as follows:
> 2017-06-01 06:07:28.378 INFO  (qtp1358444045-3673692) [c:testcollection
> s:shard8 r:core_node7 x:testcollection_shard8_replica1]
> o.a.s.u.p.LogUpdateProcessorFactory [testcollection_shard8_replica1]
>  webapp=/solr path=/update params={update.distrib=TOLEADER=
> http://10.199.42.29:8983/solr/testcollection_shard7_
> replica1/=javabin=2}{add=[BQECDwZGTCEBHZZBBiIP
> (1568981383488995328), BQEBBQZB2il3wGT/0/mB (1568981383490043904),
> BQEBBQZFnhOJRj+m9RJC (1568981383491092480), BQEGBgZIeBE1klHS4fxk
> (1568981383492141056), BQEBBQZFVTmRx2VuCgfV (1568981383493189632)]} 0 25
>
> When I went through the code of CloudSolrClient on grepcode I saw that the
> client itself finds out which server it needs to hit by using the message
> id hash and getting the shard range information from state.json.
> Then it is quite confusing to me why there is a distribution of data
> between peers as there is no replication and each shard is a leader.
>
> I would like to know why this is happening and how to avoid it or if the
> above log statement means something else and I am misinterpreting
> something.
>
> --
> Sathyam Doraswamy
>

Solr Document Routing

2017-06-01 Thread Sathyam

HI,

I am indexing documents to a 10 shard collection (testcollection, having no
replicas) in solr6 cluster using CloudSolrClient. I saw that there is a lot
of peer to peer document distribution going on when I looked at the solr
logs.

An example log statement is as follows:
2017-06-01 06:07:28.378 INFO  (qtp1358444045-3673692) [c:testcollection
s:shard8 r:core_node7 x:testcollection_shard8_replica1]
o.a.s.u.p.LogUpdateProcessorFactory [testcollection_shard8_replica1]
 webapp=/solr path=/update params={update.distrib=TOLEADER=
http://10.199.42.29:8983/solr/testcollection_shard7_replica1/=javabin=2}{add=[BQECDwZGTCEBHZZBBiIP
(1568981383488995328), BQEBBQZB2il3wGT/0/mB (1568981383490043904),
BQEBBQZFnhOJRj+m9RJC (1568981383491092480), BQEGBgZIeBE1klHS4fxk
(1568981383492141056), BQEBBQZFVTmRx2VuCgfV (1568981383493189632)]} 0 25

When I went through the code of CloudSolrClient on grepcode I saw that the
client itself finds out which server it needs to hit by using the message
id hash and getting the shard range information from state.json.
Then it is quite confusing to me why there is a distribution of data
between peers as there is no replication and each shard is a leader.

I would like to know why this is happening and how to avoid it or if the
above log statement means something else and I am misinterpreting something.

-- 
Sathyam Doraswamy

Re: Number of requests spike up, when i do the delta Import.

2017-06-01 Thread vrindavda

Thanks Erick,

 But how do I solve this? I tried creating Stored proc instead of plain
query, but no change in performance.

For delta import it in processing more documents than the total documents.
In this case delta import is not helping at all, I cannot switch to full
import each time. This was working fine with less data.

Thank you,
Vrinda Davda



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Number-of-requests-spike-up-when-i-do-the-delta-Import-tp4338162p4338444.html
Sent from the Solr - User mailing list archive at Nabble.com.

Aw: Re: Facet ranges and stats

2017-06-01 Thread Per Newgro

Thanks for your support.

Because the null handling is one of the important things i decided to use 
another way.

I added a script in my data import handler that decides if object was audited
  function auditComplete(row) {
var total = row.get('TOTAL');
if (total == null || total < 90) {
  row.remove('audit_complete');
} else {
  row.put('audit_complete', 1);
}
return row;
  }

When i add/update a document the same will be done in my code.
So i can do my query based on audit_complete field, because i only need to know 
how many are complete and how many not.

A drawback is surely that the "complete" limit of 90 is now implemented in two 
places (DIH script and my code).
But so far i can life with it.

Thank you
Per

> Gesendet: Mittwoch, 31. Mai 2017 um 17:28 Uhr
> Von: "Susheel Kumar" 
> An: solr-user@lucene.apache.org
> Betreff: Re: Facet ranges and stats
>
> Hi,
> 
> You may want to explore the JSON facets.  The closest I can go to meet
> above requirement is below query (replace inStock with your rank field and
> price below with total.  Null handling something also will have to look.
> 
> -- 
> Susheel
> 
> curl http://localhost:8983/solr/techproducts/query -d 'q=*:*&
> 
>   json.facet={inStocks:{ terms:{
> field: inStock,
> limit: 5,
> facet:{
> priceRange:{ range:{  // nested terms
> facet will be executed for the top 5 genre buckets of the parent
>   field: price,
>   start : 0,
>   end : 90,
>   gap : 90,
>   other : "after"
> }}
>   }
>   }}
>   }'
> 
> 
> On Wed, May 31, 2017 at 7:33 AM, Per Newgro  wrote:
> 
> > Hello,
> >
> > i would like to generate some stats on my facets. This is working so far.
> > My problem is that i don't know how to generate Ranges on my facets and
> > calculate the stats for it.
> >
> > I have two fields in my schema -> rank(string) and total(float, nullable)
> > Rank can be A or B or C. In case my object was audited document contains a
> > total value (78 or 45 or ...). Otherwise the value is null.
> >
> > What i need to calculate per Rank is the count of documents having a total
> > value >= 90 and the count of the other documents (null or < 90).
> >
> > My solution would be to implement 2 queries. But what i learned so far:
> > Solr is build to avoid that.
> >
> > Can you please give me hint how i could solve this problem.
> >
> > Thanks for your support
> > Per
> >
>

41 matches

Mail list logo