date:20150521

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-21 Thread TK Solr



On 5/21/15, 5:26 AM, Steven White wrote:

Hi TK,

Can you share the thread you found on this WAR topic?


Steve,
Actually, that was my mistake.  I still don't know why WARs are bad.

In the thread "Solr 5.0, Jetty and WAR", which you started and are familiar 
with,
https://wiki.apache.org/solr/WhyNoWar
was mentioned. So I thought that's it!
But it turned out this wiki page was a blank page.

TK

Re: Java upgrade for solr in master-slave configuration

2015-05-21 Thread Kamal Kishore Aggarwal

Hi,

Anybody tried upgrading master first prior to slave Java upgrade. Please
suggest.




On Tue, May 19, 2015 at 6:50 PM, Shawn Heisey  wrote:

> On 5/19/2015 12:21 AM, Kamal Kishore Aggarwal wrote:
> > I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. The solr
> > configuration has slave & master architecture. I am looking forward to
> > upgrade Java from 1.7 to 1.8 version in order to take advantage of memory
> > optimization done in latest version.
> >
> > So, I am confused if I should upgrade java first on master server and
> then
> > on slave server or the other way round. What should be the ideal steps,
> so
> > that existing solr index and other things should not get corrupted .
> Please
> > suggest.
>
> I am not aware of any changes in index format resulting from changing
> your Java version.  It should not matter which machines you upgrade first.
>
> Thanks,
> Shawn
>
>

Applying gzip compression in Solr 5.1

2015-05-21 Thread Zheng Lin Edwin Yeo

Hi,

I'm trying to apply gzip compression in Solr 5.1. I understand that Running
Solr on Tomcat is no longer supported from Solr 5.0, so I've tried to
implement it in Solr.

I've downloaded jetty-servlets-9.3.0.RC0.jar and placed it in my
webapp\WEB-INF folder, and have added the following in
webapp\WEB-INF\web.xml

  
GzipFilter
org.eclipse.jetty.servlets.GzipFilter

  methods
  GET,POST
  mimeTypes

text/html;charset=UTF-8,text/plain,text/xml,text/json,text/javascript,text/css,text/plain;charset=UTF-8,application/xhtml+xml,application/javascript,image/svg+xml,application/json,application/xml;
charset=UTF-8

  
  
GzipFilter
/*
  


However, when I start Solr and check the browser, there's no gzip
compression. Is there anything which I configure wrongly or might have
missed out? I'm also running zookeeper-3.4.6.


Regards,
Edwin

Re: Index optimize runs in background.

2015-05-21 Thread Modassar Ather

Hi

An insight on the question will be really helpful.

Thanks,
Modassar

On Thu, May 21, 2015 at 5:51 PM, Modassar Ather 
wrote:

> Hi,
>
> I am using Solr-5.1.0. I have an indexer class which invokes
> cloudSolrClient.optimize(true, true, 1). My indexer exits after the
> invocation of optimize and the optimization keeps on running in the
> background.
> Kindly let me know if it is per design and how can I make my indexer to
> wait until the optimization is over. Is there a configuration/parameter I
> need to set for the same.
>
> Please note that the same indexer with cloudSolrServer.optimize(true,
> true, 1) on Solr-4.10 used to wait till the optimize was over before
> exiting.
>
> Thanks,
> Modassar
>
>

Re: Index Sizes

2015-05-21 Thread Shawn Heisey

On 1/7/2014 7:48 AM, Steven Bower wrote:
> I was looking at the code for getIndexSize() on the ReplicationHandler to
> get at the size of the index on disk. From what I can tell, because this
> does directory.listAll() to get all the files in the directory, the size on
> disk includes not only what is searchable at the moment but potentially
> also files that are being created by background merges/etc.. I am wondering
> if there is an API that would give me the size of the "currently
> searchable" index files (doubt this exists, but maybe)..
> 
> If not what is the most appropriate way to get a list of the segments/files
> that are currently in use by the active searcher such that I could then ask
> the directory implementation for the size of all those files?
> 
> For a more complete picture of what I'm trying to accomplish, I am looking
> at building a quota/monitoring component that will trigger when index size
> on disk gets above a certain size. I don't want to trigger if index is
> doing a merge and ephemerally uses disk for that process. If anyone has any
> suggestions/recommendations here too I'd be interested..

Dredging up a VERY old thread here.  As I was replying to your most
recent query, I was looking through my email archive for your previous
messages and this one caught my eye, especially because it never got a
reply.  It must have escaped my notice last year.

This is a very good idea.  I imagine that the active searcher object
directly or indirectly knows exactly which files are in use for that
searcher, so I think it should be relatively easy for it to retrieve a
list, and the index size code should be able to return both the active
index size as well as the total directory size.

I've been putting a little bit of work in to get the index size code
moved out of the replication handler so that it is available even if
replication is completely disabled, but my free time has been limited.
I don't recall the issue number(s) for that work.

Thanks,
Shawn

Re: SolrCloud with local configs

2015-05-21 Thread Shawn Heisey

On 5/21/2015 7:24 PM, Steven Bower wrote:
> Is it possible to run in "cloud" mode with zookeeper managing
> collections/state/etc.. but to read all config files (solrconfig, schema,
> etc..) from local disk?
> 
> Obviously this implies that you'd have to keep them in sync..
> 
> My thought here is of running Solr in a docker container, but instead of
> having to manage schema changes/etc via zk I can just build the config into
> the container.. and then just produce a new docker image with a solr
> version and the new config and just do rolling restarts of the containers..

As far as I am aware, this is not possible.  As I think about it, I'm
not convinced that it's a good idea.  If you're going to be using
zookeeper for ANY purpose, the config should be centralized in zookeeper.

The ZK chroot (or new ZK ensemble, if you choose to go that route) will
be dedicated to that specific cluster.  It won't be shared with any
other cluster.  Any automation you've got that fires up a new cluster
can simply upload the cluster-specific config into the new ZK chroot as
it builds the container(s) for the cluster.  Teardown automation can
delete the chroot.

The idea is probably worth an issue in jira.  I won't veto the
implementation, but as I said above, I'm not yet convinced that it's a
good idea -- ZK is already in use for the clusterstate, using it for the
config completely eliminates the need for config synchronization.  Do
you have a larger compelling argument?

Thanks,
Shawn

Re: Search for numbers

2015-05-21 Thread david.w.smi...@gmail.com

Hi Holger,

It’s not apparent to me why you are using the spatial field to index a
number.  Why not simply a “tfloat” or whatever numeric field?  Then you
could use {!frange} with a function to get the difference and filter it to
be in the range you want.

RE query parsing (problem #1): you should write a custom query parser…
perhaps by forking ExtendedDisMaxQParser to meet your needs.  But I think
you’ll have something cleaner / more maintainable if you write one from
scratch while looking at that QParser for tips/inspiration; not porting the
features you don’t want.

RE problem #2: I’m a little unclear on what you want to do, but it’s likely
you can express it with {!frange} on a number field (not spatial) with the
right functions.  If you can’t), you could write either a custom function
(AKA ValueSource) or if needed a frange like thing for your custom needs.

~ David
http://www.linkedin.com/in/davidwsmiley

On Thu, May 21, 2015 at 3:22 AM Holger Rieß 
wrote:

> Hi,
>
> I try to search numbers with a certain deviation. My parser is
> ExtendedDisMax.
> A possible search expression could be 'twist drill 1.23 mm'. It will not
> match any documents, because the document contains the keywords 'twist
> drill', '1.2' and 'mm'.
>
> In order to reach my goal, I've indexed all numbers as points with the
> solr.SpatialRecursivePrefixTreeFieldType.
> For example '1.2' as 1.2 0.0.
> A search with 'drill mm' and a filter query 'fq={!geofilt pt=0,1.23
> sfield=feature_nr d=5}' delivers the expected results.
>
> Now I have two problems:
> 1. How can I get ExtendedDisMax, to 'replace' the value 1.2 with the
> '{!geofilt}' function?
>   My first attemts were
>
> - Build a field type in schema.xml and replace the field content with a
> regular expression
> '... replacement="_query_:"{!geofilt pt=0,$1 sfield=feature_nr
> d=5}""'.
> The idea was to use a nested query. But edismax searches
> 'feature_nr:_query_:"{!geofilt pt=0,$1 sfield=feature_nr d=5}"'.
> No documents are found.
>
> - Program a new parser that analyzes the query terms, finds all numbers
> and does the geospatial stuff. Added this parser in the 'appends' section
> of the 'requestHandler' definition. But I can get this parser only to
> filter my results, not to extend them.
>
> 2. I want to calculate the distance (d) of the '{!geofilt}' function
> relative to the value, for example 1%.
>
> Could there be a simple solution?
>
> Thank you in advance.
> Holger
>

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread david.w.smi...@gmail.com

Indeed: https://github.com/dsmiley/SOLR-2155

On Thu, May 21, 2015 at 8:59 PM alexw  wrote:

> Thanks David. Unfortunately we are on Solr 3.5, so I am not sure whether
> RPT
> is available. If not, is there a way to patch 3.5 to make it work?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4207003.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

SolrCloud with local configs

2015-05-21 Thread Steven Bower

Is it possible to run in "cloud" mode with zookeeper managing
collections/state/etc.. but to read all config files (solrconfig, schema,
etc..) from local disk?

Obviously this implies that you'd have to keep them in sync..

My thought here is of running Solr in a docker container, but instead of
having to manage schema changes/etc via zk I can just build the config into
the container.. and then just produce a new docker image with a solr
version and the new config and just do rolling restarts of the containers..

Thanks,

Steve

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread alexw

Thanks David. Unfortunately we are on Solr 3.5, so I am not sure whether RPT
is available. If not, is there a way to patch 3.5 to make it work?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4207003.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reindex of document leaves old fields behind

2015-05-21 Thread tuxedomoon

I'm posting the fields from one of my problem document, based on this comment
I found from Shawn on Grokbase.  

>> If you are trying to use a Map object as the value of a field, that is
>> probably why it is interpreting your add request as an atomic update.
>> If this is the case, and you're doing it because you have a multivalued
>> field, you can use a List object rather than a Map.

This is just a solrDoc.toString() with linebreaks where commas were.  Maybe
some of these are being seen as map fields by SOLR.
=
SolrInputDocument[

mynamespaces_s_mv=[drama],

changedates_s_mv=[Tue May 19 17:21:26 EDT 2015, Thu Dec 30 19:00:00 EST
],

networks_t_mv=[{ "abcitem-id" : "288578fd-6596-47bc-af95-80daecd1f24a" ,
"abccontentType" : "Standard:SocialHandle" , "SocialNetwork" : { "$uuid" :
"73553c4c-4919-4ba9-b16c-fb340f3e4c31"} , "Handle" : "in my
imaginationseries"}],

links_s_mv=[ { "$uuid" : "4d8eb47c-ce2d-4e7f-a567-d8d6692fed4e"} , { "$uuid"
: "9fd75c26-35f2-4f48-b55a-6e82089cc3ba"} , { "$uuid" :
"150e43ed-9ebe-41b4-86cc-bdf4885a50fe"} , { "$uuid" :
"e20b0040-561f-4c34-9dd3-df85250b5a5b"} , { "$uuid" :
"0cff75d0-4f32-46c9-9092-60eec2dc847a"} , { "$uuid" :
"73553c4c-4919-4ba9-b16c-fb340f3e4c31"}],

ratings_t_mv=[{ "abcitem-id" : "56058649-579a-4160-9439-e59448eb3dff" ,
"abccontentType" : "Standard:TVPG" , "Rating" : { "$uuid" :
"150e43ed-9ebe-41b4-86cc-bdf4885a50fe"}}],

title_ci_t=in my imagination,

urlkey_s=in-my imagination,

title_cs_t=In My Imagination,

dp2_1_s_mv=[ { "_id" : { "$uuid" : "4d8eb47c-ce2d-4e7f-a567-d8d6692fed4e"} ,
"_rules" : [ { "_startDate" : { "$date" : "2015-03-23T14:58:00.000Z"} ,
"_endDate" : { "$date" : "-12-31T00:00:00.000Z"} , "_r" : { "$uuid" :
"47b6b31d-d690-437a-9bab-6eeb7be3c8a4"} , "_p" : { "$uuid" :
"d478874f-8fc7-4b3d-97f3-f7e63222d633"} , "_o" : { "$uuid" :
"983b6ae9-7882-4af8-bb2f-cff342be99b3"} , "_a" :  null }]}],

seriestype_s=e20b0040-561f-4c34-9dd3-df85250b5a5b,

shortid_s=x5jqqf, i

shorttitle_t=In My Imagination,

uuid_s=90a1fbbf-ddf8-47a7-9f00-55f05e7dc297,

status_s=DEFAULT,

updatedby_s=maceirar,

description_t=sometext,

review_s_mv=[{ "abcpublished" : { "$date" : "2015-05-19T21:21:30.930Z"} ,
"abcpublishedBy" : "jelly" , "abctargetEnvironment" :
"entertainment-staging" , "abcrequestId" : { "$uuid" :
"56769138-4a03-4ed6-8b29-8030d0941b08"} , "abcsourceEnvironment" : "fishing"
, "abcstate" : true}, { "abcpublished" : { "$date" :
"2015-05-19T21:21:31.731Z"} , "abcpublishedBy" : "jelly" ,
"abctargetEnvironment" : "myshow-live" , "abcrequestId" : { "$uuid" :
"56769138-4a03-4ed6-8b29-8030d0941b08"} , "abcsourceEnvironment" :
"myshow-staging" , "abcstate" : true}],

sorttitle_t=In My Imagination,

images_s_mv=[ { "$uuid" : "9fd75c26-35f2-4f48-b55a-6e82089cc3ba"} , {
"$uuid" : "0cff75d0-4f32-46c9-9092-60eec2dc847a"}],

title_ci_s=in my imagination,

firmuuids_s_mv=[ { "$uuid" : "4d8eb47c-ce2d-4e7f-a567-d8d6692fed4e"}],

id=mongo-v2.abcnservices.com-fishing-90a1fbbf-ddf8-47a7-9f00-55f05e7dc297,

timestamp=Thu May 21 17:29:58 EDT 2015

]




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206963.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing gets significantly slower after every batch commit

2015-05-21 Thread Erick Erickson

bq: Which is logical as index growth and time needed to put something
to it is log(n)

Not really. Solr indexes to segments, each segment is a fully
consistent "mini index".
When a segment gets flushed to disk, a new one is started. Of course
there'll be a
_little bit_ of added overyead, but it shouldn't be all that noticeable.

Furthermore, they're "append only". In the past, when I've indexed the
Wiki example,
my indexing speed actually goes faster.

So on the surface this sounds very strange to me. Are you seeing
anything at all in the
Solr logs that's supsicious?

Best,
Erick

On Thu, May 21, 2015 at 12:22 PM, Sergey Shvets  wrote:
> Hi Angel
>
> We also noticed that kind of performance degrade in our workloads.
>
> Which is logical as index growth and time needed to put something to it is
> log(n)
>
>
>
> четверг, 21 мая 2015 г. пользователь Angel Todorov написал:
>
>> hi Shawn,
>>
>> Thanks a bunch for your feedback. I've played with the heap size, but I
>> don't see any improvement. Even if i index, say , a million docs, and the
>> throughput is about 300 docs per sec, and then I shut down solr completely
>> - after I start indexing again, the throughput is dropping below 300.
>>
>> I should probably experiment with sharding those documents to multiple SOLR
>> cores - that should help, I guess. I am talking about something like this:
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>>
>> Thanks,
>> Angel
>>
>>
>> On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey > > wrote:
>>
>> > On 5/21/2015 2:07 AM, Angel Todorov wrote:
>> > > I'm crawling a file system folder and indexing 10 million docs, and I
>> am
>> > > adding them in batches of 5000, committing every 50 000 docs. The
>> > problem I
>> > > am facing is that after each commit, the documents per sec that are
>> > indexed
>> > > gets less and less.
>> > >
>> > > If I do not commit at all, I can index those docs very quickly, and
>> then
>> > I
>> > > commit once at the end, but once i start indexing docs _after_ that
>> (for
>> > > example new files get added to the folder), indexing is also slowing
>> > down a
>> > > lot.
>> > >
>> > > Is it normal that the SOLR indexing speed depends on the number of
>> > > documents that are _already_ indexed? I think it shouldn't matter if i
>> > > start from scratch or I index a document in a core that already has a
>> > > couple of million docs. Looks like SOLR is either doing something in a
>> > > linear fashion, or there is some magic config parameter that I am not
>> > aware
>> > > of.
>> > >
>> > > I've read all perf docs, and I've tried changing mergeFactor,
>> > > autowarmCounts, and the buffer sizes - to no avail.
>> > >
>> > > I am using SOLR 5.1
>> >
>> > Have you changed the heap size?  If you use the bin/solr script to start
>> > it and don't change the heap size with the -m option or another method,
>> > Solr 5.1 runs with a default size of 512MB, which is *very* small.
>> >
>> > I bet you are running into problems with frequent and then ultimately
>> > constant garbage collection, as Java attempts to free up enough memory
>> > to allow the program to continue running.  If that is what is happening,
>> > then eventually you will see an OutOfMemoryError exception.  The
>> > solution is to increase the heap size.  I would probably start with at
>> > least 4G for 10 million docs.
>> >
>> > Thanks,
>> > Shawn
>> >
>> >
>>

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread David Smiley

Another more modern option, very related to this, is to use DateRangeField in 
5.0.  You have full 64 bit precision.  More info is in the Solr Ref Guide.

If Alessandro sticks with RPT, then the best reference to give is this:
http://wiki.apache.org/solr/SpatialForTimeDurations

~ David
https://www.linkedin.com/in/davidwsmiley

> On May 21, 2015, at 11:49 AM, Holger Rieß  
> wrote:
> 
> Give geospatial search a chance. Use the 
> 'SpatialRecursivePrefixTreeFieldType' field type, set 'geo' to false.
> The date is located on the X-axis, prices on the Y axis.
> For every price you get a horizontal line between start and end date. Index a 
> rectangle with height 0.001(< 1 cent) and width 'end date - start date'.
> 
> Find all prices that are valid on a given day or in a given date range with 
> the 'geofilt' function.
> 
> The field type could look like (not tested):
> 
>  class="solr.SpatialRecursivePrefixTreeFieldType"
>   geo="false" distErrPct="0.025" maxDistErr="0.09" units="degrees"
>   worldBounds="1 0 366 1" />
> 
> Faceting possibly can be done with a facet query for every of your price 
> ranges.
> For example day 20, price range 0-5$, rectangle: 20.0 0.0 
> 21.0 5.0.
> 
> Regards Holger
>

Re: Is it possible to search for the empty string?

2015-05-21 Thread Chris Hostetter


: Subject: Re: Is it possible to search for the empty string?
: 
: Not out of the box.
: 
: Fields are parsed into tokens and queries search on tokens. An empty 
: string has no tokens for that field and a missing field has no tokens 
: for that field.

that's a missleading over simplification of what *normally* happens.

it is absolutely possible to have documents with fields whose indexed 
temrs consist of the empty string, and to search for those empty 
strings -- the most trivial way being with a simple StrField -- but using 
TExtField with some creative analyzers it's also very possible..


$ curl 
'http://localhost:8983/solr/techproducts/select?q=*:*&facet=true&facet.field=foo_s&wt=json&indent=true&omitHeader=true'
{
  "response":{"numFound":3,"start":0,"docs":[
  {
"id":"foo_blank",
"foo_s":"",
"_version_":1501816569733316608},
  {
"id":"foo_non_blank",
"foo_s":"bar",
"_version_":1501816583564034048},
  {
"id":"foo_missing",
"_version_":1501816591383265280}]
  },
  "facet_counts":{
"facet_queries":{},
"facet_fields":{
  "foo_s":[
"",1,
"bar",1]},
"facet_dates":{},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}

$ curl 
'http://localhost:8983/solr/techproducts/select?q=foo_s:""&wt=json&indent=true&omitHeader=true'
{
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"foo_blank",
"foo_s":"",
"_version_":1501816569733316608}]
  }}

$ curl 
'http://localhost:8983/solr/techproducts/select?q=foo_s:*&wt=json&indent=true&omitHeader=true'
{
  "response":{"numFound":2,"start":0,"docs":[
  {
"id":"foo_blank",
"foo_s":"",
"_version_":1501816569733316608},
  {
"id":"foo_non_blank",
"foo_s":"bar",
"_version_":1501816583564034048}]
  }}

$ curl 
'http://localhost:8983/solr/techproducts/select?q=-foo_s:*&wt=json&indent=true&omitHeader=true'
{
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"foo_missing",
"_version_":1501816591383265280}]
  }}


-Hoss
http://www.lucidworks.com/

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread alexw

Thanks Holger and Alessandro, SpatialRecursivePrefixTreeFieldType  is a new
concept to me, and I need some time to dig into it and see how it can help
solve my problem.

Alex Wang
Technical Architect
Crossview, Inc.
C: (647) 409-3066
aw...@crossview.com

On Thu, May 21, 2015 at 11:50 AM, Holger Rieß [via Lucene] <
ml-node+s472066n4206868...@n3.nabble.com> wrote:

> Give geospatial search a chance. Use the
> 'SpatialRecursivePrefixTreeFieldType' field type, set 'geo' to false.
> The date is located on the X-axis, prices on the Y axis.
> For every price you get a horizontal line between start and end date.
> Index a rectangle with height 0.001(< 1 cent) and width 'end date - start
> date'.
>
> Find all prices that are valid on a given day or in a given date range
> with the 'geofilt' function.
>
> The field type could look like (not tested):
>
>  class="solr.SpatialRecursivePrefixTreeFieldType"
> geo="false" distErrPct="0.025" maxDistErr="0.09"
> units="degrees"
> worldBounds="1 0 366 1" />
>
> Faceting possibly can be done with a facet query for every of your price
> ranges.
> For example day 20, price range 0-5$, rectangle: 20.0
> 0.0 21.0 5.0.
>
> Regards Holger
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4206868.html
>  To unsubscribe from Price Range Faceting Based on Date Constraints, click
> here
> 
> .
> NAML
> 
>

-- 

This message may contain confidential and/or privileged information or 
information 
related to CrossView Intellectual Property. If you are not the addressee or 
authorized to receive this for the addressee, you must not use, copy, 
disclose, or take any action based on this message or any information 
herein. If you have received this message in error, please advise the 
sender immediately by reply e-mail and delete this message. Thank you for 
your cooperation.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4206951.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reindex of document leaves old fields behind

2015-05-21 Thread tuxedomoon

a few further clues to this unresolved problem

1. I found one of my 5 zookeeper instances was down
2. I tried another reindex of a bad document but no change on the SOLR side
3. I deleted and reindexed the same doc, that worked (obviously, but at this
point I don't know what to expect)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206946.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Confused about whether Real-time Gets must be sent to leader?

2015-05-21 Thread Yonik Seeley

On Thu, May 21, 2015 at 3:15 PM, Timothy Potter  wrote:
> I'm seeing that RTG requests get routed to any active replica of the
> shard hosting the doc requested by /get ... I was thinking only the
> leader should handle that request since there's a brief window of time
> where the latest update may not be on the replica (albeit usually very
> brief) and the latest update is definitely on the leader.

There are different levels of "consistency".
You are guaranteed that after an update completes, a RTG will retrieve
that version of the update (or later).
The fact that a replica gets the update after the leader is not
material to this guarantee since the update has not yet completed.

What can happen is that if you are doing multiple RTG requests, you
can see a later version of a document, then see a previous version
(because you're hitting different shards).  This will only be an issue
in certain types of use-cases.  Optimistic concurrency, for example,
will *not* be bothered by this phenomenon.

In the past, we've talked about an option to route search requests to
the leader.  But really, any type of server affinity would work to
ensure a monotonic view of a document's history.  Off the top of my
head, I'm not really sure what types of apps require it, but I'd be
interested in hearing about them.

-Yonik

Re: solr uima and opennlp

2015-05-21 Thread Tommaso Teofili

Hi Andreaa,

2015-05-21 18:12 GMT+02:00 hossmaa :

> Hi everyone
>
> I'm trying to plug in a new UIMA annotator into solr. What is necessary for
> this? Is is enough to build a Jar similarly to the ones from the
> uima-addons
> package?


yes, exactly. Actually you just need a jar containing the Annotator class
(and dependencies) that you reference from within the
UIMAUpdateRequestProcessor.


> More specifically, are the uima-addona Jars identical to the ones
> found in solr's contrib folder?
>

they are the 2.3.1 versions of those jars.

Regards,
Tommaso


>
> Thanks!
> Andreea
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr-uima-and-opennlp-tp4206873.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Indexing gets significantly slower after every batch commit

2015-05-21 Thread Sergey Shvets

Hi Angel

We also noticed that kind of performance degrade in our workloads.

Which is logical as index growth and time needed to put something to it is
log(n)



четверг, 21 мая 2015 г. пользователь Angel Todorov написал:

> hi Shawn,
>
> Thanks a bunch for your feedback. I've played with the heap size, but I
> don't see any improvement. Even if i index, say , a million docs, and the
> throughput is about 300 docs per sec, and then I shut down solr completely
> - after I start indexing again, the throughput is dropping below 300.
>
> I should probably experiment with sharding those documents to multiple SOLR
> cores - that should help, I guess. I am talking about something like this:
>
>
> https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
>
> Thanks,
> Angel
>
>
> On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey  > wrote:
>
> > On 5/21/2015 2:07 AM, Angel Todorov wrote:
> > > I'm crawling a file system folder and indexing 10 million docs, and I
> am
> > > adding them in batches of 5000, committing every 50 000 docs. The
> > problem I
> > > am facing is that after each commit, the documents per sec that are
> > indexed
> > > gets less and less.
> > >
> > > If I do not commit at all, I can index those docs very quickly, and
> then
> > I
> > > commit once at the end, but once i start indexing docs _after_ that
> (for
> > > example new files get added to the folder), indexing is also slowing
> > down a
> > > lot.
> > >
> > > Is it normal that the SOLR indexing speed depends on the number of
> > > documents that are _already_ indexed? I think it shouldn't matter if i
> > > start from scratch or I index a document in a core that already has a
> > > couple of million docs. Looks like SOLR is either doing something in a
> > > linear fashion, or there is some magic config parameter that I am not
> > aware
> > > of.
> > >
> > > I've read all perf docs, and I've tried changing mergeFactor,
> > > autowarmCounts, and the buffer sizes - to no avail.
> > >
> > > I am using SOLR 5.1
> >
> > Have you changed the heap size?  If you use the bin/solr script to start
> > it and don't change the heap size with the -m option or another method,
> > Solr 5.1 runs with a default size of 512MB, which is *very* small.
> >
> > I bet you are running into problems with frequent and then ultimately
> > constant garbage collection, as Java attempts to free up enough memory
> > to allow the program to continue running.  If that is what is happening,
> > then eventually you will see an OutOfMemoryError exception.  The
> > solution is to increase the heap size.  I would probably start with at
> > least 4G for 10 million docs.
> >
> > Thanks,
> > Shawn
> >
> >
>

Confused about whether Real-time Gets must be sent to leader?

2015-05-21 Thread Timothy Potter

I'm seeing that RTG requests get routed to any active replica of the
shard hosting the doc requested by /get ... I was thinking only the
leader should handle that request since there's a brief window of time
where the latest update may not be on the replica (albeit usually very
brief) and the latest update is definitely on the leader. Am I
overthinking this since we've always maintained that Solr is
eventually consistent or ???

Cheers,
Tim

Re: Reindex of document leaves old fields behind

2015-05-21 Thread tuxedomoon

I'm relying on an autocommit of 60 secs.

I just ran the same test via my SolrJ client and result was the same,
SolrCloud query always returns correct number of fields.  

Is there a way to find out which shard and replica a particular document
lives on?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206908.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr suggester

2015-05-21 Thread Erick Erickson

right. File-based suggestions should be much faster to build, but it's
certainly the case with large indexes that you have to build it
periodically so they won't be completely up to date.

However, this stuff is way cool. AnalyzingInfixSuggester, for
instance, suggests entire fields rather than isolated words, returning
the original case, punctuation etc.

The index-based spellcheck/suggest just reads terms from the indexed
fields which takes no time to build but suffers from reading _indexed_
terms, i.e. terms that have gone through the analysis process that may
have been stemmed, lowercased, all that.

On Thu, May 21, 2015 at 9:03 AM, jon kerling
 wrote:
> Hi Erick,
> I have read your blog and it is really helpful.I'm thinking about upgrading 
> to Solr 5.1 but it won't solve all my problems with this issue, as you said 
> each build will have to read all docs, and analyze it's fields. The only 
> advantage is that I can skip default suggest.build on start up.
> Thank you for your reply.
> Jon Kerling.
>
>
>
>  On Thursday, May 21, 2015 6:38 PM, Erick Erickson 
>  wrote:
>
>
>  Frankly, the suggester is rather broken in Solr 4.x with large
> indexes. Building the suggester index (or FST) requires that _all_ the
> docs get read, the stored fields analyzed and added to the suggester.
> Unfortunately, this happens _every_ time you start Solr and can take
> many minutes whether or not you have buildOnStartup set to false, see:
> https://issues.apache.org/jira/browse/SOLR-6845.
>
> See: http://lucidworks.com/blog/solr-suggester/
>
> See inline.
>
> On Thu, May 21, 2015 at 6:12 AM, jon kerling
>  wrote:
>> Hi,
>>
>> I'm using solr 4.10 and I'm trying to add autosuggest ability to my 
>> application.
>> I'm currently using this kind of configuration:
>>
>>  
>>
>>  mySuggester
>>  FuzzyLookupFactory
>>  suggester_fuzzy_dir
>>  DocumentDictionaryFactory
>>  field2
>>  weightField
>>  text_general
>>
>> 
>>
>>  
>>
>>  true
>>  10
>>  mySuggester
>>
>>
>>  suggest
>>
>>  
>>
>> I wanted to know how the suggester Index/file is being rebuilt.
>> Is it suppose to have all the terms of the desired field in the suggester?
> Yes.
>> if not, is it related to this kind of lookup implementation?
>> if I'll use other lookup implementation which suggest also infix terms of 
>> fields,
>> doesn't it has to hold all terms of the field?
> Yes.
>>
>> When i call suggest.build, does it build from scratch the suggester 
>> Index/file,
>> or is it just doing something like sort of "delta" indexing suggestions?
> Builds from scratch
>>
>> Thank You,
>> Jon
>
>
>

Re: Reindex of document leaves old fields behind

2015-05-21 Thread Erick Erickson

My guess is that you're not committing from your SolrJ program. That's
automatic when you post.

Best,
Erick

On Thu, May 21, 2015 at 10:13 AM, tuxedomoon  wrote:
> OK it is composite
>
> I've just used post.sh to index a test doc with 3 fields to leader 1 of my
> SolrCloud.  I then reindexed it with 1 field removed and the query on it
> shows 2 fields.   I repeated this a few times and always get the correct
> field count from Solr.
>
> I'm now wondering if SolrJ is somehow involved in performing an atomic
> update rather than replacement. I will  try the above test via SolrJ.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206886.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reindex of document leaves old fields behind

2015-05-21 Thread tuxedomoon

OK it is composite

I've just used post.sh to index a test doc with 3 fields to leader 1 of my
SolrCloud.  I then reindexed it with 1 field removed and the query on it
shows 2 fields.   I repeated this a few times and always get the correct
field count from Solr.  

I'm now wondering if SolrJ is somehow involved in performing an atomic
update rather than replacement. I will  try the above test via SolrJ.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206886.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing gets significantly slower after every batch commit

2015-05-21 Thread Angel Todorov

hi Shawn,

Thanks a bunch for your feedback. I've played with the heap size, but I
don't see any improvement. Even if i index, say , a million docs, and the
throughput is about 300 docs per sec, and then I shut down solr completely
- after I start indexing again, the throughput is dropping below 300.

I should probably experiment with sharding those documents to multiple SOLR
cores - that should help, I guess. I am talking about something like this:

https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

Thanks,
Angel


On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey  wrote:

> On 5/21/2015 2:07 AM, Angel Todorov wrote:
> > I'm crawling a file system folder and indexing 10 million docs, and I am
> > adding them in batches of 5000, committing every 50 000 docs. The
> problem I
> > am facing is that after each commit, the documents per sec that are
> indexed
> > gets less and less.
> >
> > If I do not commit at all, I can index those docs very quickly, and then
> I
> > commit once at the end, but once i start indexing docs _after_ that (for
> > example new files get added to the folder), indexing is also slowing
> down a
> > lot.
> >
> > Is it normal that the SOLR indexing speed depends on the number of
> > documents that are _already_ indexed? I think it shouldn't matter if i
> > start from scratch or I index a document in a core that already has a
> > couple of million docs. Looks like SOLR is either doing something in a
> > linear fashion, or there is some magic config parameter that I am not
> aware
> > of.
> >
> > I've read all perf docs, and I've tried changing mergeFactor,
> > autowarmCounts, and the buffer sizes - to no avail.
> >
> > I am using SOLR 5.1
>
> Have you changed the heap size?  If you use the bin/solr script to start
> it and don't change the heap size with the -m option or another method,
> Solr 5.1 runs with a default size of 512MB, which is *very* small.
>
> I bet you are running into problems with frequent and then ultimately
> constant garbage collection, as Java attempts to free up enough memory
> to allow the program to continue running.  If that is what is happening,
> then eventually you will see an OutOfMemoryError exception.  The
> solution is to increase the heap size.  I would probably start with at
> least 4G for 10 million docs.
>
> Thanks,
> Shawn
>
>

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread Alessandro Benedetti

Just thinking a little bit on it, I should investigate more the .
SpatialRecursivePrefixTreeFieldType .

Each value of that field is it a Point ?
Actually each of our values must be  the rectangle.
Because the time frame and the price are a single value ( not only the
duration of the price 'end date - start date').
Could you give an example of the indexing as well ?

Cheers

2015-05-21 17:28 GMT+01:00 Alessandro Benedetti 
:

> The geo-spatial idea is brilliant !
> Do you think translating the date into ms ?
> Alex, you should try that approach, it can work !
>
> Cheers
>
> 2015-05-21 16:49 GMT+01:00 Holger Rieß :
>
>> Give geospatial search a chance. Use the
>> 'SpatialRecursivePrefixTreeFieldType' field type, set 'geo' to false.
>> The date is located on the X-axis, prices on the Y axis.
>> For every price you get a horizontal line between start and end date.
>> Index a rectangle with height 0.001(< 1 cent) and width 'end date - start
>> date'.
>>
>> Find all prices that are valid on a given day or in a given date range
>> with the 'geofilt' function.
>>
>> The field type could look like (not tested):
>>
>> > class="solr.SpatialRecursivePrefixTreeFieldType"
>> geo="false" distErrPct="0.025" maxDistErr="0.09"
>> units="degrees"
>> worldBounds="1 0 366 1" />
>>
>> Faceting possibly can be done with a facet query for every of your price
>> ranges.
>> For example day 20, price range 0-5$, rectangle: 20.0
>> 0.0 21.0 5.0.
>>
>> Regards Holger
>>
>>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread Alessandro Benedetti

The geo-spatial idea is brilliant !
Do you think translating the date into ms ?
Alex, you should try that approach, it can work !

Cheers

2015-05-21 16:49 GMT+01:00 Holger Rieß :

> Give geospatial search a chance. Use the
> 'SpatialRecursivePrefixTreeFieldType' field type, set 'geo' to false.
> The date is located on the X-axis, prices on the Y axis.
> For every price you get a horizontal line between start and end date.
> Index a rectangle with height 0.001(< 1 cent) and width 'end date - start
> date'.
>
> Find all prices that are valid on a given day or in a given date range
> with the 'geofilt' function.
>
> The field type could look like (not tested):
>
>  class="solr.SpatialRecursivePrefixTreeFieldType"
> geo="false" distErrPct="0.025" maxDistErr="0.09"
> units="degrees"
> worldBounds="1 0 366 1" />
>
> Faceting possibly can be done with a facet query for every of your price
> ranges.
> For example day 20, price range 0-5$, rectangle: 20.0
> 0.0 21.0 5.0.
>
> Regards Holger
>
>


-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Reindex of document leaves old fields behind

2015-05-21 Thread Shawn Heisey

On 5/21/2015 9:54 AM, tuxedomoon wrote:
> I'm doing all my index to leader 1 and have not specified any router
> configuration.  But there is an equal distribution of 240M docs across 5
> shards.  I think I've been stating I have 3 shards in these posts, I have 5,
> sorry.
> 
> How do I know what kind of routing I am using?  

If all your indexing is going to the same place and the docs are
distributed evenly, then quite possibly your router is compositeId.

To see for sure, go to the admin UI, click on Cloud, then Tree.  Click
the little arrow next to "collections", then click on the collection
name.  In the far right pane, there will be a small snippet of JSON
below the other attributes, defining the configName and router.

Thanks,
Shawn

solr uima and opennlp

2015-05-21 Thread hossmaa

Hi everyone 

I'm trying to plug in a new UIMA annotator into solr. What is necessary for
this? Is is enough to build a Jar similarly to the ones from the uima-addons
package? More specifically, are the uima-addona Jars identical to the ones
found in solr's contrib folder? 

Thanks! 
Andreea



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-uima-and-opennlp-tp4206873.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr suggester

2015-05-21 Thread jon kerling

Hi Erick,
I have read your blog and it is really helpful.I'm thinking about upgrading to 
Solr 5.1 but it won't solve all my problems with this issue, as you said each 
build will have to read all docs, and analyze it's fields. The only advantage 
is that I can skip default suggest.build on start up.
Thank you for your reply. 
Jon Kerling.

 On Thursday, May 21, 2015 6:38 PM, Erick Erickson 
 wrote:

 Frankly, the suggester is rather broken in Solr 4.x with large
indexes. Building the suggester index (or FST) requires that _all_ the
docs get read, the stored fields analyzed and added to the suggester.
Unfortunately, this happens _every_ time you start Solr and can take
many minutes whether or not you have buildOnStartup set to false, see:
https://issues.apache.org/jira/browse/SOLR-6845.

See: http://lucidworks.com/blog/solr-suggester/

See inline.

On Thu, May 21, 2015 at 6:12 AM, jon kerling
 wrote:
> Hi,
>
> I'm using solr 4.10 and I'm trying to add autosuggest ability to my 
> application.
> I'm currently using this kind of configuration:
>
>  
>    
>      mySuggester
>      FuzzyLookupFactory
>      suggester_fuzzy_dir
>      DocumentDictionaryFactory
>      field2
>      weightField
>      text_general
>    
> 
>
>  
>    
>      true
>      10
>      mySuggester
>    
>    
>      suggest
>    
>  
>
> I wanted to know how the suggester Index/file is being rebuilt.
> Is it suppose to have all the terms of the desired field in the suggester?
Yes.
> if not, is it related to this kind of lookup implementation?
> if I'll use other lookup implementation which suggest also infix terms of 
> fields,
> doesn't it has to hold all terms of the field?
Yes.
>
> When i call suggest.build, does it build from scratch the suggester 
> Index/file,
> or is it just doing something like sort of "delta" indexing suggestions?
Builds from scratch
>
> Thank You,
> Jon

Re: Reindex of document leaves old fields behind

2015-05-21 Thread tuxedomoon

I'm doing all my index to leader 1 and have not specified any router
configuration.  But there is an equal distribution of 240M docs across 5
shards.  I think I've been stating I have 3 shards in these posts, I have 5,
sorry.

How do I know what kind of routing I am using?  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206869.html
Sent from the Solr - User mailing list archive at Nabble.com.

AW: Price Range Faceting Based on Date Constraints

2015-05-21 Thread Holger Rieß

Give geospatial search a chance. Use the 'SpatialRecursivePrefixTreeFieldType' 
field type, set 'geo' to false.
The date is located on the X-axis, prices on the Y axis.
For every price you get a horizontal line between start and end date. Index a 
rectangle with height 0.001(< 1 cent) and width 'end date - start date'.

Find all prices that are valid on a given day or in a given date range with the 
'geofilt' function.

The field type could look like (not tested):



Faceting possibly can be done with a facet query for every of your price ranges.
For example day 20, price range 0-5$, rectangle: 20.0 0.0 
21.0 5.0.

Regards Holger

Re: Is it possible to do term Search for the filtered result set

2015-05-21 Thread Upayavira

and then facet on the tags field.

&facet=on&facet.field=tags

Upayavira

On Thu, May 21, 2015, at 04:34 PM, Erick Erickson wrote:
> Have you tried
> 
> &fq=type:A
> 
> Best,
> Erick
> 
> On Thu, May 21, 2015 at 5:49 AM, Danesh Kuruppu 
> wrote:
> > Hi all,
> >
> > Is it possible to do term search for the filtered result set. we can do
> > term search for all documents. Can we do the term search only for the
> > specified filtered result set.
> >
> > Lets says we have,
> >
> > Doc1 --> type: A
> >  tags: T1 T2
> >
> > Doc2 --> type: A
> >  tags: T1 T3
> >
> > Doc3 --> type: B
> >  tags: T1 T4 T5
> >
> > Can we do term search for tags only in type:A documents, So that it gives
> > the results as
> > T1 - 02
> > T2 - 01
> > T3 - 01
> >
> > Is this possible? If so can you please share documents on this.
> > Thanks
> > Danesh

Re: Solr suggester

2015-05-21 Thread Erick Erickson

Frankly, the suggester is rather broken in Solr 4.x with large
indexes. Building the suggester index (or FST) requires that _all_ the
docs get read, the stored fields analyzed and added to the suggester.
Unfortunately, this happens _every_ time you start Solr and can take
many minutes whether or not you have buildOnStartup set to false, see:
https://issues.apache.org/jira/browse/SOLR-6845.

See: http://lucidworks.com/blog/solr-suggester/

See inline.

On Thu, May 21, 2015 at 6:12 AM, jon kerling
 wrote:
> Hi,
>
> I'm using solr 4.10 and I'm trying to add autosuggest ability to my 
> application.
> I'm currently using this kind of configuration:
>
>  
> 
>   mySuggester
>   FuzzyLookupFactory
>   suggester_fuzzy_dir
>   DocumentDictionaryFactory
>   field2
>   weightField
>   text_general
> 
> 
>
>   
> 
>   true
>   10
>   mySuggester
> 
> 
>   suggest
> 
>   
>
> I wanted to know how the suggester Index/file is being rebuilt.
> Is it suppose to have all the terms of the desired field in the suggester?
Yes.
> if not, is it related to this kind of lookup implementation?
> if I'll use other lookup implementation which suggest also infix terms of 
> fields,
> doesn't it has to hold all terms of the field?
Yes.
>
> When i call suggest.build, does it build from scratch the suggester 
> Index/file,
> or is it just doing something like sort of "delta" indexing suggestions?
Builds from scratch
>
> Thank You,
> Jon

Re: Is it possible to do term Search for the filtered result set

2015-05-21 Thread Erick Erickson

Have you tried

&fq=type:A

Best,
Erick

On Thu, May 21, 2015 at 5:49 AM, Danesh Kuruppu  wrote:
> Hi all,
>
> Is it possible to do term search for the filtered result set. we can do
> term search for all documents. Can we do the term search only for the
> specified filtered result set.
>
> Lets says we have,
>
> Doc1 --> type: A
>  tags: T1 T2
>
> Doc2 --> type: A
>  tags: T1 T3
>
> Doc3 --> type: B
>  tags: T1 T4 T5
>
> Can we do term search for tags only in type:A documents, So that it gives
> the results as
> T1 - 02
> T2 - 01
> T3 - 01
>
> Is this possible? If so can you please share documents on this.
> Thanks
> Danesh

Re: Reindex of document leaves old fields behind

2015-05-21 Thread Shawn Heisey

On 5/21/2015 9:02 AM, tuxedomoon wrote:
> l>> If it is "implicit" then
>>> you may have indexed the new document to a different shard, which means
>>> that it is now in your index more than once, and which one gets returned
>>> may not be predictable.
> 
> If a document with uniqueKey "1234" is assigned to a shard by SolrCloud,
> implicit routing won't a reindex of "1234" be assigned to the same shard? 
> If not you'd have dups all over the cluster. 

The "implicit" router basically means manual routing.  Whatever shard
actually receives the request will be the one that indexes it.

If you want documents automatically routed according to their hash, you
need the compositeId router.

Thanks,
Shawn

Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting

2015-05-21 Thread Erick Erickson

I question your base assumption:

bq: So shard by document producer seems a good choice

 Because what this _also_ does is force all of the work for a query
onto one node and all indexing for a particular producer ditto. And
will cause you to manually monitor your shards to see if some of them
grow out of proportion to others. And

I think it would be much less hassle to just let Solr distribute the
docs as it may based on the uniqueKey and forget about it. Unless you
want, say, to do joins etc There will, of course, be some overhead
that you pay here, but unless you an measure it and it's a pain I
wouldn't add the complexity you're talking about, especially at the
volumes you're talking.

Best,
Erick

On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla  wrote:
> Hi
> I'd like some feedback on how I'd like to solve the following sharding problem
>
>
> I have a collection that will eventually become big
>
> Average document size is 1.5kb
> Every year 30 Million documents will be indexed
>
> Data come from different document producers (a person, owner of his 
> documents) and queries are almost always performed by a document producer who 
> can only query his own document. So shard by document producer seems a good 
> choice
>
> there are 3 types of doc producer
> type A,
> cardinality 105 (there are 105 producers of this type)
> produce 17M docs/year (the aggregated production af all type A producers)
> type B
> cardinality ~10k
> produce 4M docs/year
> type C
> cardinality ~10M
> produce 9M docs/year
>
> I'm thinking about
> use compositeId ( solrDocId = producerId!docId ) to send all docs of the same 
> producer to the same shards. When a shard becomes too large I can use shard 
> splitting.
>
> problems
> -documents from type A producers could be oddly distributed among shards, 
> because hashing doesn't work well on small numbers (105) see Appendix
>
> As a solution I could do this when a new typeA producer (producerA1) arrives:
>
> 1) client app: generate a producer code
> 2) client app: simulate murmurhashing and shard assignment
> 3) client app: check shard assignment is optimal (producer code is assigned 
> to the shard with the least type A producers) otherwise goto 1) and try with 
> another code
>
> when I add documents or perform searches for producerA1 I use it's producer 
> code respectively in the compositeId or in the route parameter
> What do you think?
>
>
> ---Appendix: murmurhash shard assignment 
> simulation---
>
> import mmh3
>
> hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)]
>
> num_shards = 16
> shards = [0]*num_shards
>
> for hash in hashes:
> idx = hash % num_shards
> shards[idx] += 1
>
> print shards
> print sum(shards)
>
> -
>
> result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]
>
> so with 16 shards and 105 shard keys I can have
> shards with 1 key
> shards with 11 keys
>

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread alexw

Hi Alex,

Thanks for the link to the presentation. I am going through the slides and
trying to figure out the time-sensitive search it talks about and how it
relates to the problem I am facing. It looks like it tries to solve the
problem of sku availability based on date, while in my case, all skus are
available, but the prices are time-sensitive, and faceting logic needs to
pick the right price for each sku when counting.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4206856.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread alexw

Thanks Alessandro. I am implementing this in the Hybris framework. It is not
easy to create nested documents during indexing using the Hybris Solr
indexer. So I am trying to avoid additional documents and cores if at all
possible.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4206854.html
Sent from the Solr - User mailing list archive at Nabble.com.

Clarification on Collections API for 5.x

2015-05-21 Thread Jim . Musil

Hi,

In the guide for moving from Solr 4.x to 5.x, it states the following:

"Solr 5.0 only supports creating and removing SolrCloud collections through the 
Collections 
API, unlike 
previous versions. While not using the collections API may still work in 5.0, 
it is unsupported, not recommended, and the behavior will change in a 5.x 
release."

Currently, we launch several solr nodes with identical cores defined using the 
new Core Discovery process. These nodes are also connected to a zookeeper 
ensemble. Part of the core definition is to set the configSet to use. This 
configSet is uploaded to zookeeper separately. This effectively creates a 
Collection.

Is this method no long supported in 5.x?

Thanks!
Jim Musil

Re: Reindex of document leaves old fields behind

2015-05-21 Thread tuxedomoon

l>> If it is "implicit" then
>> you may have indexed the new document to a different shard, which means
>> that it is now in your index more than once, and which one gets returned
>> may not be predictable.

If a document with uniqueKey "1234" is assigned to a shard by SolrCloud,
implicit routing won't a reindex of "1234" be assigned to the same shard? 
If not you'd have dups all over the cluster. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206849.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reindex of document leaves old fields behind

2015-05-21 Thread tuxedomoon

>> let's see the code.

simplified code and some comments

1.  solrUrl points at leader 1 of 3 leaders, each with a replica  
2.  createSolrDoc takes a full Mongo doc and returns a valid
SolrInputDocument 
3.  I have done dumps of the returned solrDoc and verified it does not have
the unwanted fields

SolrServer solrServer = new HttpSolrServer(solrUrl);   
SolrInputDocument solrDoc = solrDocFactory.createSolrDoc(mongoDoc,
dbName);
UpdateResponse uresponse  = solrServer.add(solrDoc);


>> issue a query on some of the unique ids in question
SolrCloud is returning only 1 document per uniqueKey


>> Did you push your schema up to Zookeeper and reload 
>> (or restart) your collection before re-indexing things? 
no.  the config was pushed up to Zookeeper only once a few months ago.  The
documents in question were updated in Mongo and given an updated
create_date.  Based on this new create_date my SolrJ client detects and
reindexes them.

>> are you sure the documents are actually getting indexed and that the
>> update 
>> is succeeding?
yes, I see a new value in the timestamp field each time I reindex  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206841.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread Alexandre Rafalovitch

Did you look at Gilt's presentation from a while ago:
http://www.slideshare.net/trenaman/personalized-search-on-the-largest-flash-sale-site-in-america

Slides 33 on might be most relevant.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 21 May 2015 at 22:58, alexw  wrote:
> Hi,
>
> I have an unique requirement to facet on product prices based on date
> constraints, for which I have been thinking for a solution for a couple of
> days now, but to no avail. The details are as follows:
>
> 1. Each product can have multiple prices, each price has a start-date and an
> end-date.
> 2. At search time, we need to facet on price ranges ($0 - $5, $5-$20,
> $20-$50...)
> 3. When faceting, a date is first determined. It can be either the current
> system date or a future date (call it date X)
> 4. For each product, the price to be used for faceting has to meet the
> following condition: start-date < date X, and date X < end-date, in other
> words, date X has to fall within start-date and end-date.
> 5. My Solr version: 3.5
>
> Hopefully I explained the requirement clearly. I have tried single price
> field with multivalue and each price value has startdate and enddate
> appended. I also tried one field per price with the field name containing
> both startdate and enddate. Neither approach seems to work. Can someone
> please shed some light as to how the index should be designed and what the
> facet query should look like?
>
> Thanks in advance for your help!
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Price Range Faceting Based on Date Constraints

2015-05-21 Thread Alessandro Benedetti

Hi Alex,
this is not a simple problem.
In your domain we can consider a Product as a document and the list of
 nested Documents.
Ideally we would model the Product as the father and the prices as children.
Each  will be defined by :


   -
*start_date *
   -
*end_date *
   -
*price *
   - *productId*

We can define 2 collections this way and play with Joins and faceting.
Take a look here :

http://lucene.472066.n3.nabble.com/How-do-I-get-faceting-to-work-with-Solr-JOINs-td4147785.html#a4148838

If redundancy of data is not a problem for you, you can proceed with a
simple approach where you add redundant documents.
Each document will have the start_date,end_date and price as single value
fields.
In the redundant scenario, the approach to follow is quite easy :
- always filtering by date the docs and then proceed faceting .

Cheers

2015-05-21 13:58 GMT+01:00 alexw :

> Hi,
>
> I have an unique requirement to facet on product prices based on date
> constraints, for which I have been thinking for a solution for a couple of
> days now, but to no avail. The details are as follows:
>
> 1. Each product can have multiple prices, each price has a start-date and
> an
> end-date.
> 2. At search time, we need to facet on price ranges ($0 - $5, $5-$20,
> $20-$50...)
> 3. When faceting, a date is first determined. It can be either the current
> system date or a future date (call it date X)
> 4. For each product, the price to be used for faceting has to meet the
> following condition: start-date < date X, and date X < end-date, in other
> words, date X has to fall within start-date and end-date.
> 5. My Solr version: 3.5
>
> Hopefully I explained the requirement clearly. I have tried single price
> field with multivalue and each price value has startdate and enddate
> appended. I also tried one field per price with the field name containing
> both startdate and enddate. Neither approach seems to work. Can someone
> please shed some light as to how the index should be designed and what the
> facet query should look like?
>
> Thanks in advance for your help!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Logic on Term Frequency Calculation : Bug or Functionality

2015-05-21 Thread Ahmet Arslan

Hi Ariya,

DefaultSimilarity does not use raw term frequency, but instead it uses square 
root of raw term frequency.
If you want to observe raw term frequency information in explain section, I 
suggest you to play with
org.apache.lucene.search.similarities.SimilarityBase and its sub-classes.

ahmet




On Thursday, May 21, 2015 3:59 PM, ariya bala  wrote:
Hi,

I am puzzled on the Term Frequency Behaviour of the DefaultSimilarity
implementation
I have suppressed the IDF by setting to 1.
TF-IDF would inturn reflect the same value as in Term Frequency

Below are the inferences:
Red coloured are expected to give a hit count(Term Frequency) of 2 but was
one.
*Is it bug or is it how the behaviour is?*

Search Query: AAA BBB
Parsed Query: PhraseQuery(Contents:\"aaa bbb\"~5000)

DocumentContentSlopTFslop0TFslop2TF1AAA BBB-101212BBB AAA-10-213AAA AAA BBB-
101214AAA BBB AAA-201225BBB AAA AAA-10-216AAA BBB BBB-101217BBB AAA BBB-1012
18BBB BBB AAA-10-21

*Am I missing something?!*


Cheers
*Ariya *

Solr suggester

2015-05-21 Thread jon kerling

Hi,

I'm using solr 4.10 and I'm trying to add autosuggest ability to my application.
I'm currently using this kind of configuration:

 

  mySuggester
  FuzzyLookupFactory  
  suggester_fuzzy_dir
  DocumentDictionaryFactory 
  field2
  weightField
  text_general



  

  true
  10
  mySuggester


  suggest

   

I wanted to know how the suggester Index/file is being rebuilt.
Is it suppose to have all the terms of the desired field in the suggester?
if not, is it related to this kind of lookup implementation?
if I'll use other lookup implementation which suggest also infix terms of 
fields,
doesn't it has to hold all terms of the field?

When i call suggest.build, does it build from scratch the suggester Index/file,
or is it just doing something like sort of "delta" indexing suggestions? 
 
Thank You,
Jon

Price Range Faceting Based on Date Constraints

2015-05-21 Thread alexw

Hi,

I have an unique requirement to facet on product prices based on date
constraints, for which I have been thinking for a solution for a couple of
days now, but to no avail. The details are as follows:

1. Each product can have multiple prices, each price has a start-date and an
end-date.
2. At search time, we need to facet on price ranges ($0 - $5, $5-$20,
$20-$50...)
3. When faceting, a date is first determined. It can be either the current
system date or a future date (call it date X)
4. For each product, the price to be used for faceting has to meet the
following condition: start-date < date X, and date X < end-date, in other
words, date X has to fall within start-date and end-date.
5. My Solr version: 3.5

Hopefully I explained the requirement clearly. I have tried single price
field with multivalue and each price value has startdate and enddate
appended. I also tried one field per price with the field name containing
both startdate and enddate. Neither approach seems to work. Can someone
please shed some light as to how the index should be designed and what the
facet query should look like?

Thanks in advance for your help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817.html
Sent from the Solr - User mailing list archive at Nabble.com.

Logic on Term Frequency Calculation : Bug or Functionality

2015-05-21 Thread ariya bala

Hi,

I am puzzled on the Term Frequency Behaviour of the DefaultSimilarity
implementation
I have suppressed the IDF by setting to 1.
TF-IDF would inturn reflect the same value as in Term Frequency

Below are the inferences:
Red coloured are expected to give a hit count(Term Frequency) of 2 but was
one.
*Is it bug or is it how the behaviour is?*

Search Query: AAA BBB
Parsed Query: PhraseQuery(Contents:\"aaa bbb\"~5000)

DocumentContentSlopTFslop0TFslop2TF1AAA BBB-101212BBB AAA-10-213AAA AAA BBB-
101214AAA BBB AAA-201225BBB AAA AAA-10-216AAA BBB BBB-101217BBB AAA BBB-1012
18BBB BBB AAA-10-21

*Am I missing something?!*


Cheers
*Ariya *

Is it possible to do term Search for the filtered result set

2015-05-21 Thread Danesh Kuruppu

Hi all,

Is it possible to do term search for the filtered result set. we can do
term search for all documents. Can we do the term search only for the
specified filtered result set.

Lets says we have,

Doc1 --> type: A
 tags: T1 T2

Doc2 --> type: A
 tags: T1 T3

Doc3 --> type: B
 tags: T1 T4 T5

Can we do term search for tags only in type:A documents, So that it gives
the results as
T1 - 02
T2 - 01
T3 - 01

Is this possible? If so can you please share documents on this.
Thanks
Danesh

Re: SolrCloud Leader Election

2015-05-21 Thread Ramkumar R. Aiyengar

This shouldn't happen, but if it does, there's no good way currently for
Solr to automatically fix it. There are a couple of issues being worked on
to do that currently. But till then, your best bet is to restart the node
which you expect to be the leader (you can look at ZK to see who is at the
head of the queue it maintains). If you can't figure that out, safest is to
just stop/start all nodes in sequence, and if that doesn't work, stop all
nodes and start them back one after the other.
On 21 May 2015 00:24, "Ryan Steele"  wrote:

> My SolrCloud cluster isn't reassigning the collections leaders from downed
> cores--the downed cores are still listed as the leaders. The cluster has
> been in the state for a few hours and the logs continue to report "No
> registered leader was found after waiting for 4000ms." Is there a way to
> force it to reassign the leader?
>
> I'm running SolrCloud 5.0.
> I have 7 Solr nodes, 3 Zookeeper nodes, and 3739 collections.
>
> Thanks,
> Ryan
>
> ---
> This email has been scanned for email related threats and delivered safely
> by Mimecast.
> For more information please visit http://www.mimecast.com
>
> ---
>

Re: solr 5.x on glassfish/tomcat instead of jetty

2015-05-21 Thread Steven White

Hi TK,

Can you share the thread you found on this WAR topic?

Thanks,

Steve

On Wed, May 20, 2015 at 8:58 PM, TK Solr  wrote:

> Never mind. I found that thread. Sorry for the noise.
>
>
> On 5/20/15, 5:56 PM, TK Solr wrote:
>
>> On 5/20/15, 8:21 AM, Shawn Heisey wrote:
>>
>>> As of right now, there is still a .war file. Look in the server/webapps
>>> directory for the .war, server/lib/ext for logging jars, and
>>> server/resources for the logging configuration. Consult your container's
>>> documentation to learn where to place these things. At some point in the
>>> future, such deployments will no longer be possible,
>>>
>> While we are still at this subject, I have been aware there has been an
>> anti-WAR movement in the tech but I don't quite understand where this
>> movement is coming from.  Can someone point me to some website summarizing
>> why WARs are bad?
>>
>> Thanks.
>>
>>
>

Index optimize runs in background.

2015-05-21 Thread Modassar Ather

Hi,

I am using Solr-5.1.0. I have an indexer class which invokes
cloudSolrClient.optimize(true, true, 1). My indexer exits after the
invocation of optimize and the optimization keeps on running in the
background.
Kindly let me know if it is per design and how can I make my indexer to
wait until the optimization is over. Is there a configuration/parameter I
need to set for the same.

Please note that the same indexer with cloudSolrServer.optimize(true, true,
1) on Solr-4.10 used to wait till the optimize was over before exiting.

Thanks,
Modassar

Re: [solr 5.1] Looking for full text + collation search field

2015-05-21 Thread Björn Keil

Thanks for the advice. I have tried the field type and it seems to do what it 
is supposed to in combination with a lower case filter.

However, that raises another slight problem:

German umlauts are supposed to be treated slightly different for the purpose of 
searching than for sorting. For sorting a normal ICUCollationField with 
standard rules should suffice*, for the purpose of searching I cannot just 
replace an "ü" with a "u", "ü" is supposed to equal "ue", or, in terms of 
RuleBasedCollators, there is a secondary difference.

The rules for the collator include:

& ue , ü
& ae , ä
& oe , ö
& ss , ß

(again, that applies to searching *only*, for the sorting the rule "& a , ä" 
would apply, which is implied in the default rules.)

I can of course program a filter that does these rudimentary replacements 
myself, at best after the lower case filter but before the ASCIIFoldingFilter, 
I am just wondering if there isn't some way to use collations keys for full 
text search.




* even though Latin script and specifically German is my primary concern, I 
want some rudimentary support for all European languages, including ones that 
use Cyrillic and Greek script, special symbols in Icelandic that are not 
strictly Latin and ligatures like "Æ", which collation keys could easily 
provide.





Ahmet Arslan  schrieb am 22:10 Mittwoch, 20.Mai 2015:
Hi Bjorn,

solr.ICUCollationField is useful for *sorting*, and you cannot sort on 
tokenized fields.

Your example looks like diacritics insensitive search. 
Please see : ASCIIFoldingFilterFactory

Ahmet



On Wednesday, May 20, 2015 2:53 PM, Björn Keil  wrote:
Hello,

might anyone suggest a field type with which I may do both a full text
search (i.e. there is an analyzer including a tokenizer) and apply a
collation?

An example for what I want to do:
There is a field "composer" for which I passed the value "Dvořák, Antonín".

I want the following queries to match:
composer:(antonín dvořák)
composer:dvorak
composer:"dvorak, antonin"

the latter case is possible using a solr.ICUCollationField, but that
type does not support an Analyzer and consequently no tokenizer, thus,
it is not helpful.

Unlike former versions of solr there do not seem to be
CollationKeyFilters which you may hang into the analyzer of a
solr.TextField... so I am a bit at a loss how I get *both* a tokenizer
and a collation at the same time.

Thanks for help,
Björn

optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting

2015-05-21 Thread Matteo Grolla

Hi
I'd like some feedback on how I'd like to solve the following sharding problem


I have a collection that will eventually become big

Average document size is 1.5kb
Every year 30 Million documents will be indexed

Data come from different document producers (a person, owner of his documents) 
and queries are almost always performed by a document producer who can only 
query his own document. So shard by document producer seems a good choice

there are 3 types of doc producer
type A, 
cardinality 105 (there are 105 producers of this type)
produce 17M docs/year (the aggregated production af all type A producers)
type B
cardinality ~10k
produce 4M docs/year
type C
cardinality ~10M
produce 9M docs/year

I'm thinking about 
use compositeId ( solrDocId = producerId!docId ) to send all docs of the same 
producer to the same shards. When a shard becomes too large I can use shard 
splitting.

problems
-documents from type A producers could be oddly distributed among shards, 
because hashing doesn't work well on small numbers (105) see Appendix

As a solution I could do this when a new typeA producer (producerA1) arrives:

1) client app: generate a producer code
2) client app: simulate murmurhashing and shard assignment
3) client app: check shard assignment is optimal (producer code is assigned to 
the shard with the least type A producers) otherwise goto 1) and try with 
another code

when I add documents or perform searches for producerA1 I use it's producer 
code respectively in the compositeId or in the route parameter
What do you think?


---Appendix: murmurhash shard assignment 
simulation---

import mmh3

hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)]

num_shards = 16
shards = [0]*num_shards

for hash in hashes:
idx = hash % num_shards
shards[idx] += 1

print shards
print sum(shards)

-

result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]

so with 16 shards and 105 shard keys I can have
shards with 1 key
shards with 11 keys

Re: Need help with Nested docs situation

2015-05-21 Thread Alessandro Benedetti

This scenario is a perfect fit to play with Solr Joins [1] .

As you observed, you would prefer to go with a query time join.
THis kind of join can be done inter-collection .
You can have you  collection  and  collection .
Every product will have one field  to match all the parent deals.
When you add,remove,update a new deal, you have to update in the product
index all the related products.

Then you can query over the products and get related parent deals in the
response.
Can you give me a little bit more details about your expected use case ?
Example of queries and a better explanation of the product previews ?

Cheers

[1] https://www.youtube.com/watch?v=-OiIlIijWH0&feature=youtu.be ,
http://blog.griddynamics.com/2013/09/solr-block-join-support.html

2015-05-20 18:56 GMT+01:00 Mikhail Khludnev :

> data scale and request rate can judge between block, plain joins and field
> collapsing.
>
> On Thu, Apr 30, 2015 at 1:07 PM, roySolr  wrote:
>
> > Hello,
> >
> > I have a situation and i'm a little bit stuck on the way how to fix it.
> > For example the following data structure:
> >
> > *Deal*
> > All Coca Cola 20% off
> >
> > *Products*
> > Coca Cola light
> > Coca Cola Zero 1L
> > Coca Cola Zero 20CL
> > Coca Cola 1L
> >
> > When somebody search to "Cola" discount i want the result of the deal
> with
> > related products.
> >
> > Solution #1:
> > I could index it with nested docs(solr 4.9). But the problem is when a
> > product has some changes(let's say "Zero" gets a new name "Extra Light")
> i
> > have to re-index every deal with these products.
> >
> > Solution #2:
> > I could make 2 collections, one with deals and one with products. A
> Product
> > will get a parentid(dealid). Then i have to do 2 queries to get the
> > information? When i have a resultpage with 10 deals i want to preview the
> > first 2 products. That means a lot of queries but it's doesn't have the
> > update problem from solution #1.
> >
> > Does anyone have a good solution for this?
> >
> > Thanks, any help is appreciated.
> > Roy
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Need-help-with-Nested-docs-situation-tp4203190.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Indexing gets significantly slower after every batch commit

2015-05-21 Thread Shawn Heisey

On 5/21/2015 2:07 AM, Angel Todorov wrote:
> I'm crawling a file system folder and indexing 10 million docs, and I am
> adding them in batches of 5000, committing every 50 000 docs. The problem I
> am facing is that after each commit, the documents per sec that are indexed
> gets less and less.
> 
> If I do not commit at all, I can index those docs very quickly, and then I
> commit once at the end, but once i start indexing docs _after_ that (for
> example new files get added to the folder), indexing is also slowing down a
> lot.
> 
> Is it normal that the SOLR indexing speed depends on the number of
> documents that are _already_ indexed? I think it shouldn't matter if i
> start from scratch or I index a document in a core that already has a
> couple of million docs. Looks like SOLR is either doing something in a
> linear fashion, or there is some magic config parameter that I am not aware
> of.
> 
> I've read all perf docs, and I've tried changing mergeFactor,
> autowarmCounts, and the buffer sizes - to no avail.
> 
> I am using SOLR 5.1

Have you changed the heap size?  If you use the bin/solr script to start
it and don't change the heap size with the -m option or another method,
Solr 5.1 runs with a default size of 512MB, which is *very* small.

I bet you are running into problems with frequent and then ultimately
constant garbage collection, as Java attempts to free up enough memory
to allow the program to continue running.  If that is what is happening,
then eventually you will see an OutOfMemoryError exception.  The
solution is to increase the heap size.  I would probably start with at
least 4G for 10 million docs.

Thanks,
Shawn

Indexing gets significantly slower after every batch commit

2015-05-21 Thread Angel Todorov

hi guys,

I'm crawling a file system folder and indexing 10 million docs, and I am
adding them in batches of 5000, committing every 50 000 docs. The problem I
am facing is that after each commit, the documents per sec that are indexed
gets less and less.

If I do not commit at all, I can index those docs very quickly, and then I
commit once at the end, but once i start indexing docs _after_ that (for
example new files get added to the folder), indexing is also slowing down a
lot.

Is it normal that the SOLR indexing speed depends on the number of
documents that are _already_ indexed? I think it shouldn't matter if i
start from scratch or I index a document in a core that already has a
couple of million docs. Looks like SOLR is either doing something in a
linear fashion, or there is some magic config parameter that I am not aware
of.

I've read all perf docs, and I've tried changing mergeFactor,
autowarmCounts, and the buffer sizes - to no avail.

I am using SOLR 5.1

Thanks !
Angel

Search for numbers

2015-05-21 Thread Holger Rieß

Hi,

I try to search numbers with a certain deviation. My parser is ExtendedDisMax.
A possible search expression could be 'twist drill 1.23 mm'. It will not match 
any documents, because the document contains the keywords 'twist drill', '1.2' 
and 'mm'.

In order to reach my goal, I've indexed all numbers as points with the 
solr.SpatialRecursivePrefixTreeFieldType.
For example '1.2' as 1.2 0.0.
A search with 'drill mm' and a filter query 'fq={!geofilt pt=0,1.23 
sfield=feature_nr d=5}' delivers the expected results.

Now I have two problems:
1. How can I get ExtendedDisMax, to 'replace' the value 1.2 with the 
'{!geofilt}' function?
  My first attemts were

- Build a field type in schema.xml and replace the field content with a regular 
expression
'... replacement="_query_:"{!geofilt pt=0,$1 sfield=feature_nr 
d=5}""'.
The idea was to use a nested query. But edismax searches 
'feature_nr:_query_:"{!geofilt pt=0,$1 sfield=feature_nr d=5}"'.
No documents are found.

- Program a new parser that analyzes the query terms, finds all numbers and 
does the geospatial stuff. Added this parser in the 'appends' section of the 
'requestHandler' definition. But I can get this parser only to filter my 
results, not to extend them.

2. I want to calculate the distance (d) of the '{!geofilt}' function relative 
to the value, for example 1%.

Could there be a simple solution? 

Thank you in advance.
Holger

58 matches

Mail list logo