Re: SOLR 4.4 - Slave always replicates full index

2014-01-24 Thread Erick Erickson
How are you committing? Are you committing every document? (you shouldn't).

Or, sin of all sins, are you _optimizing_ frequently? That'll cause
your entire index
to be replicated every time.

Best,
Erick

On Thu, Jan 23, 2014 at 3:26 PM, sureshrk19 sureshr...@gmail.com wrote:
 Hi,

 I have configured single core master, slave nodes on 2 different machines.
 The replication configuration is fine and it is working but, what I observed
 is, on every change to master index full replication is being triggered on
 slave.
 I was planning to get only incremental indexing done on every change.

 *Master config:*

 requestHandler name=/replication class=solr.ReplicationHandler
   lst name=master
   str name=replicateAfterstartup/str
   str name=replicateAftercommit/str
str name=confFilesschema.xml,stopwords.txt,elevate.xml/str
 str name=commitReserveDuration00:00:20/str
  /lst
str name=maxNumberOfBackups1/str
 /requestHandler

 *Slave config:*

 requestHandler name=/replication class=solr.ReplicationHandler 
   lst name=slave
 str name=masterUrlhttp://IP:Port/solr/core0/replication/str
 str name=pollInterval00:00:20/str
   /lst
 /requestHandler


 What I observed is, the index directory name is appended with timestamp
 i.e., /index.timestamp/ on slave instance.

 I have seen a similar issue on older version of SOLR and it is fixed in 4.2
 (per description). So, not sure if this is related to the same.

 https://issues.apache.org/jira/browse/SOLR-4471
 http://lucene.472066.n3.nabble.com/Slaves-always-replicate-entire-index-amp-Index-versions-td4041256.html#a4041808


 Any pointers would be highly appreciated.

 Thanks,
 Suresh



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/SOLR-4-4-Slave-always-replicates-full-index-tp4113089.html
 Sent from the Solr - User mailing list archive at Nabble.com.


LinkedIn'de bağlantı kurma daveti

2014-01-24 Thread somer81
LinkedIn




vibhoreng04 Lucene],

Sizi LinkedIn'deki profesyonel ağıma eklemek istiyorum.

- ömer sevinç

ömer sevinç
Ondokuzmayıs Üniversitesi Uzaktan Eğitim Merkezi şirketinde Öğr. Gör. 
Bilgisayar Müh. pozisyonunda
Samsun, Türkiye

ömer sevinç adlı kişiyi tanıdığınızı onaylayın:
https://www.linkedin.com/e/-raxvo-hqtkwlbr-67/isd/12853395556/gr7DJb-a/?hs=falsetok=2dovPxUUV6F641

--
Bağlantı kurmak için davet e-postaları alıyorsunuz. Aboneliği iptal etmek için 
tıklayın:
http://www.linkedin.com/e/-raxvo-hqtkwlbr-67/k4DmoUl0INtaVAIq-J7z9PWKN77TMrUq-KEzulsJgeVVicpw-KNocoLGnzC/goo/ml-node%2Bs472066n3787592h24%40n3%2Enabble%2Ecom/20061/I6339264533_1/?hs=falsetok=3C8nQRJF16F641

(c) 2012 LinkedIn Corporation 2029 Stierlin Ct., Mountain View, CA 94043 USA


  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/LinkedIn-de-ba-lant-kurma-daveti-tp4113203.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solrcloud shards backup/restoration

2014-01-24 Thread Greg Walters
We've managed some success restoring existing/backed up indexes into solr cloud 
and even building the indexes offline and dumping the lucene files into the 
directories that solr expects. The general steps we follow are:

1) Round up your files. It doesn't matter if you pull from a master or slave so 
long as you've committed and get a consistent copy of the data. 

2) Use the collection api to create a collection in solr. The collection you're 
creating must have the same number of shards as the collection you've backed up 
and are restoring.

3) Stop all solr nodes. 

4) Remove the index_name/data/ directory from the shards you're going to make 
the leader. In our case we've got 6 shards and a replication factor of 3 on a 6 
node cluster so each server/jvm has three shards on it. Conveniently the shards 
are all either even or odd per jvm.

5) Populate the index_name/data/ directories on your intended leaders. As 
mentioned above since we've got six shards and any two jvm contain the entire 
index we only populate the data on two servers.

6) Start up *JUST* the servers that you've just populated. The goal here is to 
make these servers you've populated the leaders for the new collection and to 
have the official full copy of the index. Upon startup you might have to wait 
$leaderVoteWait for previously non-leader servers to timeout and become leaders

7) Once you've got at least one core up in each shard of your collection go 
ahead and start the others up.

I think Aditya was failing by removing all the zookeeper data and starting 
everything up at once. If you force solr's hand a bit to pick leaders with the 
data that you want you'll have success when it replicates out to other nodes. 
It might also be possible to do this on-line by not stopping solr after 
creating the empty collection then copying the files into place on the leaders 
and issuing a RELOAD to pick up the changed indexes. I'm not sure how replicas 
would handle that though.

Thanks,
Greg


On Jan 24, 2014, at 12:47 AM, Allan Mascarenhas 
allan.mascarenhas1...@gmail.com wrote:

 Any update on this ? 
 
 I am also stuck with same problem, I want to install snapshot of master solr
 server to my local environment. but i could't  :(
 
 All most spend 2 days to figure it out the way. Please help!!
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/solrcloud-shards-backup-restoration-tp4088447p4113142.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr solr.JSONResponseWriter not escaping backslash '\' characters

2014-01-24 Thread stevenNabble
Hello,

thanks to all for the help :-)

we have managed to narrow it down what is exactly going wrong. My initial
thinking on the backslashes within field values being the problem were
incorrect. The source of the problem is in-fact submitting a document with
a blank field value. The JSON returned by a query containing the
problematic value, is when doing a facet search. Details below:

# cat test.xml
add
doc
field name=id9553524/field
field name=year/field
/doc
/add



# curl 'http://localhost:8983/solr/collection1/update?commit=true'
--data-binary @test.xml -H 'Content-Type: application/xml'
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint
name=QTime369/int/lst
/response



# curl '
http://localhost:8983/solr/collection1/select?wt=jsonfacet=truefacet.field=yearfacet.mincount=1json.nl=mapq=id%3A9553524start=0rows=3indent=true
'
{
responseHeader: {
status: 0,
QTime: 8669,
params: {
facet: true,
facet.mincount: 1,
start: 0,
q: id:9553524,
facet.field: [year],
json.nl: map,
wt: json,
rows: 3
}
},
response: {
numFound: 1,
start: 0,
docs: [{
id: 9553524,
year: [],
_version_: 1458116227706650624
}]
},
facet_counts: {
facet_queries: {
 },
facet_fields: {
year: {
: 1
}
},
facet_dates: {
 },
facet_ranges: {
 }
}
}



As you can see above the facet count for the '*year*' field contains a
blank JSON field name. This errors when parsing with *PHP's json_decode*
(...).

*Fatal error*: Cannot access empty property in 



The workaround is to not submit empty field values into the index but this
isn't a great solution :-(

Kind Regards
Steven



On 23 January 2014 18:49, Chris Hostetter-3 [via Lucene] 
ml-node+s472066n4113050...@n3.nabble.com wrote:


 : The problem I have is if I try to parse this response in *php *using
 : *json_decode()* I get a syntax error because of the '*\n*' s that are in
 : the response. I could escape the before doing the *json_decode() *or at
 the
 : point of submitting to the index but this seems wrong...

 I don't really know anything about PHP, but i managed to muddle my way
 through both of the little experiments below and couldn't reporoduce any
 error from json_decode when the response contains \n (ie: the two byte
 sequence represnting an escaped newline character) inside of a JSON
 string, but i do get the expected error if a literal, one byte, newline
 character is in the string. (something that Solr doesn't do)

 are you sure when you fetch the data from Solr you aren't pre-parsing it
 in some what that's evaluating hte \n and converting it to a real
 newline?

 : I am probably doing something silly and a good nights sleep will reveal
 : what I am doing wrong ;-)

 Good luck.

 ### Experiment #1, locally crated strings, one bogus json

 hossman@frisbee:~$ php -a
 Interactive shell

 php  $valid = '{id: newline: (\n)}';
 php  $bogus = {\id\: \newline: (\n)\};
 php  var_dump($valid);
 string(23) {id: newline: (\n)}
 php  var_dump($bogus);
 string(22) {id: newline: (
 )}
 php  var_dump(json_decode($valid));
 object(stdClass)#1 (1) {
   [id]=
   string(12) newline: (
 )
 }
 php  var_dump(json_decode($bogus));
 NULL
 php  var_dump(json_last_error());
 int(4)


 ### Experiment #2, fetching json data from Solr...

 hossman@frisbee:~$ curl '
 http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonindent=trueomitHeader=true'

 {
   response:{numFound:1,start:0,docs:[
   {
 id:HOSS,
 name:quote: (\) backslash: (\\) backslash-quote: (\\\)
 newline: (\n) backslash-n: (\\n),
 _version_:1458038130437259264}]
   }}
 hossman@frisbee:~$ php -a
 Interactive shell

 php  $data =
 file_get_contents('
 http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonindent=trueomitHeader=true');

 php  var_dump($data);
 string(227) {
   response:{numFound:1,start:0,docs:[
   {
 id:HOSS,
 name:quote: (\) backslash: (\\) backslash-quote: (\\\)
 newline: (\n) backslash-n: (\\n),
 _version_:1458038130437259264}]
   }}
 
 php  var_dump(json_decode($data));
 object(stdClass)#1 (1) {
   [response]=
   object(stdClass)#2 (3) {
 [numFound]=
 int(1)
 [start]=
 int(0)
 [docs]=
 array(1) {
   [0]=
   object(stdClass)#3 (3) {
 [id]=
 string(4) HOSS
 [name]=
 string(78) quote: () backslash: (\) backslash-quote: (\)
 newline: (
 ) backslash-n: (\n)
 [_version_]=
 int(1458038130437259264)
   }
 }
   }
 }



 -Hoss
 http://www.lucidworks.com/


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/Solr-solr-JSONResponseWriter-not-escaping-backslash-characters-tp4112990p4113050.html
  To unsubscribe from Solr solr.JSONResponseWriter not escaping backslash
 '\' characters, click 
 

Re: Solr solr.JSONResponseWriter not escaping backslash '\' characters

2014-01-24 Thread Ahmet Arslan


How about using 
http://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/update/processor/RemoveBlankFieldUpdateProcessorFactory.html



On Friday, January 24, 2014 5:39 PM, stevenNabble ste...@actual-systems.com 
wrote:
Hello,

thanks to all for the help :-)

we have managed to narrow it down what is exactly going wrong. My initial
thinking on the backslashes within field values being the problem were
incorrect. The source of the problem is in-fact submitting a document with
a blank field value. The JSON returned by a query containing the
problematic value, is when doing a facet search. Details below:

# cat test.xml
add
        doc
                field name=id9553524/field
                field name=year/field
        /doc
/add



# curl 'http://localhost:8983/solr/collection1/update?commit=true'
--data-binary @test.xml -H 'Content-Type: application/xml'
?xml version=1.0 encoding=UTF-8?
response
lst name=responseHeaderint name=status0/intint
name=QTime369/int/lst
/response



# curl '
http://localhost:8983/solr/collection1/select?wt=jsonfacet=truefacet.field=yearfacet.mincount=1json.nl=mapq=id%3A9553524start=0rows=3indent=true
'
{
responseHeader: {
status: 0,
QTime: 8669,
params: {
facet: true,
facet.mincount: 1,
start: 0,
q: id:9553524,
facet.field: [year],
json.nl: map,
wt: json,
rows: 3
}
},
response: {
numFound: 1,
start: 0,
docs: [{
id: 9553524,
year: [],
_version_: 1458116227706650624
}]
},
facet_counts: {
facet_queries: {
},
facet_fields: {
year: {
: 1
}
},
facet_dates: {
},
facet_ranges: {
}
}
}



As you can see above the facet count for the '*year*' field contains a
blank JSON field name. This errors when parsing with *PHP's json_decode*
(...).

*Fatal error*: Cannot access empty property in 



The workaround is to not submit empty field values into the index but this
isn't a great solution :-(

Kind Regards
Steven



On 23 January 2014 18:49, Chris Hostetter-3 [via Lucene] 
ml-node+s472066n4113050...@n3.nabble.com wrote:


 : The problem I have is if I try to parse this response in *php *using
 : *json_decode()* I get a syntax error because of the '*\n*' s that are in
 : the response. I could escape the before doing the *json_decode() *or at
 the
 : point of submitting to the index but this seems wrong...

 I don't really know anything about PHP, but i managed to muddle my way
 through both of the little experiments below and couldn't reporoduce any
 error from json_decode when the response contains \n (ie: the two byte
 sequence represnting an escaped newline character) inside of a JSON
 string, but i do get the expected error if a literal, one byte, newline
 character is in the string. (something that Solr doesn't do)

 are you sure when you fetch the data from Solr you aren't pre-parsing it
 in some what that's evaluating hte \n and converting it to a real
 newline?

 : I am probably doing something silly and a good nights sleep will reveal
 : what I am doing wrong ;-)

 Good luck.

 ### Experiment #1, locally crated strings, one bogus json

 hossman@frisbee:~$ php -a
 Interactive shell

 php  $valid = '{id: newline: (\n)}';
 php  $bogus = {\id\: \newline: (\n)\};
 php  var_dump($valid);
 string(23) {id: newline: (\n)}
 php  var_dump($bogus);
 string(22) {id: newline: (
 )}
 php  var_dump(json_decode($valid));
 object(stdClass)#1 (1) {
   [id]=
   string(12) newline: (
 )
 }
 php  var_dump(json_decode($bogus));
 NULL
 php  var_dump(json_last_error());
 int(4)


 ### Experiment #2, fetching json data from Solr...

 hossman@frisbee:~$ curl '
 http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonindent=trueomitHeader=true'

 {
   response:{numFound:1,start:0,docs:[
       {
         id:HOSS,
         name:quote: (\) backslash: (\\) backslash-quote: (\\\)
 newline: (\n) backslash-n: (\\n),
         _version_:1458038130437259264}]
   }}
 hossman@frisbee:~$ php -a
 Interactive shell

 php  $data =
 file_get_contents('
 http://localhost:8983/solr/collection1/select?q=id:HOSSwt=jsonindent=trueomitHeader=true');

 php  var_dump($data);
 string(227) {
   response:{numFound:1,start:0,docs:[
       {
         id:HOSS,
         name:quote: (\) backslash: (\\) backslash-quote: (\\\)
 newline: (\n) backslash-n: (\\n),
         _version_:1458038130437259264}]
   }}
 
 php  var_dump(json_decode($data));
 object(stdClass)#1 (1) {
   [response]=
   object(stdClass)#2 (3) {
     [numFound]=
     int(1)
     [start]=
     int(0)
     [docs]=
     array(1) {
       [0]=
       object(stdClass)#3 (3) {
         [id]=
         string(4) HOSS
         [name]=
         string(78) quote: () backslash: (\) backslash-quote: (\)
 newline: (
 ) backslash-n: (\n)
         [_version_]=
         int(1458038130437259264)
       }
     }
   }
 }



 -Hoss
 http://www.lucidworks.com/


 --
  If you reply to this email, your message will be added to the discussion
 below:

 

Loading resources from Zookeeper

2014-01-24 Thread Ugo Matrangolo
Hi,

I'm in the process to move our organization search infrastructure to
SOLR4/SolrCloud. One of the main point is to centralize our cores
configuration in Zookeeper in order to roll out changes wout redeploying
all the nodes in our cluster.

Unfortunately I have some code (custom indexers extending
org.apache.solr.handler.dataimport.EntityProcessorBase) that are assuming
to load resources from the filesystem and this is now a problem given that
everything under solr.home/core/conf is hosted in Zookeeper.

My question is : what is the best way to load a resource from Zookeeper
using SOLR APIs ??

Regards,
Ugo


Loading resources from Zookeeper using SolrCloud API

2014-01-24 Thread Ugo Matrangolo
Hi,

we have a quite large SOLR 3.6 installation and we are trying to update to
4.6.x.

One of the main point in doing this is to get SolrCloud and centralized
configuration using Zookeeper.

Unfortunately, some custom code we have (custom indexer extending
org.apache.solr.handler.dataimport.EntityProcessorBase) are trying to load
resources from the file system and this is now a problem given that
everything under solr.home/core/conf is under Zookeeper.

What is the best way to load resources from Zookeeper using SolrCloud API ?

Regards,
Ugo


Re: Loading resources from Zookeeper

2014-01-24 Thread Alan Woodward
Hi Ugo,

You can load things from the conf/ directory via SolrResourceLoader, which will 
load either from the filesystem or from zookeeper, depending on whether or not 
you're running in SolrCloud mode.

Alan Woodward
www.flax.co.uk


On 24 Jan 2014, at 16:02, Ugo Matrangolo wrote:

 Hi,
 
 I'm in the process to move our organization search infrastructure to
 SOLR4/SolrCloud. One of the main point is to centralize our cores
 configuration in Zookeeper in order to roll out changes wout redeploying
 all the nodes in our cluster.
 
 Unfortunately I have some code (custom indexers extending
 org.apache.solr.handler.dataimport.EntityProcessorBase) that are assuming
 to load resources from the filesystem and this is now a problem given that
 everything under solr.home/core/conf is hosted in Zookeeper.
 
 My question is : what is the best way to load a resource from Zookeeper
 using SOLR APIs ??
 
 Regards,
 Ugo



Re: Loading resources from Zookeeper using SolrCloud API

2014-01-24 Thread Mark Miller
The best way is to use the ResourceLoader without relying on 
ResourceLoader#getConfigDir (which will fail in SolrCloud mode).

For example, see openSchema, openConfig, openResource.

If you use these API’s, your code will work both with those files being on the 
local filesystem for non SolrCloud mode and being in ZooKeeper in SolrCloud 
mode.

There are also low level API’s you could use, but I wouldn’t normally recommend 
that.

- Mark

On Jan 24, 2014, at 11:16 AM, Ugo Matrangolo ugo.matrang...@gmail.com wrote:

 Hi,
 
 we have a quite large SOLR 3.6 installation and we are trying to update to
 4.6.x.
 
 One of the main point in doing this is to get SolrCloud and centralized
 configuration using Zookeeper.
 
 Unfortunately, some custom code we have (custom indexer extending
 org.apache.solr.handler.dataimport.EntityProcessorBase) are trying to load
 resources from the file system and this is now a problem given that
 everything under solr.home/core/conf is under Zookeeper.
 
 What is the best way to load resources from Zookeeper using SolrCloud API ?
 
 Regards,
 Ugo



What is the right way to bring a failed SolrCloud node back online?

2014-01-24 Thread Nathan Neulinger
I have an environment where new collections are being added frequently (isolated per customer), and the backup is 
virtually guaranteed to be missing some of them.


As it stands, bringing up the restored/out-of-date instance results in thos collections being stuck in 'Recovering' 
state, because the cores don't exist on the resulting server. This can also be extended to the case of restoring a 
completely blank instance.


Is there any way to tell SolrCloud Try recreating any missing cores for this collection based on where you know they 
should be located.


Or do I need to actually determine a list of cores (..._shardX_replicaY) and trigger the core creates myself, at which 
point I gather that it will start recovery for each of them?


-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: Distributed search with Terms Component and Solr Cloud.

2014-01-24 Thread Uwe Reh

Hi Ryan,

just take a look on the thread TermsComponent/SolrCloud.
Setting your parameters as default in solrconfig.xml should help.

Uwe


Am 13.01.2014 20:24, schrieb Ryan Fox:

Hello,

I am running Solr 4.6.0.  I am experiencing some difficulties using the
terms component across multiple shards.  I see according to the
documentation, it should work, but I am unable to do so with solr cloud.

When I have one shard, queries using the terms component respond as I would
expect.  However, when I split my index across two shards, I get empty
results for the same query.

I am querying solr with a CloudSolrServer object.  When I manually add the
query params shards and shards.qt to my SolrQuery, I get the expected
response.  It's not ideal, but if there's a way to get a list of all shards
programmatically, I could set that parameter.


From the documentation, it appears to me the terms component should be

supported by solr cloud, but I can't find anything that explicitly says one
way or the other.  If there is a better way to do it, or perhaps something
I have misconfigured, any advice would be much appreciated.  If it's just
not possible, I will manage.  I can provide more configuration or
specifically how I am running the query if that would help.

Ryan Fox





Re: Searching and scoring with block join

2014-01-24 Thread dev


Zitat von Mikhail Khludnev mkhlud...@griddynamics.com:


nesting query parsers is shown at
http://blog.griddynamics.com/2013/12/grandchildren-and-siblings-with-block.html

try to start from the following:
title:Test _query_:{!parent which=is_parent:true}{!dismax
qf=content_de}Test
mind about local params referencing eg {!... v=$nest}nest=...


Thank you for the hint.
I don't really know how {!dismax ...} and local parameter referencing  
are solving my problem.
I read your blog entry, but I have some issues to understand how I can  
use your explanations.
Would you mind giving me a short example how these query params  
helping me to get a proper result with a combined score for parent and  
children?


Thank you very much.


there is no such parm in
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java#L67
Raise an feature request issue, at least, don't hesitate to contribute.


Ah, okay, it was a misunderstanding then.
I created an issue: https://issues.apache.org/jira/browse/SOLR-5662


Sorry if I ask stupid questions but I just have started to work with solr
and some techniques are not very familiar.




Thanks
-Gesh



Re: SOLR 4.4 - Slave always replicates full index

2014-01-24 Thread sureshrk19
Erick,

Thanks for the reply..

I'm not committing each document but, have following configuration in
solrconfig.xml (commit every 5mins).

autoCommit
 maxTime30/maxTime
 openSearcherfalse/openSearcher
 /autoCommit

Also, if you look at my master config, I do not have 'optimize'.

 str name=replicateAfterstartup/str 
 str name=replicateAftercommit/str 

Is there any way other option which triggers 'optimize'?

Thanks,
Suresh





--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-4-4-Slave-always-replicates-full-index-tp4113089p4113249.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr server requirements for 100+ million documents

2014-01-24 Thread Susheel Kumar
Hi,

Currently we are indexing 10 million document from database (10 db data 
entities)  index size is around 8 GB on windows virtual box. Indexing in one 
shot taking 12+ hours while indexing parallel in separate cores  merging them 
together taking 4+ hours. 

We are looking to scale to 100+ million documents and looking for 
recommendation on servers requirements on below parameters for a Production 
environment. There can be 200+ users performing search same time. 

No of physical servers (considering solr cloud) 
Memory requirement
Processor requirement (# cores)
Linux as OS oppose to windows 

Thanks in advance. 
Susheel



Re: SOLR 4.4 - Slave always replicates full index

2014-01-24 Thread Shawn Heisey

On 1/24/2014 10:36 AM, sureshrk19 wrote:

I'm not committing each document but, have following configuration in
solrconfig.xml (commit every 5mins).

 autoCommit
  maxTime30/maxTime
  openSearcherfalse/openSearcher
  /autoCommit

Also, if you look at my master config, I do not have 'optimize'.

  str name=replicateAfterstartup/str
  str name=replicateAftercommit/str

Is there any way other option which triggers 'optimize'?


I think Erick was actually asking if you are optimizing your index 
frequently, not whether you have replication configured to replicate 
after optimize.


Optimizing your index (a forced merge down to one Lucene index segment) 
is something you have to do yourself.  It won't happen automatically.  
If you optimize your index, all old segments are gone and only a single 
new segment remains.  Even if you don't replicate immediately, the next 
time you commit, the entire index will need to be copied to the slave.


Your autoCommit cannot be the only committing that you do, because that 
configuration will not make new documents visible - it has 
openSearcher=false.  Therefore if you are adding new content, you must 
be doing additional soft commits, or hard commits with 
openSearcher=true.  This might be accomplished with a parameter on your 
updates, like commit, softCommit, or commitWithin. It might also be an 
explicit commit.


Optimizing *IS* a useful feature, but if you optimize very frequently 
(especially if it's done every time you add new documents), Solr's 
performance will really suffer.


Personal anecdote: One of my shards is very tiny and holds all new 
content.  That gets optimized once an hour.  In general, this is pretty 
frequently, but it happens very quickly, so in my setup it's not 
excessive.  That is a LOT more often than what I do for my other shards, 
the large ones.  I optimize one of those once every day, so each one 
only gets optimized once every six days.


Thanks,
Shawn



Complex nested structure in solr

2014-01-24 Thread Utkarsh Sengar
Hi guys,

I have to load extra meta data to an existing collection.

This is what I am looking for:
For a UPC: Store availability by merchantId per location (which has lat/lon)

My query pattern will be: Given a keyword, find all available products for
a merchantId around the given lat/lon.

Example:
Input: keyword=ipod, merchantId=922,lat/lon=28.222,82.333
Output: List of UPCs which match the criteria

So how should I go about doing it? Any suggestions?

-- 
Thanks,
-Utkarsh


Re: Solr server requirements for 100+ million documents

2014-01-24 Thread Erick Erickson
Can't be done with the information you provided, and can only
be guessed at even with more comprehensive information.

Here's why:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Also, at a guess, your indexing speed is so slow due to data
acquisition; I rather doubt
you're being limited by raw Solr indexing. If you're using SolrJ, try
commenting out the
server.add() bit and running again. My guess is that your indexing
speed will be almost
unchanged, in which case it's the data acquisition process is where
you should concentrate
efforts. As a comparison, I can index 11M Wikipedia docs on my laptop
in 45 minutes without
any attempts at parallelization.


Best,
Erick

On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar
susheel.ku...@thedigitalgroup.net wrote:
 Hi,

 Currently we are indexing 10 million document from database (10 db data 
 entities)  index size is around 8 GB on windows virtual box. Indexing in one 
 shot taking 12+ hours while indexing parallel in separate cores  merging 
 them together taking 4+ hours.

 We are looking to scale to 100+ million documents and looking for 
 recommendation on servers requirements on below parameters for a Production 
 environment. There can be 200+ users performing search same time.

 No of physical servers (considering solr cloud)
 Memory requirement
 Processor requirement (# cores)
 Linux as OS oppose to windows

 Thanks in advance.
 Susheel



RE: Solr server requirements for 100+ million documents

2014-01-24 Thread Susheel Kumar
Thanks, Erick for the info.

For indexing I agree the more time is consumed in data acquisition which in our 
case from Database.  For indexing currently we are using the manual process 
i.e. Solr dashboard Data Import but now looking to automate.  How do you 
suggest to automate the index part. Do you recommend to use SolrJ or should we 
try to automate using Curl?


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, January 24, 2014 2:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Can't be done with the information you provided, and can only be guessed at 
even with more comprehensive information.

Here's why:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Also, at a guess, your indexing speed is so slow due to data acquisition; I 
rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, 
try commenting out the
server.add() bit and running again. My guess is that your indexing speed will 
be almost unchanged, in which case it's the data acquisition process is where 
you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs 
on my laptop in 45 minutes without any attempts at parallelization.


Best,
Erick

On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:
 Hi,

 Currently we are indexing 10 million document from database (10 db data 
 entities)  index size is around 8 GB on windows virtual box. Indexing in one 
 shot taking 12+ hours while indexing parallel in separate cores  merging 
 them together taking 4+ hours.

 We are looking to scale to 100+ million documents and looking for 
 recommendation on servers requirements on below parameters for a Production 
 environment. There can be 200+ users performing search same time.

 No of physical servers (considering solr cloud) Memory requirement 
 Processor requirement (# cores) Linux as OS oppose to windows

 Thanks in advance.
 Susheel



Re: Solr server requirements for 100+ million documents

2014-01-24 Thread svante karlsson
I just indexed 100 million db docs (records) with 22 fields (4 multivalued)
in 9524 sec using libcurl.
11 million took 763 seconds so the speed drops somewhat with increasing
dbsize.

We write 1000 docs (just an arbitrary number) in each request from two
threads. If you will be using solrcloud you will want more writer threads.

The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD
and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

/svante




2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

 Thanks, Erick for the info.

 For indexing I agree the more time is consumed in data acquisition which
 in our case from Database.  For indexing currently we are using the manual
 process i.e. Solr dashboard Data Import but now looking to automate.  How
 do you suggest to automate the index part. Do you recommend to use SolrJ or
 should we try to automate using Curl?


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, January 24, 2014 2:59 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 Can't be done with the information you provided, and can only be guessed
 at even with more comprehensive information.

 Here's why:


 http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

 Also, at a guess, your indexing speed is so slow due to data acquisition;
 I rather doubt you're being limited by raw Solr indexing. If you're using
 SolrJ, try commenting out the
 server.add() bit and running again. My guess is that your indexing speed
 will be almost unchanged, in which case it's the data acquisition process
 is where you should concentrate efforts. As a comparison, I can index 11M
 Wikipedia docs on my laptop in 45 minutes without any attempts at
 parallelization.


 Best,
 Erick

 On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:
  Hi,
 
  Currently we are indexing 10 million document from database (10 db data
 entities)  index size is around 8 GB on windows virtual box. Indexing in
 one shot taking 12+ hours while indexing parallel in separate cores 
 merging them together taking 4+ hours.
 
  We are looking to scale to 100+ million documents and looking for
 recommendation on servers requirements on below parameters for a Production
 environment. There can be 200+ users performing search same time.
 
  No of physical servers (considering solr cloud) Memory requirement
  Processor requirement (# cores) Linux as OS oppose to windows
 
  Thanks in advance.
  Susheel
 



Re: Solr server requirements for 100+ million documents

2014-01-24 Thread Otis Gospodnetic
Hi Susheel,

Like Erick said, it's impossible to give precise recommendations, but
making a few assumptions and combining them with experience (+ a licked
finger in the air):
* 3 servers
* 32 GB
* 2+ CPU cores
* Linux

Assuming docs are not bigger than a few KB, that they are not being
reindexed over and over, that you don't have a search rate higher than a
few dozen QPS, assuming your queries are not a page long, etc. assuming
best practices are followed, the above should be sufficient.

I hope this helps.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:

 Hi,

 Currently we are indexing 10 million document from database (10 db data
 entities)  index size is around 8 GB on windows virtual box. Indexing in
 one shot taking 12+ hours while indexing parallel in separate cores 
 merging them together taking 4+ hours.

 We are looking to scale to 100+ million documents and looking for
 recommendation on servers requirements on below parameters for a Production
 environment. There can be 200+ users performing search same time.

 No of physical servers (considering solr cloud)
 Memory requirement
 Processor requirement (# cores)
 Linux as OS oppose to windows

 Thanks in advance.
 Susheel




Replica not consistent after update request?

2014-01-24 Thread Nathan Neulinger

How can we issue an update request and be certain that all of the replicas in 
the SolrCloud cluster are up to date?

I found this post:

http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/79886

which seems to indicate that all replicas for a shard must finish/succeed before it returns to client that the operation 
succeeded - but we've been seeing behavior lately (until we configured automatic soft commits) where the replicas were 
almost always not current - i.e. the replicas were missing documents/etc.


Is this something wrong with our cloud setup/replication, or am I misinterpreting the way that updates in a cloud 
deployment are supposed to function?


If it's a problem with our cloud setup, do you have any suggestions on 
diagnostics?

Alternatively, are we perhaps just using it wrong?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: Replica not consistent after update request?

2014-01-24 Thread Joel Bernstein
If you're on Solr 4.6 then this is likely the issue:
https://issues.apache.org/jira/browse/SOLR-4260.

The issue is resolved for Solr 4.6.1 which should be out next week.


Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 24, 2014 at 9:52 PM, Nathan Neulinger nn...@neulinger.orgwrote:

 How can we issue an update request and be certain that all of the replicas
 in the SolrCloud cluster are up to date?

 I found this post:

 http://comments.gmane.org/gmane.comp.jakarta.lucene.
 solr.user/79886

 which seems to indicate that all replicas for a shard must finish/succeed
 before it returns to client that the operation succeeded - but we've been
 seeing behavior lately (until we configured automatic soft commits) where
 the replicas were almost always not current - i.e. the replicas were
 missing documents/etc.

 Is this something wrong with our cloud setup/replication, or am I
 misinterpreting the way that updates in a cloud deployment are supposed to
 function?

 If it's a problem with our cloud setup, do you have any suggestions on
 diagnostics?

 Alternatively, are we perhaps just using it wrong?

 -- Nathan

 
 Nathan Neulinger   nn...@neulinger.org
 Neulinger Consulting   (573) 612-1412



Re: Replica not consistent after update request?

2014-01-24 Thread Anshum Gupta
Hi Nathan,

It'd be great to have more information about your setup, Solr Version?
Depending upon your version, you might want to also look at:
https://issues.apache.org/jira/browse/SOLR-4260 (which is now fixed).


On Fri, Jan 24, 2014 at 6:52 PM, Nathan Neulinger nn...@neulinger.orgwrote:

 How can we issue an update request and be certain that all of the replicas
 in the SolrCloud cluster are up to date?

 I found this post:

 http://comments.gmane.org/gmane.comp.jakarta.lucene.
 solr.user/79886

 which seems to indicate that all replicas for a shard must finish/succeed
 before it returns to client that the operation succeeded - but we've been
 seeing behavior lately (until we configured automatic soft commits) where
 the replicas were almost always not current - i.e. the replicas were
 missing documents/etc.

 Is this something wrong with our cloud setup/replication, or am I
 misinterpreting the way that updates in a cloud deployment are supposed to
 function?

 If it's a problem with our cloud setup, do you have any suggestions on
 diagnostics?

 Alternatively, are we perhaps just using it wrong?

 -- Nathan

 
 Nathan Neulinger   nn...@neulinger.org
 Neulinger Consulting   (573) 612-1412




-- 

Anshum Gupta
http://www.anshumgupta.net


Re: Replica not consistent after update request?

2014-01-24 Thread Nathan Neulinger

Wow, the detail in that jira issue makes my brain hurt... Great to see it's got 
a quick answer/fix!

Thank you!

-- Nathan

On 01/24/2014 09:43 PM, Joel Bernstein wrote:

If you're on Solr 4.6 then this is likely the issue:
https://issues.apache.org/jira/browse/SOLR-4260.

The issue is resolved for Solr 4.6.1 which should be out next week.


Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 24, 2014 at 9:52 PM, Nathan Neulinger nn...@neulinger.orgwrote:


How can we issue an update request and be certain that all of the replicas
in the SolrCloud cluster are up to date?

I found this post:

 http://comments.gmane.org/gmane.comp.jakarta.lucene.
solr.user/79886

which seems to indicate that all replicas for a shard must finish/succeed
before it returns to client that the operation succeeded - but we've been
seeing behavior lately (until we configured automatic soft commits) where
the replicas were almost always not current - i.e. the replicas were
missing documents/etc.

Is this something wrong with our cloud setup/replication, or am I
misinterpreting the way that updates in a cloud deployment are supposed to
function?

If it's a problem with our cloud setup, do you have any suggestions on
diagnostics?

Alternatively, are we perhaps just using it wrong?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412





--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: Replica not consistent after update request?

2014-01-24 Thread Nathan Neulinger

It's 4.6.0. Pair of servers with an external 3-node zk ensemble.

SOLR-4260 looks like a very promising answer. Will check it out as soon as 
4.6.1 is released.

May also check out the nightly builds since this is still just 
development/prototype usage.

-- Nathan

On 01/24/2014 09:45 PM, Anshum Gupta wrote:

Hi Nathan,

It'd be great to have more information about your setup, Solr Version?
Depending upon your version, you might want to also look at:
https://issues.apache.org/jira/browse/SOLR-4260 (which is now fixed).


On Fri, Jan 24, 2014 at 6:52 PM, Nathan Neulinger nn...@neulinger.orgwrote:


How can we issue an update request and be certain that all of the replicas
in the SolrCloud cluster are up to date?

I found this post:

 http://comments.gmane.org/gmane.comp.jakarta.lucene.
solr.user/79886

which seems to indicate that all replicas for a shard must finish/succeed
before it returns to client that the operation succeeded - but we've been
seeing behavior lately (until we configured automatic soft commits) where
the replicas were almost always not current - i.e. the replicas were
missing documents/etc.

Is this something wrong with our cloud setup/replication, or am I
misinterpreting the way that updates in a cloud deployment are supposed to
function?

If it's a problem with our cloud setup, do you have any suggestions on
diagnostics?

Alternatively, are we perhaps just using it wrong?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412







--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


Re: Replica not consistent after update request?

2014-01-24 Thread Erick Erickson
Right. There updates are guaranteed to be on the replicas and in their
transaction logs. That doesn't mean they're searchable, however. For a
document to be found in a search there must be a commit, either soft,
or hard with openSearcher=true. Here's a post that outlines all this.



If you have discrepancies when after commits, that's a problem

Best,
Erick

On Fri, Jan 24, 2014 at 8:52 PM, Nathan Neulinger nn...@neulinger.org wrote:
 How can we issue an update request and be certain that all of the replicas
 in the SolrCloud cluster are up to date?

 I found this post:

 http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/79886

 which seems to indicate that all replicas for a shard must finish/succeed
 before it returns to client that the operation succeeded - but we've been
 seeing behavior lately (until we configured automatic soft commits) where
 the replicas were almost always not current - i.e. the replicas were
 missing documents/etc.

 Is this something wrong with our cloud setup/replication, or am I
 misinterpreting the way that updates in a cloud deployment are supposed to
 function?

 If it's a problem with our cloud setup, do you have any suggestions on
 diagnostics?

 Alternatively, are we perhaps just using it wrong?

 -- Nathan

 
 Nathan Neulinger   nn...@neulinger.org
 Neulinger Consulting   (573) 612-1412


RE: Solr server requirements for 100+ million documents

2014-01-24 Thread Susheel Kumar
Thanks, Svante. Your indexing speed using db seems to really fast. Can you 
please provide some more detail on how you are indexing db records. Is it thru 
DataImportHandler? And what database? Is that local db?  We are indexing around 
70 fields (60 multivalued) but data is not populated always in all fields. The 
average size of document is in 5-10 kbs.  

-Original Message-
From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of svante 
karlsson
Sent: Friday, January 24, 2014 5:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 
9524 sec using libcurl.
11 million took 763 seconds so the speed drops somewhat with increasing dbsize.

We write 1000 docs (just an arbitrary number) in each request from two threads. 
If you will be using solrcloud you will want more writer threads.

The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD and 
32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

/svante




2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

 Thanks, Erick for the info.

 For indexing I agree the more time is consumed in data acquisition 
 which in our case from Database.  For indexing currently we are using 
 the manual process i.e. Solr dashboard Data Import but now looking to 
 automate.  How do you suggest to automate the index part. Do you 
 recommend to use SolrJ or should we try to automate using Curl?


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, January 24, 2014 2:59 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 Can't be done with the information you provided, and can only be 
 guessed at even with more comprehensive information.

 Here's why:


 http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
 -dont-have-a-definitive-answer/

 Also, at a guess, your indexing speed is so slow due to data 
 acquisition; I rather doubt you're being limited by raw Solr indexing. 
 If you're using SolrJ, try commenting out the
 server.add() bit and running again. My guess is that your indexing 
 speed will be almost unchanged, in which case it's the data 
 acquisition process is where you should concentrate efforts. As a 
 comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes 
 without any attempts at parallelization.


 Best,
 Erick

 On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar  
 susheel.ku...@thedigitalgroup.net wrote:
  Hi,
 
  Currently we are indexing 10 million document from database (10 db 
  data
 entities)  index size is around 8 GB on windows virtual box. Indexing 
 in one shot taking 12+ hours while indexing parallel in separate cores 
  merging them together taking 4+ hours.
 
  We are looking to scale to 100+ million documents and looking for
 recommendation on servers requirements on below parameters for a 
 Production environment. There can be 200+ users performing search same time.
 
  No of physical servers (considering solr cloud) Memory requirement 
  Processor requirement (# cores) Linux as OS oppose to windows
 
  Thanks in advance.
  Susheel
 



Re: Solr server requirements for 100+ million documents

2014-01-24 Thread Kranti Parisa
can you post the complete solrconfig.xml file and schema.xml files to
review all of your settings that would impact your indexing performance.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:

 Thanks, Svante. Your indexing speed using db seems to really fast. Can you
 please provide some more detail on how you are indexing db records. Is it
 thru DataImportHandler? And what database? Is that local db?  We are
 indexing around 70 fields (60 multivalued) but data is not populated always
 in all fields. The average size of document is in 5-10 kbs.

 -Original Message-
 From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of
 svante karlsson
 Sent: Friday, January 24, 2014 5:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 I just indexed 100 million db docs (records) with 22 fields (4
 multivalued) in 9524 sec using libcurl.
 11 million took 763 seconds so the speed drops somewhat with increasing
 dbsize.

 We write 1000 docs (just an arbitrary number) in each request from two
 threads. If you will be using solrcloud you will want more writer threads.

 The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD
 and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

 /svante




 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

  Thanks, Erick for the info.
 
  For indexing I agree the more time is consumed in data acquisition
  which in our case from Database.  For indexing currently we are using
  the manual process i.e. Solr dashboard Data Import but now looking to
  automate.  How do you suggest to automate the index part. Do you
  recommend to use SolrJ or should we try to automate using Curl?
 
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Friday, January 24, 2014 2:59 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  Can't be done with the information you provided, and can only be
  guessed at even with more comprehensive information.
 
  Here's why:
 
 
  http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
  -dont-have-a-definitive-answer/
 
  Also, at a guess, your indexing speed is so slow due to data
  acquisition; I rather doubt you're being limited by raw Solr indexing.
  If you're using SolrJ, try commenting out the
  server.add() bit and running again. My guess is that your indexing
  speed will be almost unchanged, in which case it's the data
  acquisition process is where you should concentrate efforts. As a
  comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
  without any attempts at parallelization.
 
 
  Best,
  Erick
 
  On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar 
  susheel.ku...@thedigitalgroup.net wrote:
   Hi,
  
   Currently we are indexing 10 million document from database (10 db
   data
  entities)  index size is around 8 GB on windows virtual box. Indexing
  in one shot taking 12+ hours while indexing parallel in separate cores
   merging them together taking 4+ hours.
  
   We are looking to scale to 100+ million documents and looking for
  recommendation on servers requirements on below parameters for a
  Production environment. There can be 200+ users performing search same
 time.
  
   No of physical servers (considering solr cloud) Memory requirement
   Processor requirement (# cores) Linux as OS oppose to windows
  
   Thanks in advance.
   Susheel