Merged segment warmer Solr 4.4

2013-07-29 Thread Manuel Le Normand
Hi,
I have a slow storage machine and non sufficient RAM for the whole index to
store all the index. This causes the first queries (~5000) to be very slow
(they are read from disk and my cpu is most of time in iowait), and after
that the readings from the index become very fast and read mainly from
memory as the OS caching cached the most used parts of the index.
My concern is about new segments that are commited to disk, either merged
segments or newly formed segments.
My first thought was to deal with linux caching policy (to factor up the
caching of index files rather than uninverted files that are least
frequently used) to urge the right OS caching without having to
explicitly query the index for this to happen.
Secondly I thought of initiating a new searcher event listener that queries
on docs that were inserted since the last hard commit.
A new ability of solr 4.4 (solr 4761) is to configure a mergedSegmentWarmer
- how does this component work and is it good for my usecase?

Are there any other ideas for dealing this usecase? What would be your
proposal as most effective way to deal with it?


Re: processing documents in solr

2013-07-29 Thread Aditya
Hi,

The easiest solution would be to have timestamp indexed. Is there any issue
in doing re-indexing?
If you want to process records in batch then you need a ordered list and a
bookmark. You require a field to sort and maintain a counter / last id as
bookmark. This is mandatory to solve your problem.

If you don't want to re-index, then you need to maintain information
related to visited nodes. Have a database / solr core which maintains list
of IDs which already processed. Fetch record from Solr, For each record,
check the new DB, if the record is already processed.

Regards
Aditya
www.findbestopensource.com





On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang smartag...@gmail.com wrote:

 Basically, I was thinking about running a range query like Shawn suggested
 on the tstamp field, but unfortunately it was not indexed. Range queries
 only work on indexed fields, right?


 On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com wrote:

  I've been thinking about tstamp solution int the past few days. but too
  bad, the field is avaialble but not indexed...
 
  I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
  counter value. If yes, that would be equivalent to an autoincrement id.
 I'm
  indexing from Nutch though; don't know how to feed in such counter...
 
 
  On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Why wouldn't a simple timestamp work for the ordering? Although
  I guess simple timestamp isn't really simple if the time settings
  change.
 
  So how about a simple counter field in your documents? Assuming
  you're indexing from SolrJ, your setup is to query q=*:*sort=counter
  desc.
  Take the counter from the first document returned. Increment for
  each doc for the life of the indexing run. Now you've got, for all
 intents
  and purposes, an identity field albeit manually maintained.
 
  Then use your counter field as Shawn suggests for pulling all the
  data out.
 
  FWIW,
  Erick
 
  On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
  mcucchi...@apache.org wrote:
   In both cases, for better performance, first I'd load just all the
 IDs,
   after, during processing I'd load each document.
   For what concern the incremental requirement, it should not be
  difficult to
   write an hash function which maps a non-numerical I'd to a value.
On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote:
  
   Dear list:
  
   I have an ever-growing solr repository, and I need to process every
  single
   document to extract statistics. What would be a reasonable process
 that
   satifies the following properties:
  
   - Exhaustive: I have to traverse every single document
   - Incremental: in other words, it has to allow me to divide and
  conquer ---
   if I have processed the first 20k docs, next time I can start with
  20001.
  
   A simple *:* query would satisfy the 1st but not the 2nd property.
 In
   fact, given that the processing will take very long, and the
 repository
   keeps growing, it is not even clear that the exhaustiveness is
  achieved.
  
   I'm running solr 3.6.2 in a single-machine setting; no hadoop
  capability
   yet. But I guess the same issues still hold even if I have the solr
  cloud
   environment, right, say in each shard?
  
   Any help would be greatly appreciated.
  
   Joe
  
 
 
 



RAM Usage Debugging

2013-07-29 Thread Furkan KAMACI
When I look at my dashboard I see that  27.30 GB available for JVM,  24.77
GB is gray and 16.50 GB is black. I don't do anything on my machine right
now. Did it cache documents or is there any problem, how can I learn it?


RE: new field type - enum field

2013-07-29 Thread Elran Dvir
Thanks, Erick.

I have tried it four times. It keeps failing.
The problem reoccurred today. 

Thanks.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, July 29, 2013 2:44 AM
To: solr-user@lucene.apache.org
Subject: Re: new field type - enum field

You should be able to attach a patch, wonder if there was some temporary glitch 
in the JIRA. Is this persisting.

Let us know if this continues...

Erick

On Sun, Jul 28, 2013 at 12:11 PM, Elran Dvir elr...@checkpoint.com wrote:
 Hi,

 I have created an issue: 
 https://issues.apache.org/jira/browse/SOLR-5084
 I tried to attach my patch, but it failed:  Cannot attach file 
 Solr-5084.patch: Unable to communicate with JIRA.
 What am I doing wrong?

 Thanks.

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, July 25, 2013 3:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: new field type - enum field

 Start here: http://wiki.apache.org/solr/HowToContribute

 Then, when your patch is ready submit a JIRA and attach your patch. Then 
 nudge (gently) if none of the committers picks it up and applies it

 NOTE: It is _not_ necessary that the first version of your patch is 
 completely polished. I often put up partial/incomplete patches (comments with 
 //nocommit are explicitly caught by the ant precommit target for instance) 
 to see if anyone has any comments before polishing.

 Best
 Erick

 On Thu, Jul 25, 2013 at 5:04 AM, Elran Dvir elr...@checkpoint.com wrote:
 Hi,

 I have implemented like Chris described it:
 The field is indexed as numeric, but displayed as string, according to 
 configuration.
 It applies to facet, pivot, group and query.

 How do we proceed? How do I contribute it?

 Thanks.

 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Thursday, July 25, 2013 4:40 AM
 To: solr-user@lucene.apache.org
 Subject: Re: new field type - enum field


 : Doable at Lucene level by any chance?

 Given how well the Trie fields compress (ByteField and ShortField have been 
 deprecated in favor of TrieIntField for this reason) it probably just makes 
 sense to treat it as a numeric at the Lucene level.

 :  If there's positive feedback, I'll open an issue with a patch for the 
 functionality.

 I've typically dealt with this sort of thing at the client layer 
 using a simple numeric field in Solr, or used an UpdateProcessor to 
 convert the
 String-numeric mapping when indexing  used clinet logic of a
 DocTransformer to handle the stored value at query time -- but having a 
 built in FieldType that handles that for you automatically (and helps ensure 
 the indexed values conform to the enum) would certainly be cool if you'd 
 like to contribute it.


 -Hoss

 Email secured by Check Point

 Email secured by Check Point

Email secured by Check Point


Re: Two-steps queries with different sorting criteria

2013-07-29 Thread Otis Gospodnetic
Hi,

Not sure if this was already answered, but...

If the source of the problem are overly general queries, I would try to
eliminate or minimize that.  For example:
* offering query autocomplete functionality can have an affect on query
length and precision
* showing related searches (derived from query logs) and exposing queries
that are not as general could lead to people using that functionality after
the initial search without running into the issue you described

As for sorting by relevance and then sorting top N of such hits by
something else - yes, you can write a custom SearchComponent and do this in
a single call from the client.  We've implemented similar functionality a a
few times for a couple of Sematext clients and it worked well.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Thursday, July 18, 2013, Fabio Amato wrote:

 Hi all,
 I need to execute a Solr query in two steps, executing in the first step a
 generic limited-results query ordered by relevance, and in the second step
 the ordering of the results of the first step according to a given sorting
 criterion (different from relevance).

 This two-steps query is meaningful when the query terms are so generic in
 such a way that the number of matched results exceeds the wanted number of
 results.

 In such circumstance, using single-step queries with different sorting
 criteria has a very confusing effect on the user experience, because at
 each change of sorting criterion the user gets different results even if
 the search query and the filtering conditions have not changed.

 On the contrary, using a two-steps query where the sorting order of the
 first step is always the relevance is more acceptable in case of large
 number of matched results because the result set would not change with the
 sorting criterion of the second step.

 I am wondering if such a two-steps query is achievable with a single Solr
 query, or if I am obliged to execute the sorting step of my two-steps query
 out of Solr (i.e.:in my application). Another possibility could be the
 development of a Solr plugin, but I am afraid of the possible effects on
 the performances.

 I am using Solr 3.4.0

 Thanks in advance for your kind help.
 Fabio



-- 
Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm


.lock file not created when making a backup snapshot

2013-07-29 Thread Artem Karpenko

Hi,

when making a backup snapshot using /replication?command=backup call, 
a snapshot directory is created and starts to be filled, but appropriate 
.lock file is not created so it's impossible to check when backup is 
finished. I've taken a look at code and it seems to me that 
lock.obtain() call is missing: there is


public class SnapShooter {
...
void createSnapshot(final IndexCommit indexCommit, int numberToKeep, 
ReplicationHandler replicationHandler) {

...
lock = lockFactory.makeLock(directoryName + .lock);
...
lock.release();

so lock file is not actually created. This is Solr 4.3.1, release notes 
for Solr 4.4 do not include this problem.


Should I raise a JIRA issue for this? Or maybe you could suggest more 
reliable way to make a backup?


Regards,
Artem.


AND Queries

2013-07-29 Thread Furkan KAMACI
I am searching for a keyword as like that:

lang:en AND url:book pencil cat

It returns me results however none of them includes both book, pencil and
cat keywords. How should I rewrite my query?

I tried this:

lang:en AND url:(book AND pencil AND cat)

and looks like OK. However this not:


lang:en AND url:book AND pencil AND cat

why?


Re: AND Queries

2013-07-29 Thread Rafał Kuć
Hello!

Try turning on debugQuery and see what I happening. From what I see
you are searching the en term in lang field, the book term in url
field and the pencil and cat terms in the default search field, but
from your second query I see that you would like to find the last two
terms in the url.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch

 I am searching for a keyword as like that:

 lang:en AND url:book pencil cat

 It returns me results however none of them includes both book, pencil and
 cat keywords. How should I rewrite my query?

 I tried this:

 lang:en AND url:(book AND pencil AND cat)

 and looks like OK. However this not:


 lang:en AND url:book AND pencil AND cat

 why?



Re: .lock file not created when making a backup snapshot

2013-07-29 Thread Mark Triggs
Hi Artem,

I noticed this recently too.  I created a JIRA issue here:

  https://issues.apache.org/jira/browse/SOLR-5040

Cheers,

Mark


Artem Karpenko a.karpe...@oxseed.com writes:

 Hi,

 when making a backup snapshot using /replication?command=backup
 call, a snapshot directory is created and starts to be filled, but
 appropriate .lock file is not created so it's impossible to check when
 backup is finished. I've taken a look at code and it seems to me that
 lock.obtain() call is missing: there is

 public class SnapShooter {
 ...
 void createSnapshot(final IndexCommit indexCommit, int numberToKeep,
 ReplicationHandler replicationHandler) {
 ...
 lock = lockFactory.makeLock(directoryName + .lock);
 ...
 lock.release();

 so lock file is not actually created. This is Solr 4.3.1, release
 notes for Solr 4.4 do not include this problem.

 Should I raise a JIRA issue for this? Or maybe you could suggest more
 reliable way to make a backup?

 Regards,
 Artem.

-- 
Mark Triggs
m...@dishevelled.net


swap and GC

2013-07-29 Thread Bernd Fehling
Something interesting I have noticed today,
after running my huge single index (49 mio. records / 137 GB index) for
about a week and replicating today I recognized that the heap usage after
replication did not go down as expected. Expected means if solr is started
I have a heap size between 4 to 5 GB and during the week under heavy load
it might go up to 10 GB. But after replication in offline mode it recovers
to between 5 to 6 GB. But today it was not going under 8 GB, even with
forced GC from jvisualvm.
So I first dropped the caches and tried again, no success.
Next I turned off swap which took quite a while and turned it back on.
This forced all content from swap back into memory. After calling
Perform GC from jvisualvm the heap dropped below 5 GB. Bingo!

This leads me to the conclusion that java GC is not seeing or reaching
objects which are located in swap.

Anyone else seen this?

As I am not short on memory or have any other problems I don't need any 
solution,
but if there are some users having memory problems with old objects in swap
I would suggest a cronjob after replication with swapoff/swapon and GC 
afterwards.

Bernd


Re: AND Queries

2013-07-29 Thread Furkan KAMACI
When I send that query:

select?pf=url^10+title^8fl=url,content,titlestart=0q=lang:en+AND+(cat+AND+dog+AND+pencil)qf=content^5+url^8.0+title^6wt=xmldebugQuery=on

It is debugged as:

+(+lang:en +(+(content:cat^5.0 | title:cat^6.0 | url:cat^8.0)
+(content:dog^5.0 | title:dog^6.0 | url:dog^8.0) +(content:pencil^5.0 |
title:pencil^6.0 | url:pencil^8.0))) (url:cat dog pencil^10.0)
(title:(cat dog pencil)^8.0)

Why default field is not applied at this situation?



2013/7/29 fbrisbart fbrisb...@bestofmedia.com

 It's because when you don't specify any field, it's the default field
 which is used.

 So,
 lang:en AND url:book AND pencil AND cat

 is interpreted as :
 ang:en AND url:book AND default_field:pencil AND default_field:cat


 The default search field is defined in your schema.xml file
 (defaultSearchField)


 Franck Brisbart

 Le lundi 29 juillet 2013 à 12:06 +0300, Furkan KAMACI a écrit :
  I am searching for a keyword as like that:
 
  lang:en AND url:book pencil cat
 
  It returns me results however none of them includes both book, pencil and
  cat keywords. How should I rewrite my query?
 
  I tried this:
 
  lang:en AND url:(book AND pencil AND cat)
 
  and looks like OK. However this not:
 
 
  lang:en AND url:book AND pencil AND cat
 
  why?





Re: AND Queries

2013-07-29 Thread fbrisbart
Because you specified the search fields to use with 'qf' which overrides
the default search field.

Franck Brisbart


Le lundi 29 juillet 2013 à 13:01 +0300, Furkan KAMACI a écrit :
 When I send that query:
 
 select?pf=url^10+title^8fl=url,content,titlestart=0q=lang:en+AND+(cat+AND+dog+AND+pencil)qf=content^5+url^8.0+title^6wt=xmldebugQuery=on
 
 It is debugged as:
 
 +(+lang:en +(+(content:cat^5.0 | title:cat^6.0 | url:cat^8.0)
 +(content:dog^5.0 | title:dog^6.0 | url:dog^8.0) +(content:pencil^5.0 |
 title:pencil^6.0 | url:pencil^8.0))) (url:cat dog pencil^10.0)
 (title:(cat dog pencil)^8.0)
 
 Why default field is not applied at this situation?
 
 
 
 2013/7/29 fbrisbart fbrisb...@bestofmedia.com
 
  It's because when you don't specify any field, it's the default field
  which is used.
 
  So,
  lang:en AND url:book AND pencil AND cat
 
  is interpreted as :
  ang:en AND url:book AND default_field:pencil AND default_field:cat
 
 
  The default search field is defined in your schema.xml file
  (defaultSearchField)
 
 
  Franck Brisbart
 
  Le lundi 29 juillet 2013 à 12:06 +0300, Furkan KAMACI a écrit :
   I am searching for a keyword as like that:
  
   lang:en AND url:book pencil cat
  
   It returns me results however none of them includes both book, pencil and
   cat keywords. How should I rewrite my query?
  
   I tried this:
  
   lang:en AND url:(book AND pencil AND cat)
  
   and looks like OK. However this not:
  
  
   lang:en AND url:book AND pencil AND cat
  
   why?
 
 
 




solr query range upper exclusive

2013-07-29 Thread alin1918
q=price_1_1:[197 TO 249] and q=*:*fq=price_1_1:[197 TO 249] returns 2 
records

but I have two records with the price_1_1 = 249, it seams that the upper 
range is exclusive and I can't figure out why, can you help me?

dynamicField name=price_*type=tfloat indexed=true/

fieldType name=tfloat class=solr.TrieFloatField precisionStep=8
omitNorms=true positionIncrementGap=0/




--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-query-range-upper-exclusive-tp4080978.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: processing documents in solr

2013-07-29 Thread Erick Erickson
No SolrJ doesn't provide this automatically. You'd be providing the
counter by inserting it into the document as you created new docs.

You could do this with any kind of document creation you are
using.

Best
Erick

On Mon, Jul 29, 2013 at 2:51 AM, Aditya findbestopensou...@gmail.com wrote:
 Hi,

 The easiest solution would be to have timestamp indexed. Is there any issue
 in doing re-indexing?
 If you want to process records in batch then you need a ordered list and a
 bookmark. You require a field to sort and maintain a counter / last id as
 bookmark. This is mandatory to solve your problem.

 If you don't want to re-index, then you need to maintain information
 related to visited nodes. Have a database / solr core which maintains list
 of IDs which already processed. Fetch record from Solr, For each record,
 check the new DB, if the record is already processed.

 Regards
 Aditya
 www.findbestopensource.com





 On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang smartag...@gmail.com wrote:

 Basically, I was thinking about running a range query like Shawn suggested
 on the tstamp field, but unfortunately it was not indexed. Range queries
 only work on indexed fields, right?


 On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com wrote:

  I've been thinking about tstamp solution int the past few days. but too
  bad, the field is avaialble but not indexed...
 
  I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
  counter value. If yes, that would be equivalent to an autoincrement id.
 I'm
  indexing from Nutch though; don't know how to feed in such counter...
 
 
  On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Why wouldn't a simple timestamp work for the ordering? Although
  I guess simple timestamp isn't really simple if the time settings
  change.
 
  So how about a simple counter field in your documents? Assuming
  you're indexing from SolrJ, your setup is to query q=*:*sort=counter
  desc.
  Take the counter from the first document returned. Increment for
  each doc for the life of the indexing run. Now you've got, for all
 intents
  and purposes, an identity field albeit manually maintained.
 
  Then use your counter field as Shawn suggests for pulling all the
  data out.
 
  FWIW,
  Erick
 
  On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
  mcucchi...@apache.org wrote:
   In both cases, for better performance, first I'd load just all the
 IDs,
   after, during processing I'd load each document.
   For what concern the incremental requirement, it should not be
  difficult to
   write an hash function which maps a non-numerical I'd to a value.
On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com wrote:
  
   Dear list:
  
   I have an ever-growing solr repository, and I need to process every
  single
   document to extract statistics. What would be a reasonable process
 that
   satifies the following properties:
  
   - Exhaustive: I have to traverse every single document
   - Incremental: in other words, it has to allow me to divide and
  conquer ---
   if I have processed the first 20k docs, next time I can start with
  20001.
  
   A simple *:* query would satisfy the 1st but not the 2nd property.
 In
   fact, given that the processing will take very long, and the
 repository
   keeps growing, it is not even clear that the exhaustiveness is
  achieved.
  
   I'm running solr 3.6.2 in a single-machine setting; no hadoop
  capability
   yet. But I guess the same issues still hold even if I have the solr
  cloud
   environment, right, say in each shard?
  
   Any help would be greatly appreciated.
  
   Joe
  
 
 
 



Re: new field type - enum field

2013-07-29 Thread Erick Erickson
OK, if you can attach it to an e-mail, I'll attach it.

Just to check, though, make sure you're logged in. I've been fooled
once or twice by being automatically signed out...

Erick

On Mon, Jul 29, 2013 at 3:17 AM, Elran Dvir elr...@checkpoint.com wrote:
 Thanks, Erick.

 I have tried it four times. It keeps failing.
 The problem reoccurred today.

 Thanks.

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Monday, July 29, 2013 2:44 AM
 To: solr-user@lucene.apache.org
 Subject: Re: new field type - enum field

 You should be able to attach a patch, wonder if there was some temporary 
 glitch in the JIRA. Is this persisting.

 Let us know if this continues...

 Erick

 On Sun, Jul 28, 2013 at 12:11 PM, Elran Dvir elr...@checkpoint.com wrote:
 Hi,

 I have created an issue:
 https://issues.apache.org/jira/browse/SOLR-5084
 I tried to attach my patch, but it failed:  Cannot attach file 
 Solr-5084.patch: Unable to communicate with JIRA.
 What am I doing wrong?

 Thanks.

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, July 25, 2013 3:25 PM
 To: solr-user@lucene.apache.org
 Subject: Re: new field type - enum field

 Start here: http://wiki.apache.org/solr/HowToContribute

 Then, when your patch is ready submit a JIRA and attach your patch. Then 
 nudge (gently) if none of the committers picks it up and applies it

 NOTE: It is _not_ necessary that the first version of your patch is 
 completely polished. I often put up partial/incomplete patches (comments 
 with //nocommit are explicitly caught by the ant precommit target for 
 instance) to see if anyone has any comments before polishing.

 Best
 Erick

 On Thu, Jul 25, 2013 at 5:04 AM, Elran Dvir elr...@checkpoint.com wrote:
 Hi,

 I have implemented like Chris described it:
 The field is indexed as numeric, but displayed as string, according to 
 configuration.
 It applies to facet, pivot, group and query.

 How do we proceed? How do I contribute it?

 Thanks.

 -Original Message-
 From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
 Sent: Thursday, July 25, 2013 4:40 AM
 To: solr-user@lucene.apache.org
 Subject: Re: new field type - enum field


 : Doable at Lucene level by any chance?

 Given how well the Trie fields compress (ByteField and ShortField have been 
 deprecated in favor of TrieIntField for this reason) it probably just makes 
 sense to treat it as a numeric at the Lucene level.

 :  If there's positive feedback, I'll open an issue with a patch for the 
 functionality.

 I've typically dealt with this sort of thing at the client layer
 using a simple numeric field in Solr, or used an UpdateProcessor to
 convert the
 String-numeric mapping when indexing  used clinet logic of a
 DocTransformer to handle the stored value at query time -- but having a 
 built in FieldType that handles that for you automatically (and helps 
 ensure the indexed values conform to the enum) would certainly be cool if 
 you'd like to contribute it.


 -Hoss

 Email secured by Check Point

 Email secured by Check Point

 Email secured by Check Point


Re: Performance vs. maxBufferedAddsPerServer=10

2013-07-29 Thread Mark Miller
SOLR-4816 won't address this - it will just speed up *different* parts. There 
are other things that will need to be done to speed up that part.

- Mark

On Jul 26, 2013, at 3:53 PM, Erick Erickson erickerick...@gmail.com wrote:

 This is current a hard-coded limit from what I've understood. From what
 I remember, Mark said Yonik said that there are reasons to make the
 packets that size. But whether this is empirically a Good Thing I don't know.
 
 SOLR-4816 will address this a different way by making SolrJ batch up
 the docs and send them to the right leader, which should pretty much remove
 any performance consideration here.
 
 There's some anecdotal evidence that changing that in the code might
 improve throughput, but I don't remember the details.
 
 FWIW
 Erick
 
 On Thu, Jul 25, 2013 at 7:09 AM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
 Hi,
 
 Context:
 * https://issues.apache.org/jira/browse/SOLR-4956
 * 
 http://search-lucene.com/c/Solr:/core/src/java/org/apache/solr/update/SolrCmdDistributor.java%7C%7CmaxBufferedAddsPerServer
 
 As you can see, maxBufferedAddsPerServer = 10.
 
 We have an app that sends 20K docs to SolrCloud using CloudSolrServer.
 We batch 20K docs for performance reasons. But then the receiving node
 ends up sending VERY small batches of just 10 docs around for indexing
 and we lose the benefit of batching those 20K docs in the first place.
 
 Our app is add only.
 
 Is there anything one can do to avoid performance loss associated with
 maxBufferedAddsPerServer=10?
 
 Thanks,
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



DIH to index the data - 250 millions - Need a best architecture

2013-07-29 Thread Santanu8939967892
Hi,
   I have a huge volume of DB records, which is close to 250 millions.
I am going to use DIH to index the data into Solr.
I need a best architecture to index and query the data in an efficient
manner.
I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4.


With Regards,
Santanu


Re: DIH to index the data - 250 millions - Need a best architecture

2013-07-29 Thread Gora Mohanty
On 29 July 2013 17:30, Santanu8939967892 mishra.sant...@gmail.com wrote:
 Hi,
I have a huge volume of DB records, which is close to 250 millions.
 I am going to use DIH to index the data into Solr.
 I need a best architecture to index and query the data in an efficient
 manner.
[...]

This is difficult to answer without knowing details of your
particular use case. Your best bet would be to prototype
a system, and measure performance on at least a subset
of the data. If you search through earlier message on the
list, you should also come across some numbers for
performance, but it is best to test for your own needs.

Regards,
Gora


Re: DIH to index the data - 250 millions - Need a best architecture

2013-07-29 Thread Jack Krupansky
The initial question is not how to index the data, but how you want to use 
or query the data. Use cases for query and data access should drive the data 
model that you will use to index the data.


So, what are some sample queries? How will users want to search and access 
the data? What data will they expect to see and in what form? Not so much 
from a UI perspective, but in terms of how the client app(s) will access 
data.


-- Jack Krupansky

-Original Message- 
From: Santanu8939967892

Sent: Monday, July 29, 2013 8:00 AM
To: solr-user@lucene.apache.org
Subject: DIH to index the data - 250 millions - Need a best architecture

Hi,
  I have a huge volume of DB records, which is close to 250 millions.
I am going to use DIH to index the data into Solr.
I need a best architecture to index and query the data in an efficient
manner.
I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4.


With Regards,
Santanu 



Re: .lock file not created when making a backup snapshot

2013-07-29 Thread Artem Karpenko

Thanks Mark!

29.07.2013 12:32, Mark Triggs пишет:

Hi Artem,

I noticed this recently too.  I created a JIRA issue here:

   https://issues.apache.org/jira/browse/SOLR-5040

Cheers,

Mark


Artem Karpenko a.karpe...@oxseed.com writes:


Hi,

when making a backup snapshot using /replication?command=backup
call, a snapshot directory is created and starts to be filled, but
appropriate .lock file is not created so it's impossible to check when
backup is finished. I've taken a look at code and it seems to me that
lock.obtain() call is missing: there is

public class SnapShooter {
...
void createSnapshot(final IndexCommit indexCommit, int numberToKeep,
ReplicationHandler replicationHandler) {
...
lock = lockFactory.makeLock(directoryName + .lock);
...
lock.release();

so lock file is not actually created. This is Solr 4.3.1, release
notes for Solr 4.4 do not include this problem.

Should I raise a JIRA issue for this? Or maybe you could suggest more
reliable way to make a backup?

Regards,
Artem.




Re: solr query range upper exclusive

2013-07-29 Thread Jack Krupansky
Square brackets are inclusive and curly braces are exclusive for range 
queries.


I tried a similar example with the standard Solr example and it works fine:

 curl http://localhost:8983/solr/update?commit=true; \
 -H 'Content-type:application/json' -d '
 [{id: doc-1,  price_f: 249}]'

curl 
http://localhost:8983/solr/select/?q=price_f:%5b149+TO+249%5dindent=truewt=json;


Make sure that you don't have some other dynamic field pattern that is 
overriding or overlapping the one you showed us.


-- Jack Krupansky

-Original Message- 
From: alin1918

Sent: Monday, July 29, 2013 6:38 AM
To: solr-user@lucene.apache.org
Subject: solr query range upper exclusive

q=price_1_1:[197 TO 249] and q=*:*fq=price_1_1:[197 TO 249] returns 2
records

but I have two records with the price_1_1 = 249, it seams that the upper
range is exclusive and I can't figure out why, can you help me?

dynamicField name=price_*type=tfloat indexed=true/

fieldType name=tfloat class=solr.TrieFloatField precisionStep=8
omitNorms=true positionIncrementGap=0/




--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-query-range-upper-exclusive-tp4080978.html
Sent from the Solr - User mailing list archive at Nabble.com. 



RE: swap and GC

2013-07-29 Thread Michael Ryan
This is interesting... How are you measuring the heap size?

-Michael

-Original Message-
From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] 
Sent: Monday, July 29, 2013 5:34 AM
To: solr-user@lucene.apache.org
Subject: swap and GC

Something interesting I have noticed today, after running my huge single index 
(49 mio. records / 137 GB index) for about a week and replicating today I 
recognized that the heap usage after replication did not go down as expected. 
Expected means if solr is started I have a heap size between 4 to 5 GB and 
during the week under heavy load it might go up to 10 GB. But after replication 
in offline mode it recovers to between 5 to 6 GB. But today it was not going 
under 8 GB, even with forced GC from jvisualvm.
So I first dropped the caches and tried again, no success.
Next I turned off swap which took quite a while and turned it back on.
This forced all content from swap back into memory. After calling Perform GC 
from jvisualvm the heap dropped below 5 GB. Bingo!

This leads me to the conclusion that java GC is not seeing or reaching
objects which are located in swap.

Anyone else seen this?

As I am not short on memory or have any other problems I don't need any 
solution, but if there are some users having memory problems with old objects 
in swap I would suggest a cronjob after replication with swapoff/swapon and GC 
afterwards.

Bernd


Re: DIH to index the data - 250 millions - Need a best architecture

2013-07-29 Thread Santanu8939967892
Hi Jack,
My sample query will be with a keyword (text) and probably 2 to 3
filters.
There is a java interface for display of data, which will consume a class,
and the class returns a data set object using SolrJ.
So for display we will use a list for binding. we may display 20 or 30 meta
data information.
I believe I have provided the information you have asked for.

With Regards,
Santanu


On Mon, Jul 29, 2013 at 5:50 PM, Jack Krupansky j...@basetechnology.comwrote:

 The initial question is not how to index the data, but how you want to use
 or query the data. Use cases for query and data access should drive the
 data model that you will use to index the data.

 So, what are some sample queries? How will users want to search and access
 the data? What data will they expect to see and in what form? Not so much
 from a UI perspective, but in terms of how the client app(s) will access
 data.

 -- Jack Krupansky

 -Original Message- From: Santanu8939967892
 Sent: Monday, July 29, 2013 8:00 AM
 To: solr-user@lucene.apache.org
 Subject: DIH to index the data - 250 millions - Need a best architecture


 Hi,
   I have a huge volume of DB records, which is close to 250 millions.
 I am going to use DIH to index the data into Solr.
 I need a best architecture to index and query the data in an efficient
 manner.
 I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4.


 With Regards,
 Santanu



Re: DIH to index the data - 250 millions - Need a best architecture

2013-07-29 Thread Jack Krupansky
You neglected to provide information about the filters or the 20 or 30 meta 
data information.


Did you mean to imply that you will not be querying against the metadata 
(only returning it)?


-- Jack Krupansky

-Original Message- 
From: Santanu8939967892

Sent: Monday, July 29, 2013 9:41 AM
To: solr-user@lucene.apache.org
Subject: Re: DIH to index the data - 250 millions - Need a best architecture

Hi Jack,
   My sample query will be with a keyword (text) and probably 2 to 3
filters.
There is a java interface for display of data, which will consume a class,
and the class returns a data set object using SolrJ.
So for display we will use a list for binding. we may display 20 or 30 meta
data information.
I believe I have provided the information you have asked for.

With Regards,
Santanu


On Mon, Jul 29, 2013 at 5:50 PM, Jack Krupansky 
j...@basetechnology.comwrote:



The initial question is not how to index the data, but how you want to use
or query the data. Use cases for query and data access should drive the
data model that you will use to index the data.

So, what are some sample queries? How will users want to search and access
the data? What data will they expect to see and in what form? Not so much
from a UI perspective, but in terms of how the client app(s) will access
data.

-- Jack Krupansky

-Original Message- From: Santanu8939967892
Sent: Monday, July 29, 2013 8:00 AM
To: solr-user@lucene.apache.org
Subject: DIH to index the data - 250 millions - Need a best architecture


Hi,
  I have a huge volume of DB records, which is close to 250 millions.
I am going to use DIH to index the data into Solr.
I need a best architecture to index and query the data in an efficient
manner.
I am using windows server 2008 with 16 GB RAM, zion processor and Solr 
4.4.



With Regards,
Santanu





Re: solr query range upper exclusive

2013-07-29 Thread alin1918
what query parser should I use?  http://wiki.apache.org/solr/SolrQuerySyntax

Differences From Lucene Query Parser

Differences in the Solr Query Parser include

Range queries [a TO z], prefix queries a*, and wildcard queries a*b are
constant-scoring (all matching documents get an equal score). The scoring
factors tf, idf, index boost, and coord are not used. There is no limitation
on the number of terms that match (as there was in past versions of Lucene).

Lucene 2.1 has also switched to use ConstantScoreRangeQuery for its
range queries. 

A * may be used for either or both endpoints to specify an open-ended
range query.

field:[* TO 100] finds all field values less than or equal to 100

field:[100 TO *] finds all field values greater than or equal to 100

field:[* TO *] matches all documents with the field 
Pure negative queries (all clauses prohibited) are allowed.

-inStock:false finds all field values where inStock is not false

-field:[* TO *] finds all documents without a value for field 

A hook into FunctionQuery syntax. Quotes will be necessary to
encapsulate the function when it includes parentheses.

Example: _val_:myfield

Example: _val_:recip(rord(myfield),1,2,3) 
Nested query support for any type of query parser (via QParserPlugin).
Quotes will often be necessary to encapsulate the nested query if it
contains reserved characters.

Example: _query_:{!dismax qf=myfield}how now brown cow



  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-query-range-upper-exclusive-tp4080978p4081042.html
Sent from the Solr - User mailing list archive at Nabble.com.


restricting a query by a set of field values

2013-07-29 Thread Benjamin Ryan
Hi,
   Is it possible to construct a query in SOLR to perform a query 
that is restricted to only those documents that have a field value in a 
particular set of values similar to what would be done in POstgres with the SQL 
query:

   SELECT date_deposited FROM stats
   WHERE date BETWEEN '2013-07-01 00:00:00' AND '2013-07-31 
23:59:00'
   AND collection_id IN ()

   In my SOLR schema.xml date_deposited is a TrieDateField and 
collection_id is an IntField

Regards,
   Ben

--
Dr Ben Ryan
Jorum Technical Manager

5.12 Roscoe Building
The University of Manchester
Oxford Road
Manchester
M13 9PL
Tel: 0160 275 6039
E-mail: 
benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk
--



The meaning of the of the doc= on the debugQuery output

2013-07-29 Thread Bruno René Santos
Hello

One line on my debugQuery of a query is

2.1706323e-6 = score(doc=49578,freq=1.0 = termfreq=1.0), product of:

I wanted to know what the doc= means. It seems to be something used on the
fieldWeight but on the other hand it is the same for all fields on the
document, regardless of the query made or fields searched...

Regards
Bruno

-- 
Bruno René Santos
Lisboa - Portugal


Re: restricting a query by a set of field values

2013-07-29 Thread Jason Hellman
Ben,

This could be constructed as so:

fl=date_depositedfq=date[2013-07-01T00:00:00Z TO 
2013-07-31T23:59:00Z]fq=collection_id(1 2 n)q.op=OR

The parenthesis around the 1 2 n set indicate a boolean query, and we're 
ensuring they are an OR boolean by the q.op parameter.

This should get you the result set you desire.  Please beware that a very large 
boolean set (your IN(…) parameter) may be expensive to run.

Jason

On Jul 29, 2013, at 7:33 AM, Benjamin Ryan benjamin.r...@manchester.ac.uk 
wrote:

 Hi,
   Is it possible to construct a query in SOLR to perform a query 
 that is restricted to only those documents that have a field value in a 
 particular set of values similar to what would be done in POstgres with the 
 SQL query:
 
   SELECT date_deposited FROM stats
   WHERE date BETWEEN '2013-07-01 00:00:00' AND '2013-07-31 
 23:59:00'
   AND collection_id IN ()
 
   In my SOLR schema.xml date_deposited is a TrieDateField and 
 collection_id is an IntField
 
 Regards,
   Ben
 
 --
 Dr Ben Ryan
 Jorum Technical Manager
 
 5.12 Roscoe Building
 The University of Manchester
 Oxford Road
 Manchester
 M13 9PL
 Tel: 0160 275 6039
 E-mail: 
 benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk
 --
 



SolrCloud and Joins

2013-07-29 Thread David Larochelle
I'm setting up SolrCloud with around 600 million documents. The basic
structure of each document is:

stories_id: integer, media_id: integer, sentence: text_en

We have a number of stories from different media and we treat each sentence
as a separate document because we need to run sentence level analytics.

We also have a concept of groups or sets of sources. We've imported this
media source to media sets mapping into Solr using the following structure:

media_id_inner: integer, media_sets_id: integer

For the single node case, we're able to filter our sources by media_set_id
using a join query like the following:

http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1

However, this does not work correctly with SolrCloud. The problem is that
the join query is performed separately on each of the shards and no shard
has the complete media set to source mapping data. So SolrCloud returns
incomplete results.

Since the complete media set to source mapping data is comparatively small
(~50,000 rows), I would like to replicate it on every shard. So that the
results of the individual join queries on separate shards would  be
equivalent to performing the same query on a single shard system.

However, I'm can't figure out how to replicate documents on separate
shards. The compositeID router has the ability to colocate documents based
on a prefix in the document ID but this isn't what I need. What I would
like is some way to either have the media set to source data replicated on
every shard or to be able to explicitly upload this data to the individual
shards. (For the rest of the data I like the compositeID autorouting.)

Any suggestions?

--

Thanks,


David


Re: DIH to index the data - 250 millions - Need a best architecture

2013-07-29 Thread Shawn Heisey
On 7/29/2013 6:00 AM, Santanu8939967892 wrote:
 Hi,
I have a huge volume of DB records, which is close to 250 millions.
 I am going to use DIH to index the data into Solr.
 I need a best architecture to index and query the data in an efficient
 manner.
 I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4.

Gora and Jack have given you great information.  I would add that when
you are dealing with an index of this size, you need to be prepared to
spend some real money on hardware if you want maximum performance.

With 20-30 fields, I would imagine that each document is probably a few
KB in size.  Even if they will be much smaller than that, with 250
million of them, your index will be pretty large.

I'd be VERY surprised if the index is less than 100GB, and something
larger than 500GB is probably more likely.  For illustration purposes,
let's be conservative and say it's 200GB.

16GB of RAM isn't enough for an index that size.  An ideal round
memory size for a 200GB index would be 256GB - 200GB of RAM for the OS
disk cache and enough memory for whatever size java heap you might need.
 In truth, you probably don't need to cache the ENTIRE index ... most
searches will involve only certain parts of the index and won't touch
the entire thing.  A good enough memory size might be 128GB which
would keep the most relevant parts of the index in RAM at all times.

If you were to put a 200GB index onto a disk that's SSD, you could
probably get away with 64GB of RAM - 50GB or so for the OS disk cache
and the rest for the java heap.

If your index will be larger than 200GB, then the numbers I have given
you will go up.  These numbers also assume that you have your entire
index on one server, which is probably not a good idea.

http://wiki.apache.org/solr/SolrPerformanceProblems

SolrCloud would likely be the best architecture.  It would spread out
your system requirements and load across multiple machines.  If you had
20 machines, each with 16-32GB of RAM, you could do a SolrCloud
installation with 10 shards and a replicationFactor of 2, and there
wouldn't be any memory problems.  Each machine would have 25 million
records on it, and you'd have two complete copies of your index so you'd
be able to keep running if a machine completely failed -- which DOES happen.

The information I've given you is for an ideal setup.  You can go
smaller, and budget needs might indeed cause you to go smaller.  If you
don't need extremely good performance from Solr, then you don't need to
spend the money required for an architecture like I've described.

Thanks,
Shawn



Re: The meaning of the of the doc= on the debugQuery output

2013-07-29 Thread fbrisbart
Hi,

doc is the internal docId of the index.

Each doc in the index has an internal id. It starts from 1 (1st doc
inserted in the index), 2 for the 2nd, ...



Franck Brisbart



Le lundi 29 juillet 2013 à 15:34 +0100, Bruno René Santos a écrit :
 Hello
 
 One line on my debugQuery of a query is
 
 2.1706323e-6 = score(doc=49578,freq=1.0 = termfreq=1.0), product of:
 
 I wanted to know what the doc= means. It seems to be something used on the
 fieldWeight but on the other hand it is the same for all fields on the
 document, regardless of the query made or fields searched...
 
 Regards
 Bruno
 




Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1

2013-07-29 Thread Nitin Agarwal
Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1,
running on apache tomcat 7.0.42 with external zookeeper 3.4.5.

When I query select?q=*:*

I only get the number of documents found, but no actual document. When I
query with rows=0, I do get correct count of documents in the index.
Faceting queries as well as group by queries also work with rows=0.
However, when rows is not equal to 0 I do not get any documents.

When I query the index I see that a query is being sent to both shards, and
subsequently I see a query being sent with just ids, however, after that
query returns I do not see any documents back.

Not sure what do I need to change, please help.

Thanks,
Nitin


Solr Out Of Memory with Field Collapsing

2013-07-29 Thread tushar_k47
Hi,

We are using Field collapsing feature with multiple shards. We ran into into
Out of Memory errors on one of the shards. We use filed collapsing on a
particular field which has only one specific value on the shard that goes
out of memory. Interestingly the Out of Memory error recurred multiple times
during the day (about 4 times in 24 hours) without any significant deviation
in traffic from the normal or the nature of queries being run. The max heap
size allocated to the shard 8 Gb.

Since then we have done the following and the problem seems to be arrested
for now -

1. Added more horizontal slaves, from 3 we have brought this up to 6.
2. We have increased the replication poll interval from 5 minutes to 20
minutes.
3. We have decreased the minimum heap allocation to this tomcat to 1Gb.
Earlier this was 4Gb.

The typical size of index directory on the problem shard is around 1Gb,
about 1 million documents in all. The average requests served are about
10/second for this shard. 

We have tried replaying the entire logs for the day on a test environment
but somehow it never goes out of memory with the same heap settings. Now we
are not certain that this would not happen again. 

Can someone suggest what could be the problem here ? Any help would be
greatly appreciated.

Regards,
Tushar

 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Out-Of-Memory-with-Field-Collapsing-tp4081076.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud and Joins

2013-07-29 Thread Walter Underwood
Denormalize. Add media_set_id to each sentence document. Done.

wunder

On Jul 29, 2013, at 7:58 AM, David Larochelle wrote:

 I'm setting up SolrCloud with around 600 million documents. The basic
 structure of each document is:
 
 stories_id: integer, media_id: integer, sentence: text_en
 
 We have a number of stories from different media and we treat each sentence
 as a separate document because we need to run sentence level analytics.
 
 We also have a concept of groups or sets of sources. We've imported this
 media source to media sets mapping into Solr using the following structure:
 
 media_id_inner: integer, media_sets_id: integer
 
 For the single node case, we're able to filter our sources by media_set_id
 using a join query like the following:
 
 http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1
 
 However, this does not work correctly with SolrCloud. The problem is that
 the join query is performed separately on each of the shards and no shard
 has the complete media set to source mapping data. So SolrCloud returns
 incomplete results.
 
 Since the complete media set to source mapping data is comparatively small
 (~50,000 rows), I would like to replicate it on every shard. So that the
 results of the individual join queries on separate shards would  be
 equivalent to performing the same query on a single shard system.
 
 However, I'm can't figure out how to replicate documents on separate
 shards. The compositeID router has the ability to colocate documents based
 on a prefix in the document ID but this isn't what I need. What I would
 like is some way to either have the media set to source data replicated on
 every shard or to be able to explicitly upload this data to the individual
 shards. (For the rest of the data I like the compositeID autorouting.)
 
 Any suggestions?
 
 --
 
 Thanks,
 
 
 David

--
Walter Underwood
wun...@wunderwood.org





solr - set fileds as default search field

2013-07-29 Thread Mysurf Mail
The following query works well for me

http://[]:8983/solr/vault/select?q=VersionComments%3AWhite

returns all the documents where version comments includes White

I try to omit the field name and put it as a default value as follows : In
solr config I write

requestHandler name=/select class=solr.SearchHandler
!-- default values for query parameters can be specified, these
 will be overridden by parameters in the request
  --
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=dfPackageName/str
   str name=dfTag/str
   str name=dfVersionComments/str
   str name=dfVersionTag/str
   str name=dfDescription/str
   str name=dfSKU/str
   str name=dfSKUDesc/str
 /lst

I restart the solr and create a full import.
Then I try using

 http://[]:8983/solr/vault/select?q=White

(Where

 http://[]:8983/solr/vault/select?q=VersionComments%3AWhite

still works)

But I dont get the document any as answer.
What am I doing wrong?


Re: solr - set fileds as default search field

2013-07-29 Thread Ahmet Arslan
Hi,


df is a single valued parameter. Only one field can be a default field.

To query multiple fields use (e)dismax query parser : 
http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29



 From: Mysurf Mail stammail...@gmail.com
To: solr-user@lucene.apache.org 
Sent: Monday, July 29, 2013 6:31 PM
Subject: solr - set fileds as default search field
 

The following query works well for me

http://[]:8983/solr/vault/select?q=VersionComments%3AWhite

returns all the documents where version comments includes White

I try to omit the field name and put it as a default value as follows : In
solr config I write

requestHandler name=/select class=solr.SearchHandler
!-- default values for query parameters can be specified, these
     will be overridden by parameters in the request
  --
lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=dfPackageName/str
   str name=dfTag/str
   str name=dfVersionComments/str
   str name=dfVersionTag/str
   str name=dfDescription/str
   str name=dfSKU/str
   str name=dfSKUDesc/str
/lst

I restart the solr and create a full import.
Then I try using

http://[]:8983/solr/vault/select?q=White

(Where

http://[]:8983/solr/vault/select?q=VersionComments%3AWhite

still works)

But I dont get the document any as answer.
What am I doing wrong?

Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1

2013-07-29 Thread Jason Hellman
Nitin,

You need to ensure the fields you wish to see are marked stored=true in your 
schema.xml file, and you should include fields in your fl= parameter 
(fl=*,score is a good place to start).

Jason

On Jul 29, 2013, at 8:08 AM, Nitin Agarwal 2nitinagar...@gmail.com wrote:

 Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1,
 running on apache tomcat 7.0.42 with external zookeeper 3.4.5.
 
 When I query select?q=*:*
 
 I only get the number of documents found, but no actual document. When I
 query with rows=0, I do get correct count of documents in the index.
 Faceting queries as well as group by queries also work with rows=0.
 However, when rows is not equal to 0 I do not get any documents.
 
 When I query the index I see that a query is being sent to both shards, and
 subsequently I see a query being sent with just ids, however, after that
 query returns I do not see any documents back.
 
 Not sure what do I need to change, please help.
 
 Thanks,
 Nitin



Re: solr - set fileds as default search field

2013-07-29 Thread Jason Hellman
Or use the copyField technique to a single searchable field and set df= to that 
field.  The example schema does this with the field called text.

On Jul 29, 2013, at 8:35 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi,
 
 
 df is a single valued parameter. Only one field can be a default field.
 
 To query multiple fields use (e)dismax query parser : 
 http://wiki.apache.org/solr/ExtendedDisMax#qf_.28Query_Fields.29
 
 
 
 From: Mysurf Mail stammail...@gmail.com
 To: solr-user@lucene.apache.org 
 Sent: Monday, July 29, 2013 6:31 PM
 Subject: solr - set fileds as default search field
 
 
 The following query works well for me
 
 http://[]:8983/solr/vault/select?q=VersionComments%3AWhite
 
 returns all the documents where version comments includes White
 
 I try to omit the field name and put it as a default value as follows : In
 solr config I write
 
 requestHandler name=/select class=solr.SearchHandler
 !-- default values for query parameters can be specified, these
  will be overridden by parameters in the request
   --
 lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=dfPackageName/str
str name=dfTag/str
str name=dfVersionComments/str
str name=dfVersionTag/str
str name=dfDescription/str
str name=dfSKU/str
str name=dfSKUDesc/str
 /lst
 
 I restart the solr and create a full import.
 Then I try using
 
 http://[]:8983/solr/vault/select?q=White
 
 (Where
 
 http://[]:8983/solr/vault/select?q=VersionComments%3AWhite
 
 still works)
 
 But I dont get the document any as answer.
 What am I doing wrong?



Re: SolrCloud and Joins

2013-07-29 Thread David Larochelle
We'd like to be able to easily update the media set to source mapping. I'm
concerned that if we store the media_sets_id in the sentence documents, it
will be very difficult to add additional media set to source mapping. I
imagine that adding a new media set would either require reimporting all
600 million documents or writing complicated application logic to find out
which sentences to update. Hence joins seem like a cleaner solution.

--

David


On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood wun...@wunderwood.orgwrote:

 Denormalize. Add media_set_id to each sentence document. Done.

 wunder

 On Jul 29, 2013, at 7:58 AM, David Larochelle wrote:

  I'm setting up SolrCloud with around 600 million documents. The basic
  structure of each document is:
 
  stories_id: integer, media_id: integer, sentence: text_en
 
  We have a number of stories from different media and we treat each
 sentence
  as a separate document because we need to run sentence level analytics.
 
  We also have a concept of groups or sets of sources. We've imported this
  media source to media sets mapping into Solr using the following
 structure:
 
  media_id_inner: integer, media_sets_id: integer
 
  For the single node case, we're able to filter our sources by
 media_set_id
  using a join query like the following:
 
 
 http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1
 
 http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1
 
 
  However, this does not work correctly with SolrCloud. The problem is that
  the join query is performed separately on each of the shards and no shard
  has the complete media set to source mapping data. So SolrCloud returns
  incomplete results.
 
  Since the complete media set to source mapping data is comparatively
 small
  (~50,000 rows), I would like to replicate it on every shard. So that the
  results of the individual join queries on separate shards would  be
  equivalent to performing the same query on a single shard system.
 
  However, I'm can't figure out how to replicate documents on separate
  shards. The compositeID router has the ability to colocate documents
 based
  on a prefix in the document ID but this isn't what I need. What I would
  like is some way to either have the media set to source data replicated
 on
  every shard or to be able to explicitly upload this data to the
 individual
  shards. (For the rest of the data I like the compositeID autorouting.)
 
  Any suggestions?
 
  --
 
  Thanks,
 
 
  David

 --
 Walter Underwood
 wun...@wunderwood.org






Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1

2013-07-29 Thread Nitin Agarwal
Jason, all my fields are set with stored=ture and indexed = true, and I used

select?q=*:*fl=*,score

but still I get the same response

*response
   lst name=responseHeader
  int name=status0/int
  int name=QTime138/int
  lst name=params
 str name=fl*,score/str
 str name=q*:*/str
  /lst
/lst
result name=response numFound=167906126 start=0 maxScore=1.0/
/response*

Here is what my schema looks like

*fields
field name=_version_ type=long indexed=true stored=true
multiValued=false /
field name=bill_account_name type=lowercase indexed=true
stored=true required=false /
field name=bill_account_nbr type=lowercase indexed=true
stored=true required=false /
field name=cust_name type=lowercase indexed=true stored=true
required=false /
**field name=tn_lookup_key_id type=lowercase
indexed=true stored=true required=true /
/fields*



Nitin


On Mon, Jul 29, 2013 at 9:38 AM, Jason Hellman 
jhell...@innoventsolutions.com wrote:

 Nitin,

 You need to ensure the fields you wish to see are marked stored=true in
 your schema.xml file, and you should include fields in your fl= parameter
 (fl=*,score is a good place to start).

 Jason

 On Jul 29, 2013, at 8:08 AM, Nitin Agarwal 2nitinagar...@gmail.com
 wrote:

  Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1,
  running on apache tomcat 7.0.42 with external zookeeper 3.4.5.
 
  When I query select?q=*:*
 
  I only get the number of documents found, but no actual document. When I
  query with rows=0, I do get correct count of documents in the index.
  Faceting queries as well as group by queries also work with rows=0.
  However, when rows is not equal to 0 I do not get any documents.
 
  When I query the index I see that a query is being sent to both shards,
 and
  subsequently I see a query being sent with just ids, however, after that
  query returns I do not see any documents back.
 
  Not sure what do I need to change, please help.
 
  Thanks,
  Nitin




Re: restricting a query by a set of field values

2013-07-29 Thread Chris Hostetter

: fl=date_depositedfq=date[2013-07-01T00:00:00Z TO 
2013-07-31T23:59:00Z]fq=collection_id(1 2 n)q.op=OR

typo -- the colon is missing...

fq=collection_id:(1 2 n)

if you don't want the q.op to apply globally to your request, you can also 
scope it only for that filter. likewise the field_name: and paren syntax 
can be replaced by using the df param...

   fq={!lucene q.op=OR df=collection_id}1 2 3 4 5


-Hoss


Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1

2013-07-29 Thread Jack Krupansky
Check the /select request handler in solrconfig. See if it defaults 
start or rows. start is the initial document number (e.g., 1), and rows 
is the number of rows to actually return in the response (nothing to do with 
numFound). The internal Solr default is rows=10, but you can set it to 20, 
50, 100, or whatever, but DO NOT set it to 0 unless you just want the header 
without any actual documents.


-- Jack Krupansky

-Original Message- 
From: Nitin Agarwal

Sent: Monday, July 29, 2013 11:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 
shards, replication Factor 1


Jason, all my fields are set with stored=ture and indexed = true, and I used

select?q=*:*fl=*,score

but still I get the same response

*response
  lst name=responseHeader
 int name=status0/int
 int name=QTime138/int
 lst name=params
str name=fl*,score/str
str name=q*:*/str
 /lst
   /lst
   result name=response numFound=167906126 start=0 maxScore=1.0/
/response*

Here is what my schema looks like

*fields
field name=_version_ type=long indexed=true stored=true
multiValued=false /
field name=bill_account_name type=lowercase indexed=true
stored=true required=false /
field name=bill_account_nbr type=lowercase indexed=true
stored=true required=false /
field name=cust_name type=lowercase indexed=true stored=true
required=false /
   **field name=tn_lookup_key_id type=lowercase
indexed=true stored=true required=true /
/fields*



Nitin


On Mon, Jul 29, 2013 at 9:38 AM, Jason Hellman 
jhell...@innoventsolutions.com wrote:


Nitin,

You need to ensure the fields you wish to see are marked stored=true in
your schema.xml file, and you should include fields in your fl= parameter
(fl=*,score is a good place to start).

Jason

On Jul 29, 2013, at 8:08 AM, Nitin Agarwal 2nitinagar...@gmail.com
wrote:

 Hi, I am using Solr 4.3.1 with 2 Shards and replication factor of 1,
 running on apache tomcat 7.0.42 with external zookeeper 3.4.5.

 When I query select?q=*:*

 I only get the number of documents found, but no actual document. When I
 query with rows=0, I do get correct count of documents in the index.
 Faceting queries as well as group by queries also work with rows=0.
 However, when rows is not equal to 0 I do not get any documents.

 When I query the index I see that a query is being sent to both shards,
and
 subsequently I see a query being sent with just ids, however, after that
 query returns I do not see any documents back.

 Not sure what do I need to change, please help.

 Thanks,
 Nitin






Re: processing documents in solr

2013-07-29 Thread Joe Zhang
I'll try reindexing the timestamp.

The id-creation approach suggested by Erick sounds attractive, but the
nutch/solr integration seems rather tight. I don't where to break in to
insert the id into solr.


On Mon, Jul 29, 2013 at 4:11 AM, Erick Erickson erickerick...@gmail.comwrote:

 No SolrJ doesn't provide this automatically. You'd be providing the
 counter by inserting it into the document as you created new docs.

 You could do this with any kind of document creation you are
 using.

 Best
 Erick

 On Mon, Jul 29, 2013 at 2:51 AM, Aditya findbestopensou...@gmail.com
 wrote:
  Hi,
 
  The easiest solution would be to have timestamp indexed. Is there any
 issue
  in doing re-indexing?
  If you want to process records in batch then you need a ordered list and
 a
  bookmark. You require a field to sort and maintain a counter / last id as
  bookmark. This is mandatory to solve your problem.
 
  If you don't want to re-index, then you need to maintain information
  related to visited nodes. Have a database / solr core which maintains
 list
  of IDs which already processed. Fetch record from Solr, For each record,
  check the new DB, if the record is already processed.
 
  Regards
  Aditya
  www.findbestopensource.com
 
 
 
 
 
  On Mon, Jul 29, 2013 at 10:26 AM, Joe Zhang smartag...@gmail.com
 wrote:
 
  Basically, I was thinking about running a range query like Shawn
 suggested
  on the tstamp field, but unfortunately it was not indexed. Range queries
  only work on indexed fields, right?
 
 
  On Sun, Jul 28, 2013 at 9:49 PM, Joe Zhang smartag...@gmail.com
 wrote:
 
   I've been thinking about tstamp solution int the past few days. but
 too
   bad, the field is avaialble but not indexed...
  
   I'm not familiar with SolrJ. Again, sounds like SolrJ is providing the
   counter value. If yes, that would be equivalent to an autoincrement
 id.
  I'm
   indexing from Nutch though; don't know how to feed in such counter...
  
  
   On Sun, Jul 28, 2013 at 7:03 AM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
  
   Why wouldn't a simple timestamp work for the ordering? Although
   I guess simple timestamp isn't really simple if the time settings
   change.
  
   So how about a simple counter field in your documents? Assuming
   you're indexing from SolrJ, your setup is to query q=*:*sort=counter
   desc.
   Take the counter from the first document returned. Increment for
   each doc for the life of the indexing run. Now you've got, for all
  intents
   and purposes, an identity field albeit manually maintained.
  
   Then use your counter field as Shawn suggests for pulling all the
   data out.
  
   FWIW,
   Erick
  
   On Sun, Jul 28, 2013 at 1:01 AM, Maurizio Cucchiara
   mcucchi...@apache.org wrote:
In both cases, for better performance, first I'd load just all the
  IDs,
after, during processing I'd load each document.
For what concern the incremental requirement, it should not be
   difficult to
write an hash function which maps a non-numerical I'd to a value.
 On Jul 27, 2013 7:03 AM, Joe Zhang smartag...@gmail.com
 wrote:
   
Dear list:
   
I have an ever-growing solr repository, and I need to process
 every
   single
document to extract statistics. What would be a reasonable process
  that
satifies the following properties:
   
- Exhaustive: I have to traverse every single document
- Incremental: in other words, it has to allow me to divide and
   conquer ---
if I have processed the first 20k docs, next time I can start with
   20001.
   
A simple *:* query would satisfy the 1st but not the 2nd
 property.
  In
fact, given that the processing will take very long, and the
  repository
keeps growing, it is not even clear that the exhaustiveness is
   achieved.
   
I'm running solr 3.6.2 in a single-machine setting; no hadoop
   capability
yet. But I guess the same issues still hold even if I have the
 solr
   cloud
environment, right, say in each shard?
   
Any help would be greatly appreciated.
   
Joe
   
  
  
  
 



Re: SolrCloud and Joins

2013-07-29 Thread Walter Underwood
A join may seem clean, but it will be slow and (currently) doesn't work in a 
cluster.

You find all the sentences in a media set by searching for that set id and 
requesting only the sentence_id (yes, you need that). Then you reindex them. 
With small documents like this, it is probably fairly fast.

If you can't estimate how often the media sets will change or the size of the 
changes, then you aren't ready to choose a design.

wunder

On Jul 29, 2013, at 8:41 AM, David Larochelle wrote:

 We'd like to be able to easily update the media set to source mapping. I'm
 concerned that if we store the media_sets_id in the sentence documents, it
 will be very difficult to add additional media set to source mapping. I
 imagine that adding a new media set would either require reimporting all
 600 million documents or writing complicated application logic to find out
 which sentences to update. Hence joins seem like a cleaner solution.
 
 --
 
 David
 
 
 On Mon, Jul 29, 2013 at 11:22 AM, Walter Underwood 
 wun...@wunderwood.orgwrote:
 
 Denormalize. Add media_set_id to each sentence document. Done.
 
 wunder
 
 On Jul 29, 2013, at 7:58 AM, David Larochelle wrote:
 
 I'm setting up SolrCloud with around 600 million documents. The basic
 structure of each document is:
 
 stories_id: integer, media_id: integer, sentence: text_en
 
 We have a number of stories from different media and we treat each
 sentence
 as a separate document because we need to run sentence level analytics.
 
 We also have a concept of groups or sets of sources. We've imported this
 media source to media sets mapping into Solr using the following
 structure:
 
 media_id_inner: integer, media_sets_id: integer
 
 For the single node case, we're able to filter our sources by
 media_set_id
 using a join query like the following:
 
 
 http://localhost:8983/solr/select?q={!join+from=media_id_inner+to=media_id}media_sets_id:1
 
 http://localhost:8983/solr/select?q=%7B!join+from=media_id_inner+to=media_id%7Dmedia_sets_id:1
 
 
 However, this does not work correctly with SolrCloud. The problem is that
 the join query is performed separately on each of the shards and no shard
 has the complete media set to source mapping data. So SolrCloud returns
 incomplete results.
 
 Since the complete media set to source mapping data is comparatively
 small
 (~50,000 rows), I would like to replicate it on every shard. So that the
 results of the individual join queries on separate shards would  be
 equivalent to performing the same query on a single shard system.
 
 However, I'm can't figure out how to replicate documents on separate
 shards. The compositeID router has the ability to colocate documents
 based
 on a prefix in the document ID but this isn't what I need. What I would
 like is some way to either have the media set to source data replicated
 on
 every shard or to be able to explicitly upload this data to the
 individual
 shards. (For the rest of the data I like the compositeID autorouting.)
 
 Any suggestions?
 
 --
 
 Thanks,
 
 
 David
 
 --
 Walter Underwood
 wun...@wunderwood.org
 
 
 
 

--
Walter Underwood
wun...@wunderwood.org





Re: SolrCloud shard down

2013-07-29 Thread Katie McCorkell
I am using Solr 4.3.1 . I did hard commit after indexing.

I think you're right that the node was still recovering. I didn't think so
since it didn't show up as yellow recovering on the visual display, but
after quite a while it went from Down to Active . Thanks!


On Fri, Jul 26, 2013 at 7:59 PM, Anshum Gupta ans...@anshumgupta.netwrote:

 Can you also let me know what version of Solr are you on?


 On Sat, Jul 27, 2013 at 8:26 AM, Anshum Gupta ans...@anshumgupta.net
 wrote:

  Hi Katie,
 
  1. First things first, I would strongly advice to manually update/remove
  zk or any other info when you're running things in the SolrCloud mode
  unless you are sure of what you're doing.
 
  2. Also, your node could be currently recovering from the transaction
  log(did you issue a hard commit after indexing?).
  The mailing list doesn't allow long texts inline so it'd be good if you
  could use something like http://pastebin.com/ to share the log in
 detail.
 
  3. If you had replicas, you wouldn't need to manually switch. It get's
  taken care of automatically.
 
 
  On Sat, Jul 27, 2013 at 4:16 AM, Katie McCorkell 
 katiemccork...@gmail.com
   wrote:
 
  Hello,
 
   I am using the SolrCloud with a zookeeper ensemble like on example C
 from
  the wiki except with total of 3 shards and no replicas (oops). After
  indexing a whole bunch of documents, shard 2 went down and I'm not sure
  why. I tried restarting it with the jar command and I tried deleting
  shard1
  's zoo_data folder and then restarting but it is still down, and I'm not
  sure what to do.
 
  1) Is there anyway to avoid reindexing all the data? It's no good to
  proceed without shard 2 because I don't know which documents are there
 vs.
  the other shards, and indexing and querying don't work when one shard is
  down.
 
  I can't exactly tell why restarting it is failing, all I can see is on
 the
  admin tool webpage the shard is yellow in the little cloud diagram. On
 the
  console is messages that I will copy and paste below. 2) How can I tell
  the
  exact problem?
 
  3) If I had had replicas, I could have just switched to shard 2's
 replica
  at this point, correct?
 
  Thanks!
  Katie
 
  Console message from start.jar
 
 
 ---
  2325 [coreLoadExecutor-4-thread-1] INFO
   org.apache.solr.cloud.ZkController
   – We are http://172.16.2.182:/solr/collection1/ and leader is
  http://172.16.2.182:/solr/collection1/
  12329 [recoveryExecutor-6-thread-1] WARN
  org.apache.solr.update.UpdateLog
   – Starting log replay
 
 
 tlog{file=/opt/solr-4.3.1/example/solr/collection1/data/tlog/tlog.0005179
  refcount=2} active=false starting pos=0
  12534 [recoveryExecutor-6-thread-1] INFO  org.apache.solr.core.SolrCore
  –
  SolrDeletionPolicy.onInit: commits:num=1
 
  commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
  /opt/solr-4.3.1/example/solr/collection1/data/index
  lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f99ea3c;
  maxCacheMB=48.0
 
 
 maxMergeSizeMB=4.0),segFN=segments_404,generation=5188,filenames=[_1gqo.fdx,
  _1h1q.nvm, _1h8x.fdt, _1gmi_Lucene41_0.pos, _1gqo.fdt, _1h8s.nvd, _
  1gmi.si,
  _1h1q.nvd, _1h6l.fnm, _1h8q.nvm, _1h6l_Lucene41_0.tim,
  _1h6l_Lucene41_0.tip, _1h8o_Lucene41_0.tim, _1h8o_Lucene41_0.tip,
  _1aq9_67.del, _1gqo.nvm, _1aq9_Lucene41_0.pos, _1h8q.fdx, _1h1q.fdt,
  _1h8r.fdt, _1h8q.fdt, _1h8p_Lucene41_0.pos, _1h8s_Lucene41_0.pos,
  _1h8r.fdx, _1gqo.nvd, _1h8s.fdx, _1h8s.fdt, _1h8x_Lucene41_.
 
 
 
 
  --
 
  Anshum Gupta
  http://www.anshumgupta.net
 



 --

 Anshum Gupta
 http://www.anshumgupta.net



Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1

2013-07-29 Thread Chris Hostetter

: Here is what my schema looks like

what is your uniqueKey field?

I'm going to bet it's tn_lookup_key_id and i'm going to bet your 
lowercase fieldType has an interesting analyzer on it.

you are probably hitting a situation where the analyzer you have on your 
uniqueKey field is munging the values in such a way that when the 
coordinator node decides which N docs to include in the response, 
and then asks the various shards to give it those specific N 
docs, those subsequent field fetching queries fail because of an 
analysis mismatch.

you need to keep your uniqueKeyField simple -- i strongly recommend a 
basic StrField.  If you also want to do lowercase lookups on your key 
field, index it redundently in a second field.


: *fields
:   field name=_version_ type=long indexed=true stored=true
: multiValued=false /
:   field name=bill_account_name type=lowercase indexed=true
: stored=true required=false /
:   field name=bill_account_nbr type=lowercase indexed=true
: stored=true required=false /
:   field name=cust_name type=lowercase indexed=true stored=true
: required=false /
: **field name=tn_lookup_key_id type=lowercase
: indexed=true stored=true required=true /
: /fields*

-Hoss


Re: SolrCloud shard down

2013-07-29 Thread Mark Miller

On Jul 29, 2013, at 12:49 PM, Katie McCorkell katiemccork...@gmail.com wrote:

 I didn't think so
 since it didn't show up as yellow recovering on the visual display, but
 after quite a while it went from Down to Active . Thanks!

Thanks, I think we should improve this! We should publish a recovery state when 
replaying the log on startup - right now it uses the down state and only 
advertises recovery when recovering from the leader. It would be useful to be 
able to tell when it's recovering from the log replay on startup as well though.

Feel free to create a JIRA issue - I'll try and get to it otherwise.

- Mark




Re: Performance vs. maxBufferedAddsPerServer=10

2013-07-29 Thread Erick Erickson
Why wouldn't it? Or are you saying that the routing to replicas
from the leader also 10/packet? Hmmm, hadn't thought of that...

On Mon, Jul 29, 2013 at 7:58 AM, Mark Miller markrmil...@gmail.com wrote:
 SOLR-4816 won't address this - it will just speed up *different* parts. There 
 are other things that will need to be done to speed up that part.

 - Mark

 On Jul 26, 2013, at 3:53 PM, Erick Erickson erickerick...@gmail.com wrote:

 This is current a hard-coded limit from what I've understood. From what
 I remember, Mark said Yonik said that there are reasons to make the
 packets that size. But whether this is empirically a Good Thing I don't know.

 SOLR-4816 will address this a different way by making SolrJ batch up
 the docs and send them to the right leader, which should pretty much remove
 any performance consideration here.

 There's some anecdotal evidence that changing that in the code might
 improve throughput, but I don't remember the details.

 FWIW
 Erick

 On Thu, Jul 25, 2013 at 7:09 AM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
 Hi,

 Context:
 * https://issues.apache.org/jira/browse/SOLR-4956
 * 
 http://search-lucene.com/c/Solr:/core/src/java/org/apache/solr/update/SolrCmdDistributor.java%7C%7CmaxBufferedAddsPerServer

 As you can see, maxBufferedAddsPerServer = 10.

 We have an app that sends 20K docs to SolrCloud using CloudSolrServer.
 We batch 20K docs for performance reasons. But then the receiving node
 ends up sending VERY small batches of just 10 docs around for indexing
 and we lose the benefit of batching those 20K docs in the first place.

 Our app is add only.

 Is there anything one can do to avoid performance loss associated with
 maxBufferedAddsPerServer=10?

 Thanks,
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



Re: DIH to index the data - 250 millions - Need a best architecture

2013-07-29 Thread Mikhail Khludnev
Mishra,
What if you setup DIH with single SQLEntityProcessor without caching, does
it works for you?


On Mon, Jul 29, 2013 at 4:00 PM, Santanu8939967892 mishra.sant...@gmail.com
 wrote:

 Hi,
I have a huge volume of DB records, which is close to 250 millions.
 I am going to use DIH to index the data into Solr.
 I need a best architecture to index and query the data in an efficient
 manner.
 I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4.


 With Regards,
 Santanu




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Pentaho Kettle vs DIH

2013-07-29 Thread Mikhail Khludnev
Hello,

Don't you have any experience with using Pentaho Kettle for processing
RDBMS and pouring them into Solr? Isn't it some sort of replacement of the
DIH?

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Solr 4.3.1 - query does not return documents, just numFounds, 2 shards, replication Factor 1

2013-07-29 Thread Nitin Agarwal
Erick, I had typed tn_lookup_key_id as lowercase and it was defined as

fieldType name=lowercase class=solr.TextField
positionIncrementGap=100
 analyzer
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.LowerCaseFilterFactory /
 /analyzer
  /fieldType

Nitin



On Mon, Jul 29, 2013 at 1:23 PM, Erick Erickson erickerick...@gmail.comwrote:

 Nitin:

 What was your tn_lookup_key_id field definition when things didn't work?
 The stock lowercase is KeywordTokenizerFactory+LowerCaseFilterFactory
 and if this leads to mis-matches as Hoss outlined, it'd surprise me so I
 need
 to file it away in my list of things not to do.

 Thanks,
 Erick

 On Mon, Jul 29, 2013 at 3:01 PM, Nitin Agarwal 2nitinagar...@gmail.com
 wrote:
  Hoss, you rock!
 
  That was the issue, I changed tn_lookup_key_id, which was my unique key
  field, to string and reloaded the index and it works.
 
  Jason, Jack and Hoss, thanks for your help.
 
  Nitin
 
 
  On Mon, Jul 29, 2013 at 12:22 PM, Chris Hostetter
  hossman_luc...@fucit.orgwrote:
 
 
  : Here is what my schema looks like
 
  what is your uniqueKey field?
 
  I'm going to bet it's tn_lookup_key_id and i'm going to bet your
  lowercase fieldType has an interesting analyzer on it.
 
  you are probably hitting a situation where the analyzer you have on your
  uniqueKey field is munging the values in such a way that when the
  coordinator node decides which N docs to include in the response,
  and then asks the various shards to give it those specific N
  docs, those subsequent field fetching queries fail because of an
  analysis mismatch.
 
  you need to keep your uniqueKeyField simple -- i strongly recommend a
  basic StrField.  If you also want to do lowercase lookups on your key
  field, index it redundently in a second field.
 
 
  : *fields
  :   field name=_version_ type=long indexed=true stored=true
  : multiValued=false /
  :   field name=bill_account_name type=lowercase indexed=true
  : stored=true required=false /
  :   field name=bill_account_nbr type=lowercase indexed=true
  : stored=true required=false /
  :   field name=cust_name type=lowercase indexed=true
  stored=true
  : required=false /
  : **field name=tn_lookup_key_id type=lowercase
  : indexed=true stored=true required=true /
  : /fields*
 
  -Hoss
 



solr sizing

2013-07-29 Thread Torsten Albrecht
Hi all,

we have

- 70 mio documents to 100 mio documents

and we want

- 800 requests per second


How many servers Amazon EC2/real hardware we Need for this?

Solr 4.x with solr cloud or better shards with loadbalancer?

Is anyone here who can give me some information, or who operates a similar 
system itself?


Regards,

Torsten


Re: Merged segment warmer Solr 4.4

2013-07-29 Thread Chris Hostetter
: I have a slow storage machine and non sufficient RAM for the whole index to
: store all the index. This causes the first queries (~5000) to be very slow
...
: Secondly I thought of initiating a new searcher event listener that queries
: on docs that were inserted since the last hard commit.

the first step in a situation like this should always be to configure at 
least some autowarming on your queryResultCache and filterCache -- this 
will not only ensure that some basic warming of your index is done, but 
will also prime the caches for your newSearcher with actual queries that 
your solr instance has alreayd recieved -- using a newSearcher listener on 
top of this can be useful for garunteeing that specific sorts or facets 
are fast against each new searcher (even if they haven't been queried on 
before) but i really wouldn't worry about htat until you are certain you 
have autowarming enabled.

: A new ability of solr 4.4 (solr 4761) is to configure a mergedSegmentWarmer
: - how does this component work and is it good for my usecase?

the new mergedSegmentWarmer option is extremely low level.  it may be 
useful, but it may also be redundent if you alreayd configure autowarming 
and/or newSearcher listener to execute basic queries -- it won't help with 
things like seeding your filterCache, queryResultCache, or FieldCaches.



-Hoss


Re: solr sizing

2013-07-29 Thread Shawn Heisey

On 7/29/2013 2:18 PM, Torsten Albrecht wrote:

we have

- 70 mio documents to 100 mio documents

and we want

- 800 requests per second


How many servers Amazon EC2/real hardware we Need for this?

Solr 4.x with solr cloud or better shards with loadbalancer?

Is anyone here who can give me some information, or who operates a similar 
system itself?


Your question is impossible to answer, aside from generalities that 
won't really help all that much.


I have a similarly sized system (82 million docs), but I don't have 
query volume anywhere near what yours is.  I've got less than 10 queries 
per second.  I have two copies of my index.  I use a load balancer with 
traditional sharding.


I don't do replication, my two index copies are completely independent. 
 I set it up this way long before SolrCloud was released.  Having two 
completely independent indexes lets me do a lot of experimentation that 
a typical SolrCloud setup won't let me do.


One copy of the index is running 3.5.0 and is about 142GB if you add up 
all the shards.  The other copy of the index is running 4.2.1 and is 
about 87GB on disk.  Each copy of the index runs on two servers, six 
large cold shards and one small hot shard.  Each of those servers has 
two quad-core processors (Xeon E5400 series, so fairly old now) and 64GB 
of RAM.  I can get away with multiple shards per host because my query 
volume is so low.


Here's a screenshot of a status servlet that I wrote for my index. 
There's tons of info here about my index stats:


https://dl.dropboxusercontent.com/u/97770508/statuspagescreenshot.png

If I needed to start over from scratch with your higher query volume, I 
would probably set up two independent SolrCloud installs, each with a 
replicationFactor of at least two, and I'd use 4-8 shards.  I would put 
a load balancer in front of it so that I could bring one cloud down and 
have everything still work, though with lower performance.  Because of 
the query volume, I'd only have one shard per host.  Depending on how 
big the index ended up being, I'd want 16-32GB (or possibly more) RAM 
per host.


You might not need the flexibility of two independent clouds, and it 
would require additional complexity in your indexing software.  If you 
only went with one cloud, you'd just need a higher replicationFactor.


I'd also want to have another set of servers (not as beefy) to have 
another independent SolrCloud with a replicationFactor of 1 or 2 for dev 
purposes.


That's a LOT of hardware, and it would NOT be cheap.  Can I be sure that 
you'd really need that much hardware?  Not really.  To to be quite 
honest, you'll just have to set up a proof-of-concept system and be 
prepared to make it bigger.


Thanks,
Shawn



SOLR replication question?

2013-07-29 Thread SolrLover
I am currently using SOLR 4.4. but not planning to use solrcloud in very near
future.

I have 3 master / 3 slave setup. Each master is linked to its corresponding
slave.. I have disabled auto polling.. 

We do both push (using MQ) and pull indexing using SOLRJ indexing program.

I have enabled soft commit in slave (to view the changes immediately pushed
by queue).

I am thinking of doing the batch indexing in master (optimize and hard
commit) and push indexing in both master / slave. 

I am trying to do more testing with my configuration but thought of getting
to know some answers before diving very deep...

Since the queue pushes the docs in master / slave there is a possibility of
slave having more record compared to master (when master is busy doing batch
indexing).. What would happen if the slave has additional segments compared
to Master. will that be deleted when the replication happens.

If a message is pushed from a queue to both master and slave during
replication, will there be a latency in seeing that document even if we use
softcommit in slave?

We want to make sure that we are not missing any documents from queue (since
its updated via UI and we don't really store that data anywhere except in
index).







--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-replication-question-tp4081161.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Cloud - How to balance Batch and Queue indexing?

2013-07-29 Thread SolrLover
I need some advice on the best way to implement Batch indexing with soft
commit / Push indexing (via queue) with soft commit when using SolrCloud.

*I am trying to figure out a way to:
*
1. Make the push indexing available almost real time (using soft commit)
without degrading the search / indexing performance.
2. Ability to not overwrite the existing document (based on listing_id, I
assume I can use overwrite=false flag to disable overwrite).
3. Not block the push indexing when delta indexing happens (push indexing
happens via UI, user should be able to search for the document pushed via UI
almost instantaneously). Delta processing might take more time to complete
indexing and I don't want the queue to wait until the batch processing is
complete.
4. Copy the updated collection for backup.

*More information on setup:
*We have 100 million records (around 6 stored fields / 12 indexed fields).
We are planning to have 5 cores (each with 20 million documents) with 5
replicas.
We will be always doing delta batch indexing.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-How-to-balance-Batch-and-Queue-indexing-tp4081169.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR replication question?

2013-07-29 Thread Shawn Heisey
 I am currently using SOLR 4.4. but not planning to use solrcloud in very
near
 future.
 I have 3 master / 3 slave setup. Each master is linked to its
 corresponding
 slave.. I have disabled auto polling..
 We do both push (using MQ) and pull indexing using SOLRJ indexing
program.
 I have enabled soft commit in slave (to view the changes immediately pushed
 by queue).
 I am thinking of doing the batch indexing in master (optimize and hard
commit) and push indexing in both master / slave.
 I am trying to do more testing with my configuration but thought of getting
 to know some answers before diving very deep...
 Since the queue pushes the docs in master / slave there is a possibility of
 slave having more record compared to master (when master is busy doing
batch
 indexing).. What would happen if the slave has additional segments compared
 to Master. will that be deleted when the replication happens.
 If a message is pushed from a queue to both master and slave during
replication, will there be a latency in seeing that document even if we
use
 softcommit in slave?
 We want to make sure that we are not missing any documents from queue
(since
 its updated via UI and we don't really store that data anywhere except
in
 index)

If you are doing replication, then all updates must go to the master
server. You cannot update the slave directly. The replication happens, the
slave will be identical to the master... Any documents aent to only the
slave will be lost.

Replication will happen according to the interval you have configured, or
since you say you have disabled polling, according to whatever schedule
you manually trigger a replication.

SolrCloud would probably be a better fit for you. With a properly
configured SolrCloud you just index to any host in the cloud and documents
end up exactly where they need to go, and all replicas get updated.

Thanks,
Shawn




Re: Streaming Updates Using HttpSolrServer.add(Iterator) In Solr 4.3

2013-07-29 Thread Shawn Heisey
 I am indexing more than 300 million records, it takes less than 7 hours to
 index all the records..

 Send the documents in batches and also use CUSS
 (ConcurrentUpdateSolrServer)
 for multi threading support.

 Ex:

  ConcurrentUpdateSolrServer server= new
 ConcurrentUpdateSolrServer(solrServer, queueSize,
   threadCount);
   ListSolrInputDocument solrDocList = new
 ArrayListSolrInputDocument();
  While (loop) {
   solrDocList.add(doc); -- Add the documents to array
if(count =100){
server.add(solrDocList); -- Add documents to SOLR in batches
}
  count++;
 }
 server.commit(); -- Commit after adding all the documents

Using CUSS is only acceptable if you don't care about error handling. If
you shut down the Solr server, your program will only see an error on the
commit. It will think the update worked perfectly, even though the server
is down.





Re: Performance question on Spatial Search

2013-07-29 Thread Bill Bell
Can you compare with the old geo handler as a baseline. ?

Bill Bell
Sent from mobile


On Jul 29, 2013, at 4:25 PM, Erick Erickson erickerick...@gmail.com wrote:

 This is very strange. I'd expect slow queries on
 the first few queries while these caches were
 warmed, but after that I'd expect things to
 be quite fast.
 
 For a 12G index and 256G RAM, you have on the
 surface a LOT of hardware to throw at this problem.
 You can _try_ giving the JVM, say, 18G but that
 really shouldn't be a big issue, your index files
 should be MMaped.
 
 Let's try the crude thing first and give the JVM
 more memory.
 
 FWIW
 Erick
 
 On Mon, Jul 29, 2013 at 4:45 PM, Steven Bower smb-apa...@alcyon.net wrote:
 I've been doing some performance analysis of a spacial search use case I'm
 implementing in Solr 4.3.0. Basically I'm seeing search times alot higher
 than I'd like them to be and I'm hoping people may have some suggestions
 for how to optimize further.
 
 Here are the specs of what I'm doing now:
 
 Machine:
 - 16 cores @ 2.8ghz
 - 256gb RAM
 - 1TB (RAID 1+0 on 10 SSD)
 
 Content:
 - 45M docs (not very big only a few fields with no large textual content)
 - 1 geo field (using config below)
 - index is 12gb
 - 1 shard
 - Using MMapDirectory
 
 Field config:
 
 fieldType name=geo class=solr.SpatialRecursivePrefixTreeFieldType
 distErrPct=0.025 maxDistErr=0.00045
 spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory
 units=degrees/
 
 field  name=geopoint indexed=true multiValued=false
 required=false stored=true type=geo/
 
 
 What I've figured out so far:
 
 - Most of my time (98%) is being spent in
 java.nio.Bits.copyToByteArray(long,Object,long,long) which is being
 driven by BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock()
 which from what I gather is basically reading terms from the .tim file
 in blocks
 
 - I moved from Java 1.6 to 1.7 based upon what I read here:
 http://blog.vlad1.com/2011/10/05/looking-at-java-nio-buffer-performance/
 and it definitely had some positive impact (i haven't been able to
 measure this independantly yet)
 
 - I changed maxDistErr from 0.09 (which is 1m precision per docs)
 to 0.00045 (50m precision) ..
 
 - It looks to me that the .tim file are being memory mapped fully (ie
 they show up in pmap output) the virtual size of the jvm is ~18gb
 (heap is 6gb)
 
 - I've optimized the index but this doesn't have a dramatic impact on
 performance
 
 Changing the precision and the JVM upgrade yielded a drop from ~18s
 avg query time to ~9s avg query time.. This is fantastic but I want to
 get this down into the 1-2 second range.
 
 At this point it seems that basically i am bottle-necked on basically
 copying memory out of the mapped .tim file which leads me to think
 that the only solution to my problem would be to read less data or
 somehow read it more efficiently..
 
 If anyone has any suggestions of where to go with this I'd love to know
 
 
 thanks,
 
 steve


Re: Performance vs. maxBufferedAddsPerServer=10

2013-07-29 Thread Mark Miller
Yes, the internal document forwarding path is different and does not use the 
CloudSolrServer. It currently works with a buffer of 10.

- Mark

On Jul 29, 2013, at 3:10 PM, Erick Erickson erickerick...@gmail.com wrote:

 Why wouldn't it? Or are you saying that the routing to replicas
 from the leader also 10/packet? Hmmm, hadn't thought of that...
 
 On Mon, Jul 29, 2013 at 7:58 AM, Mark Miller markrmil...@gmail.com wrote:
 SOLR-4816 won't address this - it will just speed up *different* parts. 
 There are other things that will need to be done to speed up that part.
 
 - Mark
 
 On Jul 26, 2013, at 3:53 PM, Erick Erickson erickerick...@gmail.com wrote:
 
 This is current a hard-coded limit from what I've understood. From what
 I remember, Mark said Yonik said that there are reasons to make the
 packets that size. But whether this is empirically a Good Thing I don't 
 know.
 
 SOLR-4816 will address this a different way by making SolrJ batch up
 the docs and send them to the right leader, which should pretty much remove
 any performance consideration here.
 
 There's some anecdotal evidence that changing that in the code might
 improve throughput, but I don't remember the details.
 
 FWIW
 Erick
 
 On Thu, Jul 25, 2013 at 7:09 AM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
 Hi,
 
 Context:
 * https://issues.apache.org/jira/browse/SOLR-4956
 * 
 http://search-lucene.com/c/Solr:/core/src/java/org/apache/solr/update/SolrCmdDistributor.java%7C%7CmaxBufferedAddsPerServer
 
 As you can see, maxBufferedAddsPerServer = 10.
 
 We have an app that sends 20K docs to SolrCloud using CloudSolrServer.
 We batch 20K docs for performance reasons. But then the receiving node
 ends up sending VERY small batches of just 10 docs around for indexing
 and we lose the benefit of batching those 20K docs in the first place.
 
 Our app is add only.
 
 Is there anything one can do to avoid performance loss associated with
 maxBufferedAddsPerServer=10?
 
 Thanks,
 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm