Re: How to identify documents failed in a batch request?

2016-12-17 Thread David Smiley
If you enable the "TolerantUpdateProcessor" Solr-side, you can add
documents in bulk allowing some to fail and know which did:

http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/TolerantUpdateProcessorFactory.html

On Sat, Dec 17, 2016 at 5:05 PM S G  wrote:

> Hi,
>
> I am using the following code to send documents to Solr:
>
> final UpdateRequest request = new UpdateRequest();
> request.setAction(UpdateRequest.ACTION.COMMIT, false, false);
> request.add(docsList);
> UpdateResponse response = request.process(solrClient);
>
> The response returned from the last line does not seem to be very helpful
> in determining how I can identify documents failed in a batch request.
>
> Does anyone know how this can be done?
>
> Thanks
> SG
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: Separating Search and Indexing in SolrCloud

2016-12-17 Thread Erick Erickson
Yes indexing is adding stress. No you can't separate
the two in SolrCloud. End of story, why beat it to death?
You'll have to figure out the sharding strategy that
meets your indexing and querying needs and live
within that framework. I'd advise setting up a small
cluster and driving it to its tipping point and extrapolating
from there. Here's the long version of "the sizing exercise".

https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

My point that while indexing to Solr/Lucene there is
additional pressure. That pressure has a fixed upper
limit that doesn't grow with the number of docs. That's not
true for searching, as you add more docs per node, the
pressure (especially memory) increases. Concentrate
your efforts there IMO.

Best
Erick



On Sat, Dec 17, 2016 at 12:54 PM, Jaroslaw Rozanski
 wrote:
> Hi Erick,
>
> So what does this buffer represent? What does it actually store? Raw
> update request or analyzed document?
>
> The documentation suggest that it stores actual update requests.
>
> Obviously analyzed document can and will occupy much more space than raw
> one. Also analysis with create a lot of new allocations and subsequent
> GC work.
>
> Yes, you are probably right that search puts more stress and is main
> memory user but combination of:
> - non-trivial analysis,
> - high volume of updates and
> - search on the same node
>
> seems adding fuel to the fire.
>
> From previous response by Pushkar, it is clear that separation is not
> achievable with existing SolrCloud mechanism.
>
> Thanks
>
>
> On 17/12/16 20:24, Erick Erickson wrote:
>> bq: I am more concerned with indexing memory requirements at volume
>>
>> By and large this isn't much of a problem. RAMBufferSizeMB in
>> solrconfig.xml governs how much memory is consumed in Solr for
>> indexing. When that limit is exceeded, the buffer is flushed to disk.
>> I've rarely heard of indexing being a memory issue. Anecdotally I
>> haven't seen throughput benefit with buffer sizes over 128M.
>>
>> You're correct in that master/slave style replication would use less
>> memory on the slave, although there are other costs. I.e. rather than
>> the data for document X being sent to the replicas once as in
>> SolrCloud, that data is re-sent to the slave every time it's merged
>> into a new segment.
>>
>> That said, memory issues are _far_ more prevalent on the search side
>> of things so unless this is a proven issue in your environment I would
>> fight other fires.
>>
>> Best,
>> Erick
>>
>> On Fri, Dec 16, 2016 at 1:06 PM, Jaroslaw Rozanski  
>> wrote:
>>> Thanks, that issue looks interesting!
>>>
>>> On 16/12/16 16:38, Pushkar Raste wrote:
 This kind of separation is not supported yet.  There however some work
 going on,  you can read about it on
 https://issues.apache.org/jira/browse/SOLR-9835

 This unfortunately would not support soft commits and hence would not be a
 good solution for near real time indexing.

 On Dec 16, 2016 7:44 AM, "Jaroslaw Rozanski"  
 wrote:

> Sorry, not what I meant.
>
> Leader is responsible for distributing update requests to replica. So
> eventually all replicas have same state as leader. Not a problem.
>
> It is more about the performance of such. If I gather correctly normal
> replication happens by standard update request. Not by, say, segment copy.
>
> Which means update on leader is as "expensive" as on replica.
>
> Hence, if my understanding is correct, sending search request to replica
> only, in index heavy environment, would bring no benefit.
>
> So the question is: is there a mechanism, in SolrCloud (not legacy
> master/slave set-up) to make one node take a load of indexing which
> other nodes focus on searching.
>
> This is not a question of SolrClient cause that is clear how to direct
> search request to specific nodes. This is more about index optimization
> so that certain nodes (ie. replicas) could suffer less due to high
> volume indexing while serving search requests.
>
>
>
>
> On 16/12/16 12:35, Dorian Hoxha wrote:
>> The leader is the source of truth. You expect to make the replica the
>> source of truth or something???Doesn't make sense?
>> What people do, is send write to leader/master and reads to
> replicas/slaves
>> in other solr/other-dbs.
>>
>> On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski 
>> >
>> wrote:
>>
>>> Hi all,
>>>
>>> According to documentation, in normal operation (not recovery) in Solr
>>> Cloud configuration the leader sends updates it receives to all the
>>> replicas.
>>>
>>> This means and all nodes in the shard perform same effort to index
>>> single document. Correct?
>>>
>>> Is there then 

How to identify documents failed in a batch request?

2016-12-17 Thread S G
Hi,

I am using the following code to send documents to Solr:

final UpdateRequest request = new UpdateRequest();
request.setAction(UpdateRequest.ACTION.COMMIT, false, false);
request.add(docsList);
UpdateResponse response = request.process(solrClient);

The response returned from the last line does not seem to be very helpful
in determining how I can identify documents failed in a batch request.

Does anyone know how this can be done?

Thanks
SG


Re: Separating Search and Indexing in SolrCloud

2016-12-17 Thread Jaroslaw Rozanski
Hi Erick,

So what does this buffer represent? What does it actually store? Raw
update request or analyzed document?

The documentation suggest that it stores actual update requests.

Obviously analyzed document can and will occupy much more space than raw
one. Also analysis with create a lot of new allocations and subsequent
GC work.

Yes, you are probably right that search puts more stress and is main
memory user but combination of:
- non-trivial analysis,
- high volume of updates and
- search on the same node

seems adding fuel to the fire.

From previous response by Pushkar, it is clear that separation is not
achievable with existing SolrCloud mechanism.

Thanks


On 17/12/16 20:24, Erick Erickson wrote:
> bq: I am more concerned with indexing memory requirements at volume
> 
> By and large this isn't much of a problem. RAMBufferSizeMB in
> solrconfig.xml governs how much memory is consumed in Solr for
> indexing. When that limit is exceeded, the buffer is flushed to disk.
> I've rarely heard of indexing being a memory issue. Anecdotally I
> haven't seen throughput benefit with buffer sizes over 128M.
> 
> You're correct in that master/slave style replication would use less
> memory on the slave, although there are other costs. I.e. rather than
> the data for document X being sent to the replicas once as in
> SolrCloud, that data is re-sent to the slave every time it's merged
> into a new segment.
> 
> That said, memory issues are _far_ more prevalent on the search side
> of things so unless this is a proven issue in your environment I would
> fight other fires.
> 
> Best,
> Erick
> 
> On Fri, Dec 16, 2016 at 1:06 PM, Jaroslaw Rozanski  
> wrote:
>> Thanks, that issue looks interesting!
>>
>> On 16/12/16 16:38, Pushkar Raste wrote:
>>> This kind of separation is not supported yet.  There however some work
>>> going on,  you can read about it on
>>> https://issues.apache.org/jira/browse/SOLR-9835
>>>
>>> This unfortunately would not support soft commits and hence would not be a
>>> good solution for near real time indexing.
>>>
>>> On Dec 16, 2016 7:44 AM, "Jaroslaw Rozanski"  wrote:
>>>
 Sorry, not what I meant.

 Leader is responsible for distributing update requests to replica. So
 eventually all replicas have same state as leader. Not a problem.

 It is more about the performance of such. If I gather correctly normal
 replication happens by standard update request. Not by, say, segment copy.

 Which means update on leader is as "expensive" as on replica.

 Hence, if my understanding is correct, sending search request to replica
 only, in index heavy environment, would bring no benefit.

 So the question is: is there a mechanism, in SolrCloud (not legacy
 master/slave set-up) to make one node take a load of indexing which
 other nodes focus on searching.

 This is not a question of SolrClient cause that is clear how to direct
 search request to specific nodes. This is more about index optimization
 so that certain nodes (ie. replicas) could suffer less due to high
 volume indexing while serving search requests.




 On 16/12/16 12:35, Dorian Hoxha wrote:
> The leader is the source of truth. You expect to make the replica the
> source of truth or something???Doesn't make sense?
> What people do, is send write to leader/master and reads to
 replicas/slaves
> in other solr/other-dbs.
>
> On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski 
> wrote:
>
>> Hi all,
>>
>> According to documentation, in normal operation (not recovery) in Solr
>> Cloud configuration the leader sends updates it receives to all the
>> replicas.
>>
>> This means and all nodes in the shard perform same effort to index
>> single document. Correct?
>>
>> Is there then a benefit to *not* to send search requests to leader, but
>> only to replicas?
>>
>> Given index & search heavy Solr Cloud system, is it possible to separate
>> search from indexing nodes?
>>
>>
>> RE: Solr 5.5.0
>>
>> --
>> Jaroslaw Rozanski | e: m...@jarekrozanski.com
>> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>>
>>
>

 --
 Jaroslaw Rozanski | e: m...@jarekrozanski.com
 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D


>>>
>>
>> --
>> Jaroslaw Rozanski | e: m...@jarekrozanski.com
>> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>>

-- 
Jaroslaw Rozanski | e: m...@jarekrozanski.com
695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D



signature.asc
Description: OpenPGP digital signature


Re: Separating Search and Indexing in SolrCloud

2016-12-17 Thread Erick Erickson
bq: I am more concerned with indexing memory requirements at volume

By and large this isn't much of a problem. RAMBufferSizeMB in
solrconfig.xml governs how much memory is consumed in Solr for
indexing. When that limit is exceeded, the buffer is flushed to disk.
I've rarely heard of indexing being a memory issue. Anecdotally I
haven't seen throughput benefit with buffer sizes over 128M.

You're correct in that master/slave style replication would use less
memory on the slave, although there are other costs. I.e. rather than
the data for document X being sent to the replicas once as in
SolrCloud, that data is re-sent to the slave every time it's merged
into a new segment.

That said, memory issues are _far_ more prevalent on the search side
of things so unless this is a proven issue in your environment I would
fight other fires.

Best,
Erick

On Fri, Dec 16, 2016 at 1:06 PM, Jaroslaw Rozanski  
wrote:
> Thanks, that issue looks interesting!
>
> On 16/12/16 16:38, Pushkar Raste wrote:
>> This kind of separation is not supported yet.  There however some work
>> going on,  you can read about it on
>> https://issues.apache.org/jira/browse/SOLR-9835
>>
>> This unfortunately would not support soft commits and hence would not be a
>> good solution for near real time indexing.
>>
>> On Dec 16, 2016 7:44 AM, "Jaroslaw Rozanski"  wrote:
>>
>>> Sorry, not what I meant.
>>>
>>> Leader is responsible for distributing update requests to replica. So
>>> eventually all replicas have same state as leader. Not a problem.
>>>
>>> It is more about the performance of such. If I gather correctly normal
>>> replication happens by standard update request. Not by, say, segment copy.
>>>
>>> Which means update on leader is as "expensive" as on replica.
>>>
>>> Hence, if my understanding is correct, sending search request to replica
>>> only, in index heavy environment, would bring no benefit.
>>>
>>> So the question is: is there a mechanism, in SolrCloud (not legacy
>>> master/slave set-up) to make one node take a load of indexing which
>>> other nodes focus on searching.
>>>
>>> This is not a question of SolrClient cause that is clear how to direct
>>> search request to specific nodes. This is more about index optimization
>>> so that certain nodes (ie. replicas) could suffer less due to high
>>> volume indexing while serving search requests.
>>>
>>>
>>>
>>>
>>> On 16/12/16 12:35, Dorian Hoxha wrote:
 The leader is the source of truth. You expect to make the replica the
 source of truth or something???Doesn't make sense?
 What people do, is send write to leader/master and reads to
>>> replicas/slaves
 in other solr/other-dbs.

 On Fri, Dec 16, 2016 at 1:31 PM, Jaroslaw Rozanski  Hi all,
>
> According to documentation, in normal operation (not recovery) in Solr
> Cloud configuration the leader sends updates it receives to all the
> replicas.
>
> This means and all nodes in the shard perform same effort to index
> single document. Correct?
>
> Is there then a benefit to *not* to send search requests to leader, but
> only to replicas?
>
> Given index & search heavy Solr Cloud system, is it possible to separate
> search from indexing nodes?
>
>
> RE: Solr 5.5.0
>
> --
> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>
>

>>>
>>> --
>>> Jaroslaw Rozanski | e: m...@jarekrozanski.com
>>> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>>>
>>>
>>
>
> --
> Jaroslaw Rozanski | e: m...@jarekrozanski.com
> 695E 436F A176 4961 7793  5C70 AFDF FB5E 682C 4D3D
>


Confusing debug=timing parameter

2016-12-17 Thread S G
Hi,

I am using Solr 4.10 and its response time for the clients is not very good.
Even though the Solr's plugin/stats shows less than 200 milliseconds,
clients report several seconds in response time.

So I tried using debug-timing parameter from the Solr UI and this is what I
got.
Note how the QTime is 2978 while the time in debug-timing is 19320.

What does this mean?
How can Solr return a result in 3 seconds when time taken between two
points in the same path is 20 seconds ?

{
  "responseHeader": {
"status": 0,
"QTime": 2978,
"params": {
  "q": "*:*",
  "debug": "timing",
  "indent": "true",
  "wt": "json",
  "_": "1481992653008"
}
  },
  "response": {
"numFound": 1565135270,
"start": 0,
"maxScore": 1,
"docs": [
  
]
  },
  "debug": {
"timing": {
  "time": 19320,
  "prepare": {
"time": 4,
"query": {
  "time": 3
},
"facet": {
  "time": 0
},
"mlt": {
  "time": 0
},
"highlight": {
  "time": 0
},
"stats": {
  "time": 0
},
"expand": {
  "time": 0
},
"debug": {
  "time": 0
}
  },
  "process": {
"time": 19315,
"query": {
  "time": 19309
},
"facet": {
  "time": 0
},
"mlt": {
  "time": 1
},
"highlight": {
  "time": 0
},
"stats": {
  "time": 0
},
"expand": {
  "time": 0
},
"debug": {
  "time": 5
}
  }
}
  }
}


Re: Solr on HDFS: Streaming API performance tuning

2016-12-17 Thread Chetas Joshi
Here is the stack trace.

java.lang.NullPointerException

at
org.apache.solr.client.solrj.io.comp.FieldComparator$2.compare(FieldComparator.java:85)

at
org.apache.solr.client.solrj.io.comp.FieldComparator.compare(FieldComparator.java:92)

at
org.apache.solr.client.solrj.io.comp.FieldComparator.compare(FieldComparator.java:30)

at
org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:45)

at
org.apache.solr.client.solrj.io.comp.MultiComp.compare(MultiComp.java:33)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.compareTo(CloudSolrStream.java:396)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.compareTo(CloudSolrStream.java:381)

at java.util.TreeMap.put(TreeMap.java:560)

at java.util.TreeSet.add(TreeSet.java:255)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream._read(CloudSolrStream.java:366)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream.read(CloudSolrStream.java:353)

at

*.*.*.*.SolrStreamResultIterator$$anon$1.run(SolrStreamResultIterator.scala:101)

at java.lang.Thread.run(Thread.java:745)

16/11/17 13:04:31 *ERROR* SolrStreamResultIterator:missing exponent number:
char=A,position=106596
BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'

org.noggit.JSONParser$ParseException: missing exponent number:
char=A,position=106596
BEFORE='p":1477189323},{"uuid":"//699/UzOPQx6thu","timestamp": 6EA'
AFTER='E 1476861439},{"uuid":"//699/vG8k4Tj'

at org.noggit.JSONParser.err(JSONParser.java:356)

at org.noggit.JSONParser.readExp(JSONParser.java:513)

at org.noggit.JSONParser.readNumber(JSONParser.java:419)

at org.noggit.JSONParser.next(JSONParser.java:845)

at org.noggit.JSONParser.nextEvent(JSONParser.java:951)

at org.noggit.ObjectBuilder.getObject(ObjectBuilder.java:127)

at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:57)

at org.noggit.ObjectBuilder.getVal(ObjectBuilder.java:37)

at
org.apache.solr.client.solrj.io.stream.JSONTupleStream.next(JSONTupleStream.java:84)

at
org.apache.solr.client.solrj.io.stream.SolrStream.read(SolrStream.java:147)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream$TupleWrapper.next(CloudSolrStream.java:413)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream._read(CloudSolrStream.java:365)

at
org.apache.solr.client.solrj.io.stream.CloudSolrStream.read(CloudSolrStream.java:353)


Thanks!

On Fri, Dec 16, 2016 at 11:45 PM, Reth RM  wrote:

> If you could provide the json parse exception stack trace, it might help to
> predict issue there.
>
>
> On Fri, Dec 16, 2016 at 5:52 PM, Chetas Joshi 
> wrote:
>
> > Hi Joel,
> >
> > The only NON alpha-numeric characters I have in my data are '+' and '/'.
> I
> > don't have any backslashes.
> >
> > If the special characters was the issue, I should get the JSON parsing
> > exceptions every time irrespective of the index size and irrespective of
> > the available memory on the machine. That is not the case here. The
> > streaming API successfully returns all the documents when the index size
> is
> > small and fits in the available memory. That's the reason I am confused.
> >
> > Thanks!
> >
> > On Fri, Dec 16, 2016 at 5:43 PM, Joel Bernstein 
> > wrote:
> >
> > > The Streaming API may have been throwing exceptions because the JSON
> > > special characters were not escaped. This was fixed in Solr 6.0.
> > >
> > >
> > >
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Fri, Dec 16, 2016 at 4:34 PM, Chetas Joshi 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I am running Solr 5.5.0.
> > > > It is a solrCloud of 50 nodes and I have the following config for all
> > the
> > > > collections.
> > > > maxShardsperNode: 1
> > > > replicationFactor: 1
> > > >
> > > > I was using Streaming API to get back results from Solr. It worked
> fine
> > > for
> > > > a while until the index data size reached beyond 40 GB per shard
> (i.e.
> > > per
> > > > node). It started throwing JSON parsing exceptions while reading the
> > > > TupleStream data. FYI: I have other services (Yarn, Spark) deployed
> on
> > > the
> > > > same boxes on which Solr shards are running. Spark jobs also use a
> lot
> > of
> > > > disk cache. So, the free available disk cache on the boxes vary a
> > > > lot depending upon what else is running on the box.
> > > >
> > > > Due to this issue, I moved to using the cursor approach and it works
> > fine
> > > > but as we all know it is way slower than the streaming approach.
> > > >
> > > > Currently the index size per shard is 80GB (The machine has 512 GB of
> > RAM
> > > > and being used by different services/programs: heap/off-heap 

Re: Caching multiple entities

2016-12-17 Thread William Bell
I am not sure, but it looks like your XML is invalid.

last_modified > XYZ

You need to switch to   or use something like a database view so that
the > and other < will not cause problems.


On Sat, Dec 17, 2016 at 7:01 AM, Per Newgro  wrote:

> Hello,
>
> we are implementing a questionnaire tool for companies. I would like to
> import the data using a DIH.
>
> To increase performance i would like to use some caching. But my solution
> is not working. The score of my
>
> questionnaire is empty. But there is a value in the database. I've checked
> that.
>
> We can mark questionnaires for special purposes. I need to import the
> special mpc score. The mpc questionnaire
>
> is not changing while importing. So i thaught i can can cache this value
> for usage in mpc_score queries.
>
> Can you please help me, to find out what i'm doing wrong here?
>
> Thanks
>
> Per
>
> 
>  name="company"
> query="SELECT id as ID,
>   customer_number as CUSTOMER_NUMBER
> FROM  companies
>WHERE  '${dataimporter.request.clean}' != 'false'
>   OR  last_modified > '${dataimporter.last_index_tim
> e}'">
>  name="mpc"
> processor="SqlEntityProcessor"
> cacheImpl="SortedMapBackedCache"
> query="select qp.questionnaire AS ID
>from questionnaire_purposes qp
>join purposes p ON qp.id = p.id
>where p.name = 'mpc';">
>  name="mpc_score"
> query="select c.score as SCORE
>FROM basfcdi.census c
>where c.company=${company.ID}
>and c.questionnaire = ${mpc.ID};">
> 
> 
> 
> 
>
> 
>
>


-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Caching multiple entities

2016-12-17 Thread Per Newgro

Hello,

we are implementing a questionnaire tool for companies. I would like to 
import the data using a DIH.


To increase performance i would like to use some caching. But my 
solution is not working. The score of my


questionnaire is empty. But there is a value in the database. I've 
checked that.


We can mark questionnaires for special purposes. I need to import the 
special mpc score. The mpc questionnaire


is not changing while importing. So i thaught i can can cache this value 
for usage in mpc_score queries.


Can you please help me, to find out what i'm doing wrong here?

Thanks

Per




processor="SqlEntityProcessor" 
cacheImpl="SortedMapBackedCache"

query="select qp.questionnaire AS ID
   from questionnaire_purposes qp
   join purposes p ON qp.id = p.id
   where p.name = 'mpc';">










Re: ttl on merge-time possible somehow ?

2016-12-17 Thread Dorian Hoxha
On Sat, Dec 17, 2016 at 12:04 AM, Chris Hostetter 
wrote:

>
> : > lucene, something has to "mark" the segements as deleted in order for
> them
> ...
> : Note, it doesn't mark the "segment", it marks the "document".
>
> correct, typo on my part -- sorry.
>
> : > The disatisfaction you expressed with this approach confuses me...
> : >
> : Really ?
> : If you have many expiring docs
>
> ...you didn't seem to finish that thought so i'm still not really sure
> what your're suggestion is in terms of why an alternative would be more
> efficient.
>
Sorry about that. The reason why (i think/thought) it won't be as
efficient, is because in some cases, like mine, all docs will expire,
rather fast (30 minutes in my case), so there will be a large number of
"deletes", which I thought were expensive.

So, if rocksdb would do it this way, it would have to keep 1 index on the
ttl-timestamp and then issue 2 deletes (to delete the index, original row).
While in lucene, because the storage is different, this is ~just a
deleted_bitmap[x]=1, which if you disable translog fsync (only for
ttl-delete) should be really fast and nonblocking(my issue).

So, the other way this can be made better in my opinion is (if the
optimization is not already there)
Is to make the 'delete-query' on ttl-documents operation on translog to not
be forced to fsync to disk (so still written to translog, but no fsync).
The another index/delete happens, it will also fsync the translog of the
previous 'delete ttl query'.
If the server crashes, meaning we lost those deletes because the translog
wasn't fsynced to disk, then a thread can run on startup to recheck
ttl-deletes.
This will make it so the delete-query comes "free" in disk-fsync on
translog.
Makes sense ?


>
> : "For example, with the configuration below the
> : DocExpirationUpdateProcessorFactory will create a timer thread that
> wakes
> : up every 30 seconds. When the timer triggers, it will execute a
> : *deleteByQuery* command to *remove any documents* with a value in the
> : press_release_expiration_date field value that is in the past "
>
> that document is describing a *logical* deletion as i mentioned before --
> the documents are "removed" in the sense that they are flaged "not alive"
> won't be included in future searches, but the data still lives in the
> segements on disk until a future merge.  (That is end user documentation,
> focusing on the effects as percieved by clients -- the concept of "delete"
> from a low level storage implementation is a much more involved concept
> that affects any discussion of "deleting" documents in solr, not just TTL
> based deletes)
>
> : > 1) nothing would ensure that docs *ever* get removed during perioids
> when
> : > docs aren't being added (thus no new segments, thus no merging)
> : >
> : This can be done with a periodic/smart thread that wakes up every 'ttl'
> and
> : checks min-max (or histogram) of timestamps on segments. If there are a
> : lot, do merge (or just delete the whole dead segment). At least that's
> how
> : those systems do it.
>
> OK -- with lucene/solr today we have the ConcurrentMergeScheduler which
> will watch for segments that have many (logically deleted) documents
> flaged "not alive" and will proactively merge those segments when the
> number of docs is above some configured/default threshold -- but to
> automatically flag those documents as "deleted" you need something like
> what solr is doing today.
>
I knew it checks "should we be merging". This would just be another clause.

>
>
> Again: i really feel like the only disconnect here is terminology.
>
> You're describing a background thread that wakes up periodically, scans
> the docs in each segment to see if they have an expire field > $now, and
> based on the size of the set of matches merges some segments and expunges
> the docs that were in that set.  For segments that aren't merged, docs
> stay put and are excluded from queries only by filters specified at
> request time.
>
> What Solr/Lucene has are 2 background threads: one wakes up periodically,
> scans the docs in the index to see if the expire field > $now and if so
> flags them as being "not alive" so they don't match queries at request
> time. A second thread chegks each segment to see how many docs are marked
> "not alive" -- either by the previous thread or by some other form of
> (logical) deletion -- and merges some of those segments, expunging the
> docs that were marked "not alive".  For segments that aren't merged, the
> "not alive" docs are still in the segment, but the "not alive" flag
> automatically excludes them from queries.
>
Yes I knew it functions that way.
The ~whole~ misunderstanding, is that the delete is more efficient than I
thought. The whole reason why the other storage engines did it "the other
way" is because of the efficiency of the delete on those engines.

>
>
>
> -Hoss
> http://www.lucidworks.com/
>