[jira] [Commented] (SOLR-12691) Index corruption when sending updates to multiple cores, if those cores can get unloaded by LotsOfCores

2018-08-26 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592994#comment-16592994
 ] 

Shawn Heisey commented on SOLR-12691:
-

I did an experiment.  Downloaded 7.4.0.  Created core1 through core25, all 
transient and not loading on startup.  Set transientCacheSize to 5.

Accessed some cores in this order with the admin UI dropdown:

core1
core10
core11
core12
core13
core14

When core14 was accessed, Solr unloaded core1, exactly as expected.

With the admin UI, I did some queries on core10, and submitted an update on 
core11.  Then I accessed core17 from the admin UI dropdown.

If the core unloading were LRU in the way that I think it should work, it would 
have been core12 that got unloaded when core17 was accessed.  It was core10 
that was unloaded, because it was the oldest entry in the LinkedHashMap.  The 
fact that I had made queries on core10 did not make any difference, and I think 
it should have.


> Index corruption when sending updates to multiple cores, if those cores can 
> get unloaded by LotsOfCores
> ---
>
> Key: SOLR-12691
> URL: https://issues.apache.org/jira/browse/SOLR-12691
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Shawn Heisey
>Priority: Minor
> Attachments: SOLR-12691-test.patch
>
>
> When the LotsOfCores setting 'transientCacheSize' results in the unloading of 
> cores that are getting updates, the indexes in one or more of those cores can 
> get corrupted.
> How to reproduce:
>  * Set the "transientCacheSize" to 1
>  * Create two cores that are both set to transient.
>  * Ingest high load to core1 only (no issue at this time)
>  * Continue ingest high load to core1 and start ingest load to core2 
> simultaneously (core2 immediately corrupted)
> Error with stacktrace:
> {noformat}
> 2018-08-16 23:02:31.212 ERROR (qtp225472281-4098) [   
> x:aggregator-core-be43376de27b1675562841f64c498] o.a.s.u.SolrIndexWriter 
> Error closing IndexWriter
> java.nio.file.NoSuchFileException: 
> /opt/solr/volumes/data1/4cf838d4b9e4675-core-897/index/_2_Lucene50_0.pos
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  ~[?:1.8.0_162]
> at java.nio.file.Files.readAttributes(Files.java:1737) ~[?:1.8.0_162]
> at java.nio.file.Files.size(Files.java:2332) ~[?:1.8.0_162]
> at 
> org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.getSegmentSizes(TieredMergePolicy.java:279)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:300)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2199)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2162) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3571) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> 

[jira] [Commented] (SOLR-12691) Index corruption when sending updates to multiple cores, if those cores can get unloaded by LotsOfCores

2018-08-25 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592776#comment-16592776
 ] 

Erick Erickson commented on SOLR-12691:
---

[~elyograg]

It _is_ an LRU cache. The LinkedHashMap is created as:

LinkedHashMap(Math.min(cacheSize, 1000), 0.75f, true)

>From the Javadocs:
{quote}public LinkedHashMap(int initialCapacity, float loadFactor, boolean 
accessOrder)
 Constructs an empty {{LinkedHashMap}} instance with the specified initial 
capacity, load factor and ordering mode.
 Parameters:
 {{initialCapacity}} - the initial capacity
 {{loadFactor}} - the load factor
 {{accessOrder}} - the ordering mode - {{true}} for access-order, {{false}} for 
insertion-order
{quote}
Although the tests don't show that explicitly. Here's a slight change to 
TestLazyCores showing that the transient cache is LRU that should be 
incorporated in any fixes here.
  

 

> Index corruption when sending updates to multiple cores, if those cores can 
> get unloaded by LotsOfCores
> ---
>
> Key: SOLR-12691
> URL: https://issues.apache.org/jira/browse/SOLR-12691
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Shawn Heisey
>Priority: Minor
> Attachments: SOLR-12691-test.patch
>
>
> When the LotsOfCores setting 'transientCacheSize' results in the unloading of 
> cores that are getting updates, the indexes in one or more of those cores can 
> get corrupted.
> How to reproduce:
>  * Set the "transientCacheSize" to 1
>  * Create two cores that are both set to transient.
>  * Ingest high load to core1 only (no issue at this time)
>  * Continue ingest high load to core1 and start ingest load to core2 
> simultaneously (core2 immediately corrupted)
> Error with stacktrace:
> {noformat}
> 2018-08-16 23:02:31.212 ERROR (qtp225472281-4098) [   
> x:aggregator-core-be43376de27b1675562841f64c498] o.a.s.u.SolrIndexWriter 
> Error closing IndexWriter
> java.nio.file.NoSuchFileException: 
> /opt/solr/volumes/data1/4cf838d4b9e4675-core-897/index/_2_Lucene50_0.pos
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  ~[?:1.8.0_162]
> at java.nio.file.Files.readAttributes(Files.java:1737) ~[?:1.8.0_162]
> at java.nio.file.Files.size(Files.java:2332) ~[?:1.8.0_162]
> at 
> org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.getSegmentSizes(TieredMergePolicy.java:279)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:300)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2199)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2162) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3571) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> 

[jira] [Commented] (SOLR-12691) Index corruption when sending updates to multiple cores, if those cores can get unloaded by LotsOfCores

2018-08-25 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592767#comment-16592767
 ] 

Shawn Heisey commented on SOLR-12691:
-

An additional thought:  Even if the problem can be found and fixed so the 
two-core reproduction scenario works perfectly, I can tell you that the 
performance will be *awful* as Solr continually unloads and loads cores.  The 
same thing might happen in the real-world scenario.  That would likely be 
preferable to index corruption, though.

[~erickerickson], should we have another issue to improve choosing which 
transient core to unload?  Do it on an LRU basis, instead of load order?  I was 
looking into request handler code.  A number of request handlers implement 
SolrCoreAware, but it's done on an individual handler basis, not on 
RequestHandlerBase.  If the base class were to implement SolrCoreAware and 
handle updating the timestamp, I think the handler code would be overall 
cleaner, and we might be in a better position to make LRU unloading happen.

> Index corruption when sending updates to multiple cores, if those cores can 
> get unloaded by LotsOfCores
> ---
>
> Key: SOLR-12691
> URL: https://issues.apache.org/jira/browse/SOLR-12691
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Shawn Heisey
>Priority: Minor
>
> When the LotsOfCores setting 'transientCacheSize' results in the unloading of 
> cores that are getting updates, the indexes in one or more of those cores can 
> get corrupted.
> How to reproduce:
>  * Set the "transientCacheSize" to 1
>  * Create two cores that are both set to transient.
>  * Ingest high load to core1 only (no issue at this time)
>  * Continue ingest high load to core1 and start ingest load to core2 
> simultaneously (core2 immediately corrupted)
> Error with stacktrace:
> {noformat}
> 2018-08-16 23:02:31.212 ERROR (qtp225472281-4098) [   
> x:aggregator-core-be43376de27b1675562841f64c498] o.a.s.u.SolrIndexWriter 
> Error closing IndexWriter
> java.nio.file.NoSuchFileException: 
> /opt/solr/volumes/data1/4cf838d4b9e4675-core-897/index/_2_Lucene50_0.pos
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  ~[?:1.8.0_162]
> at java.nio.file.Files.readAttributes(Files.java:1737) ~[?:1.8.0_162]
> at java.nio.file.Files.size(Files.java:2332) ~[?:1.8.0_162]
> at 
> org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.getSegmentSizes(TieredMergePolicy.java:279)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:300)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2199)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2162) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3571) 
> ~[lucene-core-7.4.0.jar:7.4.0 

[jira] [Commented] (SOLR-12691) Index corruption when sending updates to multiple cores, if those cores can get unloaded by LotsOfCores

2018-08-24 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592199#comment-16592199
 ] 

Erick Erickson commented on SOLR-12691:
---

Shawn is largely correct here, this is the first time I've heard of this issue.

If you want to dig into the code and see if you can find a way to fix it, I'd 
be happy to review any patches you'd care to attach to this JIRA.

> Index corruption when sending updates to multiple cores, if those cores can 
> get unloaded by LotsOfCores
> ---
>
> Key: SOLR-12691
> URL: https://issues.apache.org/jira/browse/SOLR-12691
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Shawn Heisey
>Priority: Minor
>
> When the LotsOfCores setting 'transientCacheSize' results in the unloading of 
> cores that are getting updates, the indexes in one or more of those cores can 
> get corrupted.
> How to reproduce:
>  * Set the "transientCacheSize" to 1
>  * Create two cores that are both set to transient.
>  * Ingest high load to core1 only (no issue at this time)
>  * Continue ingest high load to core1 and start ingest load to core2 
> simultaneously (core2 immediately corrupted)
> Error with stacktrace:
> {noformat}
> 2018-08-16 23:02:31.212 ERROR (qtp225472281-4098) [   
> x:aggregator-core-be43376de27b1675562841f64c498] o.a.s.u.SolrIndexWriter 
> Error closing IndexWriter
> java.nio.file.NoSuchFileException: 
> /opt/solr/volumes/data1/4cf838d4b9e4675-core-897/index/_2_Lucene50_0.pos
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  ~[?:1.8.0_162]
> at java.nio.file.Files.readAttributes(Files.java:1737) ~[?:1.8.0_162]
> at java.nio.file.Files.size(Files.java:2332) ~[?:1.8.0_162]
> at 
> org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.getSegmentSizes(TieredMergePolicy.java:279)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:300)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2199)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2162) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3571) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1028) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1071) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:286) 
> [solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
> - 2018-06-18 16:55:13]
> 

[jira] [Commented] (SOLR-12691) Index corruption when sending updates to multiple cores, if those cores can get unloaded by LotsOfCores

2018-08-24 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591917#comment-16591917
 ] 

Shawn Heisey commented on SOLR-12691:
-

The person on the mailing list who found the problem asked this on the mailing 
list:
{quote}
I see it's marked as minor. Can we bump up the priority please ?
{quote}

Here is my response:

I don't consider it a high priority item.  It's going to take a lot of effort 
to actually find the cause and fix it.  So far you're the only person that I've 
heard of that's tried it and had a failure.  There may of course be others who 
never said anything, but it is certainly not a common use-case.

[~erickerickson] is the person who wrote the code here, and this was not a 
use-case that was ever considered.  So as far as I can tell, in the eyes of the 
person who created the transient core feature, you're using it incorrectly.  
This is another reason for the priority choice.

If [~erickerickson] disagrees with my view, he is free to change the priority 
and I will not complain.


> Index corruption when sending updates to multiple cores, if those cores can 
> get unloaded by LotsOfCores
> ---
>
> Key: SOLR-12691
> URL: https://issues.apache.org/jira/browse/SOLR-12691
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Shawn Heisey
>Priority: Minor
>
> When the LotsOfCores setting 'transientCacheSize' results in the unloading of 
> cores that are getting updates, the indexes in one or more of those cores can 
> get corrupted.
> How to reproduce:
>  * Set the "transientCacheSize" to 1
>  * Create two cores that are both set to transient.
>  * Ingest high load to core1 only (no issue at this time)
>  * Continue ingest high load to core1 and start ingest load to core2 
> simultaneously (core2 immediately corrupted)
> Error with stacktrace:
> {noformat}
> 2018-08-16 23:02:31.212 ERROR (qtp225472281-4098) [   
> x:aggregator-core-be43376de27b1675562841f64c498] o.a.s.u.SolrIndexWriter 
> Error closing IndexWriter
> java.nio.file.NoSuchFileException: 
> /opt/solr/volumes/data1/4cf838d4b9e4675-core-897/index/_2_Lucene50_0.pos
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  ~[?:1.8.0_162]
> at java.nio.file.Files.readAttributes(Files.java:1737) ~[?:1.8.0_162]
> at java.nio.file.Files.size(Files.java:2332) ~[?:1.8.0_162]
> at 
> org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.getSegmentSizes(TieredMergePolicy.java:279)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:300)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2199)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2162) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3571) 
> ~[lucene-core-7.4.0.jar:7.4.0 

[jira] [Commented] (SOLR-12691) Index corruption when sending updates to multiple cores, if those cores can get unloaded by LotsOfCores

2018-08-22 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589591#comment-16589591
 ] 

Erick Erickson commented on SOLR-12691:
---

{quote}Queries very likely update a timestamp within the core that marks it as 
most-recently-used, but do updates do the same?
{quote}
Yes. The key is that the transient cache is a :LinkedHashMap with the 
removeEldestEntry overridden. Any access to that structure moves the accessed 
core to the end of the list so it's now the last one to be aged out.

> Index corruption when sending updates to multiple cores, if those cores can 
> get unloaded by LotsOfCores
> ---
>
> Key: SOLR-12691
> URL: https://issues.apache.org/jira/browse/SOLR-12691
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Shawn Heisey
>Priority: Minor
>
> When the LotsOfCores setting 'transientCacheSize' results in the unloading of 
> cores that are getting updates, the indexes in one or more of those cores can 
> get corrupted.
> How to reproduce:
>  * Set the "transientCacheSize" to 1
>  * Create two cores that are both set to transient.
>  * Ingest high load to core1 only (no issue at this time)
>  * Continue ingest high load to core1 and start ingest load to core2 
> simultaneously (core2 immediately corrupted)
> Error with stacktrace:
> {noformat}
> 2018-08-16 23:02:31.212 ERROR (qtp225472281-4098) [   
> x:aggregator-core-be43376de27b1675562841f64c498] o.a.s.u.SolrIndexWriter 
> Error closing IndexWriter
> java.nio.file.NoSuchFileException: 
> /opt/solr/volumes/data1/4cf838d4b9e4675-core-897/index/_2_Lucene50_0.pos
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  ~[?:1.8.0_162]
> at java.nio.file.Files.readAttributes(Files.java:1737) ~[?:1.8.0_162]
> at java.nio.file.Files.size(Files.java:2332) ~[?:1.8.0_162]
> at 
> org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.getSegmentSizes(TieredMergePolicy.java:279)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:300)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2199)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2162) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3571) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1028) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1071) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> 

[jira] [Commented] (SOLR-12691) Index corruption when sending updates to multiple cores, if those cores can get unloaded by LotsOfCores

2018-08-22 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589589#comment-16589589
 ] 

Shawn Heisey commented on SOLR-12691:
-

Making an attempt to check the code, I actually can't see anything that tracks 
or uses a most-recently-used timestamp.  So I'm wondering how LotsOfCores 
actually decides which core(s) to close, and whether that might need some work.

> Index corruption when sending updates to multiple cores, if those cores can 
> get unloaded by LotsOfCores
> ---
>
> Key: SOLR-12691
> URL: https://issues.apache.org/jira/browse/SOLR-12691
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Shawn Heisey
>Priority: Minor
>
> When the LotsOfCores setting 'transientCacheSize' results in the unloading of 
> cores that are getting updates, the indexes in one or more of those cores can 
> get corrupted.
> How to reproduce:
>  * Set the "transientCacheSize" to 1
>  * Create two cores that are both set to transient.
>  * Ingest high load to core1 only (no issue at this time)
>  * Continue ingest high load to core1 and start ingest load to core2 
> simultaneously (core2 immediately corrupted)
> Error with stacktrace:
> {noformat}
> 2018-08-16 23:02:31.212 ERROR (qtp225472281-4098) [   
> x:aggregator-core-be43376de27b1675562841f64c498] o.a.s.u.SolrIndexWriter 
> Error closing IndexWriter
> java.nio.file.NoSuchFileException: 
> /opt/solr/volumes/data1/4cf838d4b9e4675-core-897/index/_2_Lucene50_0.pos
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  ~[?:1.8.0_162]
> at java.nio.file.Files.readAttributes(Files.java:1737) ~[?:1.8.0_162]
> at java.nio.file.Files.size(Files.java:2332) ~[?:1.8.0_162]
> at 
> org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.getSegmentSizes(TieredMergePolicy.java:279)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:300)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2199)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2162) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3571) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1028) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1071) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:286) 
> [solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
> - 2018-06-18 

[jira] [Commented] (SOLR-12691) Index corruption when sending updates to multiple cores, if those cores can get unloaded by LotsOfCores

2018-08-22 Thread Shawn Heisey (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589582#comment-16589582
 ] 

Shawn Heisey commented on SOLR-12691:
-

Thoughts:

This is a scenario that was not considered when LotsOfCores functionality was 
built.  If system resources are too scarce to allow all cores getting updates 
to be loaded at the same time, this problem is likely.

Queries very likely update a timestamp within the core that marks it as 
most-recently-used, but do updates do the same?  If not, fixing that might make 
the issue a lot less likely in the production setup of the user that reported 
this problem on the mailing list, where they have 75 cores (which will be 
increasing) and a transientCacheSize of 32.  They have indicated that the 
majority of their cores are unlikely to see simultaneous updates.  Making sure 
that the core's most-recently-used information is refreshed by update requests 
might be an easy win for real-world usage.

Finding the cause of this problem might prove to be very difficult.  Depending 
on what's found, fixing it might be even harder.  It might never be possible to 
fix the two-core setup that reproduces this problem.


> Index corruption when sending updates to multiple cores, if those cores can 
> get unloaded by LotsOfCores
> ---
>
> Key: SOLR-12691
> URL: https://issues.apache.org/jira/browse/SOLR-12691
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.4
>Reporter: Shawn Heisey
>Priority: Minor
>
> When the LotsOfCores setting 'transientCacheSize' results in the unloading of 
> cores that are getting updates, the indexes in one or more of those cores can 
> get corrupted.
> How to reproduce:
>  * Set the "transientCacheSize" to 1
>  * Create two cores that are both set to transient.
>  * Ingest high load to core1 only (no issue at this time)
>  * Continue ingest high load to core1 and start ingest load to core2 
> simultaneously (core2 immediately corrupted)
> Error with stacktrace:
> {noformat}
> 2018-08-16 23:02:31.212 ERROR (qtp225472281-4098) [   
> x:aggregator-core-be43376de27b1675562841f64c498] o.a.s.u.SolrIndexWriter 
> Error closing IndexWriter
> java.nio.file.NoSuchFileException: 
> /opt/solr/volumes/data1/4cf838d4b9e4675-core-897/index/_2_Lucene50_0.pos
> at 
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
> ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
>  ~[?:1.8.0_162]
> at 
> sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
>  ~[?:1.8.0_162]
> at java.nio.file.Files.readAttributes(Files.java:1737) ~[?:1.8.0_162]
> at java.nio.file.Files.size(Files.java:2332) ~[?:1.8.0_162]
> at 
> org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.store.NRTCachingDirectory.fileLength(NRTCachingDirectory.java:128)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.SegmentCommitInfo.sizeInBytes(SegmentCommitInfo.java:217)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at org.apache.lucene.index.MergePolicy.size(MergePolicy.java:558) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.getSegmentSizes(TieredMergePolicy.java:279)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.TieredMergePolicy.findMerges(TieredMergePolicy.java:300)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2199)
>  ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
> jpountz - 2018-06-18 16:51:45]
> at 
> org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2162) 
> ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
>