Re: Merging of index in Solr
Hi, I found that in the IndexMergeTool.java, we found that there is this line which set the maxNumSegments to 1 writer.forceMerge(1); For this, does it means that there will always be only 1 segment after the merging? Is there any way which we can allow the merging to be in multiple segment, which each segment of a certain size? Like if we want each segment to be of 20GB? Regards, Edwin On 23 November 2017 at 20:35, Zheng Lin Edwin Yeowrote: > Hi Shawn, > > Thanks for the info. We will most likely be doing sharding when we migrate > to Solr 7.1.0, and re-index the data. > > But as Solr 7.1.0 is still not ready to index EML files yet due to this > JIRA, https://issues.apache.org/jira/browse/SOLR-11622, we have to make > use with our current Solr 6.5.1 first, which was already created without > sharding from the start. > > Regards, > Edwin > > On 23 November 2017 at 12:50, Shawn Heisey wrote: > >> On 11/22/2017 6:19 PM, Zheng Lin Edwin Yeo wrote: >> >>> I'm doing the merging on the SSD drive, the speed should be ok? >>> >> >> The speed of virtually all modern disks will have almost no influence on >> the speed of the merge. The bottleneck isn't disk transfer speed, it's the >> operation of the merge code in Lucene. >> >> As I said earlier in this thread, a merge is **NOT** just a copy. Lucene >> must completely rebuild the data structures of the index to incorporate all >> of the segments of the source indexes into a single segment in the target >> index, while simultaneously *excluding* information from documents that >> have been deleted. >> >> The best speed I have ever personally seen for a merge is 30 megabytes >> per second. This is far below the sustained transfer rate of a typical >> modern SATA disk. SSD is capable of far faster data transfer ...but it >> will NOT make merges go any faster. >> >> We need to merge because the data are indexed in two different >>> collections, >>> and we need them to be under the same collection, so that we can do >>> things >>> like faceting more accurately. >>> Will sharding alone achieve this? Or do we have to merge first before we >>> do >>> the sharding? >>> >> >> If you want the final index to be sharded, it's typically best to index >> from scratch into a new empty collection that has the number of shards you >> want. The merging tool you're using isn't aware of concepts like shards. >> It combines everything into a single index. >> >> It's not entirely clear what you're asking with the question about >> sharding alone. Making a guess: I have never heard of facet accuracy >> being affected by whether or not the index is sharded. If that *is* >> possible, then I would expect an index that is NOT sharded to have better >> accuracy. >> >> Thanks, >> Shawn >> >> >
Re: Merging of index in Solr
Hi Shawn, Thanks for the info. We will most likely be doing sharding when we migrate to Solr 7.1.0, and re-index the data. But as Solr 7.1.0 is still not ready to index EML files yet due to this JIRA, https://issues.apache.org/jira/browse/SOLR-11622, we have to make use with our current Solr 6.5.1 first, which was already created without sharding from the start. Regards, Edwin On 23 November 2017 at 12:50, Shawn Heiseywrote: > On 11/22/2017 6:19 PM, Zheng Lin Edwin Yeo wrote: > >> I'm doing the merging on the SSD drive, the speed should be ok? >> > > The speed of virtually all modern disks will have almost no influence on > the speed of the merge. The bottleneck isn't disk transfer speed, it's the > operation of the merge code in Lucene. > > As I said earlier in this thread, a merge is **NOT** just a copy. Lucene > must completely rebuild the data structures of the index to incorporate all > of the segments of the source indexes into a single segment in the target > index, while simultaneously *excluding* information from documents that > have been deleted. > > The best speed I have ever personally seen for a merge is 30 megabytes per > second. This is far below the sustained transfer rate of a typical modern > SATA disk. SSD is capable of far faster data transfer ...but it will NOT > make merges go any faster. > > We need to merge because the data are indexed in two different collections, >> and we need them to be under the same collection, so that we can do things >> like faceting more accurately. >> Will sharding alone achieve this? Or do we have to merge first before we >> do >> the sharding? >> > > If you want the final index to be sharded, it's typically best to index > from scratch into a new empty collection that has the number of shards you > want. The merging tool you're using isn't aware of concepts like shards. > It combines everything into a single index. > > It's not entirely clear what you're asking with the question about > sharding alone. Making a guess: I have never heard of facet accuracy > being affected by whether or not the index is sharded. If that *is* > possible, then I would expect an index that is NOT sharded to have better > accuracy. > > Thanks, > Shawn > >
Re: Merging of index in Solr
On 11/22/2017 6:19 PM, Zheng Lin Edwin Yeo wrote: I'm doing the merging on the SSD drive, the speed should be ok? The speed of virtually all modern disks will have almost no influence on the speed of the merge. The bottleneck isn't disk transfer speed, it's the operation of the merge code in Lucene. As I said earlier in this thread, a merge is **NOT** just a copy. Lucene must completely rebuild the data structures of the index to incorporate all of the segments of the source indexes into a single segment in the target index, while simultaneously *excluding* information from documents that have been deleted. The best speed I have ever personally seen for a merge is 30 megabytes per second. This is far below the sustained transfer rate of a typical modern SATA disk. SSD is capable of far faster data transfer ...but it will NOT make merges go any faster. We need to merge because the data are indexed in two different collections, and we need them to be under the same collection, so that we can do things like faceting more accurately. Will sharding alone achieve this? Or do we have to merge first before we do the sharding? If you want the final index to be sharded, it's typically best to index from scratch into a new empty collection that has the number of shards you want. The merging tool you're using isn't aware of concepts like shards. It combines everything into a single index. It's not entirely clear what you're asking with the question about sharding alone. Making a guess: I have never heard of facet accuracy being affected by whether or not the index is sharded. If that *is* possible, then I would expect an index that is NOT sharded to have better accuracy. Thanks, Shawn
Re: Merging of index in Solr
Hi Erick, Yes, we are planning to do sharding when we upgrade to the newer Solr 7.1.0, and probably will re-index everything. But currently we are waiting for certain issues on indexing the EML files to Solr 7.1.0 to be addressed first, like for this JIRA, https://issues.apache.org/jira/browse/SOLR-11622, which currently gives the following error when indexing EML files. java.lang.NoClassDefFoundError: org/apache/james/mime4j/stream/MimeConfig$Builder Meanwhile, as we are still on Solr 6.5.1, we plan to just merge the index, so that customer can continue to access the current index. The re-indexing will likely to take 3 to 4 weeks too, given the size of the data. Also, is there any way to do sharding for our current index size of 3.5TB, or is re-index the only way? Regards, Edwin On 23 November 2017 at 09:31, Erick Ericksonwrote: > Sure, sharding can give you accurate faceting, although do note there > are nuances, JSON faceting can occasionally be not exact, although > there are JIRAs being worked on to correct this. > > "traditional" faceting has a refinement phase that gets accurate counts. > > But the net-net is that I believe your merging is just the first of > many problems you'll encounter with indexes this size and starting > over with a reasonable sharding strategy is likely the fastest path to > what you want. > > Merging indexes isn't going to work for you though, you'll have to > create a new collection and reindex everything. As a straw-man > recommendation, I'd put no more than 200G on each shard in terms of > index size. > > Best, > Erick > > On Wed, Nov 22, 2017 at 5:19 PM, Zheng Lin Edwin Yeo > wrote: > > I'm doing the merging on the SSD drive, the speed should be ok? > > > > We need to merge because the data are indexed in two different > collections, > > and we need them to be under the same collection, so that we can do > things > > like faceting more accurately. > > Will sharding alone achieve this? Or do we have to merge first before we > do > > the sharding? > > > > Regards, > > Edwin > > > > On 23 November 2017 at 01:32, Erick Erickson > > wrote: > > > >> Really, let's back up here though. This sure seems like an XY problem. > >> You're merging indexes that will eventually be something on the order > >> of 3.5TB. I claim that an index of that size is very difficult to work > >> with effectively. _Why_ do you want to do this? Do you have any > >> evidence that you'll be able to effectively use it? > >> > >> And Shawn tells you that the result will be one large segment. If you > >> replace documents in that index, it will consist of around 3.4975T > >> wasted space before the segment is merged, see: > >> https://lucidworks.com/2017/10/13/segment-merging-deleted- > >> documents-optimize-may-bad/. > >> > >> You already know that merging is extremely painful. This sure seems > >> like a case where the evidence is mounting that you would be far > >> better off sharding and _not_ merging. > >> > >> FWIW, > >> Erick > >> > >> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey > wrote: > >> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote: > >> >> I am using the IndexMergeTool from Solr, from the command below: > >> >> > >> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar > >> >> org.apache.lucene.misc.IndexMergeTool > >> >> > >> >> The heap size is 32GB. There are more than 20 million documents in > the > >> two > >> >> cores. > >> > > >> > I have looked at IndexMergeTool, and confirmed that it does its job in > >> > exactly the same way that Solr does an optimize, so I would still > expect > >> > a rate of 20 to 30 MB per second, unless it's running on REALLY old > >> > hardware that can't transfer data that quickly. > >> > > >> > Thanks, > >> > Shawn > >> > > >> >
Re: Merging of index in Solr
Sure, sharding can give you accurate faceting, although do note there are nuances, JSON faceting can occasionally be not exact, although there are JIRAs being worked on to correct this. "traditional" faceting has a refinement phase that gets accurate counts. But the net-net is that I believe your merging is just the first of many problems you'll encounter with indexes this size and starting over with a reasonable sharding strategy is likely the fastest path to what you want. Merging indexes isn't going to work for you though, you'll have to create a new collection and reindex everything. As a straw-man recommendation, I'd put no more than 200G on each shard in terms of index size. Best, Erick On Wed, Nov 22, 2017 at 5:19 PM, Zheng Lin Edwin Yeowrote: > I'm doing the merging on the SSD drive, the speed should be ok? > > We need to merge because the data are indexed in two different collections, > and we need them to be under the same collection, so that we can do things > like faceting more accurately. > Will sharding alone achieve this? Or do we have to merge first before we do > the sharding? > > Regards, > Edwin > > On 23 November 2017 at 01:32, Erick Erickson > wrote: > >> Really, let's back up here though. This sure seems like an XY problem. >> You're merging indexes that will eventually be something on the order >> of 3.5TB. I claim that an index of that size is very difficult to work >> with effectively. _Why_ do you want to do this? Do you have any >> evidence that you'll be able to effectively use it? >> >> And Shawn tells you that the result will be one large segment. If you >> replace documents in that index, it will consist of around 3.4975T >> wasted space before the segment is merged, see: >> https://lucidworks.com/2017/10/13/segment-merging-deleted- >> documents-optimize-may-bad/. >> >> You already know that merging is extremely painful. This sure seems >> like a case where the evidence is mounting that you would be far >> better off sharding and _not_ merging. >> >> FWIW, >> Erick >> >> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey wrote: >> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote: >> >> I am using the IndexMergeTool from Solr, from the command below: >> >> >> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar >> >> org.apache.lucene.misc.IndexMergeTool >> >> >> >> The heap size is 32GB. There are more than 20 million documents in the >> two >> >> cores. >> > >> > I have looked at IndexMergeTool, and confirmed that it does its job in >> > exactly the same way that Solr does an optimize, so I would still expect >> > a rate of 20 to 30 MB per second, unless it's running on REALLY old >> > hardware that can't transfer data that quickly. >> > >> > Thanks, >> > Shawn >> > >>
Re: Merging of index in Solr
I'm doing the merging on the SSD drive, the speed should be ok? We need to merge because the data are indexed in two different collections, and we need them to be under the same collection, so that we can do things like faceting more accurately. Will sharding alone achieve this? Or do we have to merge first before we do the sharding? Regards, Edwin On 23 November 2017 at 01:32, Erick Ericksonwrote: > Really, let's back up here though. This sure seems like an XY problem. > You're merging indexes that will eventually be something on the order > of 3.5TB. I claim that an index of that size is very difficult to work > with effectively. _Why_ do you want to do this? Do you have any > evidence that you'll be able to effectively use it? > > And Shawn tells you that the result will be one large segment. If you > replace documents in that index, it will consist of around 3.4975T > wasted space before the segment is merged, see: > https://lucidworks.com/2017/10/13/segment-merging-deleted- > documents-optimize-may-bad/. > > You already know that merging is extremely painful. This sure seems > like a case where the evidence is mounting that you would be far > better off sharding and _not_ merging. > > FWIW, > Erick > > On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey wrote: > > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote: > >> I am using the IndexMergeTool from Solr, from the command below: > >> > >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar > >> org.apache.lucene.misc.IndexMergeTool > >> > >> The heap size is 32GB. There are more than 20 million documents in the > two > >> cores. > > > > I have looked at IndexMergeTool, and confirmed that it does its job in > > exactly the same way that Solr does an optimize, so I would still expect > > a rate of 20 to 30 MB per second, unless it's running on REALLY old > > hardware that can't transfer data that quickly. > > > > Thanks, > > Shawn > > >
Re: Merging of index in Solr
Really, let's back up here though. This sure seems like an XY problem. You're merging indexes that will eventually be something on the order of 3.5TB. I claim that an index of that size is very difficult to work with effectively. _Why_ do you want to do this? Do you have any evidence that you'll be able to effectively use it? And Shawn tells you that the result will be one large segment. If you replace documents in that index, it will consist of around 3.4975T wasted space before the segment is merged, see: https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/. You already know that merging is extremely painful. This sure seems like a case where the evidence is mounting that you would be far better off sharding and _not_ merging. FWIW, Erick On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heiseywrote: > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote: >> I am using the IndexMergeTool from Solr, from the command below: >> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar >> org.apache.lucene.misc.IndexMergeTool >> >> The heap size is 32GB. There are more than 20 million documents in the two >> cores. > > I have looked at IndexMergeTool, and confirmed that it does its job in > exactly the same way that Solr does an optimize, so I would still expect > a rate of 20 to 30 MB per second, unless it's running on REALLY old > hardware that can't transfer data that quickly. > > Thanks, > Shawn >
Re: Merging of index in Solr
On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote: > I am using the IndexMergeTool from Solr, from the command below: > > java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar > org.apache.lucene.misc.IndexMergeTool > > The heap size is 32GB. There are more than 20 million documents in the two > cores. I have looked at IndexMergeTool, and confirmed that it does its job in exactly the same way that Solr does an optimize, so I would still expect a rate of 20 to 30 MB per second, unless it's running on REALLY old hardware that can't transfer data that quickly. Thanks, Shawn
Re: Merging of index in Solr
Hi Emir, Yes, I am running the merging on a Windows machine. The hard disk is a SSD disk in NTFS file system. Regards, Edwin On 22 November 2017 at 16:50, Emir Arnautovićwrote: > Hi Edwin, > Quick googling suggests that this is the issue of NTFS related to large > number of file fragments caused by large number of files in one directory > of huge files. Are you running this merging on a Windows machine? > > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 22 Nov 2017, at 02:33, Zheng Lin Edwin Yeo > wrote: > > > > Hi, > > > > I have encountered this error during the merging of the 3.5TB of index. > > What could be the cause that lead to this? > > > > Exception in thread "main" Exception in thread "Lucene Merge Thread #8" > > java.io. > > > > IOException: background merge hit exception: _6f(6.5.1):C7256757 > > _6e(6.5.1):C646 > > > > 2072 _6d(6.5.1):C3750777 _6c(6.5.1):C2243594 _6b(6.5.1):C1015431 > > _6a(6.5.1):C105 > > > > 0220 _69(6.5.1):c273879 _28(6.4.1):c79011/84:delGen=84 > > _26(6.4.1):c44960/8149:de > > > > lGen=100 _29(6.4.1):c73855/68:delGen=68 _5(6.4.1):C46672/31:delGen=31 > > _68(6.5.1) > > > > :c66 into _6g [maxNumSegments=1] > > > >at > > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1931) > > > > > > > >at > > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1871) > > > > > > > >at > > org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:57) > > > > Caused by: java.io.IOException: The requested operation could not be > > completed d > > > > ue to a file system limitation > > > >at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > > > >at sun.nio.ch.FileDispatcherImpl.write(Unknown Source) > > > >at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) > > > >at sun.nio.ch.IOUtil.write(Unknown Source) > > > >at sun.nio.ch.FileChannelImpl.write(Unknown Source) > > > >at java.nio.channels.Channels.writeFullyImpl(Unknown Source) > > > >at java.nio.channels.Channels.writeFully(Unknown Source) > > > >at java.nio.channels.Channels.access$000(Unknown Source) > > > >at java.nio.channels.Channels$1.write(Unknown Source) > > > >at > > org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory > > > > .java:419) > > > >at java.util.zip.CheckedOutputStream.write(Unknown Source) > > > >at java.io.BufferedOutputStream.flushBuffer(Unknown Source) > > > >at java.io.BufferedOutputStream.write(Unknown Source) > > > >at > > org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStre > > > > amIndexOutput.java:53) > > > >at > > org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimited > > > > IndexOutput.java:73) > > > >at org.apache.lucene.store.DataOutput.writeBytes( > DataOutput.java:52) > > > >at > > org.apache.lucene.codecs.lucene50.ForUtil.writeBlock(ForUtil.java:175 > > > > ) > > > >at > > org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.addPosition( > > > > Lucene50PostingsWriter.java:286) > > > >at > > org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPosting > > > > sWriterBase.java:156) > > > >at > > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.w > > > > rite(BlockTreeTermsWriter.java:866) > > > >at > > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTr > > > > eeTermsWriter.java:344) > > > >at > > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105 > > > > ) > > > >at > > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter > > > > .merge(PerFieldPostingsFormat.java:164) > > > >at > > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:2 > > > > 16) > > > >at > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101) > > > >at > > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4353 > > > > ) > > > >at org.apache.lucene.index.IndexWriter.merge(IndexWriter. > java:3928) > > > >at > > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe > > > > rgeScheduler.java:624) > > > >at > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc > > > > urrentMergeScheduler.java:661) > > > > org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: > > The req > > > > uested operation could not be completed due to a file system limitation > > > >at > > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException > > > > (ConcurrentMergeScheduler.java:703) > > > >at > > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc > > > > urrentMergeScheduler.java:683) > > > > Caused by:
Re: Merging of index in Solr
Hi Edwin, Quick googling suggests that this is the issue of NTFS related to large number of file fragments caused by large number of files in one directory of huge files. Are you running this merging on a Windows machine? Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 22 Nov 2017, at 02:33, Zheng Lin Edwin Yeowrote: > > Hi, > > I have encountered this error during the merging of the 3.5TB of index. > What could be the cause that lead to this? > > Exception in thread "main" Exception in thread "Lucene Merge Thread #8" > java.io. > > IOException: background merge hit exception: _6f(6.5.1):C7256757 > _6e(6.5.1):C646 > > 2072 _6d(6.5.1):C3750777 _6c(6.5.1):C2243594 _6b(6.5.1):C1015431 > _6a(6.5.1):C105 > > 0220 _69(6.5.1):c273879 _28(6.4.1):c79011/84:delGen=84 > _26(6.4.1):c44960/8149:de > > lGen=100 _29(6.4.1):c73855/68:delGen=68 _5(6.4.1):C46672/31:delGen=31 > _68(6.5.1) > > :c66 into _6g [maxNumSegments=1] > >at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1931) > > > >at > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1871) > > > >at > org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:57) > > Caused by: java.io.IOException: The requested operation could not be > completed d > > ue to a file system limitation > >at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > >at sun.nio.ch.FileDispatcherImpl.write(Unknown Source) > >at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) > >at sun.nio.ch.IOUtil.write(Unknown Source) > >at sun.nio.ch.FileChannelImpl.write(Unknown Source) > >at java.nio.channels.Channels.writeFullyImpl(Unknown Source) > >at java.nio.channels.Channels.writeFully(Unknown Source) > >at java.nio.channels.Channels.access$000(Unknown Source) > >at java.nio.channels.Channels$1.write(Unknown Source) > >at > org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory > > .java:419) > >at java.util.zip.CheckedOutputStream.write(Unknown Source) > >at java.io.BufferedOutputStream.flushBuffer(Unknown Source) > >at java.io.BufferedOutputStream.write(Unknown Source) > >at > org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStre > > amIndexOutput.java:53) > >at > org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimited > > IndexOutput.java:73) > >at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52) > >at > org.apache.lucene.codecs.lucene50.ForUtil.writeBlock(ForUtil.java:175 > > ) > >at > org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.addPosition( > > Lucene50PostingsWriter.java:286) > >at > org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPosting > > sWriterBase.java:156) > >at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.w > > rite(BlockTreeTermsWriter.java:866) > >at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTr > > eeTermsWriter.java:344) > >at > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105 > > ) > >at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter > > .merge(PerFieldPostingsFormat.java:164) > >at > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:2 > > 16) > >at > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101) > >at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4353 > > ) > >at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3928) > >at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe > > rgeScheduler.java:624) > >at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc > > urrentMergeScheduler.java:661) > > org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: > The req > > uested operation could not be completed due to a file system limitation > >at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException > > (ConcurrentMergeScheduler.java:703) > >at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc > > urrentMergeScheduler.java:683) > > Caused by: java.io.IOException: The requested operation could not be > completed d > > ue to a file system limitation > >at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > >at sun.nio.ch.FileDispatcherImpl.write(Unknown Source) > >at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) > >at sun.nio.ch.IOUtil.write(Unknown Source) > >at sun.nio.ch.FileChannelImpl.write(Unknown Source) > >at java.nio.channels.Channels.writeFullyImpl(Unknown
Re: Merging of index in Solr
Hi, I have encountered this error during the merging of the 3.5TB of index. What could be the cause that lead to this? Exception in thread "main" Exception in thread "Lucene Merge Thread #8" java.io. IOException: background merge hit exception: _6f(6.5.1):C7256757 _6e(6.5.1):C646 2072 _6d(6.5.1):C3750777 _6c(6.5.1):C2243594 _6b(6.5.1):C1015431 _6a(6.5.1):C105 0220 _69(6.5.1):c273879 _28(6.4.1):c79011/84:delGen=84 _26(6.4.1):c44960/8149:de lGen=100 _29(6.4.1):c73855/68:delGen=68 _5(6.4.1):C46672/31:delGen=31 _68(6.5.1) :c66 into _6g [maxNumSegments=1] at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1931) at org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1871) at org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:57) Caused by: java.io.IOException: The requested operation could not be completed d ue to a file system limitation at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(Unknown Source) at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.write(Unknown Source) at sun.nio.ch.FileChannelImpl.write(Unknown Source) at java.nio.channels.Channels.writeFullyImpl(Unknown Source) at java.nio.channels.Channels.writeFully(Unknown Source) at java.nio.channels.Channels.access$000(Unknown Source) at java.nio.channels.Channels$1.write(Unknown Source) at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory .java:419) at java.util.zip.CheckedOutputStream.write(Unknown Source) at java.io.BufferedOutputStream.flushBuffer(Unknown Source) at java.io.BufferedOutputStream.write(Unknown Source) at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStre amIndexOutput.java:53) at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimited IndexOutput.java:73) at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52) at org.apache.lucene.codecs.lucene50.ForUtil.writeBlock(ForUtil.java:175 ) at org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.addPosition( Lucene50PostingsWriter.java:286) at org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPosting sWriterBase.java:156) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.w rite(BlockTreeTermsWriter.java:866) at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTr eeTermsWriter.java:344) at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105 ) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter .merge(PerFieldPostingsFormat.java:164) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:2 16) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4353 ) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3928) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe rgeScheduler.java:624) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc urrentMergeScheduler.java:661) org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: The req uested operation could not be completed due to a file system limitation at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException (ConcurrentMergeScheduler.java:703) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc urrentMergeScheduler.java:683) Caused by: java.io.IOException: The requested operation could not be completed d ue to a file system limitation at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(Unknown Source) at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.write(Unknown Source) at sun.nio.ch.FileChannelImpl.write(Unknown Source) at java.nio.channels.Channels.writeFullyImpl(Unknown Source) at java.nio.channels.Channels.writeFully(Unknown Source) at java.nio.channels.Channels.access$000(Unknown Source) at java.nio.channels.Channels$1.write(Unknown Source) Regards, Edwin On 22 November 2017 at 00:10, Zheng Lin Edwin Yeowrote: > I am using the IndexMergeTool from Solr, from the command below: > > java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar > org.apache.lucene.misc.IndexMergeTool > > The heap size is 32GB. There are more than 20 million documents in the two > cores. > > Regards, > Edwin > > > > On 21 November 2017 at 21:54, Shawn Heisey wrote: > >> On 11/20/2017 9:35 AM, Zheng Lin Edwin Yeo wrote: >> >>> Does anyone knows how long
Re: Merging of index in Solr
I am using the IndexMergeTool from Solr, from the command below: java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar org.apache.lucene.misc.IndexMergeTool The heap size is 32GB. There are more than 20 million documents in the two cores. Regards, Edwin On 21 November 2017 at 21:54, Shawn Heiseywrote: > On 11/20/2017 9:35 AM, Zheng Lin Edwin Yeo wrote: > >> Does anyone knows how long usually the merging in Solr will take? >> >> I am currently merging about 3.5TB of data, and it has been running for >> more than 28 hours and it is not completed yet. The merging is running on >> SSD disk. >> > > The following will apply if you mean Solr's "optimize" feature when you > say "merging". > > In my experience, merging proceeds at about 20 to 30 megabytes per second > -- even if the disks are capable of far faster data transfer. Merging is > not just copying the data. Lucene is completely rebuilding very large data > structures, and *not* including data from deleted documents as it does so. > It takes a lot of CPU power and time. > > If we average the data rates I've seen to 25, then that would indicate > that an optimize on a 3.5TB is going to take about 39 hours, and might take > as long as 48 hours. And if you're running SolrCloud with multiple > replicas, multiply that by the number of copies of the 3.5TB index. An > optimize on a SolrCloud collection handles one shard replica at a time and > works its way through the entire collection. > > If you are merging different indexes *together*, which a later message > seems to state, then the actual Lucene operation is probably nearly > identical, but I'm not really familiar with it, so I cannot say for sure. > > Thanks, > Shawn > >
Re: Merging of index in Solr
On 11/20/2017 9:35 AM, Zheng Lin Edwin Yeo wrote: Does anyone knows how long usually the merging in Solr will take? I am currently merging about 3.5TB of data, and it has been running for more than 28 hours and it is not completed yet. The merging is running on SSD disk. The following will apply if you mean Solr's "optimize" feature when you say "merging". In my experience, merging proceeds at about 20 to 30 megabytes per second -- even if the disks are capable of far faster data transfer. Merging is not just copying the data. Lucene is completely rebuilding very large data structures, and *not* including data from deleted documents as it does so. It takes a lot of CPU power and time. If we average the data rates I've seen to 25, then that would indicate that an optimize on a 3.5TB is going to take about 39 hours, and might take as long as 48 hours. And if you're running SolrCloud with multiple replicas, multiply that by the number of copies of the 3.5TB index. An optimize on a SolrCloud collection handles one shard replica at a time and works its way through the entire collection. If you are merging different indexes *together*, which a later message seems to state, then the actual Lucene operation is probably nearly identical, but I'm not really familiar with it, so I cannot say for sure. Thanks, Shawn
Re: Merging of index in Solr
Hi Edwin, I’ll let somebody with more knowledge about merge to comment merge aspects. What do you use to merge those cores - merge tool or you run it using Solr’s core API? What is the heap size? How many documents are in those two cores? Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 21 Nov 2017, at 14:20, Zheng Lin Edwin Yeowrote: > > Hi Emir, > > Thanks for your reply. > > There are only 1 host, 1 nodes and 1 shard for these 3.5TB. > The merging has already written the additional 3.5TB to another segment. > However, it is still not a single segment, and the size of the folder where > the merged index is supposed to be is now 4.6TB, This excludes the original > 3.5TB, meaning it is already using up 8.1TB of space, but the merging is > still going on. > > The index are currently updates free. We have only index the data in 2 > different collections, and we now need to merge them into a single > collection. > > Regards, > Edwin > > On 21 November 2017 at 16:52, Emir Arnautović > wrote: > >> Hi Edwin, >> How many host/nodes/shard are those 3.5TB? I am not familiar with merge >> code, but trying to think what it might include, so don’t take any of >> following as ground truth. >> Merging for sure will include segments rewrite, so you better have >> additional 3.5TB if you are merging it to a single segment. But that should >> not last days on SSD. My guess is that you are running on the edge of your >> heap and doing a lot GCs and maybe you will OOM at some point. I would >> guess that merging is memory intensive operation and even if not holding >> large structures in memory, it will probably create a lot of garbage. >> Merging requires a lot of comparison so it is also a possibility that you >> are exhausting CPU resources. >> Bottom line - without more details and some monitoring tool, it is hard to >> tell why it is taking that much. >> And there is also a question if merging is good choice in you case - is >> index static/updates free? >> >> Regards, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >>> On 20 Nov 2017, at 17:35, Zheng Lin Edwin Yeo >> wrote: >>> >>> Hi, >>> >>> Does anyone knows how long usually the merging in Solr will take? >>> >>> I am currently merging about 3.5TB of data, and it has been running for >>> more than 28 hours and it is not completed yet. The merging is running on >>> SSD disk. >>> >>> I am using Solr 6.5.1. >>> >>> Regards, >>> Edwin >> >>
Re: Merging of index in Solr
Hi Emir, Thanks for your reply. There are only 1 host, 1 nodes and 1 shard for these 3.5TB. The merging has already written the additional 3.5TB to another segment. However, it is still not a single segment, and the size of the folder where the merged index is supposed to be is now 4.6TB, This excludes the original 3.5TB, meaning it is already using up 8.1TB of space, but the merging is still going on. The index are currently updates free. We have only index the data in 2 different collections, and we now need to merge them into a single collection. Regards, Edwin On 21 November 2017 at 16:52, Emir Arnautovićwrote: > Hi Edwin, > How many host/nodes/shard are those 3.5TB? I am not familiar with merge > code, but trying to think what it might include, so don’t take any of > following as ground truth. > Merging for sure will include segments rewrite, so you better have > additional 3.5TB if you are merging it to a single segment. But that should > not last days on SSD. My guess is that you are running on the edge of your > heap and doing a lot GCs and maybe you will OOM at some point. I would > guess that merging is memory intensive operation and even if not holding > large structures in memory, it will probably create a lot of garbage. > Merging requires a lot of comparison so it is also a possibility that you > are exhausting CPU resources. > Bottom line - without more details and some monitoring tool, it is hard to > tell why it is taking that much. > And there is also a question if merging is good choice in you case - is > index static/updates free? > > Regards, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 20 Nov 2017, at 17:35, Zheng Lin Edwin Yeo > wrote: > > > > Hi, > > > > Does anyone knows how long usually the merging in Solr will take? > > > > I am currently merging about 3.5TB of data, and it has been running for > > more than 28 hours and it is not completed yet. The merging is running on > > SSD disk. > > > > I am using Solr 6.5.1. > > > > Regards, > > Edwin > >
Re: Merging of index in Solr
Hi Edwin, How many host/nodes/shard are those 3.5TB? I am not familiar with merge code, but trying to think what it might include, so don’t take any of following as ground truth. Merging for sure will include segments rewrite, so you better have additional 3.5TB if you are merging it to a single segment. But that should not last days on SSD. My guess is that you are running on the edge of your heap and doing a lot GCs and maybe you will OOM at some point. I would guess that merging is memory intensive operation and even if not holding large structures in memory, it will probably create a lot of garbage. Merging requires a lot of comparison so it is also a possibility that you are exhausting CPU resources. Bottom line - without more details and some monitoring tool, it is hard to tell why it is taking that much. And there is also a question if merging is good choice in you case - is index static/updates free? Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 20 Nov 2017, at 17:35, Zheng Lin Edwin Yeowrote: > > Hi, > > Does anyone knows how long usually the merging in Solr will take? > > I am currently merging about 3.5TB of data, and it has been running for > more than 28 hours and it is not completed yet. The merging is running on > SSD disk. > > I am using Solr 6.5.1. > > Regards, > Edwin
Merging of index in Solr
Hi, Does anyone knows how long usually the merging in Solr will take? I am currently merging about 3.5TB of data, and it has been running for more than 28 hours and it is not completed yet. The merging is running on SSD disk. I am using Solr 6.5.1. Regards, Edwin