Re: Merging of index in Solr

2017-11-27 Thread Zheng Lin Edwin Yeo
Hi,

I found that in the IndexMergeTool.java, we found that there is this line
which set the maxNumSegments to 1

writer.forceMerge(1);


For this, does it means that there will always be only 1 segment after the
merging?

Is there any way which we can allow the merging to be in multiple segment,
which each segment of a certain size? Like if we want each segment to be of
20GB?

Regards,
Edwin


On 23 November 2017 at 20:35, Zheng Lin Edwin Yeo 
wrote:

> Hi Shawn,
>
> Thanks for the info. We will most likely be doing sharding when we migrate
> to Solr 7.1.0, and re-index the data.
>
> But as Solr 7.1.0 is still not ready to index EML files yet due to this
> JIRA, https://issues.apache.org/jira/browse/SOLR-11622, we have to make
> use with our current Solr 6.5.1 first, which was already created without
> sharding from the start.
>
> Regards,
> Edwin
>
> On 23 November 2017 at 12:50, Shawn Heisey  wrote:
>
>> On 11/22/2017 6:19 PM, Zheng Lin Edwin Yeo wrote:
>>
>>> I'm doing the merging on the SSD drive, the speed should be ok?
>>>
>>
>> The speed of virtually all modern disks will have almost no influence on
>> the speed of the merge.  The bottleneck isn't disk transfer speed, it's the
>> operation of the merge code in Lucene.
>>
>> As I said earlier in this thread, a merge is **NOT** just a copy. Lucene
>> must completely rebuild the data structures of the index to incorporate all
>> of the segments of the source indexes into a single segment in the target
>> index, while simultaneously *excluding* information from documents that
>> have been deleted.
>>
>> The best speed I have ever personally seen for a merge is 30 megabytes
>> per second.  This is far below the sustained transfer rate of a typical
>> modern SATA disk.  SSD is capable of far faster data transfer ...but it
>> will NOT make merges go any faster.
>>
>> We need to merge because the data are indexed in two different
>>> collections,
>>> and we need them to be under the same collection, so that we can do
>>> things
>>> like faceting more accurately.
>>> Will sharding alone achieve this? Or do we have to merge first before we
>>> do
>>> the sharding?
>>>
>>
>> If you want the final index to be sharded, it's typically best to index
>> from scratch into a new empty collection that has the number of shards you
>> want.  The merging tool you're using isn't aware of concepts like shards.
>> It combines everything into a single index.
>>
>> It's not entirely clear what you're asking with the question about
>> sharding alone.  Making a guess:  I have never heard of facet accuracy
>> being affected by whether or not the index is sharded.  If that *is*
>> possible, then I would expect an index that is NOT sharded to have better
>> accuracy.
>>
>> Thanks,
>> Shawn
>>
>>
>


Re: Merging of index in Solr

2017-11-23 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Thanks for the info. We will most likely be doing sharding when we migrate
to Solr 7.1.0, and re-index the data.

But as Solr 7.1.0 is still not ready to index EML files yet due to this
JIRA, https://issues.apache.org/jira/browse/SOLR-11622, we have to make use
with our current Solr 6.5.1 first, which was already created without
sharding from the start.

Regards,
Edwin

On 23 November 2017 at 12:50, Shawn Heisey  wrote:

> On 11/22/2017 6:19 PM, Zheng Lin Edwin Yeo wrote:
>
>> I'm doing the merging on the SSD drive, the speed should be ok?
>>
>
> The speed of virtually all modern disks will have almost no influence on
> the speed of the merge.  The bottleneck isn't disk transfer speed, it's the
> operation of the merge code in Lucene.
>
> As I said earlier in this thread, a merge is **NOT** just a copy. Lucene
> must completely rebuild the data structures of the index to incorporate all
> of the segments of the source indexes into a single segment in the target
> index, while simultaneously *excluding* information from documents that
> have been deleted.
>
> The best speed I have ever personally seen for a merge is 30 megabytes per
> second.  This is far below the sustained transfer rate of a typical modern
> SATA disk.  SSD is capable of far faster data transfer ...but it will NOT
> make merges go any faster.
>
> We need to merge because the data are indexed in two different collections,
>> and we need them to be under the same collection, so that we can do things
>> like faceting more accurately.
>> Will sharding alone achieve this? Or do we have to merge first before we
>> do
>> the sharding?
>>
>
> If you want the final index to be sharded, it's typically best to index
> from scratch into a new empty collection that has the number of shards you
> want.  The merging tool you're using isn't aware of concepts like shards.
> It combines everything into a single index.
>
> It's not entirely clear what you're asking with the question about
> sharding alone.  Making a guess:  I have never heard of facet accuracy
> being affected by whether or not the index is sharded.  If that *is*
> possible, then I would expect an index that is NOT sharded to have better
> accuracy.
>
> Thanks,
> Shawn
>
>


Re: Merging of index in Solr

2017-11-22 Thread Shawn Heisey

On 11/22/2017 6:19 PM, Zheng Lin Edwin Yeo wrote:

I'm doing the merging on the SSD drive, the speed should be ok?


The speed of virtually all modern disks will have almost no influence on 
the speed of the merge.  The bottleneck isn't disk transfer speed, it's 
the operation of the merge code in Lucene.


As I said earlier in this thread, a merge is **NOT** just a copy. Lucene 
must completely rebuild the data structures of the index to incorporate 
all of the segments of the source indexes into a single segment in the 
target index, while simultaneously *excluding* information from 
documents that have been deleted.


The best speed I have ever personally seen for a merge is 30 megabytes 
per second.  This is far below the sustained transfer rate of a typical 
modern SATA disk.  SSD is capable of far faster data transfer ...but it 
will NOT make merges go any faster.



We need to merge because the data are indexed in two different collections,
and we need them to be under the same collection, so that we can do things
like faceting more accurately.
Will sharding alone achieve this? Or do we have to merge first before we do
the sharding?


If you want the final index to be sharded, it's typically best to index 
from scratch into a new empty collection that has the number of shards 
you want.  The merging tool you're using isn't aware of concepts like 
shards.  It combines everything into a single index.


It's not entirely clear what you're asking with the question about 
sharding alone.  Making a guess:  I have never heard of facet accuracy 
being affected by whether or not the index is sharded.  If that *is* 
possible, then I would expect an index that is NOT sharded to have 
better accuracy.


Thanks,
Shawn



Re: Merging of index in Solr

2017-11-22 Thread Zheng Lin Edwin Yeo
Hi Erick,

Yes, we are planning to do sharding when we upgrade to the newer Solr
7.1.0, and probably will re-index everything. But currently we are waiting
for certain issues on indexing the EML files to Solr 7.1.0 to be addressed
first, like for this JIRA, https://issues.apache.org/jira/browse/SOLR-11622,
which currently gives the following error when indexing EML files.

java.lang.NoClassDefFoundError:
org/apache/james/mime4j/stream/MimeConfig$Builder


Meanwhile, as we are still on Solr 6.5.1, we plan to just merge the index,
so that customer can continue to access the current index. The re-indexing
will likely to take 3 to 4 weeks too, given the size of the data. Also, is
there any way to do sharding for our current index size of 3.5TB, or is
re-index the only way?

Regards,
Edwin


On 23 November 2017 at 09:31, Erick Erickson 
wrote:

> Sure, sharding can give you accurate faceting, although do note there
> are nuances, JSON faceting can occasionally be not exact, although
> there are JIRAs being worked on to correct this.
>
> "traditional" faceting has a refinement phase that gets accurate counts.
>
> But the net-net is that I believe your merging is just the first of
> many problems you'll encounter with indexes this size and starting
> over with a reasonable sharding strategy is likely the fastest path to
> what you want.
>
> Merging indexes isn't going to work for you though, you'll have to
> create a new collection and reindex everything. As a straw-man
> recommendation, I'd put no more than 200G on each shard in terms of
> index size.
>
> Best,
> Erick
>
> On Wed, Nov 22, 2017 at 5:19 PM, Zheng Lin Edwin Yeo
>  wrote:
> > I'm doing the merging on the SSD drive, the speed should be ok?
> >
> > We need to merge because the data are indexed in two different
> collections,
> > and we need them to be under the same collection, so that we can do
> things
> > like faceting more accurately.
> > Will sharding alone achieve this? Or do we have to merge first before we
> do
> > the sharding?
> >
> > Regards,
> > Edwin
> >
> > On 23 November 2017 at 01:32, Erick Erickson 
> > wrote:
> >
> >> Really, let's back up here though. This sure seems like an XY problem.
> >> You're merging indexes that will eventually be something on the order
> >> of 3.5TB. I claim that an index of that size is very difficult to work
> >> with effectively. _Why_ do you want to do this? Do you have any
> >> evidence that you'll be able to effectively use it?
> >>
> >> And Shawn tells you that the result will be one large segment. If you
> >> replace documents in that index, it will consist of around 3.4975T
> >> wasted space before the segment is merged, see:
> >> https://lucidworks.com/2017/10/13/segment-merging-deleted-
> >> documents-optimize-may-bad/.
> >>
> >> You already know that merging is extremely painful. This sure seems
> >> like a case where the evidence is mounting that you would be far
> >> better off sharding and _not_ merging.
> >>
> >> FWIW,
> >> Erick
> >>
> >> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey 
> wrote:
> >> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
> >> >> I am using the IndexMergeTool from Solr, from the command below:
> >> >>
> >> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
> >> >> org.apache.lucene.misc.IndexMergeTool
> >> >>
> >> >> The heap size is 32GB. There are more than 20 million documents in
> the
> >> two
> >> >> cores.
> >> >
> >> > I have looked at IndexMergeTool, and confirmed that it does its job in
> >> > exactly the same way that Solr does an optimize, so I would still
> expect
> >> > a rate of 20 to 30 MB per second, unless it's running on REALLY old
> >> > hardware that can't transfer data that quickly.
> >> >
> >> > Thanks,
> >> > Shawn
> >> >
> >>
>


Re: Merging of index in Solr

2017-11-22 Thread Erick Erickson
Sure, sharding can give you accurate faceting, although do note there
are nuances, JSON faceting can occasionally be not exact, although
there are JIRAs being worked on to correct this.

"traditional" faceting has a refinement phase that gets accurate counts.

But the net-net is that I believe your merging is just the first of
many problems you'll encounter with indexes this size and starting
over with a reasonable sharding strategy is likely the fastest path to
what you want.

Merging indexes isn't going to work for you though, you'll have to
create a new collection and reindex everything. As a straw-man
recommendation, I'd put no more than 200G on each shard in terms of
index size.

Best,
Erick

On Wed, Nov 22, 2017 at 5:19 PM, Zheng Lin Edwin Yeo
 wrote:
> I'm doing the merging on the SSD drive, the speed should be ok?
>
> We need to merge because the data are indexed in two different collections,
> and we need them to be under the same collection, so that we can do things
> like faceting more accurately.
> Will sharding alone achieve this? Or do we have to merge first before we do
> the sharding?
>
> Regards,
> Edwin
>
> On 23 November 2017 at 01:32, Erick Erickson 
> wrote:
>
>> Really, let's back up here though. This sure seems like an XY problem.
>> You're merging indexes that will eventually be something on the order
>> of 3.5TB. I claim that an index of that size is very difficult to work
>> with effectively. _Why_ do you want to do this? Do you have any
>> evidence that you'll be able to effectively use it?
>>
>> And Shawn tells you that the result will be one large segment. If you
>> replace documents in that index, it will consist of around 3.4975T
>> wasted space before the segment is merged, see:
>> https://lucidworks.com/2017/10/13/segment-merging-deleted-
>> documents-optimize-may-bad/.
>>
>> You already know that merging is extremely painful. This sure seems
>> like a case where the evidence is mounting that you would be far
>> better off sharding and _not_ merging.
>>
>> FWIW,
>> Erick
>>
>> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey  wrote:
>> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
>> >> I am using the IndexMergeTool from Solr, from the command below:
>> >>
>> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
>> >> org.apache.lucene.misc.IndexMergeTool
>> >>
>> >> The heap size is 32GB. There are more than 20 million documents in the
>> two
>> >> cores.
>> >
>> > I have looked at IndexMergeTool, and confirmed that it does its job in
>> > exactly the same way that Solr does an optimize, so I would still expect
>> > a rate of 20 to 30 MB per second, unless it's running on REALLY old
>> > hardware that can't transfer data that quickly.
>> >
>> > Thanks,
>> > Shawn
>> >
>>


Re: Merging of index in Solr

2017-11-22 Thread Zheng Lin Edwin Yeo
I'm doing the merging on the SSD drive, the speed should be ok?

We need to merge because the data are indexed in two different collections,
and we need them to be under the same collection, so that we can do things
like faceting more accurately.
Will sharding alone achieve this? Or do we have to merge first before we do
the sharding?

Regards,
Edwin

On 23 November 2017 at 01:32, Erick Erickson 
wrote:

> Really, let's back up here though. This sure seems like an XY problem.
> You're merging indexes that will eventually be something on the order
> of 3.5TB. I claim that an index of that size is very difficult to work
> with effectively. _Why_ do you want to do this? Do you have any
> evidence that you'll be able to effectively use it?
>
> And Shawn tells you that the result will be one large segment. If you
> replace documents in that index, it will consist of around 3.4975T
> wasted space before the segment is merged, see:
> https://lucidworks.com/2017/10/13/segment-merging-deleted-
> documents-optimize-may-bad/.
>
> You already know that merging is extremely painful. This sure seems
> like a case where the evidence is mounting that you would be far
> better off sharding and _not_ merging.
>
> FWIW,
> Erick
>
> On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey  wrote:
> > On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
> >> I am using the IndexMergeTool from Solr, from the command below:
> >>
> >> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
> >> org.apache.lucene.misc.IndexMergeTool
> >>
> >> The heap size is 32GB. There are more than 20 million documents in the
> two
> >> cores.
> >
> > I have looked at IndexMergeTool, and confirmed that it does its job in
> > exactly the same way that Solr does an optimize, so I would still expect
> > a rate of 20 to 30 MB per second, unless it's running on REALLY old
> > hardware that can't transfer data that quickly.
> >
> > Thanks,
> > Shawn
> >
>


Re: Merging of index in Solr

2017-11-22 Thread Erick Erickson
Really, let's back up here though. This sure seems like an XY problem.
You're merging indexes that will eventually be something on the order
of 3.5TB. I claim that an index of that size is very difficult to work
with effectively. _Why_ do you want to do this? Do you have any
evidence that you'll be able to effectively use it?

And Shawn tells you that the result will be one large segment. If you
replace documents in that index, it will consist of around 3.4975T
wasted space before the segment is merged, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/.

You already know that merging is extremely painful. This sure seems
like a case where the evidence is mounting that you would be far
better off sharding and _not_ merging.

FWIW,
Erick

On Wed, Nov 22, 2017 at 8:45 AM, Shawn Heisey  wrote:
> On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
>> I am using the IndexMergeTool from Solr, from the command below:
>>
>> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
>> org.apache.lucene.misc.IndexMergeTool
>>
>> The heap size is 32GB. There are more than 20 million documents in the two
>> cores.
>
> I have looked at IndexMergeTool, and confirmed that it does its job in
> exactly the same way that Solr does an optimize, so I would still expect
> a rate of 20 to 30 MB per second, unless it's running on REALLY old
> hardware that can't transfer data that quickly.
>
> Thanks,
> Shawn
>


Re: Merging of index in Solr

2017-11-22 Thread Shawn Heisey
On 11/21/2017 9:10 AM, Zheng Lin Edwin Yeo wrote:
> I am using the IndexMergeTool from Solr, from the command below:
>
> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
> org.apache.lucene.misc.IndexMergeTool
>
> The heap size is 32GB. There are more than 20 million documents in the two
> cores.

I have looked at IndexMergeTool, and confirmed that it does its job in
exactly the same way that Solr does an optimize, so I would still expect
a rate of 20 to 30 MB per second, unless it's running on REALLY old
hardware that can't transfer data that quickly.

Thanks,
Shawn



Re: Merging of index in Solr

2017-11-22 Thread Zheng Lin Edwin Yeo
Hi Emir,

Yes, I am running the merging on a Windows machine.
The hard disk is a SSD disk in NTFS file system.

Regards,
Edwin

On 22 November 2017 at 16:50, Emir Arnautović 
wrote:

> Hi Edwin,
> Quick googling suggests that this is the issue of NTFS related to large
> number of file fragments caused by large number of files in one directory
> of huge files. Are you running this merging on a Windows machine?
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 22 Nov 2017, at 02:33, Zheng Lin Edwin Yeo 
> wrote:
> >
> > Hi,
> >
> > I have encountered this error during the merging of the 3.5TB of index.
> > What could be the cause that lead to this?
> >
> > Exception in thread "main" Exception in thread "Lucene Merge Thread #8"
> > java.io.
> >
> > IOException: background merge hit exception: _6f(6.5.1):C7256757
> > _6e(6.5.1):C646
> >
> > 2072 _6d(6.5.1):C3750777 _6c(6.5.1):C2243594 _6b(6.5.1):C1015431
> > _6a(6.5.1):C105
> >
> > 0220 _69(6.5.1):c273879 _28(6.4.1):c79011/84:delGen=84
> > _26(6.4.1):c44960/8149:de
> >
> > lGen=100 _29(6.4.1):c73855/68:delGen=68 _5(6.4.1):C46672/31:delGen=31
> > _68(6.5.1)
> >
> > :c66 into _6g [maxNumSegments=1]
> >
> >at
> > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1931)
> >
> >
> >
> >at
> > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1871)
> >
> >
> >
> >at
> > org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:57)
> >
> > Caused by: java.io.IOException: The requested operation could not be
> > completed d
> >
> > ue to a file system limitation
> >
> >at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> >
> >at sun.nio.ch.FileDispatcherImpl.write(Unknown Source)
> >
> >at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
> >
> >at sun.nio.ch.IOUtil.write(Unknown Source)
> >
> >at sun.nio.ch.FileChannelImpl.write(Unknown Source)
> >
> >at java.nio.channels.Channels.writeFullyImpl(Unknown Source)
> >
> >at java.nio.channels.Channels.writeFully(Unknown Source)
> >
> >at java.nio.channels.Channels.access$000(Unknown Source)
> >
> >at java.nio.channels.Channels$1.write(Unknown Source)
> >
> >at
> > org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory
> >
> > .java:419)
> >
> >at java.util.zip.CheckedOutputStream.write(Unknown Source)
> >
> >at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
> >
> >at java.io.BufferedOutputStream.write(Unknown Source)
> >
> >at
> > org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStre
> >
> > amIndexOutput.java:53)
> >
> >at
> > org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimited
> >
> > IndexOutput.java:73)
> >
> >at org.apache.lucene.store.DataOutput.writeBytes(
> DataOutput.java:52)
> >
> >at
> > org.apache.lucene.codecs.lucene50.ForUtil.writeBlock(ForUtil.java:175
> >
> > )
> >
> >at
> > org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.addPosition(
> >
> > Lucene50PostingsWriter.java:286)
> >
> >at
> > org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPosting
> >
> > sWriterBase.java:156)
> >
> >at
> > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.w
> >
> > rite(BlockTreeTermsWriter.java:866)
> >
> >at
> > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTr
> >
> > eeTermsWriter.java:344)
> >
> >at
> > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105
> >
> > )
> >
> >at
> > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter
> >
> > .merge(PerFieldPostingsFormat.java:164)
> >
> >at
> > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:2
> >
> > 16)
> >
> >at
> > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> >
> >at
> > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4353
> >
> > )
> >
> >at org.apache.lucene.index.IndexWriter.merge(IndexWriter.
> java:3928)
> >
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe
> >
> > rgeScheduler.java:624)
> >
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
> >
> > urrentMergeScheduler.java:661)
> >
> > org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> > The req
> >
> > uested operation could not be completed due to a file system limitation
> >
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException
> >
> > (ConcurrentMergeScheduler.java:703)
> >
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
> >
> > urrentMergeScheduler.java:683)
> >
> > Caused by: 

Re: Merging of index in Solr

2017-11-22 Thread Emir Arnautović
Hi Edwin,
Quick googling suggests that this is the issue of NTFS related to large number 
of file fragments caused by large number of files in one directory of huge 
files. Are you running this merging on a Windows machine?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Nov 2017, at 02:33, Zheng Lin Edwin Yeo  wrote:
> 
> Hi,
> 
> I have encountered this error during the merging of the 3.5TB of index.
> What could be the cause that lead to this?
> 
> Exception in thread "main" Exception in thread "Lucene Merge Thread #8"
> java.io.
> 
> IOException: background merge hit exception: _6f(6.5.1):C7256757
> _6e(6.5.1):C646
> 
> 2072 _6d(6.5.1):C3750777 _6c(6.5.1):C2243594 _6b(6.5.1):C1015431
> _6a(6.5.1):C105
> 
> 0220 _69(6.5.1):c273879 _28(6.4.1):c79011/84:delGen=84
> _26(6.4.1):c44960/8149:de
> 
> lGen=100 _29(6.4.1):c73855/68:delGen=68 _5(6.4.1):C46672/31:delGen=31
> _68(6.5.1)
> 
> :c66 into _6g [maxNumSegments=1]
> 
>at
> org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1931)
> 
> 
> 
>at
> org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1871)
> 
> 
> 
>at
> org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:57)
> 
> Caused by: java.io.IOException: The requested operation could not be
> completed d
> 
> ue to a file system limitation
> 
>at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> 
>at sun.nio.ch.FileDispatcherImpl.write(Unknown Source)
> 
>at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
> 
>at sun.nio.ch.IOUtil.write(Unknown Source)
> 
>at sun.nio.ch.FileChannelImpl.write(Unknown Source)
> 
>at java.nio.channels.Channels.writeFullyImpl(Unknown Source)
> 
>at java.nio.channels.Channels.writeFully(Unknown Source)
> 
>at java.nio.channels.Channels.access$000(Unknown Source)
> 
>at java.nio.channels.Channels$1.write(Unknown Source)
> 
>at
> org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory
> 
> .java:419)
> 
>at java.util.zip.CheckedOutputStream.write(Unknown Source)
> 
>at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
> 
>at java.io.BufferedOutputStream.write(Unknown Source)
> 
>at
> org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStre
> 
> amIndexOutput.java:53)
> 
>at
> org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimited
> 
> IndexOutput.java:73)
> 
>at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)
> 
>at
> org.apache.lucene.codecs.lucene50.ForUtil.writeBlock(ForUtil.java:175
> 
> )
> 
>at
> org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.addPosition(
> 
> Lucene50PostingsWriter.java:286)
> 
>at
> org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPosting
> 
> sWriterBase.java:156)
> 
>at
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.w
> 
> rite(BlockTreeTermsWriter.java:866)
> 
>at
> org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTr
> 
> eeTermsWriter.java:344)
> 
>at
> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105
> 
> )
> 
>at
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter
> 
> .merge(PerFieldPostingsFormat.java:164)
> 
>at
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:2
> 
> 16)
> 
>at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)
> 
>at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4353
> 
> )
> 
>at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3928)
> 
>at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe
> 
> rgeScheduler.java:624)
> 
>at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
> 
> urrentMergeScheduler.java:661)
> 
> org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> The req
> 
> uested operation could not be completed due to a file system limitation
> 
>at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException
> 
> (ConcurrentMergeScheduler.java:703)
> 
>at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
> 
> urrentMergeScheduler.java:683)
> 
> Caused by: java.io.IOException: The requested operation could not be
> completed d
> 
> ue to a file system limitation
> 
>at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
> 
>at sun.nio.ch.FileDispatcherImpl.write(Unknown Source)
> 
>at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)
> 
>at sun.nio.ch.IOUtil.write(Unknown Source)
> 
>at sun.nio.ch.FileChannelImpl.write(Unknown Source)
> 
>at java.nio.channels.Channels.writeFullyImpl(Unknown 

Re: Merging of index in Solr

2017-11-21 Thread Zheng Lin Edwin Yeo
Hi,

I have encountered this error during the merging of the 3.5TB of index.
What could be the cause that lead to this?

Exception in thread "main" Exception in thread "Lucene Merge Thread #8"
java.io.

IOException: background merge hit exception: _6f(6.5.1):C7256757
_6e(6.5.1):C646

2072 _6d(6.5.1):C3750777 _6c(6.5.1):C2243594 _6b(6.5.1):C1015431
_6a(6.5.1):C105

0220 _69(6.5.1):c273879 _28(6.4.1):c79011/84:delGen=84
_26(6.4.1):c44960/8149:de

lGen=100 _29(6.4.1):c73855/68:delGen=68 _5(6.4.1):C46672/31:delGen=31
_68(6.5.1)

:c66 into _6g [maxNumSegments=1]

at
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1931)



at
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1871)



at
org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:57)

Caused by: java.io.IOException: The requested operation could not be
completed d

ue to a file system limitation

at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

at sun.nio.ch.FileDispatcherImpl.write(Unknown Source)

at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)

at sun.nio.ch.IOUtil.write(Unknown Source)

at sun.nio.ch.FileChannelImpl.write(Unknown Source)

at java.nio.channels.Channels.writeFullyImpl(Unknown Source)

at java.nio.channels.Channels.writeFully(Unknown Source)

at java.nio.channels.Channels.access$000(Unknown Source)

at java.nio.channels.Channels$1.write(Unknown Source)

at
org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory

.java:419)

at java.util.zip.CheckedOutputStream.write(Unknown Source)

at java.io.BufferedOutputStream.flushBuffer(Unknown Source)

at java.io.BufferedOutputStream.write(Unknown Source)

at
org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStre

amIndexOutput.java:53)

at
org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimited

IndexOutput.java:73)

at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52)

at
org.apache.lucene.codecs.lucene50.ForUtil.writeBlock(ForUtil.java:175

)

at
org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.addPosition(

Lucene50PostingsWriter.java:286)

at
org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPosting

sWriterBase.java:156)

at
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.w

rite(BlockTreeTermsWriter.java:866)

at
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTr

eeTermsWriter.java:344)

at
org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105

)

at
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter

.merge(PerFieldPostingsFormat.java:164)

at
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:2

16)

at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101)

at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4353

)

at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3928)

at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe

rgeScheduler.java:624)

at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc

urrentMergeScheduler.java:661)

org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
The req

uested operation could not be completed due to a file system limitation

at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException

(ConcurrentMergeScheduler.java:703)

at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc

urrentMergeScheduler.java:683)

Caused by: java.io.IOException: The requested operation could not be
completed d

ue to a file system limitation

at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

at sun.nio.ch.FileDispatcherImpl.write(Unknown Source)

at sun.nio.ch.IOUtil.writeFromNativeBuffer(Unknown Source)

at sun.nio.ch.IOUtil.write(Unknown Source)

at sun.nio.ch.FileChannelImpl.write(Unknown Source)

at java.nio.channels.Channels.writeFullyImpl(Unknown Source)

at java.nio.channels.Channels.writeFully(Unknown Source)

at java.nio.channels.Channels.access$000(Unknown Source)

at java.nio.channels.Channels$1.write(Unknown Source)

Regards,
Edwin

On 22 November 2017 at 00:10, Zheng Lin Edwin Yeo 
wrote:

> I am using the IndexMergeTool from Solr, from the command below:
>
> java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
> org.apache.lucene.misc.IndexMergeTool
>
> The heap size is 32GB. There are more than 20 million documents in the two
> cores.
>
> Regards,
> Edwin
>
>
>
> On 21 November 2017 at 21:54, Shawn Heisey  wrote:
>
>> On 11/20/2017 9:35 AM, Zheng Lin Edwin Yeo wrote:
>>
>>> Does anyone knows how long 

Re: Merging of index in Solr

2017-11-21 Thread Zheng Lin Edwin Yeo
I am using the IndexMergeTool from Solr, from the command below:

java -classpath lucene-core-6.5.1.jar;lucene-misc-6.5.1.jar
org.apache.lucene.misc.IndexMergeTool

The heap size is 32GB. There are more than 20 million documents in the two
cores.

Regards,
Edwin



On 21 November 2017 at 21:54, Shawn Heisey  wrote:

> On 11/20/2017 9:35 AM, Zheng Lin Edwin Yeo wrote:
>
>> Does anyone knows how long usually the merging in Solr will take?
>>
>> I am currently merging about 3.5TB of data, and it has been running for
>> more than 28 hours and it is not completed yet. The merging is running on
>> SSD disk.
>>
>
> The following will apply if you mean Solr's "optimize" feature when you
> say "merging".
>
> In my experience, merging proceeds at about 20 to 30 megabytes per second
> -- even if the disks are capable of far faster data transfer.  Merging is
> not just copying the data. Lucene is completely rebuilding very large data
> structures, and *not* including data from deleted documents as it does so.
> It takes a lot of CPU power and time.
>
> If we average the data rates I've seen to 25, then that would indicate
> that an optimize on a 3.5TB is going to take about 39 hours, and might take
> as long as 48 hours.  And if you're running SolrCloud with multiple
> replicas, multiply that by the number of copies of the 3.5TB index.  An
> optimize on a SolrCloud collection handles one shard replica at a time and
> works its way through the entire collection.
>
> If you are merging different indexes *together*, which a later message
> seems to state, then the actual Lucene operation is probably nearly
> identical, but I'm not really familiar with it, so I cannot say for sure.
>
> Thanks,
> Shawn
>
>


Re: Merging of index in Solr

2017-11-21 Thread Shawn Heisey

On 11/20/2017 9:35 AM, Zheng Lin Edwin Yeo wrote:

Does anyone knows how long usually the merging in Solr will take?

I am currently merging about 3.5TB of data, and it has been running for
more than 28 hours and it is not completed yet. The merging is running on
SSD disk.


The following will apply if you mean Solr's "optimize" feature when you 
say "merging".


In my experience, merging proceeds at about 20 to 30 megabytes per 
second -- even if the disks are capable of far faster data transfer.  
Merging is not just copying the data. Lucene is completely rebuilding 
very large data structures, and *not* including data from deleted 
documents as it does so.  It takes a lot of CPU power and time.


If we average the data rates I've seen to 25, then that would indicate 
that an optimize on a 3.5TB is going to take about 39 hours, and might 
take as long as 48 hours.  And if you're running SolrCloud with multiple 
replicas, multiply that by the number of copies of the 3.5TB index.  An 
optimize on a SolrCloud collection handles one shard replica at a time 
and works its way through the entire collection.


If you are merging different indexes *together*, which a later message 
seems to state, then the actual Lucene operation is probably nearly 
identical, but I'm not really familiar with it, so I cannot say for sure.


Thanks,
Shawn



Re: Merging of index in Solr

2017-11-21 Thread Emir Arnautović
Hi Edwin,
I’ll let somebody with more knowledge about merge to comment merge aspects.
What do you use to merge those cores - merge tool or you run it using Solr’s 
core API? What is the heap size? How many documents are in those two cores?

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 21 Nov 2017, at 14:20, Zheng Lin Edwin Yeo  wrote:
> 
> Hi Emir,
> 
> Thanks for your reply.
> 
> There are only 1 host, 1 nodes and 1 shard for these 3.5TB.
> The merging has already written the additional 3.5TB to another segment.
> However, it is still not a single segment, and the size of the folder where
> the merged index is supposed to be is now 4.6TB, This excludes the original
> 3.5TB, meaning it is already using up 8.1TB of space, but the merging is
> still going on.
> 
> The index are currently updates free. We have only index the data in 2
> different collections, and we now need to merge them into a single
> collection.
> 
> Regards,
> Edwin
> 
> On 21 November 2017 at 16:52, Emir Arnautović 
> wrote:
> 
>> Hi Edwin,
>> How many host/nodes/shard are those 3.5TB? I am not familiar with merge
>> code, but trying to think what it might include, so don’t take any of
>> following as ground truth.
>> Merging for sure will include segments rewrite, so you better have
>> additional 3.5TB if you are merging it to a single segment. But that should
>> not last days on SSD. My guess is that you are running on the edge of your
>> heap and doing a lot GCs and maybe you will OOM at some point. I would
>> guess that merging is memory intensive operation and even if not holding
>> large structures in memory, it will probably create a lot of garbage.
>> Merging requires a lot of comparison so it is also a possibility that you
>> are exhausting CPU resources.
>> Bottom line - without more details and some monitoring tool, it is hard to
>> tell why it is taking that much.
>> And there is also a question if merging is good choice in you case - is
>> index static/updates free?
>> 
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 20 Nov 2017, at 17:35, Zheng Lin Edwin Yeo 
>> wrote:
>>> 
>>> Hi,
>>> 
>>> Does anyone knows how long usually the merging in Solr will take?
>>> 
>>> I am currently merging about 3.5TB of data, and it has been running for
>>> more than 28 hours and it is not completed yet. The merging is running on
>>> SSD disk.
>>> 
>>> I am using Solr 6.5.1.
>>> 
>>> Regards,
>>> Edwin
>> 
>> 



Re: Merging of index in Solr

2017-11-21 Thread Zheng Lin Edwin Yeo
Hi Emir,

Thanks for your reply.

There are only 1 host, 1 nodes and 1 shard for these 3.5TB.
The merging has already written the additional 3.5TB to another segment.
However, it is still not a single segment, and the size of the folder where
the merged index is supposed to be is now 4.6TB, This excludes the original
3.5TB, meaning it is already using up 8.1TB of space, but the merging is
still going on.

The index are currently updates free. We have only index the data in 2
different collections, and we now need to merge them into a single
collection.

Regards,
Edwin

On 21 November 2017 at 16:52, Emir Arnautović 
wrote:

> Hi Edwin,
> How many host/nodes/shard are those 3.5TB? I am not familiar with merge
> code, but trying to think what it might include, so don’t take any of
> following as ground truth.
> Merging for sure will include segments rewrite, so you better have
> additional 3.5TB if you are merging it to a single segment. But that should
> not last days on SSD. My guess is that you are running on the edge of your
> heap and doing a lot GCs and maybe you will OOM at some point. I would
> guess that merging is memory intensive operation and even if not holding
> large structures in memory, it will probably create a lot of garbage.
> Merging requires a lot of comparison so it is also a possibility that you
> are exhausting CPU resources.
> Bottom line - without more details and some monitoring tool, it is hard to
> tell why it is taking that much.
> And there is also a question if merging is good choice in you case - is
> index static/updates free?
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 20 Nov 2017, at 17:35, Zheng Lin Edwin Yeo 
> wrote:
> >
> > Hi,
> >
> > Does anyone knows how long usually the merging in Solr will take?
> >
> > I am currently merging about 3.5TB of data, and it has been running for
> > more than 28 hours and it is not completed yet. The merging is running on
> > SSD disk.
> >
> > I am using Solr 6.5.1.
> >
> > Regards,
> > Edwin
>
>


Re: Merging of index in Solr

2017-11-21 Thread Emir Arnautović
Hi Edwin,
How many host/nodes/shard are those 3.5TB? I am not familiar with merge code, 
but trying to think what it might include, so don’t take any of following as 
ground truth.
Merging for sure will include segments rewrite, so you better have additional 
3.5TB if you are merging it to a single segment. But that should not last days 
on SSD. My guess is that you are running on the edge of your heap and doing a 
lot GCs and maybe you will OOM at some point. I would guess that merging is 
memory intensive operation and even if not holding large structures in memory, 
it will probably create a lot of garbage. Merging requires a lot of comparison 
so it is also a possibility that you are exhausting CPU resources.
Bottom line - without more details and some monitoring tool, it is hard to tell 
why it is taking that much.
And there is also a question if merging is good choice in you case - is index 
static/updates free?

Regards,
Emir 
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 20 Nov 2017, at 17:35, Zheng Lin Edwin Yeo  wrote:
> 
> Hi,
> 
> Does anyone knows how long usually the merging in Solr will take?
> 
> I am currently merging about 3.5TB of data, and it has been running for
> more than 28 hours and it is not completed yet. The merging is running on
> SSD disk.
> 
> I am using Solr 6.5.1.
> 
> Regards,
> Edwin



Merging of index in Solr

2017-11-20 Thread Zheng Lin Edwin Yeo
Hi,

Does anyone knows how long usually the merging in Solr will take?

I am currently merging about 3.5TB of data, and it has been running for
more than 28 hours and it is not completed yet. The merging is running on
SSD disk.

I am using Solr 6.5.1.

Regards,
Edwin