Re: Optimization and Corruption Issues

2009-10-01 Thread Erick Erickson
Would it work to copy your entire index to a new directory, perhaps on a
different machine and optimize *there*? Then copy back to your app. Of
course updates would be lost...
But taking a week to optimize a 20G index seems just plain wrong. Have you
tried playing with the various options to see if you can get better
performance? And/or allocating more memory to the JVM?

Of course I'm not very familiar with 2.0 performance, so.

Best
Erick

On Thu, Oct 1, 2009 at 11:40 AM, lowfreq  wrote:

>
> I have a Lucene index that is very large in size.
> It was created using a pre 2.1 version of Lucene.net 2.0.0.4.
>
> The index is currently almost 20 GB, and has almost 7000 segment files.
> The problem I am having is that I need to optimize it, and cant do this
> without the search functionality of my app being down for a week.
>
> I used the Luke tool from getopt.org and it worked flawlessly, optimizing
> the index in just over 2 hours. Problem is that my search cannot use it,
> and
> the error states Unknown Format Version errors, or just plain nothing
> found.
>
> I understand that versions of Lucene that are newer than what the index was
> built and is searched with can cause problems.
>
> What can I do to make this work? I have tried older versions of Luke, 0.7
> was the oldest I could lay hands on, but even it uses a newer version of
> Lucene.
>
> My index version shows as 633103800023469045. The version the index is
> written as after optimizing with Luke 7.0 is 633103800023469057.
>
> Any help here would be awesome!
>
> Thank you,
>
> Hugh
>
> --
> View this message in context:
> http://www.nabble.com/Optimization-and-Corruption-Issues-tp25697034p25697034.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Optimization and Corruption Issues

2009-10-01 Thread Mark Miller
2.0 is pre Mike's fabulous indexing updates - which just for one means
one thread doing the merging rather than multiple. I'm sure overall its
much slower.

But you can't take advantage of the newer faster code without updating
Lucene in your app.

Your best bet is to put it another machine and take a week and put it
back - but your down for a brief period on swapping in the index. So it
seems just as good to update Lucene - swap in the update real quick
(after fixing your code offline), and then do the faster optimize. You
can still serve search requests during the 2 hour optimize - performance
will be affected though.

Erick Erickson wrote:
> Would it work to copy your entire index to a new directory, perhaps on
> a different machine and optimize *there*? Then copy back to your app.
> Of course updates would be lost...
>
> But taking a week to optimize a 20G index seems just plain wrong. Have
> you tried playing with the various options to see if you can get
> better performance? And/or allocating more memory to the JVM?
>
> Of course I'm not very familiar with 2.0 performance, so.
>
> Best
> Erick
>
> On Thu, Oct 1, 2009 at 11:40 AM, lowfreq  > wrote:
>
>
> I have a Lucene index that is very large in size.
> It was created using a pre 2.1 version of Lucene.net 2.0.0.4.
>
> The index is currently almost 20 GB, and has almost 7000 segment
> files.
> The problem I am having is that I need to optimize it, and cant do
> this
> without the search functionality of my app being down for a week.
>
> I used the Luke tool from getopt.org  and it
> worked flawlessly, optimizing
> the index in just over 2 hours. Problem is that my search cannot
> use it, and
> the error states Unknown Format Version errors, or just plain
> nothing found.
>
> I understand that versions of Lucene that are newer than what the
> index was
> built and is searched with can cause problems.
>
> What can I do to make this work? I have tried older versions of
> Luke, 0.7
> was the oldest I could lay hands on, but even it uses a newer
> version of
> Lucene.
>
> My index version shows as 633103800023469045. The version the index is
> written as after optimizing with Luke 7.0 is 633103800023469057.
>
> Any help here would be awesome!
>
> Thank you,
>
> Hugh
>
> --
> View this message in context:
> 
> http://www.nabble.com/Optimization-and-Corruption-Issues-tp25697034p25697034.html
> Sent from the Lucene - Java Developer mailing list archive at
> Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> 
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 
>
>


-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Optimization and Corruption Issues

2009-10-01 Thread Uwe Schindler
The problem you have is that, if you optimize the index with a newer Luke
version, it refactors the index in a later lucene file format. To read it
with your current app, you also have to update your application to at least
the version of Lucene Luke uses.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: lowfreq [mailto:hughmorri...@hotmail.com]
> Sent: Thursday, October 01, 2009 5:41 PM
> To: java-dev@lucene.apache.org
> Subject: Optimization and Corruption Issues
> 
> 
> I have a Lucene index that is very large in size.
> It was created using a pre 2.1 version of Lucene.net 2.0.0.4.
> 
> The index is currently almost 20 GB, and has almost 7000 segment files.
> The problem I am having is that I need to optimize it, and cant do this
> without the search functionality of my app being down for a week.
> 
> I used the Luke tool from getopt.org and it worked flawlessly, optimizing
> the index in just over 2 hours. Problem is that my search cannot use it,
> and
> the error states Unknown Format Version errors, or just plain nothing
> found.
> 
> I understand that versions of Lucene that are newer than what the index
> was
> built and is searched with can cause problems.
> 
> What can I do to make this work? I have tried older versions of Luke, 0.7
> was the oldest I could lay hands on, but even it uses a newer version of
> Lucene.
> 
> My index version shows as 633103800023469045. The version the index is
> written as after optimizing with Luke 7.0 is 633103800023469057.
> 
> Any help here would be awesome!
> 
> Thank you,
> 
> Hugh
> 
> --
> View this message in context: http://www.nabble.com/Optimization-and-
> Corruption-Issues-tp25697034p25697034.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Optimization and Corruption Issues

2009-10-01 Thread Andrzej Bialecki

lowfreq wrote:
I have a Lucene index that is very large in size. 
It was created using a pre 2.1 version of Lucene.net 2.0.0.4. 

The index is currently almost 20 GB, and has almost 7000 segment files. 
The problem I am having is that I need to optimize it, and cant do this
without the search functionality of my app being down for a week. 


I used the Luke tool from getopt.org and it worked flawlessly, optimizing
the index in just over 2 hours. Problem is that my search cannot use it, and
the error states Unknown Format Version errors, or just plain nothing found. 


You should be careful when using Lucene Java to modify Lucene.Net 
indexes. I know for a fact that deflated data in Lucene Java is 
incompatible with the deflater implementation in .Net, so it's easy to 
create an incompatible index even when you use a supposedly compatible 
version of Lucene Java. Perhaps versions around 2.0 still worked ok, but 
no guarantees.





I understand that versions of Lucene that are newer than what the index was
built and is searched with can cause problems. 


What can I do to make this work? I have tried older versions of Luke, 0.7
was the oldest I could lay hands on, but even it uses a newer version of
Lucene. 


Here are links to older versions of Luke:

http://www.getopt.org/luke/luke-0.1.zip
http://www.getopt.org/luke/luke-0.2.zip
http://www.getopt.org/luke/luke-0.3.zip
http://www.getopt.org/luke/luke-0.4.zip
http://www.getopt.org/luke/luke-0.5/luke-0.5.jar
http://www.getopt.org/luke/luke-0.5/luke-src-0.5.zip
http://www.getopt.org/luke/luke-0.6/lukeall-0.6.jar
http://www.getopt.org/luke/luke-0.6/luke-src-0.6.zip




My index version shows as 633103800023469045. The version the index is
written as after optimizing with Luke 7.0 is 633103800023469057. 


This is just a timestamp, so it doesn't say what version of Lucene 
created the index. If you open the index with Luke, in the Overview tab 
there is a line that tells what is the index format version.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Optimization and Corruption Issues

2009-10-01 Thread Earwin Burrfoot
> 2.0 is pre Mike's fabulous indexing updates - which just for one means
> one thread doing the merging rather than multiple. I'm sure overall its
> much slower.
If you're doing a full optimize, you're still using a single thread. Am I wrong?


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Optimization and Corruption Issues

2009-10-01 Thread Michael McCandless
On Thu, Oct 1, 2009 at 12:49 PM, Earwin Burrfoot  wrote:

> If you're doing a full optimize, you're still using a single thread. Am I 
> wrong?

Depends on how many merges are required, and, the merge scheduler.  In
this case (w/ 7000 segments, which is way too many, normally!),
assuming ConcurrentMergeScheduler, multiple threads will be used since
many merges will be pending.

When it gets down to the last (enormous) merge, it's only one thread.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Optimization and Corruption Issues

2009-10-01 Thread Earwin Burrfoot
>> If you're doing a full optimize, you're still using a single thread. Am I 
>> wrong?
>
> Depends on how many merges are required, and, the merge scheduler.  In
> this case (w/ 7000 segments, which is way too many, normally!),
> assuming ConcurrentMergeScheduler, multiple threads will be used since
> many merges will be pending.
>
> When it gets down to the last (enormous) merge, it's only one thread.
I'm speaking about full optimize. Is there any way to do it more
efficiently then running a single, last (enormous) merge?
If you try to parallelize, you're merging some documents several times
(more work) and killing your disks, as merges are mostly IO-bound.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Optimization and Corruption Issues

2009-10-01 Thread lowfreq

Thank you very much for the detailed information everyone!
I will try to use the information to make my code better.

I have parsed out the optimization bits into a commandline app that runs the
optimize on another box. Its messy, but effective in keeping downtime to a
minimum. This will get the large amount of segment files under control for
now. Too bad it takes a week or more. Hopefully I will not have to reindex
it anytime soon. 

I think the best way around this is transaction/agent based for the future.
That way, I can keep a read only copy for searching.

My app currently uses two services, one for writes and one for reads.
I suspect that this may be the problem that is causing the corruption.

Does anyone have any experience with this type of setup, and has seen/knows
that this can cause a corrupted lucene index? 

I have heard that having more than one service attached at a time causes the
problem I am seeing.

Thanks for the links to the old Luke distros, and thanks for all the quick
responses!

Hugh


Andrzej Bialecki wrote:
> 
> lowfreq wrote:
>> I have a Lucene index that is very large in size. 
>> It was created using a pre 2.1 version of Lucene.net 2.0.0.4. 
>> 
>> The index is currently almost 20 GB, and has almost 7000 segment files. 
>> The problem I am having is that I need to optimize it, and cant do this
>> without the search functionality of my app being down for a week. 
>> 
>> I used the Luke tool from getopt.org and it worked flawlessly, optimizing
>> the index in just over 2 hours. Problem is that my search cannot use it,
>> and
>> the error states Unknown Format Version errors, or just plain nothing
>> found. 
> 
> You should be careful when using Lucene Java to modify Lucene.Net 
> indexes. I know for a fact that deflated data in Lucene Java is 
> incompatible with the deflater implementation in .Net, so it's easy to 
> create an incompatible index even when you use a supposedly compatible 
> version of Lucene Java. Perhaps versions around 2.0 still worked ok, but 
> no guarantees.
> 
> 
>> 
>> I understand that versions of Lucene that are newer than what the index
>> was
>> built and is searched with can cause problems. 
>> 
>> What can I do to make this work? I have tried older versions of Luke, 0.7
>> was the oldest I could lay hands on, but even it uses a newer version of
>> Lucene. 
> 
> Here are links to older versions of Luke:
> 
>   http://www.getopt.org/luke/luke-0.1.zip
>   http://www.getopt.org/luke/luke-0.2.zip
>   http://www.getopt.org/luke/luke-0.3.zip
>   http://www.getopt.org/luke/luke-0.4.zip
>   http://www.getopt.org/luke/luke-0.5/luke-0.5.jar
>   http://www.getopt.org/luke/luke-0.5/luke-src-0.5.zip
>   http://www.getopt.org/luke/luke-0.6/lukeall-0.6.jar
>   http://www.getopt.org/luke/luke-0.6/luke-src-0.6.zip
> 
> 
>> 
>> My index version shows as 633103800023469045. The version the index is
>> written as after optimizing with Luke 7.0 is 633103800023469057. 
> 
> This is just a timestamp, so it doesn't say what version of Lucene 
> created the index. If you open the index with Luke, in the Overview tab 
> there is a line that tells what is the index format version.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Optimization-and-Corruption-Issues-tp25697034p25705907.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Optimization and Corruption Issues

2009-10-01 Thread Michael McCandless
On Thu, Oct 1, 2009 at 2:56 PM, Earwin Burrfoot  wrote:
>>> If you're doing a full optimize, you're still using a single thread. Am I 
>>> wrong?
>>
>> Depends on how many merges are required, and, the merge scheduler.  In
>> this case (w/ 7000 segments, which is way too many, normally!),
>> assuming ConcurrentMergeScheduler, multiple threads will be used since
>> many merges will be pending.
>>
>> When it gets down to the last (enormous) merge, it's only one thread.
> I'm speaking about full optimize. Is there any way to do it more
> efficiently then running a single, last (enormous) merge?

I guess we could merge different parts of the index w/ different
threads, if we wanted to push concurrency down into a single merge.

> If you try to parallelize, you're merging some documents several times
> (more work) and killing your disks, as merges are mostly IO-bound.

Actually I've found merging of the postings to be CPU bound.  I think
the priority queue, and decode/encode of vInt, are the big CPU costs.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Optimization and Corruption Issues

2009-10-28 Thread George Aroush
Sorry, I'm just catching up with my mailing list inbox, ...

Andrzej Bialecki wrote: 
> > 
> > I used the Luke tool from getopt.org and it worked flawlessly, optimizing
> > the index in just over 2 hours. Problem is that my search cannot use it, and
> > the error states Unknown Format Version errors, or just plain nothing 
> > found. 
>
> You should be careful when using Lucene Java to modify Lucene.Net 
> indexes. I know for a fact that deflated data in Lucene Java is 
> incompatible with the deflater implementation in .Net, so it's easy to 
> create an incompatible index even when you use a supposedly compatible 
> version of Lucene Java. Perhaps versions around 2.0 still worked ok, but 
> no guarantees.

Can you please elaborate some more on this?  There was a known issue with pre 
Lucene.Net 2.0.0 where in some instances the index is not compatible with Java 
Lucene (sorry, I can't find the JIRA issue, but search for 
"PRE_LUCENE_NET_2_0_0_COMPATIBLE" in Lucene.Net's code base for details).

Other than that, there should NOT be any issues using Java or .NET Lucene to 
read / write / optimize the index.  The same warnings that apply to Java Lucene 
when moving from version to version also applies to Lucene.Net.  This is a test 
case that I always run as part of a port.  Also, a while back, (and I think 
it's still in production) I helped write a solution in which the index is 
accessed concurrently by a Java and .NET Lucene 2.1.

If you are aware of issues, please bring those to the Lucene.Net mailing list 
for discussion.

Thanks.

-- George


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org