TooManyClauses by wildcard queries

2009-09-10 Thread Patricio Galeas

Hi all,

I get the TooManyClauses exception by some wildcard queries like :
(a) de*
(b) country AND de*
(c) ma?s* AND de*

I'm not sure how to apply the solution proposed in LuceneFAQ for the 
case of WildcardQueries like the examples above.


Can you confirm if it is the right procedure?

1. Override QueryParser.getWildcardQuery() to return a ConstantScoreQuery.
2. Break up the query to identify the wildcard query part.
3. Create a custom Filter for the wildcard query
4. Create the final query using the custom filter.

If the item 2. is right, can you suggest me an optimal way to do that?

Thank you
Patricio




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexReader.isCurrent for cached indexes

2009-09-10 Thread Ian Lea
isCurrent() will only return true if there have been committed changes
to the index.  Maybe for some reason your index update job hasn't
committed or closed the index.

Probably not relevant to this problem, but your reopen code snippet
doesn't close the old reader.  It should.  See the javadocs.

What version of lucene are you running?



--
Ian.


On Wed, Sep 9, 2009 at 10:33 PM, Nick Bailey
 wrote:
> Looking for some help figuring out a problem with the IndexReader.isCurrent() 
> method and cached indexes.
>
> We have a number of lucene indexes that we attempt to keep in memory after an 
> initial query is performed.  In order to prevent the indexes from becoming 
> stale, we check for changes about every minute by calling isCurrent().  If 
> the index has changed, we will then reopen it.
>
> From our logs it appears that in some cases isCurrent() will return true even 
> though the index has changed since the last time the reader was opened.
>
> The code to refresh the index is basically this:
>
> // Checked every minute
> if(!reader.isCurrent()){
>   // reopen the existing reader
>   reader = this.searcher.getIndexReader();
>   reader = reader.reopen();
> }
>
> This is an example of the problem from the logs:
>
> 2009-08-29 17:50:51,387 Indexed 0 documents and deleted 1 documents from 
> index 'example' in 0 ms
> 2009-08-30 03:11:58,410 Indexed 0 documents and deleted 5 documents from 
> index 'example' in 0 ms
> 2009-08-30 16:30:03,466 Using cached reader  lastRefresh=81415526>
> // numbers indicate milliseconds since opened or refreshed aka age = 24.6hrs, 
> lastRefresh = 22.6hrs
>
> The logs indicate we deleted documents from the index at about 5:50 on August 
> 29th, and then again on the 30th at 3:11.  Then at 4:30 on we attempted to 
> query the index.  We found the cached reader and used it, however, the last 
> time the cache was refreshed was about 22 hours previously, coinciding with 
> the first delete.  The index should have been reopened after the second 
> delete.
>
> I have checked, and the code to refresh the indexes is definitely being run 
> every 60 seconds.  All I can see is that the problem might be with the 
> isCurrent() method.
>
> Could it be due to holding the reader open for so long? Any other ideas?
>
> Thanks a lot,
> Nick Bailey
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to avoid huge index files

2009-09-10 Thread Michael McCandless
First, you need to limit the size of segments initially created by
IndexWriter due to newly added documents.  Probably the simplest way
is to call IndexWriter.commit() frequently enough.  You might want to
use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently
consumed by IndexWriter's buffer to determine when to commit.  But it
won't be an exact science, ie, the segment size will be different from
the RAM buffer size.  So, experiment w/ it...

Second, you need to prevent merging from creating a segment that's too
large.  For this I would use the setMaxMergeMB method of the
LogByteSizeMergePolicy (which is IndexWriter's default merge policy).
But note that this max size applies to the *input* segments, so you'd
roughly want that to be 1.0 MB (your 10.0 MB divided by the merge
factor = 10), but probably make it smaller to be sure things stay
small enough.

Note that with this approach, if your index is large enough, you'll
wind up with many segments and search performance will suffer when
compared to an index that doesn't have this max 10.0 MB file size
restriction.

Mike

On Thu, Sep 10, 2009 at 2:32 AM, Dvora  wrote:
>
> Hello again,
>
> Can someone please comment on that, whether what I'm looking is possible or
> not?
>
>
> Dvora wrote:
>>
>> Hello,
>>
>> I'm using Lucene2.4. I'm developing a web application that using Lucene
>> (via compass) to do the searches.
>> I'm intending to deploy the application in Google App Engine
>> (http://code.google.com/appengine/), which limits files length to be
>> smaller than 10MB. I've read about the various policies supported by
>> Lucene to limit the file sizes, but on matter which policy I used and
>> which parameters, the index files still grew to be lot more the 10MB.
>> Looking at the code, I've managed to limit the cfs files (predicting the
>> file size in CompoundFileWriter before closing the file) - I guess that
>> will degrade performance, but it's OK for now. But now the FDT files are
>> becoming huge (about 60MB) and I cant identifiy a way to limit those
>> files.
>>
>> Is there some built-in and correct way to limit these files length? If no,
>> can someone direct me please how should I tweak the source code to achieve
>> that?
>>
>> Thanks for any help.
>>
>
> --
> View this message in context: 
> http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25378056.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: TooManyClauses by wildcard queries

2009-09-10 Thread Uwe Schindler
Or use Lucene 2.9, it automatically uses constant score mode in wild card
queries, if needed.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Patricio Galeas [mailto:gal...@prometa.de]
> Sent: Thursday, September 10, 2009 10:41 AM
> To: java-user@lucene.apache.org
> Subject: TooManyClauses by wildcard queries
> 
> Hi all,
> 
> I get the TooManyClauses exception by some wildcard queries like :
> (a) de*
> (b) country AND de*
> (c) ma?s* AND de*
> 
> I'm not sure how to apply the solution proposed in LuceneFAQ for the
> case of WildcardQueries like the examples above.
> 
> Can you confirm if it is the right procedure?
> 
> 1. Override QueryParser.getWildcardQuery() to return a ConstantScoreQuery.
> 2. Break up the query to identify the wildcard query part.
> 3. Create a custom Filter for the wildcard query
> 4. Create the final query using the custom filter.
> 
> If the item 2. is right, can you suggest me an optimal way to do that?
> 
> Thank you
> Patricio
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: New "Stream closed" exception with Java 6

2009-09-10 Thread Chris Bamford
Hi Hoss,

I have been thinking more about what you said (below) - could you please expand 
on the indented part of this sentence:

"it's possibly you just have a simple bug where you are  closing the reader 
before you pass it to Lucene, 

  or maybe you are mistakenly adding the same field twice

(or in two different documents)"

Are you saying that if I were attempting to delete a doc and then add it again 
(e.g. update), but for some reason the delete didn't work, I would get a 
"Stream closed" exception?

Thanks 

- Chris

- Original Message -
From: Chris Hostetter 
Sent: Tue, 8/9/2009 7:57pm
To: java-user@lucene.apache.org
Subject: RE: New "Stream closed" exception with Java 6


: I'm coming to the same conclusion - there must be >1 threads accessing this 
index at the same time.  Better go figure it out  ...  :-)

careful about your assumptions ... you could get this same type of 
exception even with only one thread, the stream that's being closed isn't 
internal to Lucene, it's the InputStreamReader you supplied as the value 
of some Field.  it's possibly you just have a simple bug where you are 
closing hte reader before you pass it to Lucene, or maybe you are 
mistakenly adding the saame field twice (or in two different documents)


-Hoss


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to avoid huge index files

2009-09-10 Thread Dvora

Hi,

Thanks a lot for that, will peforms the experiments and publish the results.
I'm aware to the risk of peformance degredation, but for the pilot I'm
trying to run I think it's acceptable.

Thanks again!



Michael McCandless-2 wrote:
> 
> First, you need to limit the size of segments initially created by
> IndexWriter due to newly added documents.  Probably the simplest way
> is to call IndexWriter.commit() frequently enough.  You might want to
> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently
> consumed by IndexWriter's buffer to determine when to commit.  But it
> won't be an exact science, ie, the segment size will be different from
> the RAM buffer size.  So, experiment w/ it...
> 
> Second, you need to prevent merging from creating a segment that's too
> large.  For this I would use the setMaxMergeMB method of the
> LogByteSizeMergePolicy (which is IndexWriter's default merge policy).
> But note that this max size applies to the *input* segments, so you'd
> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge
> factor = 10), but probably make it smaller to be sure things stay
> small enough.
> 
> Note that with this approach, if your index is large enough, you'll
> wind up with many segments and search performance will suffer when
> compared to an index that doesn't have this max 10.0 MB file size
> restriction.
> 
> Mike
> 
> On Thu, Sep 10, 2009 at 2:32 AM, Dvora  wrote:
>>
>> Hello again,
>>
>> Can someone please comment on that, whether what I'm looking is possible
>> or
>> not?
>>
>>
>> Dvora wrote:
>>>
>>> Hello,
>>>
>>> I'm using Lucene2.4. I'm developing a web application that using Lucene
>>> (via compass) to do the searches.
>>> I'm intending to deploy the application in Google App Engine
>>> (http://code.google.com/appengine/), which limits files length to be
>>> smaller than 10MB. I've read about the various policies supported by
>>> Lucene to limit the file sizes, but on matter which policy I used and
>>> which parameters, the index files still grew to be lot more the 10MB.
>>> Looking at the code, I've managed to limit the cfs files (predicting the
>>> file size in CompoundFileWriter before closing the file) - I guess that
>>> will degrade performance, but it's OK for now. But now the FDT files are
>>> becoming huge (about 60MB) and I cant identifiy a way to limit those
>>> files.
>>>
>>> Is there some built-in and correct way to limit these files length? If
>>> no,
>>> can someone direct me please how should I tweak the source code to
>>> achieve
>>> that?
>>>
>>> Thanks for any help.
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25378056.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25380052.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to avoid huge index files

2009-09-10 Thread Michael McCandless
You're welcome!

Another, bottoms-up option would be to make a custom Directory impl
that simply splits up files above a certain size.  That'd be more
generic and more reliable...

Mike

On Thu, Sep 10, 2009 at 5:26 AM, Dvora  wrote:
>
> Hi,
>
> Thanks a lot for that, will peforms the experiments and publish the results.
> I'm aware to the risk of peformance degredation, but for the pilot I'm
> trying to run I think it's acceptable.
>
> Thanks again!
>
>
>
> Michael McCandless-2 wrote:
>>
>> First, you need to limit the size of segments initially created by
>> IndexWriter due to newly added documents.  Probably the simplest way
>> is to call IndexWriter.commit() frequently enough.  You might want to
>> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently
>> consumed by IndexWriter's buffer to determine when to commit.  But it
>> won't be an exact science, ie, the segment size will be different from
>> the RAM buffer size.  So, experiment w/ it...
>>
>> Second, you need to prevent merging from creating a segment that's too
>> large.  For this I would use the setMaxMergeMB method of the
>> LogByteSizeMergePolicy (which is IndexWriter's default merge policy).
>> But note that this max size applies to the *input* segments, so you'd
>> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge
>> factor = 10), but probably make it smaller to be sure things stay
>> small enough.
>>
>> Note that with this approach, if your index is large enough, you'll
>> wind up with many segments and search performance will suffer when
>> compared to an index that doesn't have this max 10.0 MB file size
>> restriction.
>>
>> Mike
>>
>> On Thu, Sep 10, 2009 at 2:32 AM, Dvora  wrote:
>>>
>>> Hello again,
>>>
>>> Can someone please comment on that, whether what I'm looking is possible
>>> or
>>> not?
>>>
>>>
>>> Dvora wrote:

 Hello,

 I'm using Lucene2.4. I'm developing a web application that using Lucene
 (via compass) to do the searches.
 I'm intending to deploy the application in Google App Engine
 (http://code.google.com/appengine/), which limits files length to be
 smaller than 10MB. I've read about the various policies supported by
 Lucene to limit the file sizes, but on matter which policy I used and
 which parameters, the index files still grew to be lot more the 10MB.
 Looking at the code, I've managed to limit the cfs files (predicting the
 file size in CompoundFileWriter before closing the file) - I guess that
 will degrade performance, but it's OK for now. But now the FDT files are
 becoming huge (about 60MB) and I cant identifiy a way to limit those
 files.

 Is there some built-in and correct way to limit these files length? If
 no,
 can someone direct me please how should I tweak the source code to
 achieve
 that?

 Thanks for any help.

>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25378056.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25380052.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Problem in lucene query

2009-09-10 Thread vibhuti
Hello 

 

I am new to Lucene and facing a problem while performing searches. I am
using lucene 2.2.0.

My application indexes documents on "keyword" field which contains integer
values. If the value is negative the query does not return correct results.

 

Following is my lucene query:

 

(keyword: \-1)

 

I also tried: 

(keyword: "-1")

 

 

But none of them returns correct results. It seems that Lucene ignores '-'.
My purpose is to search documents with index value "-1".

 

Any ideas??

 

Thanks



Re: support for PayloadTermQuery in MoreLikeThis

2009-09-10 Thread Grant Ingersoll


On Sep 9, 2009, at 4:39 PM, Bill Au wrote:


Has anyone done anything regarding the support of PayloadTermQuery in
MoreLikeThis?


Not yet!  Sounds interesting



I took a quick look at the code and it seems to be simply a matter of
swapping TermQuery with PayloadTermQuery.  I guess a generic  
solution would
be to add a enable method to enable PayloadTermQuery, keeping  
TermQuery as
the default for backwards compatibility.  The call signature of the  
same

enable method would also include the PayloadFunction to use for the
PayloadTermQuery.

Any comments/thoughts?



Hmm, this could work, but I think we should try to be generic if we  
can and have it be overridable.  Today's PTQ is likely to segue to  
tomorrow's AttributeTermQuery and I wouldn't want to preclude them.


-Grant

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to avoid huge index files

2009-09-10 Thread Dvora

Hi again,

Can you add some details and guidelines how to implement that? Different
files types have different structure, is such spliting doable without
knowing Lucene internals?


Michael McCandless-2 wrote:
> 
> You're welcome!
> 
> Another, bottoms-up option would be to make a custom Directory impl
> that simply splits up files above a certain size.  That'd be more
> generic and more reliable...
> 
> Mike
> 
> On Thu, Sep 10, 2009 at 5:26 AM, Dvora  wrote:
>>
>> Hi,
>>
>> Thanks a lot for that, will peforms the experiments and publish the
>> results.
>> I'm aware to the risk of peformance degredation, but for the pilot I'm
>> trying to run I think it's acceptable.
>>
>> Thanks again!
>>
>>
>>
>> Michael McCandless-2 wrote:
>>>
>>> First, you need to limit the size of segments initially created by
>>> IndexWriter due to newly added documents.  Probably the simplest way
>>> is to call IndexWriter.commit() frequently enough.  You might want to
>>> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently
>>> consumed by IndexWriter's buffer to determine when to commit.  But it
>>> won't be an exact science, ie, the segment size will be different from
>>> the RAM buffer size.  So, experiment w/ it...
>>>
>>> Second, you need to prevent merging from creating a segment that's too
>>> large.  For this I would use the setMaxMergeMB method of the
>>> LogByteSizeMergePolicy (which is IndexWriter's default merge policy).
>>> But note that this max size applies to the *input* segments, so you'd
>>> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge
>>> factor = 10), but probably make it smaller to be sure things stay
>>> small enough.
>>>
>>> Note that with this approach, if your index is large enough, you'll
>>> wind up with many segments and search performance will suffer when
>>> compared to an index that doesn't have this max 10.0 MB file size
>>> restriction.
>>>
>>> Mike
>>>
>>> On Thu, Sep 10, 2009 at 2:32 AM, Dvora  wrote:

 Hello again,

 Can someone please comment on that, whether what I'm looking is
 possible
 or
 not?


 Dvora wrote:
>
> Hello,
>
> I'm using Lucene2.4. I'm developing a web application that using
> Lucene
> (via compass) to do the searches.
> I'm intending to deploy the application in Google App Engine
> (http://code.google.com/appengine/), which limits files length to be
> smaller than 10MB. I've read about the various policies supported by
> Lucene to limit the file sizes, but on matter which policy I used and
> which parameters, the index files still grew to be lot more the 10MB.
> Looking at the code, I've managed to limit the cfs files (predicting
> the
> file size in CompoundFileWriter before closing the file) - I guess
> that
> will degrade performance, but it's OK for now. But now the FDT files
> are
> becoming huge (about 60MB) and I cant identifiy a way to limit those
> files.
>
> Is there some built-in and correct way to limit these files length? If
> no,
> can someone direct me please how should I tweak the source code to
> achieve
> that?
>
> Thanks for any help.
>

 --
 View this message in context:
 http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25378056.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25380052.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25381489.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How to avoid huge index files

2009-09-10 Thread Uwe Schindler
The idea is just to put a layer on top of the abstract file system function
supplied by directory. Whenever somebody wants to create a file and write
data to it, the methods create more than one file and switch e.g. after 10
Megabytes to another file. E.g. look into MMapDirectory that uses MMap to
map files into address space. Because MappedByteBuffer only supports 32 bit
offsets, there will be created different mappings for the same file (the
file is splitted up into parts of 2 Gigabytes). You could use similar code
here and just use another file, if somebody seeks or writes above the 10 MiB
limit. Just "virtualize" the files.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> From: Dvora [mailto:barak.ya...@gmail.com]
> Sent: Thursday, September 10, 2009 1:23 PM
> To: java-user@lucene.apache.org
> Subject: Re: How to avoid huge index files
> 
> 
> Hi again,
> 
> Can you add some details and guidelines how to implement that? Different
> files types have different structure, is such spliting doable without
> knowing Lucene internals?
> 
> 
> Michael McCandless-2 wrote:
> >
> > You're welcome!
> >
> > Another, bottoms-up option would be to make a custom Directory impl
> > that simply splits up files above a certain size.  That'd be more
> > generic and more reliable...
> >
> > Mike
> >
> > On Thu, Sep 10, 2009 at 5:26 AM, Dvora  wrote:
> >>
> >> Hi,
> >>
> >> Thanks a lot for that, will peforms the experiments and publish the
> >> results.
> >> I'm aware to the risk of peformance degredation, but for the pilot I'm
> >> trying to run I think it's acceptable.
> >>
> >> Thanks again!
> >>
> >>
> >>
> >> Michael McCandless-2 wrote:
> >>>
> >>> First, you need to limit the size of segments initially created by
> >>> IndexWriter due to newly added documents.  Probably the simplest way
> >>> is to call IndexWriter.commit() frequently enough.  You might want to
> >>> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently
> >>> consumed by IndexWriter's buffer to determine when to commit.  But it
> >>> won't be an exact science, ie, the segment size will be different from
> >>> the RAM buffer size.  So, experiment w/ it...
> >>>
> >>> Second, you need to prevent merging from creating a segment that's too
> >>> large.  For this I would use the setMaxMergeMB method of the
> >>> LogByteSizeMergePolicy (which is IndexWriter's default merge policy).
> >>> But note that this max size applies to the *input* segments, so you'd
> >>> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge
> >>> factor = 10), but probably make it smaller to be sure things stay
> >>> small enough.
> >>>
> >>> Note that with this approach, if your index is large enough, you'll
> >>> wind up with many segments and search performance will suffer when
> >>> compared to an index that doesn't have this max 10.0 MB file size
> >>> restriction.
> >>>
> >>> Mike
> >>>
> >>> On Thu, Sep 10, 2009 at 2:32 AM, Dvora  wrote:
> 
>  Hello again,
> 
>  Can someone please comment on that, whether what I'm looking is
>  possible
>  or
>  not?
> 
> 
>  Dvora wrote:
> >
> > Hello,
> >
> > I'm using Lucene2.4. I'm developing a web application that using
> > Lucene
> > (via compass) to do the searches.
> > I'm intending to deploy the application in Google App Engine
> > (http://code.google.com/appengine/), which limits files length to be
> > smaller than 10MB. I've read about the various policies supported by
> > Lucene to limit the file sizes, but on matter which policy I used
> and
> > which parameters, the index files still grew to be lot more the
> 10MB.
> > Looking at the code, I've managed to limit the cfs files (predicting
> > the
> > file size in CompoundFileWriter before closing the file) - I guess
> > that
> > will degrade performance, but it's OK for now. But now the FDT files
> > are
> > becoming huge (about 60MB) and I cant identifiy a way to limit those
> > files.
> >
> > Is there some built-in and correct way to limit these files length?
> If
> > no,
> > can someone direct me please how should I tweak the source code to
> > achieve
> > that?
> >
> > Thanks for any help.
> >
> 
>  --
>  View this message in context:
>  http://www.nabble.com/How-to-avoid-huge-index-files-
> tp25347505p25378056.html
>  Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
>  -
>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>  For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> >>>
> >>> -
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java

September 2009 Hadoop/Lucene/Solr/UIMA/katta/Mahout Get Together Berlin

2009-09-10 Thread Uwe Schindler
Hi,

I cross-post this here, Isabel Drost is managing the meetup. This time it is
more about Hadoop, but there is also a talk about the new Lucene 2.9 release
(presented by me). As far as I know, Simon Willnauer will also be there:

---
I would like to announce the September-2009 Hadoop Get Together in
newthinking store Berlin.

When: 29. September 2009 at 5:00pm
Where: newthinking store, Tucholskystr. 48, Berlin, Germany

As always there will be slots of 20min each for talks on your Hadoop topic.
After each talk there will be a lot time to discuss. You can order drinks
directly at the bar in the newthinking store. If you like, you can order
pizza. There are quite a few good restaurants nearby, so we can go there
after the official part.

Talks scheduled so far:
Thorsten Schuett, Solving Puzzles with MapReduce: MapReduce is most often
used for data mining and filtering large datasets. In this talk we will show
that it also useful for a completely different problem domain: solving
puzzles. Based on MapReduce, we can implement massively parallel
breadth-first and heuristic search. MapReduce will take care of the hard
problems, like parallelization, disk and error handling, while we can
concentrate on the puzzle. Throughout the talk we will use the sliding
puzzle (http://en.wikipedia.org/wiki/Sliding_puzzle) as our example.

Thilo Götz, Text analytics on jaql: Jaql (JSON query language) is a query
language for Javascript Object Notation that runs on top of Apache Hadoop.
It was primarily designed for large scale analysis of semi-structured data.
I will give an introduction to jaql and describe our experiences using it
for text analytics tasks. Jaql is open source and available from
http://code.google.com/p/jaql.

Uwe Schindler, Lucene 2.9 Developments: Numeric Search, Per-Segment- and
Near-Real-Time Search, new TokenStream API: Uwe Schindler presents some new
additions to Lucene 2.9. In the first half he will talk about fast numerical
and date range queries (NumericRangeQuery, formerly TrieRangeQuery) and
their usage in geospatial search applications like the Publishing Network
for Geoscientific & Environmental Data (PANGAEA). In the second half of his
talk, Uwe will highlight various improvements to the internal search
implementation for near-real-time search. Finally, he will present the new
TokenStream API, based on AttributeSource/Attributes that make indexing more
pluggable. Future
developments in the Flexible Indexing Area will make use of it. Uwe will
show a Tokenizer that uses custom attributes to index XML files into various
document fields based on XML element names as a possible use-case.

We would like to invite you, the visitor to also tell your Hadoop story, if
you like, you can bring slides - there will be a beamer.

A big Thanks goes to the newthinking store for providing a room in the
center of Berlin for us.

See the Upcoming page: http://upcoming.yahoo.com/event/4314020/

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: September 2009 Hadoop/Lucene/Solr/UIMA/katta/Mahout Get Together Berlin

2009-09-10 Thread Uwe Schindler
Hi again,

By the way, if somebody of the other involved developers want to provide me
some PPT Slides about the other new features in Lucene 2.9 (NRT, future
Flexible Indexing), I would be happy!

Uwe

> Uwe Schindler, Lucene 2.9 Developments: Numeric Search, Per-Segment- and
> Near-Real-Time Search, new TokenStream API: Uwe Schindler presents some
> new
> additions to Lucene 2.9. In the first half he will talk about fast
> numerical
> and date range queries (NumericRangeQuery, formerly TrieRangeQuery) and
> their usage in geospatial search applications like the Publishing Network
> for Geoscientific & Environmental Data (PANGAEA). In the second half of
> his
> talk, Uwe will highlight various improvements to the internal search
> implementation for near-real-time search. Finally, he will present the new
> TokenStream API, based on AttributeSource/Attributes that make indexing
> more
> pluggable. Future
> developments in the Flexible Indexing Area will make use of it. Uwe will
> show a Tokenizer that uses custom attributes to index XML files into
> various
> document fields based on XML element names as a possible use-case.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: TooManyClauses by wildcard queries

2009-09-10 Thread Patricio Galeas

Hi Uwe

But if  I don't use Lucene 2.9, is this procedure (items 1-4) the right 
way to avoid the TooManyClauses exception? or is there a more efficients 
procedure to do that?

Thanks
Patricio

Uwe Schindler schrieb:

Or use Lucene 2.9, it automatically uses constant score mode in wild card
queries, if needed.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

  

-Original Message-
From: Patricio Galeas [mailto:gal...@prometa.de]
Sent: Thursday, September 10, 2009 10:41 AM
To: java-user@lucene.apache.org
Subject: TooManyClauses by wildcard queries

Hi all,

I get the TooManyClauses exception by some wildcard queries like :
(a) de*
(b) country AND de*
(c) ma?s* AND de*

I'm not sure how to apply the solution proposed in LuceneFAQ for the
case of WildcardQueries like the examples above.

Can you confirm if it is the right procedure?

1. Override QueryParser.getWildcardQuery() to return a ConstantScoreQuery.
2. Break up the query to identify the wildcard query part.
3. Create a custom Filter for the wildcard query
4. Create the final query using the custom filter.

If the item 2. is right, can you suggest me an optimal way to do that?

Thank you
Patricio




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

  



--
P a t r i c i o   G a l e a s
ProMeta Team
--
I n f o t r a X  G m b H

Fon  +49 (0)271 30 30 888
Fax  +49 (0)271 74124-77
Mob  +49 (0)177 2962611

Adresse:
Friedrichstraße 81
D-57072 Siegen 


Geschäftsführerin
Dipl.-Wi.-Inf. Stephanie Sarach 


Handelsregister
HRB 8877 Amtsgericht Siegen 


http://www.prometa.de
http://www.infotrax.de
--


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How to avoid huge index files

2009-09-10 Thread Dvora

Me again :-)

I'm looking at the code of FSDirectory and MMapDirectory, and found that its
somewhat difficult for to understand how should subclass the FSDirectory and
adjust it to my needs. If I understand correct, MMapDirectory overrides the
openInput() method and returns MultiMMapIndexInput if the file size exceeds
the threshold. What I'm not understand is that how the new impl should keep
track on the generated files (or shouldn't it?..) so when searhcing Lucene
will know in which file to search - I'm confused :-)

Can I bother you so you supply some kind of psuedo code illustrating how the
implementation should look like?

Thanks again for your huge help!


Uwe Schindler wrote:
> 
> The idea is just to put a layer on top of the abstract file system
> function
> supplied by directory. Whenever somebody wants to create a file and write
> data to it, the methods create more than one file and switch e.g. after 10
> Megabytes to another file. E.g. look into MMapDirectory that uses MMap to
> map files into address space. Because MappedByteBuffer only supports 32
> bit
> offsets, there will be created different mappings for the same file (the
> file is splitted up into parts of 2 Gigabytes). You could use similar code
> here and just use another file, if somebody seeks or writes above the 10
> MiB
> limit. Just "virtualize" the files.
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
>> From: Dvora [mailto:barak.ya...@gmail.com]
>> Sent: Thursday, September 10, 2009 1:23 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: How to avoid huge index files
>> 
>> 
>> Hi again,
>> 
>> Can you add some details and guidelines how to implement that? Different
>> files types have different structure, is such spliting doable without
>> knowing Lucene internals?
>> 
>> 
>> Michael McCandless-2 wrote:
>> >
>> > You're welcome!
>> >
>> > Another, bottoms-up option would be to make a custom Directory impl
>> > that simply splits up files above a certain size.  That'd be more
>> > generic and more reliable...
>> >
>> > Mike
>> >
>> > On Thu, Sep 10, 2009 at 5:26 AM, Dvora  wrote:
>> >>
>> >> Hi,
>> >>
>> >> Thanks a lot for that, will peforms the experiments and publish the
>> >> results.
>> >> I'm aware to the risk of peformance degredation, but for the pilot I'm
>> >> trying to run I think it's acceptable.
>> >>
>> >> Thanks again!
>> >>
>> >>
>> >>
>> >> Michael McCandless-2 wrote:
>> >>>
>> >>> First, you need to limit the size of segments initially created by
>> >>> IndexWriter due to newly added documents.  Probably the simplest way
>> >>> is to call IndexWriter.commit() frequently enough.  You might want to
>> >>> use IndexWriter.ramSizeInBytes() to gauge how much RAM is currently
>> >>> consumed by IndexWriter's buffer to determine when to commit.  But it
>> >>> won't be an exact science, ie, the segment size will be different
>> from
>> >>> the RAM buffer size.  So, experiment w/ it...
>> >>>
>> >>> Second, you need to prevent merging from creating a segment that's
>> too
>> >>> large.  For this I would use the setMaxMergeMB method of the
>> >>> LogByteSizeMergePolicy (which is IndexWriter's default merge policy).
>> >>> But note that this max size applies to the *input* segments, so you'd
>> >>> roughly want that to be 1.0 MB (your 10.0 MB divided by the merge
>> >>> factor = 10), but probably make it smaller to be sure things stay
>> >>> small enough.
>> >>>
>> >>> Note that with this approach, if your index is large enough, you'll
>> >>> wind up with many segments and search performance will suffer when
>> >>> compared to an index that doesn't have this max 10.0 MB file size
>> >>> restriction.
>> >>>
>> >>> Mike
>> >>>
>> >>> On Thu, Sep 10, 2009 at 2:32 AM, Dvora  wrote:
>> 
>>  Hello again,
>> 
>>  Can someone please comment on that, whether what I'm looking is
>>  possible
>>  or
>>  not?
>> 
>> 
>>  Dvora wrote:
>> >
>> > Hello,
>> >
>> > I'm using Lucene2.4. I'm developing a web application that using
>> > Lucene
>> > (via compass) to do the searches.
>> > I'm intending to deploy the application in Google App Engine
>> > (http://code.google.com/appengine/), which limits files length to
>> be
>> > smaller than 10MB. I've read about the various policies supported
>> by
>> > Lucene to limit the file sizes, but on matter which policy I used
>> and
>> > which parameters, the index files still grew to be lot more the
>> 10MB.
>> > Looking at the code, I've managed to limit the cfs files
>> (predicting
>> > the
>> > file size in CompoundFileWriter before closing the file) - I guess
>> > that
>> > will degrade performance, but it's OK for now. But now the FDT
>> files
>> > are
>> > becoming huge (about 60MB) and I cant identifiy a way to limit
>> those
>> > files.
>> >
>> > Is there some built-in and correct way to limit

Re: Problem in lucene query

2009-09-10 Thread AHMET ARSLAN
> I am new to Lucene and facing a problem while performing
> searches. I am using lucene 2.2.0.
> 
> My application indexes documents on "keyword" field which
> contains integer values. 

Which analyzer/tokenizer are you using on that field? I am assuming it is a 
tokenized field.

>If the value is negative the query does not return
> correct results.

Is it returning 1's as well as -1's?

- is a special character so you have to escape it when querying.
So keyword:\-1 is correct. But the problem is StandardTokenizer tokenizes 
-1 to 1. If you use it all -1's and 1's are threated same. Use 
whitespaceanalyzer instead.

Hope this helps.



  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Problem in lucene query

2009-09-10 Thread Anshum
Hi Vibhuti,
Not in sync with your query, but I'd advice you to graduate you to a rather
recent lucene release. Something like 2.4.1 or atleast a 2.3.1 [Considering
its already time for 2.9].

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw


On Thu, Sep 10, 2009 at 4:17 PM, vibhuti  wrote:

> Hello
>
>
>
> I am new to Lucene and facing a problem while performing searches. I am
> using lucene 2.2.0.
>
> My application indexes documents on "keyword" field which contains integer
> values. If the value is negative the query does not return correct results.
>
>
>
> Following is my lucene query:
>
>
>
> (keyword: \-1)
>
>
>
> I also tried:
>
> (keyword: "-1")
>
>
>
>
>
> But none of them returns correct results. It seems that Lucene ignores '-'.
> My purpose is to search documents with index value "-1".
>
>
>
> Any ideas??
>
>
>
> Thanks
>
>


MultiSearcherThread.hits(ParallelMultiSearcher.java:280) nullPointerException

2009-09-10 Thread maryam ma'danipour
Hello every .
I have a problem with MultiSearcherThread.hits in ParallelMultiSearcher.java
. Some times when I want to search via paralleMultiSearcher,
the method MultiSearcherThread.hits() throws nullPointerException. this is
because docs somehow has become null.
but why this field is null. I've checked lucene code . this field never
becomes null except in the ParallelMultiSearcher when lucene wants to
aggregate all results ( in line 79) the
instruction of msta[i].join throws InterruptedException. then the ioe will
be null and because msta[i] hasn't finished its work yet so docs will be
null.
is this right? or is it possible the msta[i] be interrupted in this part of
code?

the exception is :
java.lang.NullPointerException
at
org.apache.lucene.search.MultiSearcherThread.hits(ParallelMultiSearcher.java:280)
at
org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:83)

Best Regards


Re: Problem in lucene query

2009-09-10 Thread Erick Erickson
Also, get a copy of Luke and examine your index, that'll tell you what
isactually
in there *and* it will let you see how queries parse under
various analyzers.

Best
Erick

On Thu, Sep 10, 2009 at 6:47 AM, vibhuti  wrote:

> Hello
>
>
>
> I am new to Lucene and facing a problem while performing searches. I am
> using lucene 2.2.0.
>
> My application indexes documents on "keyword" field which contains integer
> values. If the value is negative the query does not return correct results.
>
>
>
> Following is my lucene query:
>
>
>
> (keyword: \-1)
>
>
>
> I also tried:
>
> (keyword: "-1")
>
>
>
>
>
> But none of them returns correct results. It seems that Lucene ignores '-'.
> My purpose is to search documents with index value "-1".
>
>
>
> Any ideas??
>
>
>
> Thanks
>
>


Re: IndexReader.isCurrent for cached indexes

2009-09-10 Thread Nick Bailey
Our commit code will close the IndexWriter after adding the documents and 
before we see the log message indicating the documents have been added and 
deleted, so I don't believe that is the problem.

Thanks for the tip about reopen.  I actually noticed that when researching this 
problem but didn't think it was related.

We are running 2.4.1


-Original Message-
From: "Ian Lea" 
Sent: Thursday, September 10, 2009 5:05am
To: java-user@lucene.apache.org
Subject: Re: IndexReader.isCurrent for cached indexes

isCurrent() will only return true if there have been committed changes
to the index.  Maybe for some reason your index update job hasn't
committed or closed the index.

Probably not relevant to this problem, but your reopen code snippet
doesn't close the old reader.  It should.  See the javadocs.

What version of lucene are you running?



--
Ian.


On Wed, Sep 9, 2009 at 10:33 PM, Nick Bailey
 wrote:
> Looking for some help figuring out a problem with the IndexReader.isCurrent() 
> method and cached indexes.
>
> We have a number of lucene indexes that we attempt to keep in memory after an 
> initial query is performed.  In order to prevent the indexes from becoming 
> stale, we check for changes about every minute by calling isCurrent().  If 
> the index has changed, we will then reopen it.
>
> From our logs it appears that in some cases isCurrent() will return true even 
> though the index has changed since the last time the reader was opened.
>
> The code to refresh the index is basically this:
>
> // Checked every minute
> if(!reader.isCurrent()){
>   // reopen the existing reader
>   reader = this.searcher.getIndexReader();
>   reader = reader.reopen();
> }
>
> This is an example of the problem from the logs:
>
> 2009-08-29 17:50:51,387 Indexed 0 documents and deleted 1 documents from 
> index 'example' in 0 ms
> 2009-08-30 03:11:58,410 Indexed 0 documents and deleted 5 documents from 
> index 'example' in 0 ms
> 2009-08-30 16:30:03,466 Using cached reader  lastRefresh=81415526>
> // numbers indicate milliseconds since opened or refreshed aka age = 24.6hrs, 
> lastRefresh = 22.6hrs
>
> The logs indicate we deleted documents from the index at about 5:50 on August 
> 29th, and then again on the 30th at 3:11.  Then at 4:30 on we attempted to 
> query the index.  We found the cached reader and used it, however, the last 
> time the cache was refreshed was about 22 hours previously, coinciding with 
> the first delete.  The index should have been reopened after the second 
> delete.
>
> I have checked, and the code to refresh the indexes is definitely being run 
> every 60 seconds.  All I can see is that the problem might be with the 
> isCurrent() method.
>
> Could it be due to holding the reader open for so long? Any other ideas?
>
> Thanks a lot,
> Nick Bailey
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to avoid huge index files

2009-09-10 Thread Ted Stockwell
Another alternative is storing the indexes in the Google Datastore, I think 
Compass already supports that (though I have not used it).

Also, I have successfully run Lucene on GAE using GaeVFS 
(http://code.google.com/p/gaevfs/) to store the index in the Datastore.
(I developed a Lucene Directory implementation on top of GaeVFS that's 
available at http://sf.net/contrail).



> Dvora wrote:
> > 
> > Hello,
> > 
> > I'm using Lucene2.4. I'm developing a web application that using Lucene
> > (via compass) to do the searches.
> > I'm intending to deploy the application in Google App Engine
> > (http://code.google.com/appengine/), which limits files length to be
> > smaller than 10MB. 



  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extending Sort/FieldCache

2009-09-10 Thread Jason Rutherglen
I think CSF hasn't been implemented because it's only marginally
useful yet requires fairly significant rewrites of core code
(i.e. SegmentMerger) so no one's picked it up including myself.
An interim solution that fulfills the same function (quickly
loading field cache values) using what works reliably today
(i.e. payloads) is a good simple forward moving step.

Shai, feel free to open an issue and post your code. I'd will
check it out and help where possible.

On Tue, Sep 8, 2009 at 8:46 PM, Shai Erera  wrote:
> I didn't say we won't need CSF, but that at least conceptually, CSF and my
> sort-by-payload are the same. If however it turns out that CSF performs
> better, then I'll definitely switch my sort-by-payload package to use it. I
> thought that CSF is going to be implemented using payloads, but perhaps I'm
> wrong.
>
> Shai
>
> On Wed, Sep 9, 2009 at 1:39 AM, Yonik Seeley 
> wrote:
>
>> On Sun, Sep 6, 2009 at 4:42 AM, Shai Erera wrote:
>> >> I've resisted using payloads for this purpose in Solr because it felt
>> >> like an interim hack until CSF is implemented.
>> >
>> > I don't see it as a hack, but as a proper use of a great feature in
>> Lucene.
>>
>> It's proper use for an application perhaps, but not for core Lucene.
>> Applications are pretty much required to work with what's given in
>> Lucene... but Lucene developers can make better choices.  Hence if at
>> all possible, work should be put into implementing CSF rather than
>> sorting by payloads.
>>
>> > CSF and this are essentially the same.
>>
>> In which case we wouldn't need CSF?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Chinese Japanese Korean Indexing issue Version 2.4

2009-09-10 Thread asitag

Hi,

We are trying to index html files which have japanese /  korean / chinese
content using the CJK analyser. But while indexing we are getting Lexical
parse error. Encountered unkown character. We tried setting the string
encoding to UTF 8 but it does not help.

Can anyone please help. Any pointers will be highly appreciated. 

Thanks
-- 
View this message in context: 
http://www.nabble.com/Chinese-Japanese-Korean-Indexing-issue-Version-2.4-tp25388003p25388003.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Chinese Japanese Korean Indexing issue Version 2.4

2009-09-10 Thread asitag

TO add some more context - I am able to index english and Western european
langauages.


asitag wrote:
> 
> Hi,
> 
> We are trying to index html files which have japanese /  korean / chinese
> content using the CJK analyser. But while indexing we are getting Lexical
> parse error. Encountered unkown character. We tried setting the string
> encoding to UTF 8 but it does not help.
> 
> Can anyone please help. Any pointers will be highly appreciated. 
> 
> Thanks
> 

-- 
View this message in context: 
http://www.nabble.com/Chinese-Japanese-Korean-Indexing-issue-Version-2.4-tp25388003p25388078.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to avoid huge index files

2009-09-10 Thread Dvora

Is it possible to upload to GAE an already exist index? My index is data I'm
collecting for long time, and I prefer not to give it up.



ted stockwell wrote:
> 
> Another alternative is storing the indexes in the Google Datastore, I
> think Compass already supports that (though I have not used it).
> 
> Also, I have successfully run Lucene on GAE using GaeVFS
> (http://code.google.com/p/gaevfs/) to store the index in the Datastore.
> (I developed a Lucene Directory implementation on top of GaeVFS that's
> available at http://sf.net/contrail).
> 
> 
> 
>> Dvora wrote:
>> > 
>> > Hello,
>> > 
>> > I'm using Lucene2.4. I'm developing a web application that using Lucene
>> > (via compass) to do the searches.
>> > I'm intending to deploy the application in Google App Engine
>> > (http://code.google.com/appengine/), which limits files length to be
>> > smaller than 10MB. 
> 
> 
> 
>   
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/How-to-avoid-huge-index-files-tp25347505p25389394.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to avoid huge index files

2009-09-10 Thread Ted Stockwell
Not at the moment.
Actually, I'm already working on a remote copy utility for gaevfs that will 
upload large files and folders but the first cut is about a week away.



- Original Message 
> From: Dvora 
> To: java-user@lucene.apache.org
> Sent: Thursday, September 10, 2009 2:18:35 PM
> Subject: Re: How to avoid huge index files
> 
> 
> Is it possible to upload to GAE an already exist index? My index is data I'm
> collecting for long time, and I prefer not to give it up.
> 
> 


  

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Index docstore flush problem

2009-09-10 Thread Jason Rutherglen
I'm seeing a strange exception when indexing using the latest Solr rev on EC2.

org.apache.solr.client.solrj.SolrServerException:
org.apache.solr.client.solrj.SolrServerException:
java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs
vs 298404 length in bytes of _0.fdx
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153)
at 
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:268)
at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
at 
org.apache.solr.hadoop.SolrRecordWriter$1.run(SolrRecordWriter.java:239)
Caused by: org.apache.solr.client.solrj.SolrServerException:
java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs
vs 298404 length in bytes of _0.fdx
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141)
... 3 more
Caused by: java.lang.RuntimeException: after flush: fdx size mismatch:
468 docs vs 298404 length in bytes of _0.fdx
at 
org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:95)
at 
org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50)
at 
org.apache.lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:380)
at 
org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:574)
at 
org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4212)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4110)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4101)
at 
org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2108)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2071)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2035)
at 
org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:215)
at 
org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:180)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:404)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
at 
org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:105)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:48)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139)
... 3 more

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index docstore flush problem

2009-09-10 Thread Michael McCandless
That's an odd exception.  It means IndexWriter thinks 468 docs have
been written to the stored fields file, which should mean the fdx file
size is 3748 (= 4 + 468*8), yet the file size is far larger than that
(298404).

How repeatable is it?  Can you turn on infoStream, get the exception
to happen, then post the resulting output?

Mike

On Thu, Sep 10, 2009 at 7:19 PM, Jason Rutherglen
 wrote:
> I'm seeing a strange exception when indexing using the latest Solr rev on EC2.
>
> org.apache.solr.client.solrj.SolrServerException:
> org.apache.solr.client.solrj.SolrServerException:
> java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs
> vs 298404 length in bytes of _0.fdx
>        at 
> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153)
>        at 
> org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:268)
>        at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
>        at 
> org.apache.solr.hadoop.SolrRecordWriter$1.run(SolrRecordWriter.java:239)
> Caused by: org.apache.solr.client.solrj.SolrServerException:
> java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs
> vs 298404 length in bytes of _0.fdx
>        at 
> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141)
>        ... 3 more
> Caused by: java.lang.RuntimeException: after flush: fdx size mismatch:
> 468 docs vs 298404 length in bytes of _0.fdx
>        at 
> org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:95)
>        at 
> org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50)
>        at 
> org.apache.lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:380)
>        at 
> org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:574)
>        at 
> org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4212)
>        at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4110)
>        at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4101)
>        at 
> org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2108)
>        at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2071)
>        at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2035)
>        at 
> org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:215)
>        at 
> org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:180)
>        at 
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:404)
>        at 
> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
>        at 
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:105)
>        at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:48)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
>        at 
> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139)
>        ... 3 more
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index docstore flush problem

2009-09-10 Thread Jason Rutherglen
Indexing locking was off, there was a bug higher up clobbering the
index.  Sorry and thanks!

On Thu, Sep 10, 2009 at 4:49 PM, Michael McCandless
 wrote:
> That's an odd exception.  It means IndexWriter thinks 468 docs have
> been written to the stored fields file, which should mean the fdx file
> size is 3748 (= 4 + 468*8), yet the file size is far larger than that
> (298404).
>
> How repeatable is it?  Can you turn on infoStream, get the exception
> to happen, then post the resulting output?
>
> Mike
>
> On Thu, Sep 10, 2009 at 7:19 PM, Jason Rutherglen
>  wrote:
>> I'm seeing a strange exception when indexing using the latest Solr rev on 
>> EC2.
>>
>> org.apache.solr.client.solrj.SolrServerException:
>> org.apache.solr.client.solrj.SolrServerException:
>> java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs
>> vs 298404 length in bytes of _0.fdx
>>        at 
>> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153)
>>        at 
>> org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:268)
>>        at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
>>        at 
>> org.apache.solr.hadoop.SolrRecordWriter$1.run(SolrRecordWriter.java:239)
>> Caused by: org.apache.solr.client.solrj.SolrServerException:
>> java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs
>> vs 298404 length in bytes of _0.fdx
>>        at 
>> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141)
>>        ... 3 more
>> Caused by: java.lang.RuntimeException: after flush: fdx size mismatch:
>> 468 docs vs 298404 length in bytes of _0.fdx
>>        at 
>> org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:95)
>>        at 
>> org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50)
>>        at 
>> org.apache.lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:380)
>>        at 
>> org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:574)
>>        at 
>> org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4212)
>>        at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4110)
>>        at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4101)
>>        at 
>> org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2108)
>>        at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2071)
>>        at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2035)
>>        at 
>> org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:215)
>>        at 
>> org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:180)
>>        at 
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:404)
>>        at 
>> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:85)
>>        at 
>> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:105)
>>        at 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:48)
>>        at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1299)
>>        at 
>> org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139)
>>        ... 3 more
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



quick survey on schema less database usage

2009-09-10 Thread rr04

I am a MIT student doing a project on schema-less database usage and would
greatly appreciate if you guys can fill out a quick survey on this (should
take < 5 mins)

http://bit.ly/nosqldb
-- 
View this message in context: 
http://www.nabble.com/quick-survey-on-schema-less-database-usage-tp25394429p25394429.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org