Re: Using Lucene to model ownership of documents

2016-06-16 Thread Denis Bazhenov
The speed for a and b, should be the same, at least from conceptual point of 
view. The number of terms generated for each scenario is equal. Therefore, 
index size and vocabulary size should be the same.

I’m wondering why there is difference. It seems like there is some penalty for 
writing/reading terms for different fields, but I can’t elaborate on that. 
Could you provide index size for scenarios a and b?

Scenario c could be the fastest in terms of search and indexing speed, but it’s 
far more complex and make sense only if you have a need for scaling your 
system. Which imply you can’t solve problem on the single box.

So, if there is no need for scaling, I’d go with b because of simplicity.

> On Jun 15, 2016, at 23:25, Geebee Coder  wrote:
> 
> Hi there,
> I would like to use Lucene to solve the following problem:
> 
> 1.We have about 100k customers and we have 25 millions of documents.
> 
> 2.When a customer performs a text search on the document space, we want to
> return only documents that the customer has access to.
> 
> 3.The # of documents a customer owns varies a lot. some have close to 23
> million, some have close to 10k and some own a third of the documents etc.
> 
> What is an efficient way to use Lucene in this scenario in terms of
> performance and indexing?
> We have tried a number of solutions such as
> 
> a)100k boolean fields per document that indicates whether a customer has
> access to the document.
> b)A single text field that has a list of customers who owns the document
> e.g. (customers field : "abc abd cfx...")
> c) the above option with shards by customers
> 
> The search performance for a was bad. b,c performed better for search
> but lengthened the time needed for indexing & index size.
> We are also thinking about using a custom filter but we are concerned about
> the memory requirements.
> 
> Any ideas/suggestions would be really appreciated.

---
Denis Bazhenov 






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: LockFactory issue observed in lucene while getting instance of indexWriter

2016-06-16 Thread Mukul Ranjan
Hi Mike,

Yes, we are getting indexReader instance from the active Directory. We are 
using MultiReader to obtain instance of indexSearcher.

Thanks,
Mukul

From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Friday, June 17, 2016 12:56 AM
To: Mukul Ranjan 
Cc: Lucene Users 
Subject: Re: LockFactory issue observed in lucene while getting instance of 
indexWriter

But do you open any near-real-time readers from this writer?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jun 16, 2016 at 1:01 PM, Mukul Ranjan 
> wrote:
Hi Michael,

Thanks for your reply.
I’m running it on windows. I have checked my code, I’m closing IndexWriter 
after adding document to it.
We are not getting this issue always but it’s frequency is high in our 
application. Can you please provide your suggestion?

Thanks,
Mukul

From: Michael McCandless 
[mailto:luc...@mikemccandless.com]
Sent: Thursday, June 16, 2016 10:22 PM
To: Lucene Users 
>; Mukul Ranjan 
>
Subject: Re: LockFactory issue observed in lucene while getting instance of 
indexWriter

Are you running on Windows?

This is not a LockFactory issue ... it's likely caused because you closed 
IndexWriter, and then opened a new one, before closing NRT readers you had 
opened from the first writer?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jun 16, 2016 at 6:19 AM, Mukul Ranjan 
> wrote:
Hi,

I'm observing below exception while getting instance of indexWriter-

java.lang.IllegalArgumentException: Directory MMapDirectory@"directoryName" 
lockFactory=org.apache.lucene.store.NativeFSLockFactory@1ec79746
 still has pending deleted files; cannot initialize IndexWriter

Is it related to the default used NativeFSLockFactory. Should I use 
simpleFSLockFactory to avoid this type of issue. Please suggest as I'm getting 
the above exception in my application.

Thanks,
Mukul
Visit eGain on YouTube and 
LinkedIn




Re: LockFactory issue observed in lucene while getting instance of indexWriter

2016-06-16 Thread Michael McCandless
Are you running on Windows?

This is not a LockFactory issue ... it's likely caused because you closed
IndexWriter, and then opened a new one, before closing NRT readers you had
opened from the first writer?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jun 16, 2016 at 6:19 AM, Mukul Ranjan  wrote:

> Hi,
>
> I'm observing below exception while getting instance of indexWriter-
>
> java.lang.IllegalArgumentException: Directory MMapDirectory@"directoryName"
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@1ec79746 still
> has pending deleted files; cannot initialize IndexWriter
>
> Is it related to the default used NativeFSLockFactory. Should I use
> simpleFSLockFactory to avoid this type of issue. Please suggest as I'm
> getting the above exception in my application.
>
> Thanks,
> Mukul
> Visit eGain on YouTube and
> LinkedIn
>


Re:Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

2016-06-16 Thread dr

Thank you so much, Steve. Your reply is very helpful.







At 2016-06-16 23:01:18, "Steve Rowe"  wrote:
>Hi dr,
>
>Unicode’s character property model is described here: 
>.
>
>Wikipedia has a description of Unicode character properties: 
>
>
>JFlex allows you to refer to the set of characters that have a given Unicode 
>property using the \p{PropertyName} syntax.  In the case of the HangulEx macro:
>
>  HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
> [\p{WB:Format}\p{WB:Extend}]*
>
>This matches a Hangul script character (\p{Script:Hangul})[1] that also either 
>has the Word-Break property “ALetter” or “Hebrew_Letter”, followed by zero or 
>more characters that have either the “Format” or “Extend” Word-Break 
>properties[2].  
>
>Some helpful resources:
>
>* Character code charts organized by Unicode block: 
>
>* UnicodeSet utility:  - 
>note that this utility supports a different regex syntax from JFlex - click on 
>the “help” link for more info.
>
>[1] All characters matching \p{Script:Hangul}: 
>
>[2] Word-Break properties, which in JFlex can be referred to with the 
>abbreviation “WB:” in \p{WB:property-name}, are described in the table at 
>.
>
>--
>Steve
>www.lucidworks.com
>
>
>> On Jun 16, 2016, at 7:01 AM, dr  wrote:
>> 
>> Hi guys
>>   Currenly, I'm looking into the rules of StandardTokenizer, but met some 
>> probleam.
>>As the docs says, StandardTokenizer implements the Word Break rules from 
>> the Unicode Text Segmentation algorithm, as specified in Unicode Standard 
>> Annex #29. Also it is generated by JFlex, a lexer/scanner generator. 
>> 
>>   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as 
>> follows
>> "
>>HangulEx= 
>> [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
>> [\p{WB:Format}\p{WB:Extend}]*
>> HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]
>>[\p{WB:Format}\p{WB:Extend}]*
>> NumericEx   = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]] 
>>[\p{WB:Format}\p{WB:Extend}]*
>> KatakanaEx  = \p{WB:Katakana}
>>[\p{WB:Format}\p{WB:Extend}]* 
>> MidLetterEx = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]   
>>[\p{WB:Format}\p{WB:Extend}]* 
>> ..
>> "
>> What does them mean, like HangulEx  or NumericEx  ?
>> In ClassicTokenizerImpl.jflex, for num, it is expressed like this
>> "
>> P   = ("_"|"-"|"/"|"."|",")
>> NUM= ({ALPHANUM} {P} {HAS_DIGIT}
>>   | {HAS_DIGIT} {P} {ALPHANUM}
>>   | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>>   | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>>   | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>>   | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
>> "
>> This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be 
>> tokenized as NUMBERS.
>> 
>> 
>> 
>> I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode 
>> Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
>> UNICODE CHARACTER DATABASE, but they include too much information and hard 
>> to understand.
>> Anyone has some reference of these kinds of Regular Expressions or tell me 
>> where to find the meanings of these UNICODE Regular Expressions
>> 
>> 
>> Thanks.
>
>
>-
>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>For additional commands, e-mail: java-user-h...@lucene.apache.org
>


Re: Using Lucene to model ownership of documents

2016-06-16 Thread Geebee Coder
Thank you all.
Michael, do you mean grouping customers by categories? (e.g. customer A has
premium access and so does customer B so they will have access to same set
of documents)
if that's the case, unfortunately, we don't have such categories of
customers, their access rights are over specific documents and not tiers.


On Thu, Jun 16, 2016 at 9:37 AM, Michael Wilkowski 
wrote:

> Definitely b). I would also suggest groups and expanding user groups at
> user sign in time.
>
> MW
>
> On Thu, Jun 16, 2016 at 12:36 PM, Ian Lea  wrote:
>
> > I'd definitely go for b).  The index will of course be larger for every
> > extra bit of data you store but it doesn't sound like this would make
> much
> > difference.  Likewise for speed of indexing.
> >
> >
> > --
> > Ian.
> >
> >
> > On Wed, Jun 15, 2016 at 2:25 PM, Geebee Coder 
> wrote:
> >
> > > Hi there,
> > > I would like to use Lucene to solve the following problem:
> > >
> > > 1.We have about 100k customers and we have 25 millions of documents.
> > >
> > > 2.When a customer performs a text search on the document space, we want
> > to
> > > return only documents that the customer has access to.
> > >
> > > 3.The # of documents a customer owns varies a lot. some have close to
> 23
> > > million, some have close to 10k and some own a third of the documents
> > etc.
> > >
> > > What is an efficient way to use Lucene in this scenario in terms of
> > > performance and indexing?
> > > We have tried a number of solutions such as
> > >
> > >  a)100k boolean fields per document that indicates whether a customer
> has
> > > access to the document.
> > >  b)A single text field that has a list of customers who owns the
> document
> > > e.g. (customers field : "abc abd cfx...")
> > > c) the above option with shards by customers
> > >
> > > The search performance for a was bad. b,c performed better for
> > search
> > > but lengthened the time needed for indexing & index size.
> > > We are also thinking about using a custom filter but we are concerned
> > about
> > > the memory requirements.
> > >
> > > Any ideas/suggestions would be really appreciated.
> > >
> >
>


Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

2016-06-16 Thread Steve Rowe
Hi dr,

Unicode’s character property model is described here: 
.

Wikipedia has a description of Unicode character properties: 


JFlex allows you to refer to the set of characters that have a given Unicode 
property using the \p{PropertyName} syntax.  In the case of the HangulEx macro:

  HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
[\p{WB:Format}\p{WB:Extend}]*

This matches a Hangul script character (\p{Script:Hangul})[1] that also either 
has the Word-Break property “ALetter” or “Hebrew_Letter”, followed by zero or 
more characters that have either the “Format” or “Extend” Word-Break 
properties[2].  

Some helpful resources:

* Character code charts organized by Unicode block: 

* UnicodeSet utility:  - 
note that this utility supports a different regex syntax from JFlex - click on 
the “help” link for more info.

[1] All characters matching \p{Script:Hangul}: 

[2] Word-Break properties, which in JFlex can be referred to with the 
abbreviation “WB:” in \p{WB:property-name}, are described in the table at 
.

--
Steve
www.lucidworks.com


> On Jun 16, 2016, at 7:01 AM, dr  wrote:
> 
> Hi guys
>   Currenly, I'm looking into the rules of StandardTokenizer, but met some 
> probleam.
>As the docs says, StandardTokenizer implements the Word Break rules from 
> the Unicode Text Segmentation algorithm, as specified in Unicode Standard 
> Annex #29. Also it is generated by JFlex, a lexer/scanner generator. 
> 
>   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as 
> follows
> "
>HangulEx= 
> [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
> [\p{WB:Format}\p{WB:Extend}]*
> HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}] 
>   [\p{WB:Format}\p{WB:Extend}]*
> NumericEx   = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]  
>   [\p{WB:Format}\p{WB:Extend}]*
> KatakanaEx  = \p{WB:Katakana} 
>   [\p{WB:Format}\p{WB:Extend}]* 
> MidLetterEx = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]
>   [\p{WB:Format}\p{WB:Extend}]* 
> ..
> "
> What does them mean, like HangulEx  or NumericEx  ?
> In ClassicTokenizerImpl.jflex, for num, it is expressed like this
> "
> P   = ("_"|"-"|"/"|"."|",")
> NUM= ({ALPHANUM} {P} {HAS_DIGIT}
>   | {HAS_DIGIT} {P} {ALPHANUM}
>   | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>   | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>   | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>   | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
> "
> This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be 
> tokenized as NUMBERS.
> 
> 
> 
> I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode 
> Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
> UNICODE CHARACTER DATABASE, but they include too much information and hard to 
> understand.
> Anyone has some reference of these kinds of Regular Expressions or tell me 
> where to find the meanings of these UNICODE Regular Expressions
> 
> 
> Thanks.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using Lucene to model ownership of documents

2016-06-16 Thread Michael Wilkowski
Definitely b). I would also suggest groups and expanding user groups at
user sign in time.

MW

On Thu, Jun 16, 2016 at 12:36 PM, Ian Lea  wrote:

> I'd definitely go for b).  The index will of course be larger for every
> extra bit of data you store but it doesn't sound like this would make much
> difference.  Likewise for speed of indexing.
>
>
> --
> Ian.
>
>
> On Wed, Jun 15, 2016 at 2:25 PM, Geebee Coder  wrote:
>
> > Hi there,
> > I would like to use Lucene to solve the following problem:
> >
> > 1.We have about 100k customers and we have 25 millions of documents.
> >
> > 2.When a customer performs a text search on the document space, we want
> to
> > return only documents that the customer has access to.
> >
> > 3.The # of documents a customer owns varies a lot. some have close to 23
> > million, some have close to 10k and some own a third of the documents
> etc.
> >
> > What is an efficient way to use Lucene in this scenario in terms of
> > performance and indexing?
> > We have tried a number of solutions such as
> >
> >  a)100k boolean fields per document that indicates whether a customer has
> > access to the document.
> >  b)A single text field that has a list of customers who owns the document
> > e.g. (customers field : "abc abd cfx...")
> > c) the above option with shards by customers
> >
> > The search performance for a was bad. b,c performed better for
> search
> > but lengthened the time needed for indexing & index size.
> > We are also thinking about using a custom filter but we are concerned
> about
> > the memory requirements.
> >
> > Any ideas/suggestions would be really appreciated.
> >
>


IndexWriterConfig.readerPooling option...

2016-06-16 Thread Ravikumar Govindarajan
Came across a JIRA filed for pooling IndexReaders

https://issues.apache.org/jira/browse/LUCENE-2297


For every commit/delete/update cycle IndexWriter opens a bunch of
SegmentReaders, does the job & closes it.

Does the JIRA aim to re-use the SegmentReaders for all commit-cycles till
they are finally closed (after a merge, rollback, iw.close() etc...) ?

Also, we use DirectoryReader.open(dir) construct for searching. Is the
SegmentReader instances associated with this different from that
IndexWriter's & thus have different life-cycle?

Could someone please help in understanding this readerPooling option

--
Ravi


Some questions about StandardTokenizer and UNICODE Regular Expressions

2016-06-16 Thread dr
Hi guys
   Currenly, I'm looking into the rules of StandardTokenizer, but met some 
probleam.
As the docs says, StandardTokenizer implements the Word Break rules from 
the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex 
#29. Also it is generated by JFlex, a lexer/scanner generator. 

   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as 
follows
 "
HangulEx= 
[\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
[\p{WB:Format}\p{WB:Extend}]*
HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]   
[\p{WB:Format}\p{WB:Extend}]*
NumericEx   = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]
[\p{WB:Format}\p{WB:Extend}]*
KatakanaEx  = \p{WB:Katakana}   
[\p{WB:Format}\p{WB:Extend}]* 
MidLetterEx = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]  
[\p{WB:Format}\p{WB:Extend}]* 
..
"
What does them mean, like HangulEx  or NumericEx  ?
In ClassicTokenizerImpl.jflex, for num, it is expressed like this
"
P   = ("_"|"-"|"/"|"."|",")
NUM= ({ALPHANUM} {P} {HAS_DIGIT}
   | {HAS_DIGIT} {P} {ALPHANUM}
   | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
   | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
   | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
   | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
"
This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be tokenized 
as NUMBERS.



 I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode 
Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
UNICODE CHARACTER DATABASE, but they include too much information and hard to 
understand.
Anyone has some reference of these kinds of Regular Expressions or tell me 
where to find the meanings of these UNICODE Regular Expressions


Thanks.

Re: LockFactory issue observed in lucene while getting instance of indexWriter

2016-06-16 Thread Ian Lea
Sounds to me like it's related to the index not having been closed properly
or still being updated or something.  I'd worry about that.

--
Ian.


On Thu, Jun 16, 2016 at 11:19 AM, Mukul Ranjan  wrote:

> Hi,
>
> I'm observing below exception while getting instance of indexWriter-
>
> java.lang.IllegalArgumentException: Directory MMapDirectory@"directoryName"
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@1ec79746 still
> has pending deleted files; cannot initialize IndexWriter
>
> Is it related to the default used NativeFSLockFactory. Should I use
> simpleFSLockFactory to avoid this type of issue. Please suggest as I'm
> getting the above exception in my application.
>
> Thanks,
> Mukul
> Visit eGain on YouTube and
> LinkedIn
>


Re: Using Lucene to model ownership of documents

2016-06-16 Thread Ian Lea
I'd definitely go for b).  The index will of course be larger for every
extra bit of data you store but it doesn't sound like this would make much
difference.  Likewise for speed of indexing.


--
Ian.


On Wed, Jun 15, 2016 at 2:25 PM, Geebee Coder  wrote:

> Hi there,
> I would like to use Lucene to solve the following problem:
>
> 1.We have about 100k customers and we have 25 millions of documents.
>
> 2.When a customer performs a text search on the document space, we want to
> return only documents that the customer has access to.
>
> 3.The # of documents a customer owns varies a lot. some have close to 23
> million, some have close to 10k and some own a third of the documents etc.
>
> What is an efficient way to use Lucene in this scenario in terms of
> performance and indexing?
> We have tried a number of solutions such as
>
>  a)100k boolean fields per document that indicates whether a customer has
> access to the document.
>  b)A single text field that has a list of customers who owns the document
> e.g. (customers field : "abc abd cfx...")
> c) the above option with shards by customers
>
> The search performance for a was bad. b,c performed better for search
> but lengthened the time needed for indexing & index size.
> We are also thinking about using a custom filter but we are concerned about
> the memory requirements.
>
> Any ideas/suggestions would be really appreciated.
>


LockFactory issue observed in lucene while getting instance of indexWriter

2016-06-16 Thread Mukul Ranjan
Hi,

I'm observing below exception while getting instance of indexWriter-

java.lang.IllegalArgumentException: Directory MMapDirectory@"directoryName" 
lockFactory=org.apache.lucene.store.NativeFSLockFactory@1ec79746 still has 
pending deleted files; cannot initialize IndexWriter

Is it related to the default used NativeFSLockFactory. Should I use 
simpleFSLockFactory to avoid this type of issue. Please suggest as I'm getting 
the above exception in my application.

Thanks,
Mukul
Visit eGain on YouTube and 
LinkedIn