help camelcase tokenizer

2016-11-16 Thread Andres Fernando Wilches Riano
Hello

I am indexing java source code files. I need to know how indexi or tokenize
camel case words in identifiers, method names, clases , etc. e.g.
getSystemRequirements.

I am using lucene 3.0.1.

Thank you,

-- 
Atentamente,


*Andrés Fernando Wilches Riaño*
Ingeniero de Sistemas y Computación
Estudiante de Maestría en Ingeniería de Sistemas y Computación
Universidad Nacional de Colombia


Re: StartsWith on DrillDown?

2016-11-16 Thread Matt Hicks
My situation is that I simply don't understand how I'm supposed to pass a
`Query` into it.  Just passing in a `new QueryParser(facetName,
standardAnalyzer)` to `drillDown.add(facetName, queryParser.parse("valid
query"))` just always returns zero items.  I'm guessing it has to do with
needing a specific type of query?

On Wed, Nov 16, 2016 at 4:05 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Can you post a test case showing the unexpected behavior?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Wed, Nov 16, 2016 at 1:55 PM, Matt Hicks  wrote:
> > Is this simply not possible to accomplish or does nobody on this list
> know?
> >
> > On Mon, Nov 14, 2016 at 2:39 PM Matt Hicks  wrote:
> >
> >> I'm trying to add a sub-query to my DrillDownQuery but I keep ending up
> >> with no results when I use add(String dim, Query subQuery).  I'm trying
> to
> >> query the tags that start with a specific String.
> >>
> >> Any suggestions of how to do this would be greatly appreciated. I am
> using
> >> Lucene Core 6.3.0.
> >>
> >> Thank you
> >>
>


Re: StartsWith on DrillDown?

2016-11-16 Thread Michael McCandless
Can you post a test case showing the unexpected behavior?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Nov 16, 2016 at 1:55 PM, Matt Hicks  wrote:
> Is this simply not possible to accomplish or does nobody on this list know?
>
> On Mon, Nov 14, 2016 at 2:39 PM Matt Hicks  wrote:
>
>> I'm trying to add a sub-query to my DrillDownQuery but I keep ending up
>> with no results when I use add(String dim, Query subQuery).  I'm trying to
>> query the tags that start with a specific String.
>>
>> Any suggestions of how to do this would be greatly appreciated. I am using
>> Lucene Core 6.3.0.
>>
>> Thank you
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: StartsWith on DrillDown?

2016-11-16 Thread Matt Hicks
Is this simply not possible to accomplish or does nobody on this list know?

On Mon, Nov 14, 2016 at 2:39 PM Matt Hicks  wrote:

> I'm trying to add a sub-query to my DrillDownQuery but I keep ending up
> with no results when I use add(String dim, Query subQuery).  I'm trying to
> query the tags that start with a specific String.
>
> Any suggestions of how to do this would be greatly appreciated. I am using
> Lucene Core 6.3.0.
>
> Thank you
>


Re: How exclude empty fields?

2016-11-16 Thread Chris Hostetter
: The issue I have is that some promotions are permanent so they don't have
: an endDate set.
: 
: I tried doing:
: 
: ( +Promotion.endDate:[210100TOvariable containing yesterday's date]
: || -Promotion.endDate:* )

1) mixing prefix ops with "||" like this is most certainly not doing what 
you think...

https://lucidworks.com/blog/why-not-and-or-and-not/

2) combine that with Ahmet's point about needing a "MatchAllDocsQuery" to 
"select all docs" from which you can thin "exclude docs with an endDate" 
to give you the final results of "docs w/o an endDate" ...

BooleanQuery(
  Should(NumericRangeQuery("endDate:[X TO Y]"))
  Should(BooleanQuery(
Must(MatchAllDocsQuery())
MustNot(FieldValueQuery("endDate"))
  ))
)

...either that, or index a new boolean field "permenant" and then simplify 
your query to basically just be "endDate:[X TO Y] OR permentant:true"







-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Sort merge strategy ?

2016-11-16 Thread Kevin Burton
What's the current status of the sort merge strategy?

I want to sort an index by a given field and keep it in that order on disk.

It seems to have evolved over the years and I can't easily figure out the
current status via the Javadoc in 6.x

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-16 Thread Michael McCandless
You store dimension + string (a single value path, since it's not
hierarchical) into SSDVFF so that you can compute facet counts, either
ordinary drill down counts or the drill sideways counts.

You can see examples of drill sideways at
http://jirasearch.mikemccandless.com, e.g. drill down on any of those
fields on the left and you don't lose the previous facet counts for
that field.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Nov 16, 2016 at 8:51 AM, Chitra R  wrote:
> Hi,
>
> Lucene-Drill sideways
>
> jira_issue:LUCENE-4748
>
>  Is this the reason( ie Drill sideways makes
> a very nice faceted search UI because we
> don't "lose" the facet counts after drilling in) behind storing path and
> dimension for the given SSDVF field? Else anything?
>
> Regards,
> Chitra
>
>
>  Hey, thank you so much for the fast response, I agree NRT refresh is
> somewhat costly operations and this is the major pitfall, suppose we use doc
> value faceting.
>
>
>  While indexing SortedSetDocValuesFacetField , it stores
> path and dimension of the given field internally. So Can we achieve
> hierarchical facets using DrillDownQuery? Hope, purpose of storing path and
> dimension is to achieve hierarchical facets. If yes (ie we can achieve
> hierarchy in SSDVFF) , so what is the need to move over taxonomy?
>  Else I missed anything?
>
>
>  What is the real purpose to store path and dimension in
> SSDVF field?
>
>
> Kindly post your suggestions.
>
> Regards,
> Chitra
>
>
>
> On Sat, Nov 12, 2016 at 4:03 AM, Michael McCandless
>  wrote:
>>
>> On Fri, Nov 11, 2016 at 5:21 AM, Chitra R  wrote:
>>
>> > i)Hope, when opening SortedSetDocValuesReaderState , we are
>> > calculating ordinals( this will be used to calculate facet count ) for
>> > doc
>> > values field and this only made the state instance somewhat costly.
>> >   Am I right or any other reason behind that?
>>
>> That's correct.  It adds some latency to an NRT refresh, and some heap
>> used to hold the ordinal mappings.
>>
>> >  ii) During indexing, we are providing facet ordinals in each
>> > doc
>> > and I think it will be useful in search side, to calculate facet counts
>> > only for matching docs.  otherwise, it carries any other benefits?
>>
>> Well, compared to the taxonomy facets, SSDV facets don't require a
>> separate index.
>>
>> But they add latency/heap usage, and they cannot do hierarchical
>> facets yet (though this could be fixed if someone just built it).
>>
>> >  iii) Is SortedSetDocValuesReaderState thread-safe (ie) multiple
>> > threads can call this method concurrently?
>>
>> Yes.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-16 Thread Chitra R
Hi,

Lucene-Drill sideways


jira_issue:LUCENE-4748 

 Is this the reason( ie Drill sideways
makes a very nice faceted search UI because we
don't "lose" the facet counts after drilling in) behind storing path and
dimension for the given SSDVF field? Else anything?

Regards,
Chitra

 Hey, thank you so much for the fast response, I agree NRT refresh is
somewhat costly operations and this is the major pitfall, suppose we use
doc value faceting.


 While indexing SortedSetDocValuesFacetField , it stores
path and dimension of the given field internally. So Can we achieve
hierarchical facets using DrillDownQuery? Hope, purpose of storing path and
dimension is to achieve hierarchical facets. If yes (ie we can achieve
hierarchy in SSDVFF) , so what is the need to move over taxonomy?
 Else I missed anything?


 What is the real purpose to store path and dimension in
SSDVF field?


Kindly post your suggestions.

Regards,
Chitra



On Sat, Nov 12, 2016 at 4:03 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Fri, Nov 11, 2016 at 5:21 AM, Chitra R  wrote:
>
> > i)Hope, when opening SortedSetDocValuesReaderState , we are
> > calculating ordinals( this will be used to calculate facet count ) for
> doc
> > values field and this only made the state instance somewhat costly.
> >   Am I right or any other reason behind that?
>
> That's correct.  It adds some latency to an NRT refresh, and some heap
> used to hold the ordinal mappings.
>
> >  ii) During indexing, we are providing facet ordinals in each doc
> > and I think it will be useful in search side, to calculate facet counts
> > only for matching docs.  otherwise, it carries any other benefits?
>
> Well, compared to the taxonomy facets, SSDV facets don't require a
> separate index.
>
> But they add latency/heap usage, and they cannot do hierarchical
> facets yet (though this could be fixed if someone just built it).
>
> >  iii) Is SortedSetDocValuesReaderState thread-safe (ie) multiple
> > threads can call this method concurrently?
>
> Yes.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>


Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-16 Thread Michael McCandless
No, SSDVFF does not do hierarchical faceting today, but this is just a
limitation of the current implementation, and with some changes
(patches welcome!), it could do so.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Nov 14, 2016 at 1:38 AM, Chitra R  wrote:
>
>  Hey, thank you so much for the fast response, I agree NRT refresh is
> somewhat costly operations and this is the major pitfall, suppose we use doc
> value faceting.
>
>
>  While indexing SortedSetDocValuesFacetField , it stores
> path and dimension of the given field internally. So Can we achieve
> hierarchical facets using DrillDownQuery? Hope, purpose of storing path and
> dimension is to achieve hierarchical facets. If yes (ie we can achieve
> hierarchy in SSDVFF) , so what is the need to move over taxonomy?
>  Else I missed anything?
>
>
>  What is the real purpose to store path and dimension in
> SSDVF field?
>
>
> Kindly post your suggestions.
>
> Regards,
> Chitra
>
>
>
> On Sat, Nov 12, 2016 at 4:03 AM, Michael McCandless
>  wrote:
>>
>> On Fri, Nov 11, 2016 at 5:21 AM, Chitra R  wrote:
>>
>> > i)Hope, when opening SortedSetDocValuesReaderState , we are
>> > calculating ordinals( this will be used to calculate facet count ) for
>> > doc
>> > values field and this only made the state instance somewhat costly.
>> >   Am I right or any other reason behind that?
>>
>> That's correct.  It adds some latency to an NRT refresh, and some heap
>> used to hold the ordinal mappings.
>>
>> >  ii) During indexing, we are providing facet ordinals in each
>> > doc
>> > and I think it will be useful in search side, to calculate facet counts
>> > only for matching docs.  otherwise, it carries any other benefits?
>>
>> Well, compared to the taxonomy facets, SSDV facets don't require a
>> separate index.
>>
>> But they add latency/heap usage, and they cannot do hierarchical
>> facets yet (though this could be fixed if someone just built it).
>>
>> >  iii) Is SortedSetDocValuesReaderState thread-safe (ie) multiple
>> > threads can call this method concurrently?
>>
>> Yes.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Possible to cause documents to be contiguous after forceMerge?

2016-11-16 Thread Tommaso Teofili
improved locality of "near" documents could be used to avoid loading some
segments during the retrieval phase for certain use cases (e.g. spatial
search).


Il giorno mer 16 nov 2016 alle ore 09:45 Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> ha scritto:

http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html

On Wed, Nov 16, 2016 at 11:15 AM, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Can IndexSort help here?
> --
> From: Erick Erickson 
> Sent: ‎11/‎16/‎2016 9:29
> To: java-user 
> Subject: Re: Possible to cause documents to be contiguous after
> forceMerge?
>
> Well, codecs are pluggable so if you can show that you'd get
> an improvement (however you measure them) and that whatever
> you have in mind wouldn't penalize the general case you could
> submit it as a proposal/patch.
>
> Best,
> Erick
>
> On Tue, Nov 15, 2016 at 6:21 PM, Kevin Burton  wrote:
> > On Tue, Nov 15, 2016 at 6:16 PM, Erick Erickson  >
> > wrote:
> >
> >> You can make no assumptions about locality in terms of where separate
> >> documents land on disk. I suppose if you have the whole corpus at index
> >> time you
> >> could index these "similar" documents contiguously. T
> >>
> >
> > Wow.. that's shockingly frightening. There are a ton of optimizations if
> > you can trick the underlying content store into performing locality.
> >
> > Not trying to be overly negative so another way to phrase it is that at
> > least there's room for improvement !
> >
> >
> >> My base question is why you'd care about compressing 500G. Disk space
> >> is so cheap that the expense of trying to control this dwarfs any
> >> imaginable
> >> $avings, unless you're talking about a lot of 500G indexes. In other
> words
> >> this seems like an
> >> XY problem, you're asking about compressing when you are really
> concerned
> >> with something else.
> >>
> >
> > 500GB per day... additionally, disk is cheap, but IOPS are not. The more
> we
> > can keep in ram and on SSD the better.
> >
> > And we're trying to get as much in RAM then SSD as possible... plus we
> have
> > about 2 years of content.  It adds up ;)
> >
> > Kevin
> >
> > --
> >
> > We’re hiring if you know of any awesome Java Devops or Linux Operations
> > Engineers!
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > 
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Possible to cause documents to be contiguous after forceMerge?

2016-11-16 Thread Ishan Chattopadhyaya
http://shaierera.blogspot.com/2013/04/index-sorting-with-lucene.html

On Wed, Nov 16, 2016 at 11:15 AM, Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Can IndexSort help here?
> --
> From: Erick Erickson 
> Sent: ‎11/‎16/‎2016 9:29
> To: java-user 
> Subject: Re: Possible to cause documents to be contiguous after
> forceMerge?
>
> Well, codecs are pluggable so if you can show that you'd get
> an improvement (however you measure them) and that whatever
> you have in mind wouldn't penalize the general case you could
> submit it as a proposal/patch.
>
> Best,
> Erick
>
> On Tue, Nov 15, 2016 at 6:21 PM, Kevin Burton  wrote:
> > On Tue, Nov 15, 2016 at 6:16 PM, Erick Erickson  >
> > wrote:
> >
> >> You can make no assumptions about locality in terms of where separate
> >> documents land on disk. I suppose if you have the whole corpus at index
> >> time you
> >> could index these "similar" documents contiguously. T
> >>
> >
> > Wow.. that's shockingly frightening. There are a ton of optimizations if
> > you can trick the underlying content store into performing locality.
> >
> > Not trying to be overly negative so another way to phrase it is that at
> > least there's room for improvement !
> >
> >
> >> My base question is why you'd care about compressing 500G. Disk space
> >> is so cheap that the expense of trying to control this dwarfs any
> >> imaginable
> >> $avings, unless you're talking about a lot of 500G indexes. In other
> words
> >> this seems like an
> >> XY problem, you're asking about compressing when you are really
> concerned
> >> with something else.
> >>
> >
> > 500GB per day... additionally, disk is cheap, but IOPS are not. The more
> we
> > can keep in ram and on SSD the better.
> >
> > And we're trying to get as much in RAM then SSD as possible... plus we
> have
> > about 2 years of content.  It adds up ;)
> >
> > Kevin
> >
> > --
> >
> > We’re hiring if you know of any awesome Java Devops or Linux Operations
> > Engineers!
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > 
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>