Split / Concatenation of search term

2016-11-30 Thread hariram ravichandran
Is it possible to handle the split and concatenation of words when a
space was inserted in a word or removed between two words?

For example, "entert ainment"will match with "entertainment"  and
"smartwatch" will match with "smart watch".


Save the date: ApacheCon Miami, May 15-19, 2017

2016-11-30 Thread Rich Bowen
Dear Apache enthusiast,

ApacheCon and Apache Big Data will be held at the Intercontinental in
Miami, Florida, May 16-18, 2017. Submit your talks, and register, at
http://apachecon.com/  Talks aimed at the Big Data section of the event
should go to
http://events.linuxfoundation.org/events/apache-big-data-north-america/program/cfp
while other talks should go to
http://events.linuxfoundation.org/events/apachecon-north-america/program/cfp


ApacheCon is the best place to meet the people that develop the software
that you use and rely on. It’s also a great opportunity to deepen your
involvement in the project, and perhaps make the leap to contributing.
And we find that user case studies, showcasing how you use Apache
projects to solve real world problems, are very popular at this event.
So, do consider whether you have a use case that might make a good
presentation.

ApacheCon will have many different ways that you can participate:

Technical Content: We’ll have three days of technical sessions covering
many of the projects at the ASF. We’ll be publishing a schedule of talks
on March 9th, so that you can plan what you’ll be attending

BarCamp: The Apache BarCamp is a standard feature of ApacheCon - an
un-conference style event, where the schedule is determined on-site by
the attendees, and anything is fair game.

Lightning Talks: Even if you don’t give a full-length talk, the
Lightning Talks are five minute presentations on any topic related to
the ASF, and can be given by any attendee. If there’s something you’re
passionate about, consider giving a Lightning Talk.

Sponsor: It costs money to put on a conference, and this is a great
opportunity for companies involved in Apache projects, or who benefit
from Apache code - your employers - to get their name and products in
front of the community. Sponsors can start any any monetary level, and
can sponsor everything from the conference badge lanyard, through larger
items such as video recordings and evening events. For more information
on sponsoring ApacheCon, see http://apachecon.com/sponsor/

So, get your tickets today at http://apachecon.com/ and submit your
talks. ApacheCon Miami is going to be our best ApacheCon yet, and you,
and your project, can’t afford to miss it.

-- 
Rich Bowen - rbo...@apache.org
VP, Conferences
http://apachecon.com
@apachecon


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: commit frequency guideline?

2016-11-30 Thread Rob Audenaerde
Thanks for the quick reply!

>What do you mean by "Lucene complain about too-many uncommitted docs"?

--> good question, I was thoughtlessly echoing words from my colleague. I
asked him and he said that it was about taking very long to commit and
memory issues. So maybe this wasn't the best opening statement :)

For the other part of the question: we need users to see the changed
documents immediately, but I think we have this covered by using NRT
Readers and the SearcherManager.

Am I correct to conclude calling commit() is not necessary for finding
recently changed documents?

I think we can then switch to a time based commit() where we just call
commit every 5 minutes, in effect losing a maximum of 5 minutes of work
(which we can mitigate in another way)
 when the server somehow stops working.

Thank you,
-Rob




On Wed, Nov 30, 2016 at 3:17 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> What do you mean by "Lucene complain about too-many uncommitted docs"?
>  Lucene does not really care how frequently you commit...
>
> How frequently you commit is really your choice, i.e. what risk you
> see of power loss / OS crash vs the cost (not just in CPU/IO work for
> the computer, but in the users not seeing the recently indexed
> documents for a while) of replaying those documents since the last
> commit when power comes back.
>
> Pushing durability back into the queue/channel can be a nice option
> too, e.g. Kafka, so that your application doesn't need to keep track
> of which docs were not yet committed.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Nov 30, 2016 at 8:50 AM, Rob Audenaerde
>  wrote:
> > Hi all,
> >
> > Currently we call commit() many times on our index (about 5M docs, where
> > some 10.000-100.000 modifications during the day). The commit times
> > typically get more expensive when the index grows, up to several seconds,
> > so we want to reduce the number of calls.
> >
> > (Historically, we had Lucene complain about too-many uncommitted docs
> > sometimes, so we went with the commit often approach.)
> >
> > What is a good strategy for calling commit? Fixed frequency? After X
> docs?
> > Combination?
> >
> > I'm curious what is considered 'industry-standard'. Can you share some of
> > your expercience?
> >
> > Thanks!
> >
> > -Rob
>


Re: commit frequency guideline?

2016-11-30 Thread Michael McCandless
What do you mean by "Lucene complain about too-many uncommitted docs"?
 Lucene does not really care how frequently you commit...

How frequently you commit is really your choice, i.e. what risk you
see of power loss / OS crash vs the cost (not just in CPU/IO work for
the computer, but in the users not seeing the recently indexed
documents for a while) of replaying those documents since the last
commit when power comes back.

Pushing durability back into the queue/channel can be a nice option
too, e.g. Kafka, so that your application doesn't need to keep track
of which docs were not yet committed.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Nov 30, 2016 at 8:50 AM, Rob Audenaerde
 wrote:
> Hi all,
>
> Currently we call commit() many times on our index (about 5M docs, where
> some 10.000-100.000 modifications during the day). The commit times
> typically get more expensive when the index grows, up to several seconds,
> so we want to reduce the number of calls.
>
> (Historically, we had Lucene complain about too-many uncommitted docs
> sometimes, so we went with the commit often approach.)
>
> What is a good strategy for calling commit? Fixed frequency? After X docs?
> Combination?
>
> I'm curious what is considered 'industry-standard'. Can you share some of
> your expercience?
>
> Thanks!
>
> -Rob

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



commit frequency guideline?

2016-11-30 Thread Rob Audenaerde
Hi all,

Currently we call commit() many times on our index (about 5M docs, where
some 10.000-100.000 modifications during the day). The commit times
typically get more expensive when the index grows, up to several seconds,
so we want to reduce the number of calls.

(Historically, we had Lucene complain about too-many uncommitted docs
sometimes, so we went with the commit often approach.)

What is a good strategy for calling commit? Fixed frequency? After X docs?
Combination?

I'm curious what is considered 'industry-standard'. Can you share some of
your expercience?

Thanks!

-Rob


Re: Query expansion

2016-11-30 Thread hariram ravichandran
I am overriding getFieldQuery(String field, String fieldText,boolean
quoted). And in case of phrase query,

getFieldQuery(String field, String queryText, int slop) will be called.


And prefix query will not be my use case. So, we can ignore prefix query.


Assume this is my only case. Sequence of words (apple orange mango) as
input, and i need result for (apple~ orange~  mango~).

And I use default conjunction operator as AND (parser.setDefaultOperator(
QueryParser.Operator.AND)) for providing better relevance in results.



That method works as I expected. Is there any drawbacks of using this?


And is there any better method to expand query like this?







On Wed, Nov 30, 2016 at 4:37 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> This is likely tricky to do correctly.
>
> E.g., MultiFieldQueryParser.getFieldQuery is invoked on whole chunks
> of text.  If you search for:
>
>   apple orange
>
> I suspect it won't do what you want, since the whole string "apple
> orange" is passed to getFieldQuery.
>
> How do you want to handle e.g. a phrase query (user types "apple
> orange", with the double quotes)?  Or a prefix query (app*)?
>
> Maybe you could instead override newTermQuery?  In the example above
> it would be invoked twice, once for apple and once for orange.
>
> Finally, all this being said, making everything fuzzy is likely a big
> performance hit and often poor results (massive recall, poor
> precision) to the user!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Nov 28, 2016 at 6:24 AM, hariram ravichandran
>  wrote:
> > I need to perform *fuzzy search* for the whole search term. I
> > extended MultiFieldQueryParser and overridden getFieldQuery()
> >
> >
> > protected Query getFieldQuery(String field, String fieldText,boolean
> > quoted) throws ParseException{
> >return *super.getFuzzyQuery(field,fieldText,3.0f);
> > //constructing fuzzy query*
> > }
> >
> > For example, If i give search term as "(apple AND orange) OR (mango)",
> the
> > query should be expanded as "(apple~ AND orange~) OR (mango~)".
> >
> > I need to search in multiple fields and also i need to implement this
> > without affecting any of the lucene features. Is there any other simple
> way?
>


Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-30 Thread Chitra R
Thank you so much, Shai...

Chitra

On Wed, Nov 30, 2016 at 2:17 PM, Shai Erera  wrote:

> This feature is not available in Lucene currently, but it shouldn't be hard
> to add it. See Mike's comment here:
> http://blog.mikemccandless.com/2013/05/dynamic-faceting-
> with-lucene.html?showComment=1412777154420#c363162440067733144
>
> One more tricky (yet nicer) feature would be to have it all in one go, i.e.
> you'd say something like "facet on field price" and you'd get "interesting"
> buckets, per the variance in the results.
>
> But before that, we could have a StatsFacets in Lucene which provide some
> statistics about a numeric field (min/max/avg etc.).
>
> On Wed, Nov 30, 2016 at 7:50 AM Chitra R  wrote:
>
> > Thank you so much, mike... Hope, gained a lot of stuff on Doc
> > Values faceting and also clarified all my doubts. Thanks..!!
> >
> >
> > *Another use case:*
> >
> > After getting matching documents for the given query, Is there any way to
> > calculate mix and max values on NumericDocValuesField ( say date field)?
> >
> >
> > I would like to implement it in numeric range faceting by splitting the
> > numeric values (getting from resulted documents) into ranges.
> >
> >
> > Chitra
> >
> >
> > On Wed, Nov 30, 2016 at 3:51 AM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> > > Doc values fields are never loaded into memory; at most some small
> > > index structures are.
> > >
> > > When you use those fields, the bytes (for just the one doc values
> > > field you are using) are pulled from disk, and the OS will cache them
> > > in memory if available.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Mon, Nov 28, 2016 at 6:01 AM, Chitra R 
> wrote:
> > > > Hi,
> > > >  When opening SortedSetDocValuesReaderState at search time,
> > > whether
> > > > the whole doc value files (.dvd & .dvm) information are loaded in
> > memory
> > > or
> > > > specified field information(say $facets field) alone load in memory?
> > > >
> > > >
> > > >
> > > >
> > > > Any help is much appreciated.
> > > >
> > > >
> > > > Regards,
> > > > Chitra
> > > >
> > > > On Tue, Nov 22, 2016 at 5:47 PM, Chitra R 
> > wrote:
> > > >>
> > > >>
> > > >> Kindly post your suggestions.
> > > >>
> > > >> Regards,
> > > >> Chitra
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Sat, Nov 19, 2016 at 1:38 PM, Chitra R 
> > > wrote:
> > > >>>
> > > >>> Hey, I got it clearly. Thank you so much. Could you please help us
> to
> > > >>> implement it in our use case?
> > > >>>
> > > >>>
> > > >>> In our case, we are having dynamic index and it is variable depth
> > too.
> > > So
> > > >>> flat facet is enough.No need of hierarchical facets.
> > > >>>
> > > >>> What I think is,
> > > >>>
> > > >>> Index my facet field as normal doc value field, so that no special
> > > >>> operation (like taxonomy and sorted set doc values facet field)
> will
> > > be done
> > > >>> at index time and only doc value field stores its ordinals in their
> > > >>> respective field.
> > > >>> At search time, I will pass query (user search query) , filter
> (path
> > > >>> traversed list)  and collect the matching documents in
> > Facetscollector.
> > > >>> To compute facet count for the specific field, I will gather those
> > > >>> resulted docs, then move through each segment for collecting the
> > > matching
> > > >>> ordinals using AtomicReader.
> > > >>>
> > > >>>
> > > >>> And know when I use this means, can't calculate facet count for
> more
> > > than
> > > >>> one field(facet) in a search.
> > > >>>
> > > >>> Instead of loading all the dimensions in DocValuesReaderState (will
> > > take
> > > >>> more time and memory) at search time, loading specific fields will
> > > take less
> > > >>> time and memory, hope so. Kindly help to solve.
> > > >>>
> > > >>>
> > > >>> It will do it in a minimal index and search cost, I think. And hope
> > > this
> > > >>> won't put overload at index time, also at search time this will be
> > > better.
> > > >>>
> > > >>>
> > > >>> Kindly post your suggestions.
> > > >>>
> > > >>>
> > > >>> Regards,
> > > >>> Chitra
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Fri, Nov 18, 2016 at 7:15 PM, Michael McCandless
> > > >>>  wrote:
> > > 
> > >  I think you've summed up exactly the differences!
> > > 
> > >  And, yes, it would be possible to emulate hierarchical facets on
> top
> > >  of flat facets, if the hierarchy is fixed depth like
> year/month/day.
> > > 
> > >  But if it's variable depth, it's trickier (but I think still
> > >  possible).  See e.g. the Committed 

Re: Faceting : what are the limitations of Taxonomy (Separate index and hierarchical facets) and SortedSetDocValuesFacetField ( flat facets and no sidecar index) ?

2016-11-30 Thread Shai Erera
This feature is not available in Lucene currently, but it shouldn't be hard
to add it. See Mike's comment here:
http://blog.mikemccandless.com/2013/05/dynamic-faceting-with-lucene.html?showComment=1412777154420#c363162440067733144

One more tricky (yet nicer) feature would be to have it all in one go, i.e.
you'd say something like "facet on field price" and you'd get "interesting"
buckets, per the variance in the results.

But before that, we could have a StatsFacets in Lucene which provide some
statistics about a numeric field (min/max/avg etc.).

On Wed, Nov 30, 2016 at 7:50 AM Chitra R  wrote:

> Thank you so much, mike... Hope, gained a lot of stuff on Doc
> Values faceting and also clarified all my doubts. Thanks..!!
>
>
> *Another use case:*
>
> After getting matching documents for the given query, Is there any way to
> calculate mix and max values on NumericDocValuesField ( say date field)?
>
>
> I would like to implement it in numeric range faceting by splitting the
> numeric values (getting from resulted documents) into ranges.
>
>
> Chitra
>
>
> On Wed, Nov 30, 2016 at 3:51 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> > Doc values fields are never loaded into memory; at most some small
> > index structures are.
> >
> > When you use those fields, the bytes (for just the one doc values
> > field you are using) are pulled from disk, and the OS will cache them
> > in memory if available.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Mon, Nov 28, 2016 at 6:01 AM, Chitra R  wrote:
> > > Hi,
> > >  When opening SortedSetDocValuesReaderState at search time,
> > whether
> > > the whole doc value files (.dvd & .dvm) information are loaded in
> memory
> > or
> > > specified field information(say $facets field) alone load in memory?
> > >
> > >
> > >
> > >
> > > Any help is much appreciated.
> > >
> > >
> > > Regards,
> > > Chitra
> > >
> > > On Tue, Nov 22, 2016 at 5:47 PM, Chitra R 
> wrote:
> > >>
> > >>
> > >> Kindly post your suggestions.
> > >>
> > >> Regards,
> > >> Chitra
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Sat, Nov 19, 2016 at 1:38 PM, Chitra R 
> > wrote:
> > >>>
> > >>> Hey, I got it clearly. Thank you so much. Could you please help us to
> > >>> implement it in our use case?
> > >>>
> > >>>
> > >>> In our case, we are having dynamic index and it is variable depth
> too.
> > So
> > >>> flat facet is enough.No need of hierarchical facets.
> > >>>
> > >>> What I think is,
> > >>>
> > >>> Index my facet field as normal doc value field, so that no special
> > >>> operation (like taxonomy and sorted set doc values facet field) will
> > be done
> > >>> at index time and only doc value field stores its ordinals in their
> > >>> respective field.
> > >>> At search time, I will pass query (user search query) , filter (path
> > >>> traversed list)  and collect the matching documents in
> Facetscollector.
> > >>> To compute facet count for the specific field, I will gather those
> > >>> resulted docs, then move through each segment for collecting the
> > matching
> > >>> ordinals using AtomicReader.
> > >>>
> > >>>
> > >>> And know when I use this means, can't calculate facet count for more
> > than
> > >>> one field(facet) in a search.
> > >>>
> > >>> Instead of loading all the dimensions in DocValuesReaderState (will
> > take
> > >>> more time and memory) at search time, loading specific fields will
> > take less
> > >>> time and memory, hope so. Kindly help to solve.
> > >>>
> > >>>
> > >>> It will do it in a minimal index and search cost, I think. And hope
> > this
> > >>> won't put overload at index time, also at search time this will be
> > better.
> > >>>
> > >>>
> > >>> Kindly post your suggestions.
> > >>>
> > >>>
> > >>> Regards,
> > >>> Chitra
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Fri, Nov 18, 2016 at 7:15 PM, Michael McCandless
> > >>>  wrote:
> > 
> >  I think you've summed up exactly the differences!
> > 
> >  And, yes, it would be possible to emulate hierarchical facets on top
> >  of flat facets, if the hierarchy is fixed depth like year/month/day.
> > 
> >  But if it's variable depth, it's trickier (but I think still
> >  possible).  See e.g. the Committed Paths drill-down on the left, on
> >  our dog-food server
> >  http://jirasearch.mikemccandless.com/search.py?index=jira
> > 
> >  Mike McCandless
> > 
> >  http://blog.mikemccandless.com
> > 
> > 
> >  On Fri, Nov 18, 2016 at 1:43 AM, Chitra R 
> > wrote:
> >  > case 1:
> >  > In taxonomy, for each indexed document, examines facet
> > label ,
> >  > computes their