Re: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-05 Thread Trejkaz
On Wed, Feb 5, 2014 at 4:16 AM, Earl Hood  wrote:
> Our current solution is to do highlighting on the client-side.  When
> search happens, the search results from the server includes the parsed
> query terms so the client has an idea of which terms to highlight vs
> trying to reimplement a complete query string parser in the client.
>
> A problem is that Lucene (we are still on v3.0.3) does not provide a
> robust mechanism for extracting the terms of a query.  The following is
> the utility method that the server uses to get the terms needed to
> support client-side highlighting:
>
>   public static Set extractTermsFromQuery(
>   Query q,
>   IndexReader r,
>   Set terms
>   )
[ ... ]

This is very similar to what we're doing now, actually.

It does avoid the mess with having to double-parse the query, but the
catch is we still have to double-parse the text (and the text is
nearly always larger.)

All the special cases for fuzzy queries, regex queries, phrase queries
and the like, having to dig inside queries to pull out filters,
sometimes having to dig inside filters to pull out queries (had to
modify the Lucene API here and there to make more of it public, as I
recall!)

I just thought it would be nice to be able to find all the matches,
pull just those bits of the text somehow and display them without
reading the rest of the text. At least, without reading the rest of
the text all the time. I think I would have to store something about
where the lines wrap in the database in order to really avoid reading
all the text. :/

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-05 Thread Olivier Binda

On 02/05/2014 01:05 AM, Michael Sokolov wrote:

On 2/4/2014 2:50 PM, Earl Hood wrote:

On Tue, Feb 4, 2014 at 1:16 PM, Michael Sokolov wrote:

You might be interested in looking at Lux, which layers XML services 
like
XQuery on top of Lucene and Solr, and includes an XML-aware 
highlighter:
https://github.com/msokolov/lux/blob/master/src/main/java/lux/search/highlight/XmlHighlighter.java 


I am aware of Lux, but moving to use it would be a major redesign effort
for the project I am on, something that likely would not get management
approval.

BTW, just within the scope of the class you cite, doing a quick look at
it, it looks like I may have to modify highlighting code behavior to
support how the project I am transforms the XML data.  Example: we deal
with attribute data that gets transformed to render content in the HTML
served to the client, and the highlighting code cited does not appear to
handle XML attributes.

There are other technical challenges also due to the nature of the
project.  There may be ways deal with the challenges, but any further
analysis is not worth it if there is never any approval for me to pursue
a redesign for the project.

Thanks for the feedback.  I think it's difficult to know what to do 
about attribute value highlighting in the general case - do you have 
any suggestions?




Would it be possible/interesting to have an interface that let's the 
caller decide what to do with the attribute  ?


This has nothing to do with xml higlighting but , on Android, I had to 
hack a bit (nothing comparable to what is suggested here) the 
highlighter class to enable it to directly produce an Android Spanned  
String (basically a String with spans attached to it) instead of 
producing a String with html content (that must be parsed into an 
Android Spanned String) ... The Formatter Class is nice but maybee it 
could be improved a bit to be more flexible.



Olivier


-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: question about using lucene on large documents

2014-02-05 Thread Michael Sokolov
No, not really.  What would you do if you had a match contained entirely 
within the overlapping region? You'd probably need a way to distinguish 
that from a term that matched in two adjacent chunks, but *not* in the 
overlap.  Sounds very tricky to me.


-Mike

On 2/5/2014 2:21 AM, mrodent wrote:

Thanks, gives me food for thought.  So no { N, N+1 } ideas specifically...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/question-about-using-lucene-on-large-documents-tp4115343p4115465.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Help retrieving BinaryDocValues

2014-02-05 Thread Xavier Sanchez Loro

Hi,
I have some problems working with BinaryDocValues. The code below works 
well with a few thousands of documents, but with more than 65000 
documents it does not return the correct BinaryDocValues after the docId 
(with docBase rebasing) reaches a certain id. From this point on, it 
cycles returning the BinaryDocValues of first docs. I'm working wiht 
lucene/solr 4.3.


I tested this code indexing 10 documents, each with a 
"binary_ids_campaigns" value equal to docId. After docId 65500 aprox. it 
return BinaryDocValues corresponding to first doc ids. I have followed 
the API instructions on how to rebase the docId, but I guess I'm missing 
something. If someone could point me in the right direction, I would 
really appreciate it.


Best regards,
Xavier

public void computeVals(ResponseBuilder rb, SolrCore core, final 
CampaignObserver observer) {

RefCounted searchHolder = null;
try {
  searchHolder = core.getNewestSearcher(false);
  AtomicReader reader = searchHolder.get().getAtomicReader();
  SolrIndexSearcher searcher = searchHolder.get();
  idsCampaigns = reader.getBinaryDocValues("binary_ids_campaigns");
  final float[] topscore = new float[]{Float.NEGATIVE_INFINITY};
  CpcCollector delegate = new CpcCollector(reader, topscore, 
observer, maxCpc, idsCampaigns, maxDocCpc);

  DocSet filter = null;
  //Only filter in ppc, not for search, in search only apply sorting
  SolrIndexSearcher.ProcessedFilter pf = 
searcher.getProcessedFilter(filter, rb.getFilters());

  //Check for existing filters, apply them
  if (pf != null && pf.filter != null) {
searcher.search(rb.getQuery(), pf.filter, delegate);
  } else {
searcher.search(rb.getQuery(), delegate);
  }
  float[] collectedTopscore = delegate.getTopscore();
  maxOrganicScore = collectedTopscore[0];
  maxCpc = delegate.getMaxCpc();
  if (core.getName().indexOf("ppc") > -1) {
filter = delegate.getDocSet();
List filters = rb.getFilters();
if (filters == null) {
  filters = new ArrayList();
}
filters.add(new FilteredQuery(rb.getQuery(), 
filter.getTopFilter()));

rb.setFilters(filters);
  }
} catch (Exception e) {
  throw new SolrException(SolrException.ErrorCode.SERVER_ERROR,
  "Error loading FieldCache.Ints for idcampaigns field", e);
} finally {
  if (searchHolder != null) {
searchHolder.decref();
  }
}
  }

  - Collector code ---

  public class CpcCollector extends Collector {
  private static Logger log = LoggerFactory.getLogger(CpcCollector.class);
  private SortedIntDocSet docSet = null;
  Scorer scorer;
  private final float[] topscore;
  private CampaignObserver observer;
  private float maxCpc;
  private com.carrotsearch.hppc.IntArrayList idDocs;
  private BinaryDocValues values;
  private com.carrotsearch.hppc.IntFloatOpenHashMap maxDocCpc;//Maximum 
cpc per document

  private int docBase = 0;

  /**
   *
   * @param reader
   * @param topscore
   * @param observer
   * @param ids
   * @param maxCpc
   * @param values
   */
  public CpcCollector(IndexReader reader, final float[] topscore, 
CampaignObserver observer, float maxCpc, BinaryDocValues values, 
com.carrotsearch.hppc.IntFloatOpenHashMap maxDocCpc) {


this.topscore = topscore;
this.observer = observer;
this.maxCpc = maxCpc;
idDocs = new com.carrotsearch.hppc.IntArrayList();
this.maxDocCpc = maxDocCpc;
this.values = values;
  }

  @Override
  public void setScorer(Scorer scorer) throws IOException {
this.scorer = scorer;
  }

  @Override
  public void collect(int doc) throws IOException {
float score = scorer.score();
if (score > getTopscore()[0]) {
  topscore[0] = score;
}
BytesRef term = new BytesRef();
values.get(doc + docBase, term);
int size = (int) term.bytes[term.offset] * 4 + 1;
byte[] docValues = new byte[size];
ByteBuffer.wrap(term.bytes, term.offset, size).get(docValues, 0, size);
int[] campIds = observer.parseBinaryIdsOldSkoolWayArray(docValues);
if (campIds != null) {
  float cpc = observer.getMaxActiveCpc(campIds);
  getMaxDocCpc().put(doc + docBase, cpc);
  if (cpc > 0) {
if (cpc > getMaxCpc()) {
  maxCpc = cpc;
}
//active campaign
idDocs.add(doc + docBase);
  }
}
  }

  @Override
  public boolean acceptsDocsOutOfOrder() {
return true;//podria ser tru
  }

  @Override
  public void setNextReader(AtomicReaderContext context) throws 
IOException {

this.docBase = context.docBase;
  }

  /**
   * @return the topscore
   */
  public float[] getTopscore() {
return topscore;
  }

  /**
   * @return the maxCpc
   */
  public float getMaxCpc() {
return maxCpc;
  }

  /**
   * @return the docSet
   */
  public SortedIntDocSet getDocSet() {
docSet = new SortedIntDocSet(idDocs.toArray());
return d

[REMINDER] ApacheCon NA 2014 Travel Assistance Applications Due Feb 7

2014-02-05 Thread Chris Hostetter


(NOTE: cross posted, if you feel the need to reply, please keep it on 
general@lucene)


As a reminder, Travel Assistance Applications for ApacheCon NA 2014 are 
due on Feb 7th (about 48 hours from now)


Details are below, please note that if you have any questions about this 
program or the applicaiton, they should be addressed to 
travel-assista...@apache.org


-Hoss
http://www.lucidworks.com/


-- Forwarded message --
Date: Wed, Jan 15, 2014 at 4:41 PM
Subject: ApacheCon NA 2014 Travel Assistance Applications now open!
Reply-To: travel-assista...@apache.org


The Travel Assistance Committee (TAC) are pleased to announce that travel
assistance applications for ApacheCon North America 2014 are now open! This
announcement serves as a purpose for you (pmcs@) to let members of your
community know about both ApacheConNA 2014 and about the TAC assistance to
attend. Could you please forward this announcement to your community, along
if possible with information on how your project is involved in ApacheCon
this year?

ApacheConNA will be held in Denver, Colorado, April 7-9, 2014.

TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons. For more info on this years
applications and qualifying criteria please visit the TAC website at <
http://www.apache.org/travel/ >.   Applications are already open, so don't
delay!

*The important date*...

   - Friday February 7th 2014 - TAC applications close.

Applicants have until the the closing date above to submit their
applications (which should contain as much supporting material as required
to efficiently and accurately process your request), this will enable TAC
to announce successful awards shortly afterwards.

As usual TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.

We look forward to greeting everyone in Denver, Colorado in April.

Kind Regards

Lewis

(On behalf of the Travel Assistance Committee)


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Wildcard searches

2014-02-05 Thread raghavendra.k.rao
Hi,

Can Lucene support wildcard searches such as the ones shown below?

Indexed value is "XYZ CORPORATION LIMITED".

XYZ CORPORATION LIMI*
XYZ CORPORATION *MIT*
XYZ *PORAT* LIMI*
*YZ CORPO* LIMITE*

In other words, the flexibility for the user to provide a wild card at any 
position, in a situation where they aren't sure about the exact value. Ignoring 
the performance aspect, please suggest if it is even possible. If yes, please 
provide further inputs on how to approach it such as Analyzer / Tokenizer to 
consider, whether PhraseQueries can be formed etc.

Any input is greatly appreciated.

Regards,
Raghu


___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.

For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.

___


Re: Wildcard searches

2014-02-05 Thread Jack Krupansky

Take a look at the complex phrase query parser.

See:
http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html

See also:
https://issues.apache.org/jira/browse/LUCENE-1486

-- Jack Krupansky

-Original Message- 
From: raghavendra.k@barclays.com

Sent: Wednesday, February 5, 2014 6:30 PM
To: java-user@lucene.apache.org
Subject: Wildcard searches

Hi,

Can Lucene support wildcard searches such as the ones shown below?

Indexed value is "XYZ CORPORATION LIMITED".

XYZ CORPORATION LIMI*
XYZ CORPORATION *MIT*
XYZ *PORAT* LIMI*
*YZ CORPO* LIMITE*

In other words, the flexibility for the user to provide a wild card at any 
position, in a situation where they aren't sure about the exact value. 
Ignoring the performance aspect, please suggest if it is even possible. If 
yes, please provide further inputs on how to approach it such as Analyzer / 
Tokenizer to consider, whether PhraseQueries can be formed etc.


Any input is greatly appreciated.

Regards,
Raghu


___

This message is for information purposes only, it is not a recommendation, 
advice, offer or solicitation to buy or sell a product or service nor an 
official confirmation of any transaction. It is directed at persons who are 
professionals and is not intended for retail customer use. Intended for 
recipient only. This message is subject to the terms at: 
www.barclays.com/emaildisclaimer.


For important disclosures, please see: 
www.barclays.com/salesandtradingdisclaimer regarding market commentary from 
Barclays Sales and/or Trading, who are active market participants; and in 
respect of Barclays Research, including disclosures relating to specific 
issuers, please see http://publicresearch.barclays.com.


___ 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Wildcard searches

2014-02-05 Thread Michael Sokolov

On 2/5/2014 6:30 PM, raghavendra.k@barclays.com wrote:

Hi,

Can Lucene support wildcard searches such as the ones shown below?

Indexed value is "XYZ CORPORATION LIMITED".

If you index the value as a single token (KeywordTokenizer), there is 
nothing really special about the examples you gave, except for the 
leading * which people optimize by reversing (ReverseStringFilter) and 
then reversing the query to turn it into a trailing wildcard.


-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Highlighting text, do I seriously have to reimplement this from scratch?

2014-02-05 Thread Earl Hood
On Tue, Feb 4, 2014 at 6:05 PM, Michael Sokolov wrote:

> Thanks for the feedback.  I think it's difficult to know what to do about
> attribute value highlighting in the general case - do you have any
> suggestions?

That is a challenging one since one has to know how attribute data will
be transformed for rendering purposes.

I do not know the workings of Lux, so I cannot provide any specific
suggestions on what Lux can do.  I would need time to dive into it.

However, one solution is to workaround the limitation by preprocessing
the data in a form that is friendly to Lux (or at least the highligher).
For example, if I have attribute data I know will be transformed into
renderable content, I would transform it into element-style content,
which should be more friendly for indexing and highlighting purposes.

--ewh

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene 4.0 chokes on multiple requests

2014-02-05 Thread saisantoshi
We recently upgraded to Lucen4.0 and found performance issues in searching
the results. Upon some analysis, we found that it chokes when there are
multiple requests coming for lucene search.

User1 -> Search
User2  -> search
User3 -> search


The search request done by User Search1 is still waiting to finish while
user2 and user3 are finished. Not sure if someone facing any performance
issues on Lucene 4.0

Thanks,
Sai.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-4-0-chokes-on-multiple-requests-tp4115776.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org