custom token filter generates empty tokens

2014-10-09 Thread G.Long

Hi :)

I wrote a custom token filter which removes special characters. 
Sometimes, all characters of the token are removed so the filter 
procudes an empty token. I would like to remove this token from the 
tokenstream but i'm not sure how to do that.


Is there something missing in my custom token filter or do I need to 
chain another custom token filter to remove empty tokens?


Regards :)

ps:

this is the code of my custom filter :

public class SpecialCharFilter extends TokenFilter {

private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);


protected SpecialCharFilter(TokenStream input) {
super(input);
}

@Override
public boolean incrementToken() throws IOException {

if (!input.incrementToken()) {
return false;
}

final char[] buffer = termAtt.buffer();
final int length = termAtt.length();
final char[] newBuffer = new char[length];

int newIndex = 0;
for (int i = 0; i < length; i++) {
if (!isFilteredChar(buffer[i])) {
newBuffer[newIndex] = buffer[i];
newIndex++;
}
}

String term = new String(newBuffer);
term = term.trim();
char[] characters = term.toCharArray();
termAtt.setEmpty();
termAtt.copyBuffer(characters, 0, characters.length);

return true;
}
}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: custom token filter generates empty tokens

2014-10-09 Thread Jose Fernandez
When you return true from incrementToken() you tell Lucene to add your token to 
the token stream. At the end of incrementToken() check for an empty token. If 
it's empty then return incrementToken() to process the next token. This will 
affect your positions so if you're doing phrase search you will need to adjust 
the position attribute to account for the now-empty token.

-Original Message-
From: G.Long [mailto:jde...@gmail.com] 
Sent: Thursday, October 09, 2014 7:54 AM
To: java-user@lucene.apache.org
Subject: custom token filter generates empty tokens

Hi :)

I wrote a custom token filter which removes special characters. 
Sometimes, all characters of the token are removed so the filter procudes an 
empty token. I would like to remove this token from the tokenstream but i'm not 
sure how to do that.

Is there something missing in my custom token filter or do I need to chain 
another custom token filter to remove empty tokens?

Regards :)

ps:

this is the code of my custom filter :

public class SpecialCharFilter extends TokenFilter {

 private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);

 protected SpecialCharFilter(TokenStream input) {
 super(input);
 }

 @Override
 public boolean incrementToken() throws IOException {

 if (!input.incrementToken()) {
 return false;
 }

 final char[] buffer = termAtt.buffer();
 final int length = termAtt.length();
 final char[] newBuffer = new char[length];

 int newIndex = 0;
 for (int i = 0; i < length; i++) {
 if (!isFilteredChar(buffer[i])) {
 newBuffer[newIndex] = buffer[i];
 newIndex++;
 }
 }

 String term = new String(newBuffer);
 term = term.trim();
 char[] characters = term.toCharArray();
 termAtt.setEmpty();
 termAtt.copyBuffer(characters, 0, characters.length);

 return true;
 }
}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

SDL PLC confidential, all rights reserved.
If you are not the intended recipient of this mail SDL requests and requires 
that you delete it without acting upon or copying any of its contents, and we 
further request that you advise us.
SDL PLC is a public limited company registered in England and Wales.  
Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, 
UK.



This message has been scanned for malware by Websense. www.websense.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: custom token filter generates empty tokens

2014-10-09 Thread Ahmet Arslan
Hi G.Long,

You can use TrimFilter+LengthFilter to remove empty/whitespace tokens.


Ahmet

On Thursday, October 9, 2014 5:54 PM, G.Long  wrote:
Hi :)

I wrote a custom token filter which removes special characters. 
Sometimes, all characters of the token are removed so the filter 
procudes an empty token. I would like to remove this token from the 
tokenstream but i'm not sure how to do that.

Is there something missing in my custom token filter or do I need to 
chain another custom token filter to remove empty tokens?

Regards :)

ps:

this is the code of my custom filter :

public class SpecialCharFilter extends TokenFilter {

 private final CharTermAttribute termAtt = 
addAttribute(CharTermAttribute.class);

 protected SpecialCharFilter(TokenStream input) {
 super(input);
 }

 @Override
 public boolean incrementToken() throws IOException {

 if (!input.incrementToken()) {
 return false;
 }

 final char[] buffer = termAtt.buffer();
 final int length = termAtt.length();
 final char[] newBuffer = new char[length];

 int newIndex = 0;
 for (int i = 0; i < length; i++) {
 if (!isFilteredChar(buffer[i])) {
 newBuffer[newIndex] = buffer[i];
 newIndex++;
 }
 }

 String term = new String(newBuffer);
 term = term.trim();
 char[] characters = term.toCharArray();
 termAtt.setEmpty();
 termAtt.copyBuffer(characters, 0, characters.length);

 return true;
 }
}

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Delete / Update facets from taxonomy index

2014-10-09 Thread Shai Erera
Hi

You cannot remove facets from the taxonomy index, but you can reindex a
single document and update its facets. This will add new facets to the
taxonomy index (if they do not already exist). You do that just like you
reindex any document, by calling IndexWriter.updateDocument(). Just make
sure to rebuild the document with FacetsConfig.

Shai

On Tue, Oct 7, 2014 at 12:42 AM, wesli  wrote:

> I'm using lucene for a full text search on a online store.
> I've build a indexer program which creates a lucene and a taxonomy index.
> The taxonomy index contains facets with categories and article features
> (like color, brand, etc.).
> Is it possible to re-add or update single document facets? F.g. the shop
> owner changes the category of an article or some feature (like color f.g.).
> As I read in the documentation, the taxonomy index can be rebuild but it is
> not possible to re-add (delete and add) facets.
> I don't want to rebuild the whole taxonomy index each time when some single
> article (document) facet is changed.
> Is there another solution to update the taxonomy index?
> I'm using lucene 4.10
>
> Regards
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Delete-Update-facets-from-taxonomy-index-tp4163014.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: topdocs per facet

2014-10-09 Thread Shai Erera
The facets translation should be done at the application level. So if you
index the dimension A w/ two facets A/A1 and A/A2, where A1 should also be
translated to B1 and A2 translated to B2, there are several options:

Index the dimensions A and B with their respective facets, and count the
relevant dimension based on the user's locale. Then the user can drill-down
on any of the returned facets easily. I'd say that if your index and/or
taxonomy aren't big, this is the easiest solution and most straightforward
to implement.

Another way is to index the facet Root/R1 and Root/R2, which are
language-independent. At the application level you translate Root/R1 to
either A/A1 or B/B1 based on the user locale. You also then do the reverse
translation when the user drills-down. So e.g. if the user clicked A/A1,
you translate that to Root/R1 and drill-down on that. If your application
is UI based, you probably can return e.g a JSON construct which contains
the labels to display + the facet values to drill-down by and then you
don't need to do any reverse translation.

As for retrieving a document's facets, you can either index them as
separate StoredFields (easy), or use DocValuesOrdinalsReader to traverse
the facets list along with the MatchingDocs, read the facet ordinals and
translate them. If it sounds complex, just use StoredFields :).

Shai

On Mon, Sep 29, 2014 at 7:15 PM, Jürgen Albert 
wrote:

> Hi,
>
> I'm currently implementing the lucene facets in the version 4.8.1 and two
> questions remain for me:
>
> 1. Is the an easy way to have translations for the facets? If we use e.g.
> the books example, the user should see the translation. But if he clicks on
> a link the english value should be used for the search. Thus I have to
> return the facet translation and the actual value by the search.
> 2. Is there a possibility to get the docs per facet?
>
> As An example I have e.g. a DrillDownQuery returning 5 docs and 2
> dimensions with 2 facets each. I guess the solution is somewhere in the
> MatchingDocs.  If I try:
>
> List matchingDocs = facetsCollector.
> getMatchingDocs();
>
> for(MatchingDocs doc : matchingDocs){
> DocIdSet docSet = doc.bits;
> DocIdSetIterator iterator = docSet.iterator();
> int docId = iterator.nextDoc();
> while (docId != DocIdSetIterator.NO_MORE_DOCS){
> Document document = doc.context.reader().document(
> docId);
> System.out.println(document.toString());
> docId = iterator.nextDoc();
> }
> }
>
> result:
>
> A List with as much MachtingDocs as dimensions, but only one MatchDocs
> gives me my docs at all. How I could get the docs per facet I can't see at
> all, nor how  could get the facets of a doc.
>
> What do I miss?
>
> Thx,
>
> Jürgen Albert.
>
> --
> Jürgen Albert
> Geschäftsführer
>
> Data In Motion UG (haftungsbeschränkt)
>
> Kahlaische Str. 4
> 07745 Jena
>
> Mobil:  0157-72521634
> E-Mail: j.alb...@datainmotion.de
> Web: www.datainmotion.de
>
> XING:   https://www.xing.com/profile/Juergen_Albert5
>
> Rechtliches
>
> Jena HBR 507027
> USt-IdNr: DE274553639
> St.Nr.: 162/107/04586
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Exception from FastTaxonomyFacetCounts

2014-10-09 Thread Shai Erera
This usually means that your IndexReader and TaxonomyReader are out of
sync. That is, the IndexReader sees category ordinals that the
TaxonomyReader does not yet see.

Do you use SearcherTaxonomyManager in your application? It ensures that the
two are always in sync, i.e. reopened together and that your application
always sees a consistent view of the two.

Shai

On Tue, Oct 7, 2014 at 10:03 AM, Jigar Shah  wrote:

> Intermittently while search i am getting this exception on huge index.
> (FacetsConfig used while indexing and searching is same.)
>
> java.lang.ArrayIndexOutOfBoundsException: 252554
> 06:28:37,954 ERROR [stderr] at
>
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.count(FastTaxonomyFacetCounts.java:73)
> 06:28:37,954 ERROR [stderr] at
>
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:49)
> 06:28:37,954 ERROR [stderr] at
>
> org.apache.lucene.facet.taxonomy.FastTaxonomyFacetCounts.(FastTaxonomyFacetCounts.java:39)
> 06:28:37,954 ERROR [stderr] at
>
> com.company.search.CustomDrillSideways.buildFacetsResult(LuceneDrillSideways.java:41)
> 06:28:37,954 ERROR [stderr] at
> org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:146)
> 06:28:37,955 ERROR [stderr] at
> org.apache.lucene.facet.DrillSideways.search(DrillSideways.java:203)
>
> Thanks,
> Jigar Shah
>


Re: Getting min/max of numeric doc-values facets

2014-10-09 Thread Chris Hostetter

: Is there some way when faceted search is executed, we can retrieve the
: possible min/max values of numeric doc-values field with supplied custom
: ranges in (LongRangeFacetCounts) or some other way to do it ?
: 
: As i believe this can give application hint, and next search request can be
: much smarter, e.g custom ranges can be more specific ?

You can use the StatsComponent to find out the min/max values of a field 
(constrained by your query, or *:* if you want the min/max across the 
entire index) and then you can use those values in your subsequent 
queries...

https://cwiki.apache.org/confluence/display/solr/The+Stats+Component

It's not currently possible to get the "actual" min/max *within* each 
range bucket of a facet.range (which is what you seem to be asking for 
althought i may be missunderstanding) but it's something being actively 
investigated on as part of a larger objective to beter integrate stats & 
facets...

https://issues.apache.org/jira/browse/SOLR-6352
https://issues.apache.org/jira/browse/SOLR-6348


-Hoss
http://www.lucidworks.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Wow8899

2014-10-09 Thread mickeyforster15
1470356651177.jpg
--Original Message--
From: Sick Mick
To: java-user@lucene.apache.org
Subject: Wow8899
Sent: 8 Oct 2014 23:23



Sent from my iPhone


Sent from my BlackBerry® smartphone
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org