RE: Search differences between solr 1.4.0 and 3.6.1

2012-11-28 Thread Frederico Azeiteiro
Also, i'm having issues with searching "RoC" . It returns thousands of matches 
on 3.6.1 against just a few on solr 1.4.0.
Looking to analysis I see no differences...

Should I add "RoC" to protected keywords or can I tweak something on schema to 
achieve exact "RoC" matches?


-Mensagem original-
De: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 17:19
Para: solr-user@lucene.apache.org
Assunto: RE: Search differences between solr 1.4.0 and 3.6.1

Ok, I'll test that and let you know.

Is there some test I can easily do to confirm that is was really a side-effect 
of the bug?

____
Frederico Azeiteiro
Developer
 


-Mensagem original-
De: Jack Krupansky [mailto:j...@basetechnology.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 13:39
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

You need to add the generateNumberParts=1 attribute - assuming you actually 
want the number generated.

The fact that your schema worked in 1.4 was probably simply a side effect of 
this bug:
https://issues.apache.org/jira/browse/SOLR-1706
"wrong tokens output from WordDelimiterFilter depending upon options"

-- Jack Krupansky

-----Original Message-
From: Frederico Azeiteiro
Sent: Monday, November 26, 2012 9:06 AM
To: solr-user@lucene.apache.org
Subject: Search differences between solr 1.4.0 and 3.6.1

Hi,



While updating our SOLR to 3.6.1 I noticed some results differences when using 
search strings with letters+number.

For a text field defined as:


<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>










<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>











Searching for string GAMES12 returns a lot of results on 3.6.1 that are not 
returned on 1.4.0.



It looks like WordDelimiterFilterFactory  is acting different for 3.6.1, the 
numeric part of the keyword is being ignored and the search is performed using 
only GAMES.



Analisys returns for 1.4.0:

org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}

term position

1

2

term text

GAMES

12

term type

word

word

source start,end

0,5

5,7

payload





AND for 3.6.1



org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, 
catenateAll=0, catenateNumbers=0}

position

1

term text

GAMES

startOffset

0

endOffset

5

type

word

positionLength

1





Is this something that can be modified/fixed to return the same results?



Thank you.



Regards,

Frederico







RE: Search differences between solr 1.4.0 and 3.6.1

2012-11-28 Thread Frederico Azeiteiro
Ok, I'll test that and let you know.

Is there some test I can easily do to confirm that is was really a side-effect 
of the bug?


Frederico Azeiteiro
Developer
 


-Mensagem original-
De: Jack Krupansky [mailto:j...@basetechnology.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 13:39
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

You need to add the generateNumberParts=1 attribute - assuming you actually 
want the number generated.

The fact that your schema worked in 1.4 was probably simply a side effect of 
this bug:
https://issues.apache.org/jira/browse/SOLR-1706
"wrong tokens output from WordDelimiterFilter depending upon options"

-- Jack Krupansky

-Original Message-
From: Frederico Azeiteiro
Sent: Monday, November 26, 2012 9:06 AM
To: solr-user@lucene.apache.org
Subject: Search differences between solr 1.4.0 and 3.6.1

Hi,



While updating our SOLR to 3.6.1 I noticed some results differences when using 
search strings with letters+number.

For a text field defined as:


<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>










<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>











Searching for string GAMES12 returns a lot of results on 3.6.1 that are not 
returned on 1.4.0.



It looks like WordDelimiterFilterFactory  is acting different for 3.6.1, the 
numeric part of the keyword is being ignored and the search is performed using 
only GAMES.



Analisys returns for 1.4.0:

org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}

term position

1

2

term text

GAMES

12

term type

word

word

source start,end

0,5

5,7

payload





AND for 3.6.1



org.apache.solr.analysis.WordDelimiterFilterFactory
{protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, 
catenateAll=0, catenateNumbers=0}

position

1

term text

GAMES

startOffset

0

endOffset

5

type

word

positionLength

1





Is this something that can be modified/fixed to return the same results?



Thank you.



Regards,

Frederico







RE: Search differences between solr 1.4.0 and 3.6.1

2012-11-28 Thread Frederico Azeiteiro
Sorry, ignore the 
"<http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>".
Somehow that text appeared when I copy/pasted the XML from IE and I did not 
notice, but that is not part of the schema... :)

Still can't figure this thing out...

-Mensagem original-
De: Erick Erickson [mailto:erickerick...@gmail.com] 
Enviada: quarta-feira, 28 de Novembro de 2012 12:52
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

Well, I get the same results in 1.4 and 3.6. The only difference is I didn't 
put <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
in.

In both cases the 12 is missing from the query analysis but is in the index 
analysis, due to the catenateNumbers being 1 in one case and
0 in the other.

So Im guessing there's something else going on that you're overlooking, but 
don't have any good clue

Best
Erick


On Wed, Nov 28, 2012 at 4:34 AM, Frederico Azeiteiro < 
frederico.azeite...@cision.com> wrote:

> I just reload both indexes just to make sure that all definitions are 
> loaded.
> On Analysis tool I can see differences, even that the fields are 
> defined on the same way:
>
> Query Analyser for 3.6.1
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
> catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, 
> catenateAll=0, catenateNumbers=0} term text: GAMES
>
> Query Analyser for 1.4.0
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
> catenateWords=0, generateWordParts=1, catenateAll=0, 
> catenateNumbers=0} term text: GAMES | 12
>
> The "12" is lost on query for 3.6.1.
> The only diference I can see on the field definition is the 
> "luceneMatchVersion=LUCENE_36"... Could it cause this issue?
>
> Thank you.
> Frederico
>
> -Mensagem original-
> De: Erick Erickson [mailto:erickerick...@gmail.com]
> Enviada: terça-feira, 27 de Novembro de 2012 12:26
> Para: solr-user@lucene.apache.org
> Assunto: Re: Search differences between solr 1.4.0 and 3.6.1
>
> Using the definition you provided, I don't get the same output. Are 
> you sure you are doing what you think? The generateNumberParts=0 keeps the 
> '12'
> from making it through the filter in 1.4 and 3.6 so I suspect you're 
> not quite doing something the same way in both.
>
> Perhaps looking at index tokenization in one and query in the other?
>
> Best
> Erick
>
>
> On Mon, Nov 26, 2012 at 9:06 AM, Frederico Azeiteiro < 
> frederico.azeite...@cision.com> wrote:
>
> > Hi,
> >
> >
> >
> > While updating our SOLR to 3.6.1 I noticed some results differences 
> > when using search strings with letters+number.
> >
> > For a text field defined as:
> >
> > 
> > <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
> >
> > 
> >
> >  > mapping="mapping-ISOLatin1Accent.txt"/>
> >
> >  > protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> > catenateNumbers="1" catenateWords="1" generateNumberParts="0"
> > generateWordParts="1" stemEnglishPossessive="0"/>
> >
> > 
> >
> > 
> > <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
> >
> > 
> >
> >  > expand="true" synonyms="synonyms.txt"/>
> >
> >  > protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> > catenateNumbers="0" catenateWords="0" generateNumberParts="0"
> > generateWordParts="1"/>
> >
> > 
> >
> >
> >
> > Searching for string GAMES12 returns a lot of results on 3.6.1 that 
> > are not returned on 1.4.0.
> >
> >
> >
> > It looks like WordDelimiterFilterFactory  is acting different for 
> > 3.6.1, the numeric part of the keyword is being ignored and the 
> > search is performed using only GAMES.
> >
> >
> >
> > Analisys returns for 1.4.0:
> >
> > org.apache.solr.analysis.WordDelimiterFilterFactory
> > {protected=protwords.txt, splitOnCaseChange=1, 
> > generateNumberParts=0, catenateWords=0, generateWordParts=1, 
> > catenateAll=0, catenateNumbers=0}
> >
> > term position
> >
> > 1
> >
> > 2
> >
> > term text
> >
> > GAMES
> >
> > 12
> >
> > term type
> >
> > word
> >
> > word
> >
> > source start,end
> >
> > 0,5
> >
> > 5,7
> >
> > payload
> >
> >
> >
> >
> >
> > AND for 3.6.1
> >
> >
> >
> > org.apache.solr.analysis.WordDelimiterFilterFactory
> > {protected=protwords.txt, splitOnCaseChange=1, 
> > generateNumberParts=0, catenateWords=0, 
> > luceneMatchVersion=LUCENE_36, generateWordParts=1, catenateAll=0, 
> > catenateNumbers=0}
> >
> > position
> >
> > 1
> >
> > term text
> >
> > GAMES
> >
> > startOffset
> >
> > 0
> >
> > endOffset
> >
> > 5
> >
> > type
> >
> > word
> >
> > positionLength
> >
> > 1
> >
> >
> >
> >
> >
> > Is this something that can be modified/fixed to return the same results?
> >
> >
> >
> > Thank you.
> >
> >
> >
> > Regards,
> >
> > Frederico
> >
> >
> >
> >
> >
> >
>


RE: Search differences between solr 1.4.0 and 3.6.1

2012-11-28 Thread Frederico Azeiteiro
I just reload both indexes just to make sure that all definitions are loaded.
On Analysis tool I can see differences, even that the fields are defined on the 
same way:

Query Analyser for 3.6.1
org.apache.solr.analysis.WordDelimiterFilterFactory {protected=protwords.txt, 
splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, 
luceneMatchVersion=LUCENE_36, generateWordParts=1, catenateAll=0, 
catenateNumbers=0}
term text: GAMES

Query Analyser for 1.4.0
org.apache.solr.analysis.WordDelimiterFilterFactory {protected=protwords.txt, 
splitOnCaseChange=1, generateNumberParts=0, catenateWords=0, 
generateWordParts=1, catenateAll=0, catenateNumbers=0}
term text: GAMES | 12
 
The "12" is lost on query for 3.6.1.
The only diference I can see on the field definition is the 
"luceneMatchVersion=LUCENE_36"... Could it cause this issue?

Thank you.
Frederico

-Mensagem original-
De: Erick Erickson [mailto:erickerick...@gmail.com] 
Enviada: terça-feira, 27 de Novembro de 2012 12:26
Para: solr-user@lucene.apache.org
Assunto: Re: Search differences between solr 1.4.0 and 3.6.1

Using the definition you provided, I don't get the same output. Are you sure 
you are doing what you think? The generateNumberParts=0 keeps the '12'
from making it through the filter in 1.4 and 3.6 so I suspect you're not quite 
doing something the same way in both.

Perhaps looking at index tokenization in one and query in the other?

Best
Erick


On Mon, Nov 26, 2012 at 9:06 AM, Frederico Azeiteiro < 
frederico.azeite...@cision.com> wrote:

> Hi,
>
>
>
> While updating our SOLR to 3.6.1 I noticed some results differences 
> when using search strings with letters+number.
>
> For a text field defined as:
>
> 
> <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
>
> 
>
>  mapping="mapping-ISOLatin1Accent.txt"/>
>
>  protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> catenateNumbers="1" catenateWords="1" generateNumberParts="0"
> generateWordParts="1" stemEnglishPossessive="0"/>
>
> 
>
> 
> <http://cbrsrvmtr04:8983/solr/WISE/admin/file/?file=schema.xml>
>
> 
>
>  expand="true" synonyms="synonyms.txt"/>
>
>  protected="protwords.txt" splitOnCaseChange="1" catenateAll="0"
> catenateNumbers="0" catenateWords="0" generateNumberParts="0"
> generateWordParts="1"/>
>
> 
>
>
>
> Searching for string GAMES12 returns a lot of results on 3.6.1 that 
> are not returned on 1.4.0.
>
>
>
> It looks like WordDelimiterFilterFactory  is acting different for 
> 3.6.1, the numeric part of the keyword is being ignored and the search 
> is performed using only GAMES.
>
>
>
> Analisys returns for 1.4.0:
>
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
> catenateWords=0, generateWordParts=1, catenateAll=0, 
> catenateNumbers=0}
>
> term position
>
> 1
>
> 2
>
> term text
>
> GAMES
>
> 12
>
> term type
>
> word
>
> word
>
> source start,end
>
> 0,5
>
> 5,7
>
> payload
>
>
>
>
>
> AND for 3.6.1
>
>
>
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {protected=protwords.txt, splitOnCaseChange=1, generateNumberParts=0, 
> catenateWords=0, luceneMatchVersion=LUCENE_36, generateWordParts=1, 
> catenateAll=0, catenateNumbers=0}
>
> position
>
> 1
>
> term text
>
> GAMES
>
> startOffset
>
> 0
>
> endOffset
>
> 5
>
> type
>
> word
>
> positionLength
>
> 1
>
>
>
>
>
> Is this something that can be modified/fixed to return the same results?
>
>
>
> Thank you.
>
>
>
> Regards,
>
> Frederico
>
>
>
>
>
>


RE: Error loading class solr.CJKBigramFilterFactory

2012-11-15 Thread Frederico Azeiteiro
:)
Just installed 3.6.1 and its working just fine. 

Something should be wrong with my tomcat/solr install.

Thank you Robert.

//Frederico
 


-Mensagem original-
De: Robert Muir [mailto:rcm...@gmail.com] 
Enviada: quarta-feira, 14 de Novembro de 2012 19:18
Para: solr-user@lucene.apache.org
Assunto: Re: Error loading class solr.CJKBigramFilterFactory

I'm sure. I added it to 3.6 ;)

You must have something funky with your tomcat configuration, like an exploded 
war with different versions of jars or some other form of jar hell.

On Wed, Nov 14, 2012 at 9:32 AM, Frederico Azeiteiro 
 wrote:
> Are you sure about that?
>
> We have it working on:
>
> Solr Specification Version: 3.5.0.2011.11.22.14.54.38 Solr 
> Implementation Version: 3.5.0 1204988 - simon - 2011-11-22 14:54:38 
> Lucene Specification Version: 3.5.0 Lucene Implementation Version: 
> 3.5.0 1204988 - simon - 2011-11-22 14:46:51 Current Time: Wed Nov 14 
> 17:30:07 WET 2012 Server Start Time:Wed Nov 14 11:40:36 WET 2012
>
> ??
>
> Thanks,
> Frederico
>
>
> -Mensagem original-
> De: Robert Muir [mailto:rcm...@gmail.com]
> Enviada: quarta-feira, 14 de Novembro de 2012 16:28
> Para: solr-user@lucene.apache.org
> Assunto: Re: Error loading class solr.CJKBigramFilterFactory
>
> On Wed, Nov 14, 2012 at 8:12 AM, Frederico Azeiteiro 
>  wrote:
>> Fo make some further testing I installed SOLR 3.5.0 using default 
>> Jetty server.
>>
>> When tried to start SOLR using the same schema I get:
>>
>>
>>
>> SEVERE: org.apache.solr.common.SolrException: Error loading class 
>> 'solr.CJKBigramFilterFactory'
>
> This filter was added in 3.6, so its expected that it wouldnt be found.


RE: Error loading class solr.CJKBigramFilterFactory

2012-11-14 Thread Frederico Azeiteiro
Are you sure about that?

We have it working on:

Solr Specification Version: 3.5.0.2011.11.22.14.54.38
Solr Implementation Version: 3.5.0 1204988 - simon - 2011-11-22 14:54:38
Lucene Specification Version: 3.5.0
Lucene Implementation Version: 3.5.0 1204988 - simon - 2011-11-22 14:46:51
Current Time: Wed Nov 14 17:30:07 WET 2012
Server Start Time:Wed Nov 14 11:40:36 WET 2012

??

Thanks,
Frederico 


-Mensagem original-
De: Robert Muir [mailto:rcm...@gmail.com] 
Enviada: quarta-feira, 14 de Novembro de 2012 16:28
Para: solr-user@lucene.apache.org
Assunto: Re: Error loading class solr.CJKBigramFilterFactory

On Wed, Nov 14, 2012 at 8:12 AM, Frederico Azeiteiro 
 wrote:
> Fo make some further testing I installed SOLR 3.5.0 using default 
> Jetty server.
>
> When tried to start SOLR using the same schema I get:
>
>
>
> SEVERE: org.apache.solr.common.SolrException: Error loading class 
> 'solr.CJKBigramFilterFactory'

This filter was added in 3.6, so its expected that it wouldnt be found.


Error loading class solr.CJKBigramFilterFactory

2012-11-14 Thread Frederico Azeiteiro
Hi, 

 

I've been testing some CJK tokenizers and I manage to get acceptable
results using:

 



 

   

   

   

   

   

   

 



 

The tests have been done using SOLR 3.5.0 on TomCat7.

 

Fo make some further testing I installed SOLR 3.5.0 using default Jetty
server. 

When tried to start SOLR using the same schema I get:

 

SEVERE: org.apache.solr.common.SolrException: Error loading class
'solr.CJKBigramFilterFactory'

SEVERE: org.apache.solr.common.SolrException: Error loading class
'solr.CJKWidthFilterFactory'

 

Should these classes com on v. 3.5.0 by default? 

Do I need to install anything or copy any lib?

 

Thank you all.

Frederico

 



Solr 3.5.0 - different behaviour on rows?

2012-03-15 Thread Frederico Azeiteiro
Hi all,

 

Just testing SOLR 3.5.0. and notice a different behavior on this new
version:

select?rows=10&q=sig%3a("54ba3e8fd3d5d8371f0e01c403085a0c")&?

 

this query returns no results on my indexes, but works for SOLR 1.4.0
and returns "Java heap space java.lang.OutOfMemoryError: Java heap
space" on SOLR 3.5.0

 

Is this normal? As there are no results, why the OutOfMemoryError?

Is it some memory allocated based on the rows number?

 

Regards,

Frederico

 



Recover index

2011-10-28 Thread Frederico Azeiteiro
Hello all,

 

When moving a SOLR index to another instance I lost the files:

segments.gen

segments_xk

 

I have the .cfs file complete.

 

What are my options to recover the data.

Any ideia that I can test?

 

Thank you.



Frederico Azeiteiro

 



RE: Using MLT feature

2011-04-08 Thread Frederico Azeiteiro
Yes, i guess that could be an option, but I'm not very experienced with Java 
development and SOLR modifications.
As my main goal was to create a similar sig in C#, I just use the c# method to 
create the sig myself before indexing instead of SOLR Deduplicate function.

That way, when searching I could use the same method with the certain the sig 
is the same. 
As the algorytm used is the same of textProfileSignature the result is the same 
as using SOLR deduplicate. 

Frederico 
 


-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: sexta-feira, 8 de Abril de 2011 10:11
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

Couldn't you extend the TextProfileSignature and modify the TokenComparator
class to use lexical order when token have the same frequency ?

Ludovic.

2011/4/8 Frederico Azeiteiro [via Lucene] <
ml-node+2794604-1683988626-383...@n3.nabble.com>

> Hi.
>
> Yes, I manage to create a stable comparator in c# for profile.
> The problem is before that on:
>
> ...
> tokens.put(s, tok);
> ...
>
> Imagine you have 2 tokens with the same frequency, on the stable sort
> comparator for profile it will maintain the original order.
> The problem is that the original order comes from the way they are
> inserted in hashmap 'tokens' and not from the order the tokens appear on
> original text.
>
> Frederico
>
> -Original Message-
> From: lboutros [mailto:[hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=0&by-user=t>]
>
> Sent: sexta-feira, 8 de Abril de 2011 09:49
> To: [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=1&by-user=t>
> Subject: Re: Using MLT feature
>
> It seems that tokens are sorted by frequencies :
>
> ...
> Collections.sort(profile, new TokenComparator());
> ...
>
>
> and
>
> private static class TokenComparator implements Comparator {
> public int compare(Token t1, Token t2) {
>   return t2.cnt - t1.cnt;
> }
>
> and cnt is the token count.
>
> Ludovic.
>
> 2011/4/7 Frederico Azeiteiro [via Lucene] <
> [hidden 
> email]<http://user/SendEmail.jtp?type=node&node=2794604&i=2&by-user=t>>
>
>
> > Well at this point I'm more dedicated to the Deduplicate issue.
> >
> > Using a Min_token_len of 4 I'm getting nice comparison results. MLT
> returns
> > a lot of similar docs that I don't consider similar - even tuning the
> > parameters.
> >
> > Finishing this issue, I found out that the signature also contains the
> > field name meaning that if you wish to signature both title and text
> fields,
> > your signature will be a hash of ("text"+"text value"+"title"+"title
> > value").
> >
> > In any case, I found that the Hashmap used on the hash algorithm
> inserts
> > the tokens by some hashmap internal sort method that I can't
> understand :),
> > and so, impossible to copy to C# implementation.
> >
> > Thank you for all your help,
> > Frederico
> >
> >
>
>
> -
> Jouve
> France.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h<http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h?by-user=t>
> tml
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794604.html
>  To start a new topic under Solr - User, email
> ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, click 
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=>.
>
>


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794622.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Using MLT feature

2011-04-08 Thread Frederico Azeiteiro
Hi.

Yes, I manage to create a stable comparator in c# for profile. 
The problem is before that on: 

...
tokens.put(s, tok);
...

Imagine you have 2 tokens with the same frequency, on the stable sort
comparator for profile it will maintain the original order. 
The problem is that the original order comes from the way they are
inserted in hashmap 'tokens' and not from the order the tokens appear on
original text.

Frederico

-Original Message-
From: lboutros [mailto:boutr...@gmail.com] 
Sent: sexta-feira, 8 de Abril de 2011 09:49
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

It seems that tokens are sorted by frequencies :

...
Collections.sort(profile, new TokenComparator());
...


and

private static class TokenComparator implements Comparator {
public int compare(Token t1, Token t2) {
  return t2.cnt - t1.cnt;
}

and cnt is the token count.

Ludovic.

2011/4/7 Frederico Azeiteiro [via Lucene] <
ml-node+2790579-1141723501-383...@n3.nabble.com>

> Well at this point I'm more dedicated to the Deduplicate issue.
>
> Using a Min_token_len of 4 I'm getting nice comparison results. MLT
returns
> a lot of similar docs that I don't consider similar - even tuning the
> parameters.
>
> Finishing this issue, I found out that the signature also contains the
> field name meaning that if you wish to signature both title and text
fields,
> your signature will be a hash of ("text"+"text value"+"title"+"title
> value").
>
> In any case, I found that the Hashmap used on the hash algorithm
inserts
> the tokens by some hashmap internal sort method that I can't
understand :),
> and so, impossible to copy to C# implementation.
>
> Thank you for all your help,
> Frederico
>
>


-
Jouve
France.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Using-MLT-feature-tp2774454p2794585.h
tml
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Using MLT feature

2011-04-07 Thread Frederico Azeiteiro
Well at this point I'm more dedicated to the Deduplicate issue.

Using a Min_token_len of 4 I'm getting nice comparison results. MLT returns a 
lot of similar docs that I don't consider similar - even tuning the parameters.

Finishing this issue, I found out that the signature also contains the field 
name meaning that if you wish to signature both title and text fields, your 
signature will be a hash of ("text"+"text value"+"title"+"title value").

In any case, I found that the Hashmap used on the hash algorithm inserts the 
tokens by some hashmap internal sort method that I can't understand :), and so, 
impossible to copy to C# implementation.

Thank you for all your help,
Frederico 


-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: quinta-feira, 7 de Abril de 2011 04:09
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

A "fuzzy signature" system will not work here. You are right, you want
to try MLT instead.

Lance

On Wed, Apr 6, 2011 at 9:47 AM, Frederico Azeiteiro
 wrote:
> Yes, I had already check the code for it and use it to compile a c# method 
> that returns the same signature.
>
> But I have a strange issue:
> For instance, using MinTokenLenght=2 and default QUANT_RATE,  passing the 
> text "frederico" (simple text no big deal here):
>
> 1. using my c# app returns "8b92e01d67591dfc60adf9576f76a055"
> 2. using SOLR, passing a doc with HeadLine "frederico" I get 
> "8d9a5c35812ba75b8383d4538b91080f" on my signature field.
> 3. Created a Java app (i'm not a Java expert..), using the code from SOLR 
> SignatureUpdateProcessorFactory class (please check code below) and I get 
> "8b92e01d67591dfc60adf9576f76a055".
>
> Java app code:
>                TextProfileSignature textProfileSignature = new 
> TextProfileSignature();
>                NamedList params = new NamedList();
>                params.add("", "");
>                SolrParams solrParams = SolrParams.toSolrParams(params);
>                textProfileSignature.init(solrParams);
>                textProfileSignature.add("frederico");
>
>
>                byte[] signature =  textProfileSignature.getSignature();
>                char[] arr = new char[signature.length << 1];
>                for (int i = 0; i < signature.length; i++) {
>                        int b = signature[i];
>                        int idx = i << 1;
>                        arr[idx] = StrUtils.HEX_DIGITS[(b >> 4) & 0xf];
>                        arr[idx + 1] = StrUtils.HEX_DIGITS[b & 0xf];
>                }
>                String sigString = new String(arr);
>                System.out.println(sigString);
>
>
>
>
> Here's my processor configs:
>
> 
>      class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>       true
>       sig
>       false
>       HeadLine
>        name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
>       2
>       
>     
>     
>   
>
>
> So both my apps (Java and C#)  return the same signature but SOLR returns a 
> different one..
> Can anyone understand what I should be doing wrong?
>
> Thank you once again.
>
> Frederico
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: terça-feira, 5 de Abril de 2011 15:20
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
>
> If you check the code for TextProfileSignature [1] your'll notice the init
> method reading params. You can set those params as you did. Reading Javadoc
> [2] might help as well. But what's not documented in the Javadoc is how QUANT
> is computed; it rounds.
>
> [1]:
> http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
> [2]:
> http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html
>
> On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
>> Thank you, I'll try to create a c# method to create the same sig of SOLR,
>> and then compare both sigs before index the doc. This way I can avoid the
>> indexation of existing docs.
>>
>> If anyone needs to use this parameter (as this info is not on the wiki),
>> you can add the option
>>
>> 5
>>
>> On the processor tag.
>>
>> Best regards,
>> Frederico
>>
>>
>> -Original Message-
>> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
>> Sent: terç

RE: Using MLT feature

2011-04-06 Thread Frederico Azeiteiro
Yes, I had already check the code for it and use it to compile a c# method that 
returns the same signature.

But I have a strange issue:
For instance, using MinTokenLenght=2 and default QUANT_RATE,  passing the text 
"frederico" (simple text no big deal here): 

1. using my c# app returns "8b92e01d67591dfc60adf9576f76a055"
2. using SOLR, passing a doc with HeadLine "frederico" I get 
"8d9a5c35812ba75b8383d4538b91080f" on my signature field.
3. Created a Java app (i'm not a Java expert..), using the code from SOLR 
SignatureUpdateProcessorFactory class (please check code below) and I get 
"8b92e01d67591dfc60adf9576f76a055".

Java app code:
TextProfileSignature textProfileSignature = new 
TextProfileSignature();
NamedList params = new NamedList();
params.add("", "");
SolrParams solrParams = SolrParams.toSolrParams(params);
textProfileSignature.init(solrParams);
textProfileSignature.add("frederico");


byte[] signature =  textProfileSignature.getSignature();
char[] arr = new char[signature.length << 1];
for (int i = 0; i < signature.length; i++) {
int b = signature[i];
int idx = i << 1;
arr[idx] = StrUtils.HEX_DIGITS[(b >> 4) & 0xf];
arr[idx + 1] = StrUtils.HEX_DIGITS[b & 0xf];
}
String sigString = new String(arr);
System.out.println(sigString);




Here's my processor configs:


 
   true
   sig
   false
   HeadLine
   org.apache.solr.update.processor.TextProfileSignature
   2
   
 
 
   


So both my apps (Java and C#)  return the same signature but SOLR returns a 
different one.. 
Can anyone understand what I should be doing wrong?

Thank you once again.

Frederico

-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: terça-feira, 5 de Abril de 2011 15:20
To: solr-user@lucene.apache.org
Cc: Frederico Azeiteiro
Subject: Re: Using MLT feature

If you check the code for TextProfileSignature [1] your'll notice the init 
method reading params. You can set those params as you did. Reading Javadoc 
[2] might help as well. But what's not documented in the Javadoc is how QUANT 
is computed; it rounds.

[1]: 
http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.4/src/java/org/apache/solr/update/processor/TextProfileSignature.java?view=markup
[2]: 
http://lucene.apache.org/solr/api/org/apache/solr/update/processor/TextProfileSignature.html

On Tuesday 05 April 2011 16:10:08 Frederico Azeiteiro wrote:
> Thank you, I'll try to create a c# method to create the same sig of SOLR,
> and then compare both sigs before index the doc. This way I can avoid the
> indexation of existing docs.
> 
> If anyone needs to use this parameter (as this info is not on the wiki),
> you can add the option
> 
> 5
> 
> On the processor tag.
> 
> Best regards,
> Frederico 
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: terça-feira, 5 de Abril de 2011 12:01
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
> 
> On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
> > Sorry, the reply I made yesterday was directed to Markus and not the
> > list...
> > 
> > Here's my thoughts on this. At this point I'm a little confused if SOLR
> > is a good option to find near duplicate docs.
> > 
> > >> Yes there is, try set overwriteDupes to true and documents yielding
> > 
> > the same signature will be overwritten
> > 
> > The problem is that I don't want to overwrite the doc, I need to
> > maintain the original version (because the doc has others fields I need
> > to maintain).
> > 
> > >>If you have need both fuzzy and exact matching then add a second
> > 
> > update processor inside the chain and create another signature field.
> > 
> > I just need the fuzzy search but the quick tests I made, return
> > different signatures for what I consider duplicate docs.
> > "Army deploys as clan war kills 11 in Philippine south"
> > "Army deploys as clan war kills 11 in Philippine south."
> > 
> > Same sig for the above 2 strings, that's ok.
> > 
> > But a different sig was created for:
> > "Army deploys as clan war kills 11 in Philippine south the."
> > 
> > Is there a way to setup the TextProfileSignature parameters to adjust
> > the "s

RE: Using MLT feature

2011-04-05 Thread Frederico Azeiteiro
Thank you, I'll try to create a c# method to create the same sig of SOLR, and 
then compare both sigs before index the doc. This way I can avoid the 
indexation of existing docs.

If anyone needs to use this parameter (as this info is not on the wiki), you 
can add the option

5

On the processor tag.

Best regards,
Frederico 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: terça-feira, 5 de Abril de 2011 12:01
To: solr-user@lucene.apache.org
Cc: Frederico Azeiteiro
Subject: Re: Using MLT feature



On Tuesday 05 April 2011 12:19:33 Frederico Azeiteiro wrote:
> Sorry, the reply I made yesterday was directed to Markus and not the
> list...
> 
> Here's my thoughts on this. At this point I'm a little confused if SOLR
> is a good option to find near duplicate docs.
> 
> >> Yes there is, try set overwriteDupes to true and documents yielding
> 
> the same signature will be overwritten
> 
> The problem is that I don't want to overwrite the doc, I need to
> maintain the original version (because the doc has others fields I need
> to maintain).
> 
> >>If you have need both fuzzy and exact matching then add a second
> 
> update processor inside the chain and create another signature field.
> 
> I just need the fuzzy search but the quick tests I made, return
> different signatures for what I consider duplicate docs.
> "Army deploys as clan war kills 11 in Philippine south"
> "Army deploys as clan war kills 11 in Philippine south."
> 
> Same sig for the above 2 strings, that's ok.
> 
> But a different sig was created for:
> "Army deploys as clan war kills 11 in Philippine south the."
> 
> Is there a way to setup the TextProfileSignature parameters to adjust
> the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?
> 
> Do you think that these parameters can help creating the same sig for
> the above example?

You can only fix this by increasing minTokenLen to 4 to prevent `the` from 
being added to the list of tokens but this may affect other signatures. 
Possibly more documents will then get the same signature. Messing around with 
quantRate won't do much good because all your tokens have the same frequency 
(1) so quant will always be 1 in this short text. That's why 
TextProfileSignature works less well for short texts.

http://nutch.apache.org/apidocs-1.2/org/apache/nutch/crawl/TextProfileSignature.html

> 
> Is anyone using the TextProfileSignature with success?
> 
> Thank you,
> Frederico
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: segunda-feira, 4 de Abril de 2011 16:47
> To: solr-user@lucene.apache.org
> Cc: Frederico Azeiteiro
> Subject: Re: Using MLT feature
> 
> > Hi again,
> > I guess I was wrong on my early post... There's no automated way to
> 
> avoid
> 
> > the indexation of the duplicate doc.
> 
> Yes there is, try set overwriteDupes to true and documents yielding the
> same
> signature will be overwritten. If you have need both fuzzy and exact
> matching
> then add a second update processor inside the chain and create another
> signature field.
> 
> > I guess I have 2 options:
> > 
> > 1. Create a temp index with signatures and then have an app that for
> 
> each
> 
> > new doc verifies if sig exists on my primary index. If not, add the
> > article.
> > 
> > 2. Before adding the doc, create a signature (using the same algorithm
> 
> that
> 
> > SOLR uses) on my indexing app and then verify if signature exists
> 
> before
> 
> > adding.
> > 
> > I'm way thinking the right way here? :)
> > 
> > Thank you,
> > Frederico
> > 
> > 
> > 
> > -Original Message-
> > From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
> > Sent: segunda-feira, 4 de Abril de 2011 11:59
> > To: solr-user@lucene.apache.org
> > Subject: RE: Using MLT feature
> > 
> > Thank you Markus it looks great.
> > 
> > But the wiki is not very detailed on this.
> > Do you mean if I:
> > 
> > 1. Create:
> > 
> > 
> >  
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
> 
> > true
> > 
> >   false
> >   signature
> >   headline,body,medianame
> >
> name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
> /s
> 
> > tr> 
> > 
> > 
> >     
> >   
> >   
> > 
> > 2. Add the request as the default update request
>

RE: Using MLT feature

2011-04-05 Thread Frederico Azeiteiro
Sorry, the reply I made yesterday was directed to Markus and not the
list...

Here's my thoughts on this. At this point I'm a little confused if SOLR
is a good option to find near duplicate docs.

>> Yes there is, try set overwriteDupes to true and documents yielding
the same signature will be overwritten

The problem is that I don't want to overwrite the doc, I need to
maintain the original version (because the doc has others fields I need
to maintain).

>>If you have need both fuzzy and exact matching then add a second
update processor inside the chain and create another signature field.

I just need the fuzzy search but the quick tests I made, return
different signatures for what I consider duplicate docs. 
"Army deploys as clan war kills 11 in Philippine south"
"Army deploys as clan war kills 11 in Philippine south."

Same sig for the above 2 strings, that's ok.

But a different sig was created for:
"Army deploys as clan war kills 11 in Philippine south the."
 
Is there a way to setup the TextProfileSignature parameters to adjust
the "sensibility" on SOLR (QUANT_RATE or MIN_TOKEN_LEN)?

Do you think that these parameters can help creating the same sig for
the above example?

Is anyone using the TextProfileSignature with success?

Thank you,
Frederico 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 16:47
To: solr-user@lucene.apache.org
Cc: Frederico Azeiteiro
Subject: Re: Using MLT feature


> Hi again,
> I guess I was wrong on my early post... There's no automated way to
avoid
> the indexation of the duplicate doc.

Yes there is, try set overwriteDupes to true and documents yielding the
same 
signature will be overwritten. If you have need both fuzzy and exact
matching 
then add a second update processor inside the chain and create another 
signature field.

> 
> I guess I have 2 options:
> 
> 1. Create a temp index with signatures and then have an app that for
each
> new doc verifies if sig exists on my primary index. If not, add the
> article.
> 
> 2. Before adding the doc, create a signature (using the same algorithm
that
> SOLR uses) on my indexing app and then verify if signature exists
before
> adding.
> 
> I'm way thinking the right way here? :)
> 
> Thank you,
> Frederico
>  
> 
> 
> -Original Message-
> From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com]
> Sent: segunda-feira, 4 de Abril de 2011 11:59
> To: solr-user@lucene.apache.org
> Subject: RE: Using MLT feature
> 
> Thank you Markus it looks great.
> 
> But the wiki is not very detailed on this.
> Do you mean if I:
> 
> 1. Create:
> 
> 
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
>
> true
>   false
>   signature
>   headline,body,medianame
>   
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
/s
> tr> 
> 
> 
>   
> 
> 2. Add the request as the default update request
> 3. Add a "signature" indexed field to my schema.
> 
> Then,
> When adding a new doc to my index, it is only added of not considered
a
> duplicate using a Lookup3Signature on the field defined? All
duplicates
> are ignored and not added to my index?
> Is it so simple as that?
> 
> Does it works even if the medianame should be an exact match (not
similar
> match as the headline and bodytext are)?
> 
> Thank you for your help,
> 
> 
> Frederico Azeiteiro
> Developer
>  
> 
> 
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: segunda-feira, 4 de Abril de 2011 10:48
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> http://wiki.apache.org/solr/Deduplication
> 
> On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> > Hi,
> > 
> > The ideia is don't index if something similar (headline+bodytext)
for
> > the same exact medianame.
> > 
> > Do you mean I would need to index the doc first (maybe in a temp
index)
> > and then use the MLT feature to find similar docs before adding to
final
> > index?
> > 
> > Thanks,
> > Frederico
> > 
> > 
> > -Original Message-
> > From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> > Sent: segunda-feira, 4 de Abril de 2011 10:22
> > To: solr-user@lucene.apache.org
> > Subject: Re: Using MLT feature
> > 
> > Do you want to not index if something similar? Or don't index if
exact.
> > I would look into a hash code of the document if you don't want to
index
> &

RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi again,
I guess I was wrong on my early post... There's no automated way to avoid the 
indexation of the duplicate doc.

I guess I have 2 options: 

1. Create a temp index with signatures and then have an app that for each new 
doc verifies if sig exists on my primary index. 
If not, add the article.

2. Before adding the doc, create a signature (using the same algorithm that 
SOLR uses) on my indexing app and then verify if signature exists before adding.

I'm way thinking the right way here? :)

Thank you,
Frederico 
 


-Original Message-----
From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] 
Sent: segunda-feira, 4 de Abril de 2011 11:59
To: solr-user@lucene.apache.org
Subject: RE: Using MLT feature

Thank you Markus it looks great.

But the wiki is not very detailed on this. 
Do you mean if I:

1. Create:


  true
  false
  signature
  headline,body,medianame
  org.apache.solr.update.processor.Lookup3Signature



  

2. Add the request as the default update request 
3. Add a "signature" indexed field to my schema.

Then,
When adding a new doc to my index, it is only added of not considered a 
duplicate using a Lookup3Signature on the field defined?
All duplicates are ignored and not added to my index? 
Is it so simple as that?

Does it works even if the medianame should be an exact match (not similar match 
as the headline and bodytext are)?

Thank you for your help,

____
Frederico Azeiteiro
Developer
 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 10:48
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -Original Message-
> From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
>  wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > 
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Thank you Markus it looks great.

But the wiki is not very detailed on this. 
Do you mean if I:

1. Create:


  true
  false
  signature
  headline,body,medianame
  org.apache.solr.update.processor.Lookup3Signature



  

2. Add the request as the default update request 
3. Add a "signature" indexed field to my schema.

Then,
When adding a new doc to my index, it is only added of not considered a 
duplicate using a Lookup3Signature on the field defined?
All duplicates are ignored and not added to my index? 
Is it so simple as that?

Does it works even if the medianame should be an exact match (not similar match 
as the headline and bodytext are)?

Thank you for your help,

____
Frederico Azeiteiro
Developer
 


-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: segunda-feira, 4 de Abril de 2011 10:48
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

http://wiki.apache.org/solr/Deduplication

On Monday 04 April 2011 11:34:52 Frederico Azeiteiro wrote:
> Hi,
> 
> The ideia is don't index if something similar (headline+bodytext) for
> the same exact medianame.
> 
> Do you mean I would need to index the doc first (maybe in a temp index)
> and then use the MLT feature to find similar docs before adding to final
> index?
> 
> Thanks,
> Frederico
> 
> 
> -Original Message-
> From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com]
> Sent: segunda-feira, 4 de Abril de 2011 10:22
> To: solr-user@lucene.apache.org
> Subject: Re: Using MLT feature
> 
> Do you want to not index if something similar? Or don't index if exact.
> I would look into a hash code of the document if you don't want to index
> exact.Similar though, I think has to be based off a document in the
> index.
> 
> On Apr 4, 2011, at 5:16, Frederico Azeiteiro
> 
>  wrote:
> > Hi,
> > 
> > 
> > 
> > I would like to hear your opinion about the MLT feature and if it's a
> > good solution to what I need to implement.
> > 
> > 
> > 
> > My index has fields like: headline, body and medianame.
> > 
> > What I need to do is, before adding a new doc, verify if a similar doc
> > exists for this media.
> > 
> > 
> > 
> > My idea is to use the MorelikeThisHandler
> > (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
> 
> way:
> > For each new doc, perform a MLT search with q= medianame and
> > stream.body=headline+bodytext.
> > 
> > If no similar docs are found than I can safely add the doc.
> > 
> > 
> > 
> > Is this feasible using the MLT handler? Is it a good approach? Are
> 
> there
> 
> > a better way to perform this comparison?
> > 
> > 
> > 
> > Thank you for your help.
> > 
> > 
> > 
> > Best regards,
> > 
> > 
> > 
> > Frederico Azeiteiro

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


RE: Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi,

The ideia is don't index if something similar (headline+bodytext) for
the same exact medianame.

Do you mean I would need to index the doc first (maybe in a temp index)
and then use the MLT feature to find similar docs before adding to final
index?

Thanks,
Frederico


-Original Message-
From: Chris Fauerbach [mailto:chris.fauerb...@gmail.com] 
Sent: segunda-feira, 4 de Abril de 2011 10:22
To: solr-user@lucene.apache.org
Subject: Re: Using MLT feature

Do you want to not index if something similar? Or don't index if exact.
I would look into a hash code of the document if you don't want to index
exact.Similar though, I think has to be based off a document in the
index.   

On Apr 4, 2011, at 5:16, Frederico Azeiteiro
 wrote:

> Hi,
> 
> 
> 
> I would like to hear your opinion about the MLT feature and if it's a
> good solution to what I need to implement.
> 
> 
> 
> My index has fields like: headline, body and medianame.
> 
> What I need to do is, before adding a new doc, verify if a similar doc
> exists for this media.
> 
> 
> 
> My idea is to use the MorelikeThisHandler
> (http://wiki.apache.org/solr/MoreLikeThisHandler) in the following
way:
> 
> 
> 
> For each new doc, perform a MLT search with q= medianame and
> stream.body=headline+bodytext.
> 
> If no similar docs are found than I can safely add the doc.
> 
> 
> 
> Is this feasible using the MLT handler? Is it a good approach? Are
there
> a better way to perform this comparison?
> 
> 
> 
> Thank you for your help.
> 
> 
> 
> Best regards,
> 
> 
> 
> Frederico Azeiteiro
> 
> 
> 


Using MLT feature

2011-04-04 Thread Frederico Azeiteiro
Hi,

 

I would like to hear your opinion about the MLT feature and if it's a
good solution to what I need to implement.

 

My index has fields like: headline, body and medianame.

What I need to do is, before adding a new doc, verify if a similar doc
exists for this media.

 

My idea is to use the MorelikeThisHandler
(http://wiki.apache.org/solr/MoreLikeThisHandler) in the following way:

 

For each new doc, perform a MLT search with q= medianame and
stream.body=headline+bodytext.

If no similar docs are found than I can safely add the doc.

 

Is this feasible using the MLT handler? Is it a good approach? Are there
a better way to perform this comparison?

 

Thank you for your help.

 

Best regards,

____

Frederico Azeiteiro

 



Strange query behaviour using splitOnCaseChange=1

2011-01-11 Thread Frederico Azeiteiro
Hi all,

 

I had indexed a text with the word "InterContinental" with fieldType
text (with the default filters just removing the
solr.SnowballPorterFilterFactory).

 

As far as I understand, using the filter solr.WordDelimiterFilterFactory
with splitOnCaseChange="1", this word is indexed as:

 

term text

inter

continental

intercontinental

 

When I search for "continental" the article is returned. 

When searching for "intercontinental" the article is returned

When searching for "Inter Continental" the article is returned

When searching for "Inter AND Continental" the article is returned

When searching for "InterContinental" the article is NOT returned

 

Can anyone explains why the last search didn't return the article?

 

Thank you,

 



Frederico Azeiteiro

 



RE: wildcard and proximity searches

2010-10-05 Thread Frederico Azeiteiro
Hi Mark,
unfortanelly it's still on my ToDo list... :(.
 
I don't know if it allows "solr mail*"~10 . I hope so, as i'll need that also 
on the future.
 
Frederico



De: Mark N [mailto:nipen.m...@gmail.com]
Enviada: ter 05-10-2010 11:29
Para: solr-user@lucene.apache.org
Assunto: Re: wildcard and proximity searches



Hi

were you successful in trying SOLR -1604  to allow wild card queries in
phrases ?

Also does this plugin allow us to use proximity with wild card
*  "solr mail*"~10 *

If this the right approach to go ahead to support these functionalities?

thanks
Mark





On Wed, Aug 4, 2010 at 2:24 PM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Thanks for you ideia.
>
> At this point I'm logging each query time. My ideia is to divide my
> queries into "normal queries" and "heavy queries". I have some heavy
> queries with 1 minute or 2mintes to get results. But they have for
> instance (*word1* AND *word2* AND word3*). I guess that this will be
> always slower (could be a little faster with
> "ReversedWildcardFilterFactory") but they never be ready in a few
> seconds. For now, I just increased the timeout for those :) (using
> solrnet).
>
> My priority at the moment is the queries phrases like "word1* word2*
> word3". After this is working, I'll try to optimize the "heavy queries"
>
> Frederico
>
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: quarta-feira, 4 de Agosto de 2010 01:41
> To: solr-user@lucene.apache.org
> Subject: Re: wildcard and proximity searches
>
> Frederico Azeiteiro wrote:
> >
> >>> But it is unusual to use both leading and trailing * operator. Why
> are
> >>>
> > you doing this?
> >
> > Yes I know, but I have a few queries that need this. I'll try the
> > "ReversedWildcardFilterFactory".
> >
> >
> >
>
> ReverseWildcardFilter will help leading wildcard, but will not help
> trying to use a query with BOTH leading and trailing wildcard. it'll
> still be slow. Solr/lucene isn't good at that; I didn't even know Solr
> would do it at all in fact.
>
> If you really needed to do that, the way to play to solr/lucene's way of
>
> doing things, would be to have a field where you actually index each
> _character_ as a seperate token. Then leading and trailing wildcard
> search is basically reduced to a "phrase search", but where the words
> are actually characters.   But then you're going to get an index where
> pretty much every token belongs to every document, which Solr isn't that
>
> great at either, but then you can apply "commongram" stuff on top to
> help that out a lot too. Not quite sure what the end result will be,
> I've never tried it.  I'd only use that weird special "char as token"
> field for queries that actually required leading and trailing wildcards.
>
> Figuring out how to set up your analyzers, and what (if anything) you're
>
> going to have to do client-app-side to transform the user's query into
> something that'll end up searching like a "phrase search where each
> 'word' is a character is left as an exersize for the reader. :)
>
> Jonathan
>



--
Nipen Mark




RE: timestamp field

2010-08-11 Thread Frederico Azeiteiro
Hi Jan,

Dah, I didn't know that :(

I always thought it used the servertime. 

Anyway,just out of curiosity, the hour is UTC but NOT the time in London right 
now.

London is UTC+1 (same as here in Portugal) :).

So, London solr users should have the same "problem".
Well, I must be careful when using this field.

Thanks for your answer,
Frederico

-Original Message-
From: Jan Høydahl / Cominvent [mailto:jan@cominvent.com] 
Sent: quarta-feira, 11 de Agosto de 2010 12:17
To: solr-user@lucene.apache.org
Subject: Re: timestamp field

Hi,

Which time zone are you located in? Do you have DST?

Solr uses UTC internally for dates, which means that "NOW" will be the time in 
London right now :) Does that appear to be right 4 u?
Also see this thread: http://search-lucene.com/m/hqBed2jhu2e2/

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 13.02, Frederico Azeiteiro wrote:

> Hi,
> 
> 
> 
> I have on my schema
> 
>  default="NOW" /> 
> 
> 
> 
> This field is returned as 
> 
> 2010-08-11T10:11:03.354Z
> 
> 
> 
> For an article added at 2010-08-11T11:11:03.354Z!
> 
> 
> 
> And the server has the time of 2010-08-11T11:11:03.354Z...
> 
> 
> 
> This is a w2003 server using solr 1.4. 
> 
> 
> 
> Any guess of what could be wrong here?
> 
> 
> 
> Thanks,
> 
> Frederico
> 
> 
> 
> 
> 



timestamp field

2010-08-11 Thread Frederico Azeiteiro
Hi,

 

I have on my schema

 

 

This field is returned as 

2010-08-11T10:11:03.354Z

 

For an article added at 2010-08-11T11:11:03.354Z!

 

And the server has the time of 2010-08-11T11:11:03.354Z...

 

This is a w2003 server using solr 1.4. 

 

Any guess of what could be wrong here?

 

Thanks,

Frederico

 

 



RE: wildcard and proximity searches

2010-08-04 Thread Frederico Azeiteiro
Thanks for you ideia.

At this point I'm logging each query time. My ideia is to divide my
queries into "normal queries" and "heavy queries". I have some heavy
queries with 1 minute or 2mintes to get results. But they have for
instance (*word1* AND *word2* AND word3*). I guess that this will be
always slower (could be a little faster with
"ReversedWildcardFilterFactory") but they never be ready in a few
seconds. For now, I just increased the timeout for those :) (using
solrnet).

My priority at the moment is the queries phrases like "word1* word2*
word3". After this is working, I'll try to optimize the "heavy queries"

Frederico


-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: quarta-feira, 4 de Agosto de 2010 01:41
To: solr-user@lucene.apache.org
Subject: Re: wildcard and proximity searches

Frederico Azeiteiro wrote:
>
>>> But it is unusual to use both leading and trailing * operator. Why
are
>>>   
> you doing this?
>
> Yes I know, but I have a few queries that need this. I'll try the
> "ReversedWildcardFilterFactory". 
>
>
>   

ReverseWildcardFilter will help leading wildcard, but will not help 
trying to use a query with BOTH leading and trailing wildcard. it'll 
still be slow. Solr/lucene isn't good at that; I didn't even know Solr 
would do it at all in fact.

If you really needed to do that, the way to play to solr/lucene's way of

doing things, would be to have a field where you actually index each 
_character_ as a seperate token. Then leading and trailing wildcard 
search is basically reduced to a "phrase search", but where the words 
are actually characters.   But then you're going to get an index where 
pretty much every token belongs to every document, which Solr isn't that

great at either, but then you can apply "commongram" stuff on top to 
help that out a lot too. Not quite sure what the end result will be, 
I've never tried it.  I'd only use that weird special "char as token" 
field for queries that actually required leading and trailing wildcards.

Figuring out how to set up your analyzers, and what (if anything) you're

going to have to do client-app-side to transform the user's query into 
something that'll end up searching like a "phrase search where each 
'word' is a character is left as an exersize for the reader. :)  

Jonathan


RE: wildcard and proximity searches

2010-07-30 Thread Frederico Azeiteiro
Hi Ahmet,

> a) I think wildcard search is by default "case sensitive"?
> Is there a
> way to make case insensitive?
>>Wildcard searches are not analyzed. To case insensitive search you can
lowercase query terms >>at client side. (with using lowercasefilter at
index time) e.g. Mail* => mail*
> 
> I discovered that the normal query type doesn't work with wildcards
> and so I'm using the "Filter Query" to query these. 
>>I don't understand this. Wildcard search works with q parameter if you
are asking that. >>&q=mail*

For the 2 points above, my bad. I'm already using the "lowercasefilter"
but I was not lowering the query with wildcards (the others are lowered
by the analyser). So it's working fine now! On my tests yesterday
probably I was testing &q=Mail* and &fq=mail* (and didn't notice the
difference) and read somewhere that it wasn't possible (probably on
older solr version) so I get the wrong conclusion that it wasn't
working. 

>>But it is unusual to use both leading and trailing * operator. Why are
you doing this?

Yes I know, but I have a few queries that need this. I'll try the
"ReversedWildcardFilterFactory". 

>>By default it is not supported. With SOLR-1604 is it possible.
Ok then. I guess "SOLR-1604" is the answer for most of my problems. I'm
going to give it a try and then I'll share some feedback.

Thanks for your help and sorry for my newbie confusions. :)
Frederico

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: sexta-feira, 30 de Julho de 2010 12:09
To: solr-user@lucene.apache.org
Subject: RE: wildcard and proximity searches

> a) I think wildcard search is by default "case sensitive"?
> Is there a
> way to make case insensitive?

Wildcard searches are not analyzed. To case insensitive search you can
lowercase query terms at client side. (with using lowercasefilter at
index time) e.g. Mail* => mail*

 
> I discovered that the normal query type doesn't work with wildcards
> and so I'm using the "Filter Query" to query these. 

I don't understand this. Wildcard search works with q parameter if you
are asking that. &q=mail*

> field my
> queries are much slower (I have some queries like *word* or
> *word1* or
> *word2* that take about one minute to perform)
> Is there a way to optimize these queries (without removing
> the wildcards
> :))?

It is normal for leading wildcard search to be slow. Using
ReversedWildcardFilterFactory at index time can speedup it.

But it is unusual to use both leading and trailing * operator. Why are
you doing this? 

> c)Is there a way to do phrase queries with wildcards? Like
> "This solr*
> mail*"? Because the tests I made, when using quotes I think
> the wildcards are ignored.

By default it is not supported. With SOLR-1604 is it possible.

> d)How exactly works the pf (phrase fields) and ps (phrase
> slop)
> parameters and what's the difference for the proximity
> searches (ex:
> "word word2"~20)?

These parameters are specific to dismax query parser. 
http://wiki.apache.org/solr/DisMaxQParserPlugin



  


RE: wildcard and proximity searches

2010-07-30 Thread Frederico Azeiteiro
Hi Ahmet,

Thank you. I'll be happy to test it if I manage to install it ok.. I'm a
newbie at solr but I'm going to try the instructions in the thread to
load it.

Another doubts I have about wildcard searches:

a) I think wildcard search is by default "case sensitive"? Is there a
way to make case insensitive?

b) I have about 6000 queries to run (could have widlcards, proximity
searches or just normal queries). I discovered that the normal query
type doesn't work with wildcards and so I'm using the "Filter Query" to
query these. Is this field slower? I notice that using this field my
queries are much slower (I have some queries like *word* or *word1* or
*word2* that take about one minute to perform)
Is there a way to optimize these queries (without removing the wildcards
:))?

c)Is there a way to do phrase queries with wildcards? Like "This solr*
mail*"? Because the tests I made, when using quotes I think the
wildcards are ignored.

d)How exactly works the pf (phrase fields) and ps (phrase slop)
parameters and what's the difference for the proximity searches (ex:
"word word2"~20)?

Sorry for the long email and thank you for your help...
Frederico

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: sexta-feira, 30 de Julho de 2010 10:57
To: solr-user@lucene.apache.org
Subject: Re: wildcard and proximity searches


> What approach shoud I use to perform wildcard and proximity
> searches?
> 
>  
> 
> Like: "solr mail*"~10
> 
>  
> 
> For getting docs where solr is within 10 words of "mailing"
> for
> instance?


You can do it with the plug-in described here:
https://issues.apache.org/jira/browse/SOLR-1604
It would be great if you test it and give feedback.



  


wildcard and proximity searches

2010-07-29 Thread Frederico Azeiteiro
Hi,

 

What approach shoud I use to perform wildcard and proximity searches?

 

Like: "solr mail*"~10

 

For getting docs where solr is within 10 words of "mailing" for
instance?

 

Thanks,

Frederico

 

 



java.lang.NullPointerException

2010-07-13 Thread Frederico Azeiteiro
Hi again,

I change the search options to decrease my query size and now I get
passed the URI too long from the other thread.

I already added :

819200

819200

 

On Jetty config but now I'm stucked again on:

 

13/Jul/2010 9:41:38 org.apache.solr.common.SolrException log

SEVERE: java.lang.NullPointerException

at java.io.StringReader.(Unknown Source)

 

My querystring has now about 10 000 chars.

Could this be memory issues? 

 

Thank You,

Frederico

 



RE: Query: URl too long

2010-07-12 Thread Frederico Azeiteiro
Ok, I posted on SOLRNet forum asking how can I reduce the URL string
using POST method.

But I'm giving a try to SOLRJ. Think should be the right way to do it
maybe.


-Original Message-
From: Mauricio Scheffer [mailto:mauricioschef...@gmail.com] 
Sent: segunda-feira, 12 de Julho de 2010 14:31
To: solr-user@lucene.apache.org
Subject: Re: Query: URl too long

Frederico,
This is indeed a SolrNet issue. You can switch to POST in queries by
implementing a ISolrConnection decorator. In the Get() method you'd
build a
POST request instead of the standard GET.
Please use the SolrNet forum for further questions about SolrNet.

Cheers,
Mauricio

On Mon, Jul 12, 2010 at 9:33 AM, kenf_nc 
wrote:

>
> Frederico,
> You should also pose your question on the SolrNet forum,
> http://groups.google.com/group/solrnet?hl=en
> Switching from GET to POST isn't a Solr issue, but a SolrNet issue.
> --
> View this message in context:
>
http://lucene.472066.n3.nabble.com/Query-URl-too-long-tp959990p960208.ht
ml
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: Query: URl too long

2010-07-12 Thread Frederico Azeiteiro
Not an option because the query has other fields to query also.
They are generated throw a list choices (that could go to 5000's string
with 7 char each..).

I don't know is this could be considered off-topic (please advise...)
but:

i'm doing some test with lucene (Lucene.Net 2.9.2) but the results with
date range queries are not similar (0 hits on Lucene, 900 with Solr).
Does lucene supports date range queries?

Thank you for your help.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: segunda-feira, 12 de Julho de 2010 13:16
To: solr-user@lucene.apache.org
Subject: RE: Query: URl too long

> Yes, i guess i can't create an URI
> object that long.
> 
> Can someone remember other options?

You can shorten your String by not repeating OR and fieldName. e.g.

"fieldName: value1 OR fieldName: value2 OR fieldName: value3..."

q=value1 value2 value3&q.op=OR&df=fieldName

By the way how are you generating these value1 value2 etc? If the above
does not solve your problem you can embed this logic into a custom
SearchHandler.


  


RE: Query: URl too long

2010-07-12 Thread Frederico Azeiteiro
Yes, i guess i can't create an URI object that long.

Can someone remember other options?
I'm thinking about options avoiding the http request... 

My best try is using lucene again but keep the solr for indexing.

Do you thing this is a good aproach? 



-Original Message-----
From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] 
Sent: segunda-feira, 12 de Julho de 2010 12:10
To: solr-user@lucene.apache.org
Subject: RE: Query: URl too long

Hi,

A closer look shows that the problem is not on the request but on the
creation of the URI object.

The exception is sent when trying to access the URI object inside the
URIbuilder.

Trying to google it but without luck...


-Original Message-
From: Jon Poulton [mailto:jon.poul...@vyre.com] 
Sent: segunda-feira, 12 de Julho de 2010 11:56
To: solr-user@lucene.apache.org
Subject: Re: Query: URl too long

Hi there,
We had a similar issue. It's an easy fix, simply change the request type
from GET to POST. 

Jon

On 12 Jul 2010, at 11:18, Frederico Azeiteiro wrote:

> Hi,
> 
> 
> 
> I need to perform a search using a list of values (about 2000).
> 
> 
> 
> I'm using SolrNET QueryInList function that creates the searchstring
> like:
> 
> 
> 
> "fieldName: value1 OR fieldName: value2 OR fieldName: value3..." (2000
> values)
> 
> 
> 
> This method created a string with about 100 000 chars and the Web
> Request fails with URI too long (C#).
> 
> 
> 
> I'm trying to update an old Lucene app that performs this kind of
> searches. 
> 
> How can I achieve this with Solr?
> 
> 
> 
> What are my options here?
> 
> 
> 
> Thank you,
> 
> Frederico
> 



RE: Query: URl too long

2010-07-12 Thread Frederico Azeiteiro
Hi,

A closer look shows that the problem is not on the request but on the
creation of the URI object.

The exception is sent when trying to access the URI object inside the
URIbuilder.

Trying to google it but without luck...


-Original Message-
From: Jon Poulton [mailto:jon.poul...@vyre.com] 
Sent: segunda-feira, 12 de Julho de 2010 11:56
To: solr-user@lucene.apache.org
Subject: Re: Query: URl too long

Hi there,
We had a similar issue. It's an easy fix, simply change the request type
from GET to POST. 

Jon

On 12 Jul 2010, at 11:18, Frederico Azeiteiro wrote:

> Hi,
> 
> 
> 
> I need to perform a search using a list of values (about 2000).
> 
> 
> 
> I'm using SolrNET QueryInList function that creates the searchstring
> like:
> 
> 
> 
> "fieldName: value1 OR fieldName: value2 OR fieldName: value3..." (2000
> values)
> 
> 
> 
> This method created a string with about 100 000 chars and the Web
> Request fails with URI too long (C#).
> 
> 
> 
> I'm trying to update an old Lucene app that performs this kind of
> searches. 
> 
> How can I achieve this with Solr?
> 
> 
> 
> What are my options here?
> 
> 
> 
> Thank you,
> 
> Frederico
> 



Query: URl too long

2010-07-12 Thread Frederico Azeiteiro
Hi,

 

I need to perform a search using a list of values (about 2000).

 

I'm using SolrNET QueryInList function that creates the searchstring
like:

 

"fieldName: value1 OR fieldName: value2 OR fieldName: value3..." (2000
values)

 

This method created a string with about 100 000 chars and the Web
Request fails with URI too long (C#).

 

I'm trying to update an old Lucene app that performs this kind of
searches. 

How can I achieve this with Solr?

 

What are my options here?

 

Thank you,

Frederico



RE: steps to improve search

2010-07-02 Thread Frederico Azeiteiro
Thanks Leonardo, I didn't know that tool, very good!

So I see what is wrong:

SnowballPorterFilterFactory and StopFilterFactory. (both used on index and 
query)

I tried remove the snowball and change the stopfilter to "ignorecase=false" on 
QUERY and restarted solr.

But now I get no results :(.

On index analysis I get (result of filters):
paying  for it
paying
paying
paying
pay

For Query analysis (result of filters):
paying  for it
paying  for it
paying
paying
paying

This means that at the end, the word indexed is "pay" and the searched is 
"paying"?

It's necessary to reindex the data?

Thanks

-Original Message-
From: Leonardo Menezes [mailto:leonardo.menez...@googlemail.com] 
Sent: sexta-feira, 2 de Julho de 2010 12:58
To: solr-user@lucene.apache.org
Subject: Re: steps to improve search

most likely due to:
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
StopFilterFactory

you get those "fake" matches. try going into the admin, on the analysis
section. in there you can "simulate" the index/search of a document, and see
how its actually searched/indexed. it will give you some clues...

On Fri, Jul 2, 2010 at 1:50 PM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> For the example given, I need the full expression "paying for it", so
> yes all the words.
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: sexta-feira, 2 de Julho de 2010 12:30
> To: solr-user@lucene.apache.org
> Subject: RE: steps to improve search
>
> > I need to know how to achieve more accurates queries (like
> > the example below...) using these filters.
>
> do you want that all terms - you search - must appear in returned
> documents?
>
> You can change default operator of QueryParser to AND. either in
> schema.xml or appending &q.op=AND you your search url. I am assuming you
> are not using dismax.
>
>
>
>


RE: steps to improve search

2010-07-02 Thread Frederico Azeiteiro
For the example given, I need the full expression "paying for it", so
yes all the words.
-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: sexta-feira, 2 de Julho de 2010 12:30
To: solr-user@lucene.apache.org
Subject: RE: steps to improve search

> I need to know how to achieve more accurates queries (like
> the example below...) using these filters.

do you want that all terms - you search - must appear in returned
documents?

You can change default operator of QueryParser to AND. either in
schema.xml or appending &q.op=AND you your search url. I am assuming you
are not using dismax.


  


RE: steps to improve search

2010-07-02 Thread Frederico Azeiteiro
I'm using " surrounding the text.

My Query: Headline:("paying for it") on solr admin interface

Some results:
...l stop paying tax until council pays for dam...
"Why paying extra doesn't always pay!"
"...pay cut as M&S investor pressure pays off"
"Can't pay or won't pay: the debt collector call"

What could be wrong here?
Thanks.
-Original Message-
From: Leonardo Menezes [mailto:leonardo.menez...@googlemail.com] 
Sent: sexta-feira, 2 de Julho de 2010 12:30
To: solr-user@lucene.apache.org
Subject: Re: steps to improve search

No, you explained alright, but then didnt understand the answer. Searching
with the " surrounding the text you are searching for, has exactly the
effect you are looking for. try it...

On Fri, Jul 2, 2010 at 1:23 PM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> I'm sorry, maybe I didn’t explain correctly.
>
> The issue is using the default text FIELD TYPE, not the default text FIELD.
> The "text" field type uses a  lot of filters on indexing.
> I need to know how to achieve more accurates queries (like the example
> below...) using these filters.
>
>
> -Original Message-
> From: Leonardo Menezes [mailto:leonardo.menez...@googlemail.com]
> Sent: sexta-feira, 2 de Julho de 2010 12:07
> To: solr-user@lucene.apache.org
> Subject: Re: steps to improve search
>
> Try
> field:"text to search"
>
> On Fri, Jul 2, 2010 at 12:57 PM, Frederico Azeiteiro <
> frederico.azeite...@cision.com> wrote:
>
> > Hi,
> >
> > I'm using the default text field type on my schema.
> >
> >
> >
> > Is there a quick way to do more accurate searches like searching for
> > "paying for it" only return docs with the full expression "paying for
> > it",  and not return articles with word "pay" as it does now?
> >
> >
> >
> > Thanks,
> >
> > Frederico
> >
> >
>


RE: steps to improve search

2010-07-02 Thread Frederico Azeiteiro
I'm sorry, maybe I didn’t explain correctly. 

The issue is using the default text FIELD TYPE, not the default text FIELD.
The "text" field type uses a  lot of filters on indexing. 
I need to know how to achieve more accurates queries (like the example 
below...) using these filters.


-Original Message-
From: Leonardo Menezes [mailto:leonardo.menez...@googlemail.com] 
Sent: sexta-feira, 2 de Julho de 2010 12:07
To: solr-user@lucene.apache.org
Subject: Re: steps to improve search

Try
field:"text to search"

On Fri, Jul 2, 2010 at 12:57 PM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Hi,
>
> I'm using the default text field type on my schema.
>
>
>
> Is there a quick way to do more accurate searches like searching for
> "paying for it" only return docs with the full expression "paying for
> it",  and not return articles with word "pay" as it does now?
>
>
>
> Thanks,
>
> Frederico
>
>


steps to improve search

2010-07-02 Thread Frederico Azeiteiro
Hi,

I'm using the default text field type on my schema.

 

Is there a quick way to do more accurate searches like searching for
"paying for it" only return docs with the full expression "paying for
it",  and not return articles with word "pay" as it does now?

 

Thanks,

Frederico



RE: Where to check optimize status

2010-06-29 Thread Frederico Azeiteiro
Thank you but I didn't find anything like "Merge thread" and I continued to 
have the lock file. 

The segments were not merged so I stopped the SOLR and restart. 
The lock disappear but I guess the optimization didn’t complete. 

I'll try again tomorrow

-Original Message-
From: Alexander Rothenberg [mailto:a.rothenb...@fotofinder.net] 
Sent: terça-feira, 29 de Junho de 2010 12:20
To: solr-user@lucene.apache.org
Subject: Re: Where to check optimize status

To determine if the optimize is still in progress, you can look at the 
admin-frontend on the page "THREAD DUMP" for something like "Lucene Merge 
Thread". If its there, then optimize is still running. Also, index-filesize 
and filenames in your index-dir are changing a lot... 


On Tuesday 29 June 2010 12:54:54 Frederico Azeiteiro wrote:
> Hi,
> I'm using solr1.4.0 default installation.
> Is there a place where I can find the optimization status.
> I sent a optimize http request and it should had finish by now, but I
> still see the lock file on index folder.
> Can I see somewhere if the optimization is still running?
> Thanks,
> Frederico Azeiteiro


-- 
Alexander Rothenberg
Fotofinder GmbH USt-IdNr. DE812854514
Software EntwicklungWeb: http://www.fotofinder.net/
Potsdamer Str. 96   Tel: +49 30 25792890
10785 BerlinFax: +49 30 257928999

Geschäftsführer:Ali Paczensky
Amtsgericht:Berlin Charlottenburg (HRB 73099)
Sitz:   Berlin


Where to check optimize status

2010-06-29 Thread Frederico Azeiteiro
Hi,

 

I'm using solr1.4.0 default installation.

 

Is there a place where I can find the optimization status. 

I sent a optimize http request and it should had finish by now, but I
still see the lock file on index folder.

 

Can I see somewhere if the optimization is still running?

 

Thanks,

 

Frederico Azeiteiro



RE: run on reboot on windows

2010-05-03 Thread Frederico Azeiteiro
Hi Ahmed,

I need to achieve that also. Do you manage to install it as service and
start solr with Jetty?
After installing and start jetty as service how do you start solr?

Thanks,
Frederico

-Original Message-
From: S Ahmed [mailto:sahmed1...@gmail.com] 
Sent: segunda-feira, 3 de Maio de 2010 01:05
To: solr-user@lucene.apache.org
Subject: Re: run on reboot on windows

Thanks, for some reason I was looking for a solution outside of
jetty/tomcat, when that was the obvious way to get things restarted :)

On Sun, May 2, 2010 at 7:53 PM, Dave Searle
wrote:

> Tomcat is installed as a service on windows. Just go into service
> control panel and set startup type to automatic
>
> Sent from my iPhone
>
> On 3 May 2010, at 00:43, "S Ahmed"  wrote:
>
> > its not tomcat/jetty that's the issue, its how to get things to re-
> > start on
> > a windows server (tomcat and jetty don't run as native windows
> > services) so
> > I am a little confused..thanks.
> >
> > On Sun, May 2, 2010 at 7:37 PM, caman
> > wrote:
> >
> >>
> >> Ahmed,
> >>
> >>
> >>
> >> Best is if you take a look at the documentation of jetty or tomcat.
> >> SOLR
> >> can
> >> run on any web container, it's up to you how you  configure your
web
> >> container to run
> >>
> >>
> >>
> >> Thanks
> >>
> >> Aboxy
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> From: S Ahmed [via Lucene]
> >>
[mailto:ml-node+772174-2097041460-124...@n3.nabble.com
>  >> %2b772174-2097041460-124...@n3.nabble.com>
> >> ]
> >> Sent: Sunday, May 02, 2010 4:33 PM
> >> To: caman
> >> Subject: Re: run on reboot on windows
> >>
> >>
> >>
> >> By default it uses Jetty, so your saying Tomcat on windows server
> >> 2008/
> >> IIS7
> >>
> >> runs as a native windows service?
> >>
> >> On Sun, May 2, 2010 at 12:46 AM, Dave Searle <[hidden email]>wrote:
> >>
> >>
> >>> Set tomcat6 service to auto start on boot (if running tomat)
> >>>
> >>> Sent from my iPhone
> >>>
> >>> On 2 May 2010, at 02:31, "S Ahmed" <[hidden email]> wrote:
> >>>
>  Hi,
> 
>  I'm trying to get Solr to run on windows, such that if it reboots
>  the Solr
>  service will be running.
> 
>  How can I do this?
> >>>
> >>
> >>
> >>
> >>  _
> >>
> >> View message @
> >>
> >>
>
http://lucene.472066.n3.nabble.com/run-on-reboot-on-windows-tp770892p772
174
> >> .
> >> html
> >> To start a new topic under Solr - User, email
> >>
ml-node+472068-464289649-124...@n3.nabble.com
>  >> %2b472068-464289649-124...@n3.nabble.com>
> >> To unsubscribe from Solr - User, click
> >> < (link removed)
> >> GZvcnRoZW90aGVyc3R1ZmZAZ21haWwuY29tfDQ3MjA2OHwtOTM0OTI1NzEx>  here.
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
>
http://lucene.472066.n3.nabble.com/run-on-reboot-on-windows-tp770892p772
178.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>


search within sentence or paragraph

2010-03-30 Thread Frederico Azeiteiro
Hi all,

 

Is it possible search for a combination of words within the same
sentence or paragraph?

 

Ex: American and McDonalds 

 

Returns : "McDonalds is a American company"

Don't returns: "...went to McDonalds. After that se saw the American
flag..."

 

Is this possible?

Frederico Azeiteiro



RE: Cant commit on 125 GB index

2010-03-13 Thread Frederico Azeiteiro
Yes, the http request is timing out even when using values of 10m.
 
Normally the commit takes about 10s. I did an optimize (it took 6h) and it 
looks good for now... 
 
59m? well i didn't wait that long, i restarted the solr instance and tried 
again.
 
I'll try to use autocommit on a near future. 
 
Using autocommit how can i check how many commits are happening at the moment, 
when they started to? Is there a way to control and konw what  is happening 
behind the scenes in "real time"?
 
I'm using solr 1.4 with jetty.



De: Lance Norskog [mailto:goks...@gmail.com]
Enviada: sáb 13-03-2010 23:31
Para: solr-user@lucene.apache.org
Assunto: Re: Cant commit on 125 GB index



What is timing out? The external HTTP request? Commit times are a
sawtooth and slowly increase. My record is 59 minutes, but I was doing
benchmarking.

On Thu, Mar 11, 2010 at 1:46 AM, Frederico Azeiteiro
 wrote:
> Hi,
>
> I'm having timeouts commiting on a 125 GB index with about 2200
> docs.
>
>
>
> I'm inserting new docs every 5m and commiting after that.
>
>
>
> I would like to try the autocommit option and see if I can get better
> results. I need the docs indexed available for searching in about 10
> minutes after the insert.
>
>
>
> I was thinking of using something like
>
>
>
> 
>
>  5000
>
>  86000
>
>
>
>
>
> I update about 4000 docs every 15m.
>
>
>
> Can you share your thoughts on this config?
>
> Do you think this will solve my commits timeout problem?
>
>
>
> Thanks,
>
> Frederico
>
>



--
Lance Norskog
goks...@gmail.com




Cant commit on 125 GB index

2010-03-11 Thread Frederico Azeiteiro
Hi, 

I'm having timeouts commiting on a 125 GB index with about 2200
docs.

 

I'm inserting new docs every 5m and commiting after that.

 

I would like to try the autocommit option and see if I can get better
results. I need the docs indexed available for searching in about 10
minutes after the insert.

 

I was thinking of using something like

 



  5000

  86000



 

I update about 4000 docs every 15m.

 

Can you share your thoughts on this config?

Do you think this will solve my commits timeout problem?

 

Thanks,

Frederico



RE: search and count ocurrences

2010-03-10 Thread Frederico Azeiteiro
Thanks Chris.

Could something like that be implemented in c# ? :)

Does anyone has any link where I can start digging?
This is not an urgent matter, just something to investigate and implement in a 
near future.

Frederico


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: terça-feira, 9 de Março de 2010 23:39
To: solr-user@lucene.apache.org
Subject: Re: search and count ocurrences


: I need to implement a search where i should count the number of times
: the string appears on the search field, 
: 
: ie: only return articles that mention the word 'HP' at least 2x.
...
: Is there a way that SOLR does this type of operation for me?

you'd have to implement it in a custom QParser -- if all you are worried 
about is simple TermQuery style matches, then this should be fairly 
trivial using SpanNearQuery.


-Hoss



search and count ocurrences

2010-03-02 Thread Frederico Azeiteiro
Hi,

I need to implement a search where i should count the number of times
the string appears on the search field, 

ie: only return articles that mention the word 'HP' at least 2x.

 

I'm currently doing this after the SOLR search with my own methods. 

Is there a way that SOLR does this type of operation for me?

 

Thanks,

Frederico

 



fieldType "text"

2010-03-02 Thread Frederico Azeiteiro
Hi,

I'm using the default "text"  field type that comes with the example.

 

When searching for simple words as 'HP' or 'TCS' solr is returning
results that contains 'HP1' or 'T&CS'

Is there a solution for to avoid this?

 

Thanks,

Frederico



RE: Solrsharp

2010-02-27 Thread Frederico Azeiteiro
Hi Saschin,
 
Yes i had to make some patches too (range queries didn't work very good...) and 
yes, i thought about changing the parameterjoin. 
 
I'm already using solrsharp for indexing without problems but i guess for 
searches, I'm gonna give a try to solrnet mostly because the lack of 
support/feedback with solrsharp... 
 
Does anyone around here uses c# with solr? What client do you use?
 
Another ideia is perform the search directly on SOLR index using lucene API. I 
previsouly used c# luceneAPI for a long time without any problems. Maybe it's 
the quicker solution...
 
Thanks for your feedback.



De: Sachin [mailto:sachinni...@aim.com]
Enviada: sáb 27-02-2010 12:04
Para: solr-user@lucene.apache.org
Assunto: Re: Solrsharp




 solr# does not have built in support for "NOT" searches, you would have to 
tweak the solr# library to do that (take a look at how the ParameterJoin is 
used, add one for Not). I have faced quite a few issues with using solr# in the 
past like unclosed TCP connections, no spellchecker, json support etc and had 
to patch it quite frequently. I guess your best bet would be to take a look at 
some other client like solr.net: http://code.google.com/p/solrnet/. Disclaimer: 
I haven't evaluated solr.net myself but it looks to be more robust than solr# 
and is more actively maintained than solr#.

S








-Original Message-
From: Frederico Azeiteiro 
To: solr-user@lucene.apache.org
Sent: Fri, Feb 26, 2010 9:54 pm
Subject: Solrsharp


Hi,



I don't know if this list includes this kind of help, but I'm using
Solrsharp with C# to operate SOLR. Please advise if this is off-topic
please.



I'm having a little trouble to make a search with exclude terms using
the query parameters.



Does anyone uses Solrsharp around here? Do you manage to exclude terms
on searches?



Br

Frederico











Solrsharp

2010-02-26 Thread Frederico Azeiteiro
Hi,

 

I don't know if this list includes this kind of help, but I'm using
Solrsharp with C# to operate SOLR. Please advise if this is off-topic
please.

 

I'm having a little trouble to make a search with exclude terms using
the query parameters.

 

Does anyone uses Solrsharp around here? Do you manage to exclude terms
on searches?

 

Br

Frederico

 

 



Reindex after changing defaultSearchField?

2010-02-17 Thread Frederico Azeiteiro
Hi,

 

If i change the "defaultSearchField" in the core schema, do I need to
recreate the index?

 

Thanks,

Frederico

 



RE: query all filled field?

2010-02-04 Thread Frederico Azeiteiro
I've analyzed my index application and checked the XML before executing the 
http request and the field it's empty:



It should be empty on SOLR. 

Probably something in the way between my application (.NET) and the SOLR (Jetty 
on Ubuntu) adds the whitespace.

Anyway, I'll try to remove the field but, as I validate each doc to the SOLR 
schema, I must make some adjustments and stop validate the doc. 
I don't know if that will be acceptable...

Thanks for your help.


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: quinta-feira, 4 de Fevereiro de 2010 17:22
To: solr-user@lucene.apache.org
Subject: RE: query all filled field?

> XML update. I'm serializing the doc
> in .NET, and then using solsharp to
> insert/update the doc to SOLR.
> 
> The result is:
> 
> 
>     
> 
> 
> Dows this means I'm adding a whitespace on XML Update?

Yes exactly. You can remove   from your 
 
 
...
 
 
if value of fieldX.trim() is equal to "" when preparing your xml.


  


RE: query all filled field?

2010-02-04 Thread Frederico Azeiteiro
XML update. I'm serializing the doc in .NET, and then using solsharp to
insert/update the doc to SOLR.

The result is:





Dows this means I'm adding a whitespace on XML Update?

Frederico


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: quinta-feira, 4 de Fevereiro de 2010 16:26
To: solr-user@lucene.apache.org
Subject: RE: query all filled field?


> Theoretically yes,it's correct, but i
> have about 1/10 of the docs with
> this field not empty and the rest is empty.
> 
> Most of the articles have the field empty as I can see when
> query *:*.

How are you adding documents to solr? xml update, DIH?

Probably you are adding whitespace value to that field.

When you query q=*:*&fl=fieldX  what do you see?
Do you see 





or 





  


RE: query all filled field?

2010-02-04 Thread Frederico Azeiteiro
Theoretically yes,it's correct, but i have about 1/10 of the docs with
this field not empty and the rest is empty.

Most of the articles have the field empty as I can see when query *:*.

So the queries don't make sense...

-Original Message-
From: Ankit Bhatnagar [mailto:abhatna...@vantage.com] 
Sent: quinta-feira, 4 de Fevereiro de 2010 14:56
To: 'solr-user@lucene.apache.org'
Subject: RE: query all filled field?


That's correct.
If u want to find "Missing Values" 
ie fields for whom value is not present then u will use -


Ankit

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Thursday, February 04, 2010 9:41 AM
To: solr-user@lucene.apache.org
Subject: RE: query all filled field?


> *:* AND -fieldX:[* TO *] - returns 0 docs
> 
> fieldX:(a*) - return docs, so I'm sure that there's docs
> with this field filled.
> 
> Any other ideias what could be wrong?

There is not wrong in this scenario. 
If -fieldX:[* TO *] returns 0 docs, it means that all of your documents
have that fieldX filled. Therefore fieldX:[* TO *] returns all of your
dataset same as *:* 



  


RE: query all filled field?

2010-02-04 Thread Frederico Azeiteiro
I tried another one:

fieldX:["" TO *] and it returns articles with the field filled :), so I guess 
I'm getting there. 
But I tried also fieldX:[" " TO *] and get a few more results that the first 
one...

Is there a real difference between these, and also if the results are really 
all docs with field not empty? 

Thanks again,
Frederico

-Original Message-
From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] 
Sent: quinta-feira, 4 de Fevereiro de 2010 10:55
To: solr-user@lucene.apache.org
Subject: RE: query all filled field?

Thanks, but still no luck with that:

*:* AND -fieldX:[* TO *] - returns 0 docs

fieldX:(a*) - return docs, so I'm sure that there's docs with this field filled.

Any other ideias what could be wrong?

Frederico

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: quinta-feira, 4 de Fevereiro de 2010 05:38
To: solr-user@lucene.apache.org
Subject: Re: query all filled field?

Queries that start with minus or NOT don't work. You have to do this:
 *:* AND -fieldX:[* TO *]

On Wed, Feb 3, 2010 at 5:04 AM, Frederico Azeiteiro
 wrote:
> Hum, strange.. I reindexed some docs with the field corrected.
>
> Now I'm sure the field is filled because:
>
> "fieldX:(*a*)" returns docs.
>
> But "fieldX:[* TO *]" is returning the same as "*.*" (all results)
>
> I tried with "-fieldX:[* TO *]" and I get no results at all.
>
> I wonder if someone has tried this before with success?
>
> The field is indexed as string, indexed=true and stored=true.
>
> Thanks,
> Frederico
>
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: quarta-feira, 3 de Fevereiro de 2010 11:48
> To: solr-user@lucene.apache.org
> Subject: Re: query all filled field?
>
>
>> Is it possible to query some field in order to get only not
>> empty
>> documents?
>>
>>
>>
>> All documents where field x is filled?
>
> Yes. q=x:[* TO *] will bring documents that has non-empty x field.
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


RE: query all filled field?

2010-02-04 Thread Frederico Azeiteiro
Thanks, but still no luck with that:

*:* AND -fieldX:[* TO *] - returns 0 docs

fieldX:(a*) - return docs, so I'm sure that there's docs with this field filled.

Any other ideias what could be wrong?

Frederico

-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: quinta-feira, 4 de Fevereiro de 2010 05:38
To: solr-user@lucene.apache.org
Subject: Re: query all filled field?

Queries that start with minus or NOT don't work. You have to do this:
 *:* AND -fieldX:[* TO *]

On Wed, Feb 3, 2010 at 5:04 AM, Frederico Azeiteiro
 wrote:
> Hum, strange.. I reindexed some docs with the field corrected.
>
> Now I'm sure the field is filled because:
>
> "fieldX:(*a*)" returns docs.
>
> But "fieldX:[* TO *]" is returning the same as "*.*" (all results)
>
> I tried with "-fieldX:[* TO *]" and I get no results at all.
>
> I wonder if someone has tried this before with success?
>
> The field is indexed as string, indexed=true and stored=true.
>
> Thanks,
> Frederico
>
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: quarta-feira, 3 de Fevereiro de 2010 11:48
> To: solr-user@lucene.apache.org
> Subject: Re: query all filled field?
>
>
>> Is it possible to query some field in order to get only not
>> empty
>> documents?
>>
>>
>>
>> All documents where field x is filled?
>
> Yes. q=x:[* TO *] will bring documents that has non-empty x field.
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


RE: query all filled field?

2010-02-03 Thread Frederico Azeiteiro
Hum, strange.. I reindexed some docs with the field corrected.

Now I'm sure the field is filled because:

"fieldX:(*a*)" returns docs.

But "fieldX:[* TO *]" is returning the same as "*.*" (all results)

I tried with "-fieldX:[* TO *]" and I get no results at all.

I wonder if someone has tried this before with success?

The field is indexed as string, indexed=true and stored=true.

Thanks,
Frederico

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: quarta-feira, 3 de Fevereiro de 2010 11:48
To: solr-user@lucene.apache.org
Subject: Re: query all filled field?


> Is it possible to query some field in order to get only not
> empty
> documents?
> 
>  
> 
> All documents where field x is filled?

Yes. q=x:[* TO *] will bring documents that has non-empty x field.


  


RE: query all filled field?

2010-02-03 Thread Frederico Azeiteiro
Ok, if anyone needs it:

I tried fieldX:[* TO *]
I think this is correct.

In my case I found out that I was not indexing this field correctly
because they are all empty. :)



-Original Message-
From: Frederico Azeiteiro [mailto:frederico.azeite...@cision.com] 
Sent: quarta-feira, 3 de Fevereiro de 2010 11:34
To: solr-user@lucene.apache.org
Subject: query all filled field?

Hi all,

 

Is it possible to query some field in order to get only not empty
documents?

 

All documents where field x is filled?

 

Thanks,

Frederico

 

 

 

 



query all filled field?

2010-02-03 Thread Frederico Azeiteiro
Hi all,

 

Is it possible to query some field in order to get only not empty
documents?

 

All documents where field x is filled?

 

Thanks,

Frederico

 

 

 

 



RE: Problem comitting on 40GB index

2010-01-13 Thread Frederico Azeiteiro
The hanging didn't happen again since yesterday. I never run out of space 
again. This is still a dev environment, so the number of searches is very low. 
Maybe I'm just lucky...

Where can I see the garbage collection info?

-Original Message- 
From: Marc Des Garets [mailto:marc.desgar...@192.com] 
Sent: quarta-feira, 13 de Janeiro de 2010 17:20
To: solr-user@lucene.apache.org
Subject: RE: Problem comitting on 40GB index

Just curious, have you checked if the hanging you are experiencing is not 
garbage collection related?

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: 13 January 2010 13:33
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

That's my understanding.. But fortunately disk space is cheap 


On Wed, Jan 13, 2010 at 5:01 AM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Sorry, my bad... I replied to a current mailing list message only changing
> the subject... Didn't know about this " Hijacking" problem. Will not happen
> again.
>
> Just for close this issue, if I understand correctly, for an index of 40G,
> I will need, for running an optimize:
> - 40G if all activity on index is stopped
> - 80G if index is being searched...)
> - 120G if index is being searched and if a commit is performed.
>
> Is this correct?
>
> Thanks.
> Frederico
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: terça-feira, 12 de Janeiro de 2010 19:18
> To: solr-user@lucene.apache.org
> Subject: Re: Problem comitting on 40GB index
>
> Huh?
>
> On Tue, Jan 12, 2010 at 2:00 PM, Chris Hostetter
> wrote:
>
> >
> > : Subject: Problem comitting on 40GB index
> > : In-Reply-To: <
> > 7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com>
> >
> > http://people.apache.org/~hossman/#threadhijack
> > Thread Hijacking on Mailing Lists
> >
> > When starting a new discussion on a mailing list, please do not reply to
> > an existing message, instead start a fresh email.  Even if you change the
> > subject line of your email, other mail headers still track which thread
> > you replied to and your question is "hidden" in that thread and gets less
> > attention.   It makes following discussions in the mailing list archives
> > particularly difficult.
> > See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
> >
> >
> >
> > -Hoss
> >
> >
>
--
This transmission is strictly confidential, possibly legally privileged, and 
intended solely for the 
addressee.  Any views or opinions expressed within it are those of the author 
and do not necessarily 
represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's subsidiary 
companies.  If you 
are not the intended recipient then you must not disclose, copy or take any 
action in reliance of this 
transmission. If you have received this transmission in error, please notify 
the sender as soon as 
possible.  No employee or agent is authorised to conclude any binding agreement 
on behalf of 
i-CD Publishing (UK) Ltd with another party by email without express written 
confirmation by an 
authorised employee of the Company. http://www.192.com (Tel: 08000 192 192).  
i-CD Publishing (UK) Ltd 
is incorporated in England and Wales, company number 3148549, VAT No. GB 
673128728.


RE: Problem comitting on 40GB index

2010-01-13 Thread Frederico Azeiteiro
Sorry, my bad... I replied to a current mailing list message only changing the 
subject... Didn't know about this " Hijacking" problem. Will not happen again.

Just for close this issue, if I understand correctly, for an index of 40G, I 
will need, for running an optimize:
- 40G if all activity on index is stopped
- 80G if index is being searched...)
- 120G if index is being searched and if a commit is performed.

Is this correct?

Thanks.
Frederico
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: terça-feira, 12 de Janeiro de 2010 19:18
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

Huh?

On Tue, Jan 12, 2010 at 2:00 PM, Chris Hostetter
wrote:

>
> : Subject: Problem comitting on 40GB index
> : In-Reply-To: <
> 7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com>
>
> http://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
>
> When starting a new discussion on a mailing list, please do not reply to
> an existing message, instead start a fresh email.  Even if you change the
> subject line of your email, other mail headers still track which thread
> you replied to and your question is "hidden" in that thread and gets less
> attention.   It makes following discussions in the mailing list archives
> particularly difficult.
> See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
>
>
>
> -Hoss
>
>


RE: Problem comitting on 40GB index

2010-01-12 Thread Frederico Azeiteiro
I restarted the solr and stopped all searches. After that, the commit() was 
normal (2 secs) and it's been working for 3h without problems (indexing and a 
few searches too)... I haven't done any optimize yet, mainly because I had no 
deletes on the index and the performance is ok, so no need to optimize I think..

I had tried this procedure a few times in the morning and the commit always 
hanged so.. I have no explanation for it start working suddenly.. 
I'm making a commit every 2m (because I need the results updated on searches), 
so propably when I have more searches at the same time the commit will hang 
again right?

Sorry for the newbie questions and thanks for your help and explanation Erik.

BR, 
Frederico

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: terça-feira, 12 de Janeiro de 2010 15:15
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

Rebooting the machine certainly closes the searchers, but
depending upon how you shut it down there may be stale files
After reboot (but before you start SOLR), how much space
is on your disk? If it's 40G, you have no stale files

Yes, IR is IndexReader, which is a searcher.

I'll have to leave it to others if you don't have stale files
hanging around, although if you're optimizing while
searchers are running, you'll use up to 3X the index size...

Otherwise I'll have to leave it to others for additional insights

Best
Erick

On Tue, Jan 12, 2010 at 9:22 AM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Hi Erik,
>
> I'm a newbie to solr... By IR, you mean searcher? Is there a place where I
> can check the open searchers? And rebooting the machine shouldn't closed
> that searchers?
>
> Thanks,
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: terça-feira, 12 de Janeiro de 2010 13:54
> To: solr-user@lucene.apache.org
> Subject: Re: Problem comitting on 40GB index
>
> There are several possibilities:
>
> 1> you have some process holding open your indexes, probably
> other searchers. You *probably* are OK just committing
> new changes if there is exactly *one* searcher keeping
> your index open. If you have some process whereby
> you periodically open a new search but you fail to close
> the old one, then you'll use up an extra 40G for every
> version of your index held open by your processes. That's
>confusing... I'm saying that if you open any number of IRs,
>you'll have 40G consumed. Then if you add
>some more documents and open *another* IR,  you'll have
>another 40G consumed. They'll stay around until you close
>your readers.
>
> 2> If you optimize, there can be up to 3X the index size being
>consumed if you also have a previous reader opened.
>
> So I suspect that sometime recently you've opened another
> IR.
>
> HTH
> Erick
>
>
>
> On Tue, Jan 12, 2010 at 8:03 AM, Frederico Azeiteiro <
> frederico.azeite...@cision.com> wrote:
>
> > Hi all,
> >
> > I started working with solr about 1 month ago, and everything was
> > running well both indexing as searching documents.
> >
> > I have a 40GB index with about 10 000 000 documents available. I index
> > 3k docs for each 10m and commit after each insert.
> >
> > Since yesterday, I can't commit no articles to index. I manage to search
> > ok, and index documents without commiting. But when I start the commit
> > is takes a long time and eats all of the available disk space
> > left(60GB). The commit eventually stops with full disk and I have to
> > restart SOLR and get the 60GB returned to system.
> >
> > Before this, the commit was taking a few seconds to complete.
> >
> > Can someone help to debug the problem? Where should I start? Should I
> > try to copy the index to other machine with more free space and try to
> > commit? Should I try an optimize?
> >
> > Log for the last commit I tried:
> >
> > INFO: start
> > commit(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=f
> > alse)
> > (Then, after a long time...)
> > Exception in thread "Lucene Merge Thread #0"
> > org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> > No space left on device
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
> > ncurrentMergeScheduler.java:351)
> >at
> > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
> > entMergeScheduler.java:315)
> > Caused by: java.io.IOException: No space left on device
> >
> > I'm using Ubuntu 9.04 and Solr 1.4.0.
> >
> > Thanks in advance,
> >
> > Frederico
> >
>


RE: Problem comitting on 40GB index

2010-01-12 Thread Frederico Azeiteiro
Hi Erik,

I'm a newbie to solr... By IR, you mean searcher? Is there a place where I can 
check the open searchers? And rebooting the machine shouldn't closed that 
searchers?

Thanks,

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: terça-feira, 12 de Janeiro de 2010 13:54
To: solr-user@lucene.apache.org
Subject: Re: Problem comitting on 40GB index

There are several possibilities:

1> you have some process holding open your indexes, probably
 other searchers. You *probably* are OK just committing
 new changes if there is exactly *one* searcher keeping
 your index open. If you have some process whereby
 you periodically open a new search but you fail to close
 the old one, then you'll use up an extra 40G for every
 version of your index held open by your processes. That's
confusing... I'm saying that if you open any number of IRs,
you'll have 40G consumed. Then if you add
some more documents and open *another* IR,  you'll have
another 40G consumed. They'll stay around until you close
your readers.

2> If you optimize, there can be up to 3X the index size being
consumed if you also have a previous reader opened.

So I suspect that sometime recently you've opened another
IR.....

HTH
Erick



On Tue, Jan 12, 2010 at 8:03 AM, Frederico Azeiteiro <
frederico.azeite...@cision.com> wrote:

> Hi all,
>
> I started working with solr about 1 month ago, and everything was
> running well both indexing as searching documents.
>
> I have a 40GB index with about 10 000 000 documents available. I index
> 3k docs for each 10m and commit after each insert.
>
> Since yesterday, I can't commit no articles to index. I manage to search
> ok, and index documents without commiting. But when I start the commit
> is takes a long time and eats all of the available disk space
> left(60GB). The commit eventually stops with full disk and I have to
> restart SOLR and get the 60GB returned to system.
>
> Before this, the commit was taking a few seconds to complete.
>
> Can someone help to debug the problem? Where should I start? Should I
> try to copy the index to other machine with more free space and try to
> commit? Should I try an optimize?
>
> Log for the last commit I tried:
>
> INFO: start
> commit(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=f
> alse)
> (Then, after a long time...)
> Exception in thread "Lucene Merge Thread #0"
> org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
> No space left on device
>at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
> ncurrentMergeScheduler.java:351)
>at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
> entMergeScheduler.java:315)
> Caused by: java.io.IOException: No space left on device
>
> I'm using Ubuntu 9.04 and Solr 1.4.0.
>
> Thanks in advance,
>
> Frederico
>


Problem comitting on 40GB index

2010-01-12 Thread Frederico Azeiteiro
Hi all,

I started working with solr about 1 month ago, and everything was
running well both indexing as searching documents.

I have a 40GB index with about 10 000 000 documents available. I index
3k docs for each 10m and commit after each insert.

Since yesterday, I can't commit no articles to index. I manage to search
ok, and index documents without commiting. But when I start the commit
is takes a long time and eats all of the available disk space
left(60GB). The commit eventually stops with full disk and I have to
restart SOLR and get the 60GB returned to system.

Before this, the commit was taking a few seconds to complete.

Can someone help to debug the problem? Where should I start? Should I
try to copy the index to other machine with more free space and try to
commit? Should I try an optimize?

Log for the last commit I tried:

INFO: start
commit(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=f
alse)
(Then, after a long time...)
Exception in thread "Lucene Merge Thread #0"
org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException:
No space left on device
at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
ncurrentMergeScheduler.java:351)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
entMergeScheduler.java:315)
Caused by: java.io.IOException: No space left on device

I'm using Ubuntu 9.04 and Solr 1.4.0.

Thanks in advance,

Frederico