Re: MoreLikeThis Question

2012-02-15 Thread Michael Jakl
Hi!

On Wed, Feb 15, 2012 at 07:27, Jamie Johnson jej2...@gmail.com wrote:
 Is there anyway with MLT to say get similar based on all fields or is
 it always a requirement to specify the fields?

It seems to be not the case. But you could append the fields Parameter
in the solrconfig.xml:
lst name=appends
 str name=mlt.fl.../str
/lst

Cheers,
Michael


RE: OR-FilterQuery

2012-02-15 Thread spring
  q=some text
  fq=id:(1 OR 2 OR 3...)
 
  Should I better use q:some text AND id:(1 OR 2 OR 3...)?
 
 1. These two opts have the different scoring.
 2. if you hit same fq=id:(1 OR 2 OR 3...) many times you have 
 a benefit due
 to reading docset from heap instead of searching on disk.

OK, understood.
Thank you.



RE: OR-FilterQuery

2012-02-15 Thread spring
 In other words, there's no attempt to decompose the fq clause
 and store parts of it in the cache, it's exact-match or
 nothing.

Ah ok, thank you.



Solr as an part of api to unburden databases

2012-02-15 Thread Ramo Karahasan
Hi,

 

does anyone of the maillinglist users use solr as an API to avoid database
queries? I know that this depends on the type of data. Imagine you have
something like Quora QA System, which is most just text. If I would
embed some of these QA into my personal site, and would invoke the Quroa
API, I guess, they would do some database operations.

Would it be possible to call the Quora API that internally calls solr and
stream the results back to my website?

This should be highly configurable, but the advantage would be that  it
would unburden the databases.

 

There would be something like a three layer architecture:   Client  - |
API (is doing some authorization/authentication checks) - |  solr 

   Solr  - | API (may be filter the data, remove unofficial
data, etc. ) - | Client

 

 

I'm not really familiar with that kind of architecture, and therefore does
not know, if it makes any sense.

Any comments are appreciated!

 

Best regards,

Ramo



MoreLikeThis Requesthandler

2012-02-15 Thread Molidor, Robert
Hi,
I'm quite new to Solr. We want to find similar documents based on a 
MoreLikeThis query. In general this works fine and gives us reasonable results. 
Now we want to influence the result score by ranking more recent documents 
higher than older documents. Is this possible with the MoreLikeThis 
Requesthandler? If so, how can we achieve this?

Thanks in advance,
Robert



Error Indexing in solr 3.5

2012-02-15 Thread mechravi25
Hi,

When I tried to index in solr 3.5 i got the following exception

org.apache.solr.client.solrj.SolrServerException: Error executing query
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at com.quartz.test.FullImport.callIndex(FullImport.java:80)
at
com.quartz.test.GetObjectTypes.checkObjectTypeProp(GetObjectTypes.java:245)
at com.quartz.test.GetObjectTypes.execute(GetObjectTypes.java:640)
at com.quartz.test.QuartzSchedMain.main(QuartzSchedMain.java:55)
Caused by: java.lang.RuntimeException: Invalid version or the data in not in
'javabin' format
at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)



I placed the latest solrj 3.5 jar in the example/solr/lib directory and then
re-started the same but still I am getting the above mentioned exception. 

Please let me know if I am missing anything.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-Indexing-in-solr-3-5-tp3746735p3746735.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Highlighting stopwords

2012-02-15 Thread O. Klein

Koji Sekiguchi wrote
 
 (12/02/14 22:25), O. Klein wrote:
 I have not been able to find any logic in the behavior of hl.q and how it
 analyses the query. Could you explain how it is supposed to work?
 
 Nothing special on hl.q. If you use hl.q, the value of it will be used for
 highlighting rather than the value of q. There's no tricks, I think.
 
 koji
 -- 
 Apache Solr Query Log Visualizer
 http://soleami.com/
 

Field definitions:
content_text (no stopwords, only synonyms in index)
content_hl (stopwords, synonyms in index and query, and only field in hl.fl)

Searching is done with edismax on content_text

1. If I use a query like hl.q=spell Check it doesn't highlight terms with
uppercase, synonyms get highlighted (all fields have LowerCaseFilterFactory)

2. hl.q=content_hl:(spell Check) also highlights terms with uppercase,
synonyms are not highlighted

4. hl.q=content_hl:(spell Check) content_text:(spell Check) highlights terms
with uppercase and synonyms, but sometimes no highlights at all.

So if 1 also highlights terms with uppercase I get the behavior I need. I
can do this on client side, but maybe it's a bug?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-stopwords-tp3681901p3746817.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr binary response for C#?

2012-02-15 Thread Jan Høydahl
Hi,

I just created a JIRA to investigate an Avro based serialization format for 
Solr: https://issues.apache.org/jira/browse/SOLR-3135
You're welcome to contribute. Guess we'll first need to define schemas, then 
create an AvroResponseWriter and then support in the C# Solr client.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 14. feb. 2012, at 15:14, Erick Erickson wrote:

 It's not as compact as binary format, but would just using something
 like JSON help enough? This is really simple, just specify
 wt=json (there's a method to set this on the server, at least in Java).
 
 Otherwise, you might get a more knowledgeable response on the
 C# java list, I'm frankly clueless
 
 Best
 Erick
 
 On Mon, Feb 13, 2012 at 1:15 PM, naptowndev naptowndev...@gmail.com wrote:
 Admittedly I'm new to this, but the project we're working on feeds results
 from Solr to an ASP.net application.  Currently we are using XML, but our
 payloads can be rather large, some up to 17MB.  We are looking for a way to
 minimize that payload and increase performance and I'm curious if there's
 anything anyone has been working out that creates a binary response that can
 be read by C# (similar to the javabin response built into Solr).
 
 That, or if anyone has experience implementing an external protocol like
 Thrift with Solr and consuming it with C# - again all in the effort to
 increase performance across the wire and while being consumed.
 
 Any help and direction would be greatly appreciated!
 
 Thanks!
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-binary-response-for-C-tp3741101p3741101.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Stemming and accents (HunspellStemFilterFactory)

2012-02-15 Thread Jan Høydahl
Or if you know that you'll always strip accents in your search you may 
pre-process your pt_PT.dic to remove accents from it and use that custom 
dictionary instead in Solr.

Another alternative could be to extend HunSpellFilter so that it can take in 
the class name of a TokenFilter class to apply when parsing the dictionary into 
memory.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 14. feb. 2012, at 16:27, Chantal Ackermann wrote:

 Hi Bráulio,
 
 I don't know about HunspellStemFilterFactory especially but concerning
 accents:
 
 There are several accent filter that will remove accents from your
 tokens. If the Hunspell filter factory requires the accents, then simply
 add the accent filters after Hunspell in your index and query filter
 chains.
 
 You would then have Hunspell produce the tokens as result of the
 stemming and only afterwards the accents would be removed (your example:
 'forum' instead of 'fórum'). Do the same on the query side in case
 someone inputs accents.
 
 Accent filters are:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory
 (lowercases, as well!)
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
 
 and others on that page.
 
 Chantal
 
 
 On Tue, 2012-02-14 at 14:48 +0100, Bráulio Bhavamitra wrote:
 Hello all,
 
 I'm evaluating the HunspellStemFilterFactory I found it works with a
 pt_PT dictionary.
 
 For example, if I search for 'fóruns' it stems it to 'fórum' and then find
 'fórum' references.
 
 But if I search for 'foruns' (without accent),
 then HunspellStemFilterFactory cannot stem
 word, as it does' not exist in its dictionary.
 
 It there any way to make HunspellStemFilterFactory work without accents
 differences?
 
 best,
 bráulio
 



Re: Semantic autocomplete with Solr

2012-02-15 Thread Jan Høydahl
Check out 
http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
You can feed it anything, such as a log of previous searches, or a pre-computed 
dictionary of item + color combinations that exist in your DB etc.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 14. feb. 2012, at 23:46, Roman Chyla wrote:

 done something along these lines:
 
 https://svnweb.cern.ch/trac/rcarepo/wiki/InspireAutoSuggest#Autosuggestautocompletefunctionality
 
 but you would need MontySolr for that - 
 https://github.com/romanchyla/montysolr
 
 roman
 
 On Tue, Feb 14, 2012 at 11:10 PM, Octavian Covalschi
 octavian.covals...@gmail.com wrote:
 Hey guys,
 
 Has anyone done any kind of smart autocomplete? Let's say we have a web
 store, and we'd like to autocomplete user's searches. So if I'll type in
 jacket next word that will be suggested should be something related to
 jacket (color, fabric) etc...
 
 It seems to me I have to structure this data in a particular way, but that
 way I can do without solr, so I was wondering if Solr could help us.
 
 Thank you in advance.



Re: Solr as an part of api to unburden databases

2012-02-15 Thread Tomas Zerolo
On Wed, Feb 15, 2012 at 11:48:14AM +0100, Ramo Karahasan wrote:
 Hi,
 
  
 
 does anyone of the maillinglist users use solr as an API to avoid database
 queries? [...]

Like in a... cache?

Why not use a cache then? (memcached, for example, but there are more).

Regards
-- tomás


Re: Solr soft commit feature

2012-02-15 Thread Nagendra Nagarajayya


If you are looking for NRT functionality with Solr 3.5, you may want to 
take a look at Solr 3.5 with RankingAlgorithm. This allows you to 
add/update documents without a commit while being able to search 
concurrently. The add/update performance to add 1m docs is about 5000 
docs in about 498 ms  with one concurrent searcher. You can get more 
information about Solr 3.5 with RankingAlgorithm from here:


http://tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 2/14/2012 4:41 PM, Dipti Srivastava wrote:

Hi All,
Is there a way to soft commit in the current released version of solr 3.5?

Regards,
Dipti Srivastava


This message is private and confidential. If you have received it in error, 
please notify the sender and remove it from your system.








Re: MoreLikeThis Question

2012-02-15 Thread Chantal Ackermann
Hi,

you would not want to include the unique ID and similar stuff, though?
No idea whether it would impact the number of hits but it would most
probably influence the scoring if nothing else.

E.g. if you compare by certain fields, I would expect that a score of
1.0 indicates a match on all of those fields (haven't tested that
explicitly, though). If the unique ID is included you could never reach
that score.

Just my 2 cents...

Chantal


On Wed, 2012-02-15 at 07:27 +0100, Jamie Johnson wrote:
 Is there anyway with MLT to say get similar based on all fields or is
 it always a requirement to specify the fields?



Re: Solr as an part of api to unburden databases

2012-02-15 Thread Chantal Ackermann
  
  does anyone of the maillinglist users use solr as an API to avoid database
  queries? [...]
 
 Like in a... cache?
 
 Why not use a cache then? (memcached, for example, but there are more).
 

Good point. A cache only uses lookup by one kind of cache key while SOLR
provides lookup by ... well... any search configuration that your index
setup (mainly the schema) supports.

If the database queries always do a find by unique id, then use a
cache. Otherwise using SOLR is a valid option.


Chantal



Re: Error Indexing in solr 3.5

2012-02-15 Thread Chantal Ackermann
Hi,

I've got these errors when my client used a different SolrJ version from
the SOLR server it connected to:

SERVER 3.5  responding --- CLIENT some other version

You haven't provided any information on your client, though.

Chantal

On Wed, 2012-02-15 at 13:09 +0100, mechravi25 wrote:
 Hi,
 
 When I tried to index in solr 3.5 i got the following exception
 
 org.apache.solr.client.solrj.SolrServerException: Error executing query
   at
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
   at com.quartz.test.FullImport.callIndex(FullImport.java:80)
   at
 com.quartz.test.GetObjectTypes.checkObjectTypeProp(GetObjectTypes.java:245)
   at com.quartz.test.GetObjectTypes.execute(GetObjectTypes.java:640)
   at com.quartz.test.QuartzSchedMain.main(QuartzSchedMain.java:55)
 Caused by: java.lang.RuntimeException: Invalid version or the data in not in
 'javabin' format
   at 
 org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
   at
 org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
   at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
   at
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
   at
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
 
 
 
 I placed the latest solrj 3.5 jar in the example/solr/lib directory and then
 re-started the same but still I am getting the above mentioned exception. 
 
 Please let me know if I am missing anything.
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Error-Indexing-in-solr-3-5-tp3746735p3746735.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Facet on TrieDateField field without including date

2012-02-15 Thread Yonik Seeley
On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson jej2...@gmail.com wrote:
 I would like to be able to facet based on the time of
 day items are purchased across a date span.  I was hoping that I could
 do a query of something like date:[NOW-1WEEK TO NOW] and then specify
 I wanted facet broken into hourly bins.  Is this possible?  Do I

Will range faceting do everything you need?
http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range

-Yonik
lucidimagination.com


Re: Facet on TrieDateField field without including date

2012-02-15 Thread Jamie Johnson
I think it would if I indexed the time information separately.  Which
was my original thought, but I was hoping to store this in one field
instead of 2.  So my idea was I'd store the time portion as as a
number (an int might suffice from 0 to 24 since I only need this to
have that level of granularity) then do range queries over that.  I
couldn't think of a way to do this using the date field though because
it would give me bins broken up by hours in a particular day,
something like

2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
2012-01-01-02:00:00 - 2012-01-01-03:00:00 5

But what I really want is just the time portion across all days

00:00:00 - 01:00:00 10
01:00:00 - 02:00:00 20
02:00:00 - 03:00:00 5

I would then use the date field to limit the time range in which the
facet was operating.  Does that make sense?  Is there a more efficient
way of doing this?

On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson jej2...@gmail.com wrote:
 I would like to be able to facet based on the time of
 day items are purchased across a date span.  I was hoping that I could
 do a query of something like date:[NOW-1WEEK TO NOW] and then specify
 I wanted facet broken into hourly bins.  Is this possible?  Do I

 Will range faceting do everything you need?
 http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range

 -Yonik
 lucidimagination.com


Re: MoreLikeThis Question

2012-02-15 Thread Jamie Johnson
Yes, agree that ID would be one that would need to be ignored.  I
don't think specifying them is too difficult I was just curious if it
was possible to do this or not.

On Wed, Feb 15, 2012 at 8:41 AM, Chantal Ackermann
chantal.ackerm...@btelligent.de wrote:
 Hi,

 you would not want to include the unique ID and similar stuff, though?
 No idea whether it would impact the number of hits but it would most
 probably influence the scoring if nothing else.

 E.g. if you compare by certain fields, I would expect that a score of
 1.0 indicates a match on all of those fields (haven't tested that
 explicitly, though). If the unique ID is included you could never reach
 that score.

 Just my 2 cents...

 Chantal


 On Wed, 2012-02-15 at 07:27 +0100, Jamie Johnson wrote:
 Is there anyway with MLT to say get similar based on all fields or is
 it always a requirement to specify the fields?



Re: Facet on TrieDateField field without including date

2012-02-15 Thread Yonik Seeley
On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson jej2...@gmail.com wrote:
 I think it would if I indexed the time information separately.  Which
 was my original thought, but I was hoping to store this in one field
 instead of 2.  So my idea was I'd store the time portion as as a
 number (an int might suffice from 0 to 24 since I only need this to
 have that level of granularity) then do range queries over that.  I
 couldn't think of a way to do this using the date field though because
 it would give me bins broken up by hours in a particular day,
 something like

 2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
 2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
 2012-01-01-02:00:00 - 2012-01-01-03:00:00 5

 But what I really want is just the time portion across all days

 00:00:00 - 01:00:00 10
 01:00:00 - 02:00:00 20
 02:00:00 - 03:00:00 5

 I would then use the date field to limit the time range in which the
 facet was operating.  Does that make sense?  Is there a more efficient
 way of doing this?

Hmm, no there's no way to do this.
Even if you were to write a custom faceting component, it seems like
it would still be very expensive to derive the hour of the day from ms
for every doc.

-Yonik
lucidimagination.com




 On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson jej2...@gmail.com wrote:
 I would like to be able to facet based on the time of
 day items are purchased across a date span.  I was hoping that I could
 do a query of something like date:[NOW-1WEEK TO NOW] and then specify
 I wanted facet broken into hourly bins.  Is this possible?  Do I

 Will range faceting do everything you need?
 http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range

 -Yonik
 lucidimagination.com


Re: Semantic autocomplete with Solr

2012-02-15 Thread Octavian Covalschi
Thank you! I'll check them out.

On Wed, Feb 15, 2012 at 6:50 AM, Jan Høydahl jan@cominvent.com wrote:

 Check out
 http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
 You can feed it anything, such as a log of previous searches, or a
 pre-computed dictionary of item + color combinations that exist in your
 DB etc.

 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com

 On 14. feb. 2012, at 23:46, Roman Chyla wrote:

  done something along these lines:
 
 
 https://svnweb.cern.ch/trac/rcarepo/wiki/InspireAutoSuggest#Autosuggestautocompletefunctionality
 
  but you would need MontySolr for that -
 https://github.com/romanchyla/montysolr
 
  roman
 
  On Tue, Feb 14, 2012 at 11:10 PM, Octavian Covalschi
  octavian.covals...@gmail.com wrote:
  Hey guys,
 
  Has anyone done any kind of smart autocomplete? Let's say we have a
 web
  store, and we'd like to autocomplete user's searches. So if I'll type in
  jacket next word that will be suggested should be something related to
  jacket (color, fabric) etc...
 
  It seems to me I have to structure this data in a particular way, but
 that
  way I can do without solr, so I was wondering if Solr could help us.
 
  Thank you in advance.




Re: Facet on TrieDateField field without including date

2012-02-15 Thread Ted Dunning
Use multiple fields and you get what you want.  The extra fields are going
to cost very little and will have a bit positive impact.

On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson jej2...@gmail.com wrote:

 I think it would if I indexed the time information separately.  Which
 was my original thought, but I was hoping to store this in one field
 instead of 2.  So my idea was I'd store the time portion as as a
 number (an int might suffice from 0 to 24 since I only need this to
 have that level of granularity) then do range queries over that.  I
 couldn't think of a way to do this using the date field though because
 it would give me bins broken up by hours in a particular day,
 something like

 2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
 2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
 2012-01-01-02:00:00 - 2012-01-01-03:00:00 5

 But what I really want is just the time portion across all days

 00:00:00 - 01:00:00 10
 01:00:00 - 02:00:00 20
 02:00:00 - 03:00:00 5

 I would then use the date field to limit the time range in which the
 facet was operating.  Does that make sense?  Is there a more efficient
 way of doing this?

 On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
 yo...@lucidimagination.com wrote:
  On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson jej2...@gmail.com
 wrote:
  I would like to be able to facet based on the time of
  day items are purchased across a date span.  I was hoping that I could
  do a query of something like date:[NOW-1WEEK TO NOW] and then specify
  I wanted facet broken into hourly bins.  Is this possible?  Do I
 
  Will range faceting do everything you need?
  http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range
 
  -Yonik
  lucidimagination.com



Re: Facet on TrieDateField field without including date

2012-02-15 Thread Jamie Johnson
Thanks guys that's what I figured, just wanted to make sure I was
going down the right path.

On Wed, Feb 15, 2012 at 9:55 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 Use multiple fields and you get what you want.  The extra fields are going
 to cost very little and will have a bit positive impact.

 On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson jej2...@gmail.com wrote:

 I think it would if I indexed the time information separately.  Which
 was my original thought, but I was hoping to store this in one field
 instead of 2.  So my idea was I'd store the time portion as as a
 number (an int might suffice from 0 to 24 since I only need this to
 have that level of granularity) then do range queries over that.  I
 couldn't think of a way to do this using the date field though because
 it would give me bins broken up by hours in a particular day,
 something like

 2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
 2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
 2012-01-01-02:00:00 - 2012-01-01-03:00:00 5

 But what I really want is just the time portion across all days

 00:00:00 - 01:00:00 10
 01:00:00 - 02:00:00 20
 02:00:00 - 03:00:00 5

 I would then use the date field to limit the time range in which the
 facet was operating.  Does that make sense?  Is there a more efficient
 way of doing this?

 On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
 yo...@lucidimagination.com wrote:
  On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson jej2...@gmail.com
 wrote:
  I would like to be able to facet based on the time of
  day items are purchased across a date span.  I was hoping that I could
  do a query of something like date:[NOW-1WEEK TO NOW] and then specify
  I wanted facet broken into hourly bins.  Is this possible?  Do I
 
  Will range faceting do everything you need?
  http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range
 
  -Yonik
  lucidimagination.com



Re: Facet on TrieDateField field without including date

2012-02-15 Thread Chantal Ackermann
I've done something like that by calculating the hours during indexing
time (in the script part of the DIH config using java.util.Calendar
which gives you all those field values without effort). I've also
extracted information on which weekday it is (using the integer
constants of Calendar).
If you need this only for one timezone it is straight forward but if the
queries come from different time zones you'll have to shift
appropriately.

I found that pre-calculating has the advantage that you end up with very
simple data: simple integers. And it makes it quite easy to build more
complex queries on that. For example I have created a grid (build from
facets) where the columns are the weekdays and the rows are the hours of
day. The facets are created using a field containing the combination of
weekday and hour of day.


Chantal



On Wed, 2012-02-15 at 15:49 +0100, Yonik Seeley wrote:
 On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson jej2...@gmail.com wrote:
  I think it would if I indexed the time information separately.  Which
  was my original thought, but I was hoping to store this in one field
  instead of 2.  So my idea was I'd store the time portion as as a
  number (an int might suffice from 0 to 24 since I only need this to
  have that level of granularity) then do range queries over that.  I
  couldn't think of a way to do this using the date field though because
  it would give me bins broken up by hours in a particular day,
  something like
 
  2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
  2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
  2012-01-01-02:00:00 - 2012-01-01-03:00:00 5
 
  But what I really want is just the time portion across all days
 
  00:00:00 - 01:00:00 10
  01:00:00 - 02:00:00 20
  02:00:00 - 03:00:00 5
 
  I would then use the date field to limit the time range in which the
  facet was operating.  Does that make sense?  Is there a more efficient
  way of doing this?
 
 Hmm, no there's no way to do this.
 Even if you were to write a custom faceting component, it seems like
 it would still be very expensive to derive the hour of the day from ms
 for every doc.
 
 -Yonik
 lucidimagination.com
 
 
 
 
  On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
  yo...@lucidimagination.com wrote:
  On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson jej2...@gmail.com wrote:
  I would like to be able to facet based on the time of
  day items are purchased across a date span.  I was hoping that I could
  do a query of something like date:[NOW-1WEEK TO NOW] and then specify
  I wanted facet broken into hourly bins.  Is this possible?  Do I
 
  Will range faceting do everything you need?
  http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range
 
  -Yonik
  lucidimagination.com



Solr multiple cores - multiple databases approach

2012-02-15 Thread Radu Toev
Hello,

I have a use where I'm trying to integrate Solr:
 - 2 databases with the same schema
 - I want to index multiple enttities from those databases
My question is what is the best way of approaching this topic:
 - should I create a core for each database and inside that core create a
document with all information that I need?


Re: Solr multiple cores - multiple databases approach

2012-02-15 Thread Em
Hello Radu,

  - I want to index multiple enttities from those databases
Do you want to combine data of both databases within one document or are
you just interested in indexing both databases on their own?

If the second applies: You can do it within one core by using a field
(i.e. source) to filter on it or create a core per database which
would completely seperate both indizes from eachother.

It depends on your usecase and access-patterns. To tell you more, you
should provide us more information.

Regards,
Em

Am 15.02.2012 16:23, schrieb Radu Toev:
 Hello,
 
 I have a use where I'm trying to integrate Solr:
  - 2 databases with the same schema
  - I want to index multiple enttities from those databases
 My question is what is the best way of approaching this topic:
  - should I create a core for each database and inside that core create a
 document with all information that I need?
 


Re: Solr soft commit feature

2012-02-15 Thread Dipti Srivastava
Hi Nagendra,

Certainly interesting! Would this work in a Master/slave setup where the
reads are from the slaves and all writes are to the master?

Regards,
Dipti Srivastava


On 2/15/12 5:40 AM, Nagendra Nagarajayya nnagaraja...@transaxtions.com
wrote:


If you are looking for NRT functionality with Solr 3.5, you may want to
take a look at Solr 3.5 with RankingAlgorithm. This allows you to
add/update documents without a commit while being able to search
concurrently. The add/update performance to add 1m docs is about 5000
docs in about 498 ms  with one concurrent searcher. You can get more
information about Solr 3.5 with RankingAlgorithm from here:

http://tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 2/14/2012 4:41 PM, Dipti Srivastava wrote:
 Hi All,
 Is there a way to soft commit in the current released version of solr
3.5?

 Regards,
 Dipti Srivastava


 This message is private and confidential. If you have received it in
error, please notify the sender and remove it from your system.








This message is private and confidential. If you have received it in error, 
please notify the sender and remove it from your system.




Search for hashtags and mentions

2012-02-15 Thread Rohit
Hi,

 

We are using solr version 3.5 to search though Tweets, I am using
WordDelimiterFactory with the following setting, to be able to search for
@username or #hashtags

 

filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
preserveOriginal=1 handleAsChar=@#/

 

I saw the following patch but this doesn't seem to be working as I expected,
am I missing something?  

 

https://issues.apache.org/jira/browse/SOLR-2059 

 

But searching for @username is also returning results for just username or
#hashtag is just returning result for hastag. How can I achieve this? 

 

Regards,

Rohit



problem with accents

2012-02-15 Thread R M


Hi,
I've got a problem with the configuration of solr. 
I have defined a new type of data : text_fr to use accent like é à è. I 
have added this on my fieldtype definition : filter 
class=solr.ISOLatin1AccentFilterFactory/
Everything seems to be ok, data are well added. But when I'm going to this 
adress :  http://localhost:8983/solr/admin to make a research there is a 
problem.
If I search cherche and cherché the results are differents although they 
should be the same, isn't it?
Thank you guys
Romain 
  

Re: problem with accents

2012-02-15 Thread Erick Erickson
Did you specify the correct field with the search? If you just
specified entered the
word in the  search box without the field, the search would be made against
your default search field (defined in schema.xml).

If you go to the full interface link on the admin page, you can then click
the debug:enable checkbox which will give you a lot more information
about what the parsed query looks like..

Best
Erick

On Wed, Feb 15, 2012 at 2:12 PM, R M killg...@hotmail.com wrote:


 Hi,
 I've got a problem with the configuration of solr.
 I have defined a new type of data : text_fr to use accent like é à è. I 
 have added this on my fieldtype definition : filter 
 class=solr.ISOLatin1AccentFilterFactory/
 Everything seems to be ok, data are well added. But when I'm going to this 
 adress :  http://localhost:8983/solr/admin to make a research there is a 
 problem.
 If I search cherche and cherché the results are differents although they 
 should be the same, isn't it?
 Thank you guys
 Romain



update extracted docs

2012-02-15 Thread Harold Frayman
Hi

I have a solr 3.5 database which is populated by using /update/extract
(configured pretty much as per the examples) and additional metadata. The
uploads are handled by a perl-driven webapp which uses WebService::Solr
(which use behind-the-scenes POSTing). That all works fine.

When I come to update the metadata associated with the stored docs, again
using my perl web app, I find the solr doc (by id), amend or append all the
changed metadata and use /update to re-post them. Again that works fine ...
but I'm getting nervous because I'm not sure why it works.

If I try to update only the changed fields for a single doc, the unchanged
fields are removed. Slightly surprising, but if that's what I should
expect, it's not difficult to accept.

So how come using /update doesn't remove the text content (and the indexing
on it) which was originally obtained using /update/extract? And can I
depend on it being there in future, after optimization, for example?

And if I can't, what is the best technique for updating metadata under
these circumstances?

Harold Frayman

Please consider the environment before printing this email.
--
Visit guardian.co.uk - newspaper of the year

www.guardian.co.ukwww.observer.co.uk www.guardiannews.com 

On your mobile, visit m.guardian.co.uk or download the Guardian
iPhone app www.guardian.co.uk/iphone
 
To save up to 30% when you subscribe to the Guardian and the Observer
visit www.guardian.co.uk/subscriber 
-
This e-mail and all attachments are confidential and may also
be privileged. If you are not the named recipient, please notify
the sender and delete the e-mail and all attachments immediately.
Do not disclose the contents to another person. You may not use
the information for any purpose, or store, or copy, it in any way.
 
Guardian News  Media Limited is not liable for any computer
viruses or other material transmitted with or as part of this
e-mail. You should employ virus checking software.

Guardian News  Media Limited

A member of Guardian Media Group plc
Registered Office
PO Box 68164
Kings Place
90 York Way
London
N1P 2AP

Registered in England Number 908396


Re: Search for hashtags and mentions

2012-02-15 Thread Emmanuel Espina
Do you want to index the hashtags and usernames to different fields?
Probably using

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternTokenizerFactory

will solve your problem.

However I don't fully understand the problem when you search

Thanks
Emmanuel


2012/2/15 Rohit ro...@in-rev.com:
 Hi,



 We are using solr version 3.5 to search though Tweets, I am using
 WordDelimiterFactory with the following setting, to be able to search for
 @username or #hashtags



 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 preserveOriginal=1 handleAsChar=@#/



 I saw the following patch but this doesn't seem to be working as I expected,
 am I missing something?



 https://issues.apache.org/jira/browse/SOLR-2059



 But searching for @username is also returning results for just username or
 #hashtag is just returning result for hastag. How can I achieve this?



 Regards,

 Rohit



Re: update extracted docs

2012-02-15 Thread Emmanuel Espina
Solr or Lucene does not update documents. It deletes the old one and
replaces it with a new one when it has the same id.
So if you create a document with the changed fields only, and the same
id, and upload that one, the old one will be erased and replaced with
the new one. So THAT behaviour is expectable.

For updating documents you simply add the entire document again with
the modified fields, or, if that is an expensive procedure and want to
avoid the extraction of the metadata, you can store all the fields and
retrieve the full document, create a new document with all the fields,
even the not modified ones, and use the /update handler to add it
again.

Does that answer your question?

Thanks
Emmanuel






2012/2/15 Harold Frayman harold.fray...@guardian.co.uk:
 Hi

 I have a solr 3.5 database which is populated by using /update/extract
 (configured pretty much as per the examples) and additional metadata. The
 uploads are handled by a perl-driven webapp which uses WebService::Solr
 (which use behind-the-scenes POSTing). That all works fine.

 When I come to update the metadata associated with the stored docs, again
 using my perl web app, I find the solr doc (by id), amend or append all the
 changed metadata and use /update to re-post them. Again that works fine ...
 but I'm getting nervous because I'm not sure why it works.

 If I try to update only the changed fields for a single doc, the unchanged
 fields are removed. Slightly surprising, but if that's what I should
 expect, it's not difficult to accept.

 So how come using /update doesn't remove the text content (and the indexing
 on it) which was originally obtained using /update/extract? And can I
 depend on it being there in future, after optimization, for example?

 And if I can't, what is the best technique for updating metadata under
 these circumstances?

 Harold Frayman

 Please consider the environment before printing this email.
 --
 Visit guardian.co.uk - newspaper of the year

 www.guardian.co.uk    www.observer.co.uk     www.guardiannews.com

 On your mobile, visit m.guardian.co.uk or download the Guardian
 iPhone app www.guardian.co.uk/iphone

 To save up to 30% when you subscribe to the Guardian and the Observer
 visit www.guardian.co.uk/subscriber
 -
 This e-mail and all attachments are confidential and may also
 be privileged. If you are not the named recipient, please notify
 the sender and delete the e-mail and all attachments immediately.
 Do not disclose the contents to another person. You may not use
 the information for any purpose, or store, or copy, it in any way.

 Guardian News  Media Limited is not liable for any computer
 viruses or other material transmitted with or as part of this
 e-mail. You should employ virus checking software.

 Guardian News  Media Limited

 A member of Guardian Media Group plc
 Registered Office
 PO Box 68164
 Kings Place
 90 York Way
 London
 N1P 2AP

 Registered in England Number 908396


Re: feeding mahout cluster output back to solr

2012-02-15 Thread abhayd
I was looking at this
http://java.dzone.com/videos/configuring-mahout-clustering

seems like possible but can anyone shed more light, specially on the part of
mapping clusters to original docs

abhay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/feeding-mahout-cluster-output-back-to-solr-tp3745883p3748349.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Robert Stewart
I implemented an index shrinker and it works.  I reduced my test index
from 6.6 GB to 3.6 GB by removing a single shingled field I did not
need anymore.  I'm actually using Lucene.Net for this project so code
is C# using Lucene.Net 2.9.2 API.  But basic idea is:

Create an IndexReader wrapper that only enumerates the terms you want
to keep, and that removes terms from documents when returning
documents.

Use the SegmentMerger to re-write each segment (where each segment is
wrapped by the wrapper class), writing new segment to a new directory.
Collect the SegmentInfos and do a commit in order to create a new
segments file in new index directory

Done - you now have a shrunk index with specified terms removed.

Implementation uses separate thread for each segment, so it re-writes
them in parallel.  Took about 15 minutes to do 770,000 doc index on my
macbook.


On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote:
 I have roughly read the codes of 4.0 trunk. maybe it's feasible.
    SegmentMerger.add(IndexReader) will add to be merged Readers
    merge() will call
      mergeTerms(segmentWriteState);
      mergePerDoc(segmentWriteState);

   mergeTerms() will construct fields from IndexReaders
    for(int
 readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) {
      final MergeState.IndexReaderAndLiveDocs r =
 mergeState.readers.get(readerIndex);
      final Fields f = r.reader.fields();
      final int maxDoc = r.reader.maxDoc();
      if (f != null) {
        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
        fields.add(f);
      }
      docBase += maxDoc;
    }
    So If you wrapper your IndexReader and override its fields() method,
 maybe it will work for merge terms.

    for DocValues, it can also override AtomicReader.docValues(). just
 return null for fields you want to remove. maybe it should
 traverse CompositeReader's getSequentialSubReaders() and wrapper each
 AtomicReader

    other things like term vectors norms are similar.
 On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.comwrote:

 I was thinking if I make a wrapper class that aggregates another
 IndexReader and filter out terms I don't want anymore it might work.   And
 then pass that wrapper into SegmentMerger.  I think if I filter out terms
 on GetFieldNames(...) and Terms(...) it might work.

 Something like:

 HashSetstring ignoredTerms=...;

 FilteringIndexReader wrapper=new FilterIndexReader(reader);

 SegmentMerger merger=new SegmentMerger(writer);

 merger.add(wrapper);

 merger.Merge();





 On Feb 14, 2012, at 1:49 AM, Li Li wrote:

  for method 2, delete is wrong. we can't delete terms.
    you also should hack with the tii and tis file.
 
  On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
 
  method1, dumping data
  for stored fields, you can traverse the whole index and save it to
  somewhere else.
  for indexed but not stored fields, it may be more difficult.
     if the indexed and not stored field is not analyzed(fields such as
  id), it's easy to get from FieldCache.StringIndex.
     But for analyzed fields, though theoretically it can be restored from
  term vector and term position, it's hard to recover from index.
 
  method 2, hack with metadata
  1. indexed fields
       delete by query, e.g. field:*
  2. stored fields
        because all fields are stored sequentially. it's not easy to
 delete
  some fields. this will not affect search speed. but if you want to get
  stored fields,  and the useless fields are very long, then it will slow
  down.
        also it's possible to hack with it. but need more effort to
  understand the index file format  and traverse the fdt/fdx file.
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
 
  this will give you some insight.
 
 
  On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.com
 wrote:
 
  Lets say I have a large index (100M docs, 1TB, split up between 10
  indexes).  And a bunch of the stored and indexed fields are not
 used in
  search at all.  In order to save memory and disk, I'd like to rebuild
 that
  index *without* those fields, but I don't have original documents to
  rebuild entire index with (don't have the full-text anymore, etc.).  Is
  there some way to rebuild or optimize an existing index with only a
 sub-set
  of the existing indexed fields?  Or alternatively is there a way to
 avoid
  loading some indexed fields at all ( to avoid loading term infos and
 terms
  index ) ?
 
  Thanks
  Bob
 
 
 




Size of suggest dictionary

2012-02-15 Thread Mike Hugo
Hello,

We're building an auto suggest component based on the label field of
documents.  Is there a way to see how many terms are in the dictionary, or
how much memory it's taking up?  I looked on the statistics page but didn't
find anything obvious.

Thanks in advance,

Mike

ps- here's the config:

searchComponent name=suggestlabel class=solr.SpellCheckComponent
lst name=spellchecker
str name=namesuggestlabel/str
str
name=classnameorg.apache.solr.spelling.suggest.Suggester/str
str
name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
str name=fieldlabel/str
str name=buildOnOptimizetrue/str
/lst
/searchComponent

requestHandler name=suggestlabel
class=org.apache.solr.handler.component.SearchHandler
lst name=defaults
str name=spellchecktrue/str
str name=spellcheck.dictionarysuggestlabel/str
str name=spellcheck.count10/str
/lst
arr name=components
strsuggestlabel/str
/arr
/requestHandler


Date formatting issue

2012-02-15 Thread Zajkowski, Radoslaw
Hi all, here's an interesting one, in my xml imported if I use very simple 
xpath like this

field column=property_update_date xpath=/document/property_update_date 
dateTimeFormat=-MM-dd'T'hh:mm:ss locale=en/

I will get the date properly imported, however if I use this expression for 
another node which is nested:

field column=audience_customers_release_date 
xpath=/document/audiences/audience/audience_release_date[../audience_name='Customers']
 dateTimeFormat=-MM-dd'T'hh:mm:ss locale=en/

I will receive this type of exception:  java.text.ParseException: Unparseable 
date: Tue Aug 16 20:10:23 EDT 2011

I have t use the Xpath above as I have a few of those release date nodes and I 
need to flatten them so we can look at dates per audience/group

I've also run just this /document/audiences/audience/audience_release_date and 
it works, however I need a more precise result then that since different groups 
could have different release dates.

Any help greatly appreciated,

Radek.


Radoslaw Zajkowski
Senior Developer
O°
proximity
CANADA
t: 416-972-1505 ext.7306
c: 647-281-2567
f: 416-944-7886

2011 ADCC Interactive Agency of the Year
2011 Strategy Magazine Digital Agency of the Year

http://www.proximityworld.com/

Join us on:
Facebook - http://www.facebook.com/ProximityCanada
Twitter - http://twitter.com/ProximityWW
YouTube - http://www.youtube.com/proximitycanada





Please consider the environment before printing this e-mail.

This message and any attachments contain information, which may be confidential 
or privileged. If you are not the intended recipient, please refrain from any 
disclosure, copying, distribution or use of this information. Please be aware 
that such actions are prohibited. If you have received this transmission in 
error, kindly notify us by e-mail to mailto:helpd...@bbdo.com. We appreciate 
your cooperation.




Re: Size of suggest dictionary

2012-02-15 Thread Em
Hello Mike,

have a look at Solr's Schema Browser. Click on FIELDS, select label
and have a look at the number of distinct (term-)values.

Regards,
Em


Am 15.02.2012 23:07, schrieb Mike Hugo:
 Hello,
 
 We're building an auto suggest component based on the label field of
 documents.  Is there a way to see how many terms are in the dictionary, or
 how much memory it's taking up?  I looked on the statistics page but didn't
 find anything obvious.
 
 Thanks in advance,
 
 Mike
 
 ps- here's the config:
 
 searchComponent name=suggestlabel class=solr.SpellCheckComponent
 lst name=spellchecker
 str name=namesuggestlabel/str
 str
 name=classnameorg.apache.solr.spelling.suggest.Suggester/str
 str
 name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str
 str name=fieldlabel/str
 str name=buildOnOptimizetrue/str
 /lst
 /searchComponent
 
 requestHandler name=suggestlabel
 class=org.apache.solr.handler.component.SearchHandler
 lst name=defaults
 str name=spellchecktrue/str
 str name=spellcheck.dictionarysuggestlabel/str
 str name=spellcheck.count10/str
 /lst
 arr name=components
 strsuggestlabel/str
 /arr
 /requestHandler
 


Re: Query in starting solr 3.5

2012-02-15 Thread Chris Hostetter

: WARNING: XML parse warning in solrres:/dataimport.xml, line 2, column 95:
: Include operation failed, reverting to fallback. Resource error reading file
: as XML (href='solr/conf/solrconfig_master.xml'). Reason: Can't find resource
: 'solr/conf/solrconfig_master.xml' in classpath or
: '/solr/apache-solr-3.5.0/example/multicore/core1/conf/',
: cwd=/solr/apache-solr-3.5.0/example
: 
: The partial content of dataimport file that I used in solr1.4 is as follows
: 
: xi:include href=solr/conf/solrconfig_master.xml
: xmlns:xi=http://www.w3.org/2001/XInclude;

I *think* what happened there is that some fixes were made to what path 
was used for relative includes -- before it was inconsistent and 
undefined, and now it's a true relative path from where you do the 
include.  so in your case, (i think) it is looking for 
/solr/apache-solr-3.5.0/example/multicore/core1/conf/solr/conf/solrconfig_master.xml
 
and not finding it -- so just fix the path to be what you actually want it 
to be realtive to that file

(If you look for SOLR-1656 in Solr's CHANGES.txt file it has all the 
details)

: The 3 files given in Fallback tag are present in the location. Does solr 3.5
: support fallback? Can someone please suggest a solution?

I think the fallback should be working fine (particularly since they are 
absolute paths in your case) ... nothing about that error says it's not, 
it actually says it's using hte fallback because the include itself is 
failing. (so unless you see a *subsequent* error you are getting the 
fallbacks)

: WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24
: emulation. You should at some point declare and reindex to at least 3.0,
: because 2.4 emulation is deprecated and will be removed in 4.0. This
: parameter will be mandatory in 4.0.
: 
: The solution i got after googling is to apply a patch. Is there any other

citation please?  where did you read that you need a patch to get rid of 
that warning?

This warning is just letting you know that in the absense of explicit 
confiugration, it's assuming you want the legacy behavior you would get if 
you explicitly configured the option with LUCENE_24.

if you add this line to your solrconfig.xml...

  luceneMatchVersionLUCENE_24/luceneMatchVersion

...no behavior will change, and the warning will go away.  but as the 
warning points out, you should give serious consideration (on every 
upgrade) to wether or not you can re-index after upgrade, and then change 
it to the current value (LUCENE_35) to eliminate some buggy behavior that 
is supported for back compat with existing indexes.


-Hoss


Re: Language specific tokenizer for purpose of multilingual search in single-core solr,

2012-02-15 Thread Chris Hostetter

: I want to do multilingual search in single-core solr. That requires to
: define language specific tokenizers in scheme.xml. Say for example, I have
: two tokenizers, one for English (en) and one for simplified Chinese
: (zh-cn). Can I just put following definitions together in one schema.xml,
: and both sets of the files ( stopwords, synonym, and protwords) in one
: directory? 

absolutely.


-Hoss


Re: Search for hashtags and mentions

2012-02-15 Thread Erick Erickson
We need the rest of your fieldType, it's quite possible
that other parts of it are stripping out the characters
in question. Try looking at the admin/analysis page.

If that doesn't help, please show us the whole fieldType
definition and the results of attaching debugQuery=on
to the URL.

Best
Erick

On Wed, Feb 15, 2012 at 2:04 PM, Rohit ro...@in-rev.com wrote:
 Hi,



 We are using solr version 3.5 to search though Tweets, I am using
 WordDelimiterFactory with the following setting, to be able to search for
 @username or #hashtags



 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 preserveOriginal=1 handleAsChar=@#/



 I saw the following patch but this doesn't seem to be working as I expected,
 am I missing something?



 https://issues.apache.org/jira/browse/SOLR-2059



 But searching for @username is also returning results for just username or
 #hashtag is just returning result for hastag. How can I achieve this?



 Regards,

 Rohit



Spatial Search and faceting

2012-02-15 Thread Eric Grobler
Hi Solr community,

I am doing a spatial search and then do a facet by city.
Is it possible to then sort the faceted cities by distance?

We would like to display the hits per city, but sort them by distance.

Thanks  Regards
Ericz

q=iphone
fq={!bbox}
sfield=geopoint
pt=49.594857,8.468614
d=50
fl=id,description,city,geopoint

facet=true
facet.field=city
f.city.facet.limit=10
f.city.facet.sort=count //geodist() asc


Re: Search for hashtags and mentions

2012-02-15 Thread Robert Muir
On Wed, Feb 15, 2012 at 2:04 PM, Rohit ro...@in-rev.com wrote:
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 preserveOriginal=1 handleAsChar=@#/

There is no such parameter as 'handleAsChar'. If you want to do this,
you need to use a custom types file.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

-- 
lucidimagination.com


Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Li Li
great. I think you could make it a public tool. maybe others also need such
functionality.

On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote:

 I implemented an index shrinker and it works.  I reduced my test index
 from 6.6 GB to 3.6 GB by removing a single shingled field I did not
 need anymore.  I'm actually using Lucene.Net for this project so code
 is C# using Lucene.Net 2.9.2 API.  But basic idea is:

 Create an IndexReader wrapper that only enumerates the terms you want
 to keep, and that removes terms from documents when returning
 documents.

 Use the SegmentMerger to re-write each segment (where each segment is
 wrapped by the wrapper class), writing new segment to a new directory.
 Collect the SegmentInfos and do a commit in order to create a new
 segments file in new index directory

 Done - you now have a shrunk index with specified terms removed.

 Implementation uses separate thread for each segment, so it re-writes
 them in parallel.  Took about 15 minutes to do 770,000 doc index on my
 macbook.


 On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote:
  I have roughly read the codes of 4.0 trunk. maybe it's feasible.
 SegmentMerger.add(IndexReader) will add to be merged Readers
 merge() will call
   mergeTerms(segmentWriteState);
   mergePerDoc(segmentWriteState);
 
mergeTerms() will construct fields from IndexReaders
 for(int
  readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) {
   final MergeState.IndexReaderAndLiveDocs r =
  mergeState.readers.get(readerIndex);
   final Fields f = r.reader.fields();
   final int maxDoc = r.reader.maxDoc();
   if (f != null) {
 slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
 fields.add(f);
   }
   docBase += maxDoc;
 }
 So If you wrapper your IndexReader and override its fields() method,
  maybe it will work for merge terms.
 
 for DocValues, it can also override AtomicReader.docValues(). just
  return null for fields you want to remove. maybe it should
  traverse CompositeReader's getSequentialSubReaders() and wrapper each
  AtomicReader
 
 other things like term vectors norms are similar.
  On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.com
 wrote:
 
  I was thinking if I make a wrapper class that aggregates another
  IndexReader and filter out terms I don't want anymore it might work.
 And
  then pass that wrapper into SegmentMerger.  I think if I filter out
 terms
  on GetFieldNames(...) and Terms(...) it might work.
 
  Something like:
 
  HashSetstring ignoredTerms=...;
 
  FilteringIndexReader wrapper=new FilterIndexReader(reader);
 
  SegmentMerger merger=new SegmentMerger(writer);
 
  merger.add(wrapper);
 
  merger.Merge();
 
 
 
 
 
  On Feb 14, 2012, at 1:49 AM, Li Li wrote:
 
   for method 2, delete is wrong. we can't delete terms.
 you also should hack with the tii and tis file.
  
   On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
  
   method1, dumping data
   for stored fields, you can traverse the whole index and save it to
   somewhere else.
   for indexed but not stored fields, it may be more difficult.
  if the indexed and not stored field is not analyzed(fields such as
   id), it's easy to get from FieldCache.StringIndex.
  But for analyzed fields, though theoretically it can be restored
 from
   term vector and term position, it's hard to recover from index.
  
   method 2, hack with metadata
   1. indexed fields
delete by query, e.g. field:*
   2. stored fields
 because all fields are stored sequentially. it's not easy to
  delete
   some fields. this will not affect search speed. but if you want to
 get
   stored fields,  and the useless fields are very long, then it will
 slow
   down.
 also it's possible to hack with it. but need more effort to
   understand the index file format  and traverse the fdt/fdx file.
  
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
  
   this will give you some insight.
  
  
   On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart 
 bstewart...@gmail.com
  wrote:
  
   Lets say I have a large index (100M docs, 1TB, split up between 10
   indexes).  And a bunch of the stored and indexed fields are not
  used in
   search at all.  In order to save memory and disk, I'd like to
 rebuild
  that
   index *without* those fields, but I don't have original documents to
   rebuild entire index with (don't have the full-text anymore, etc.).
  Is
   there some way to rebuild or optimize an existing index with only a
  sub-set
   of the existing indexed fields?  Or alternatively is there a way to
  avoid
   loading some indexed fields at all ( to avoid loading term infos and
  terms
   index ) ?
  
   Thanks
   Bob
  
  
  
 
 



Re: Spatial Search and faceting

2012-02-15 Thread William Bell
One way to do it is to group by city and then sort=geodist() asc

select?group=truegroup.field=citysort=geodist() descrows=10fl=city

It might require 2 calls to SOLR to get it the way you want.

On Wed, Feb 15, 2012 at 5:51 PM, Eric Grobler impalah...@googlemail.com wrote:
 Hi Solr community,

 I am doing a spatial search and then do a facet by city.
 Is it possible to then sort the faceted cities by distance?

 We would like to display the hits per city, but sort them by distance.

 Thanks  Regards
 Ericz

 q=iphone
 fq={!bbox}
 sfield=geopoint
 pt=49.594857,8.468614
 d=50
 fl=id,description,city,geopoint

 facet=true
 facet.field=city
 f.city.facet.limit=10
 f.city.facet.sort=count //geodist() asc



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


RE: Search for hashtags and mentions

2012-02-15 Thread Rohit
Go the problem, I need to user types= parameter to ignore character like #,@ 
in WordDelimiterFilterFactory factory.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: 16 February 2012 06:22
To: solr-user@lucene.apache.org
Subject: Re: Search for hashtags and mentions

On Wed, Feb 15, 2012 at 2:04 PM, Rohit ro...@in-rev.com wrote:
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0
 preserveOriginal=1 handleAsChar=@#/

There is no such parameter as 'handleAsChar'. If you want to do this,
you need to use a custom types file.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

-- 
lucidimagination.com



is it possible to run deltaimport command with out delta query?

2012-02-15 Thread nagarjuna
hi all..
  i am new to solr .can any body explain me about the delta-import and
delta query and also i have the below questions
1.is it possible to run deltaimport without delataquery?
2. is it possible to write a delta query without having last_modified column
in database? if yes pls explain me


pls help me anybody
thanx in advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-it-possible-to-run-deltaimport-command-with-out-delta-query-tp3749328p3749328.html
Sent from the Solr - User mailing list archive at Nabble.com.


Using Solr for a rather busy Yellow Pages-type index - good idea or not really?

2012-02-15 Thread Alexey Verkhovsky
Hi, all,

I'm new here. Used Solr on a couple of projects before, but didn't need to
dive deep into anything until now. These days, I'm doing a spike for a
yellow pages type search server with the following technical requirements:

~10 mln listings in the database. A listing has a name, address,
description, coordinates and a number of tags / filtering fields; no more
than a kilobyte all told; i.e. theoretically the whole thing should fit in
RAM without sharding. A typical query is either all text matches on name
and/or description within a bounded box, or some combination of tag
matches within a bounded box. Bounded boxes are 1 to 50 km wide, and
contain up to 10^5 unfiltered listings (the average is more like 10^3).
More than 50% of all the listings are in the frequently requested bounding
boxes, however a vast majority of listings are almost never displayed
(because they don't match the other filters).

Data never changes (i.e., a daily batch update; rebuild of the entire
index and restart of all search servers is feasible, as long as it takes
minutes, not hours). This thing ideally should serve up to 10^3 requests
per second on a small (as in, less than 10 commodity boxes) cluster. In
other words, a typical request should be CPU bound and take ~100-200 msec
to process. Because of coordinates (that are almost never the same),
caching of queries makes no sense; from what little I understand about
Lucene internals, caching of filters probably doesn't make sense either.

After perusing documentation and some googling (but almost no source code
exploring yet), I understand how the schema and the queries will look like,
and now have to figure out a specific configuration that fits the
performance/scalability requirements. Here is what I'm thinking:

1. Search server is an internal service that uses embedded Solr for the
indexing part. RAMDirectoryFactory as index storage.
2. All data is in some sort of persistent storage on a file system, and is
loaded into the memory when a search server starts up.
3. Data updates are handled as update the persistent storage, start
another cluster, load the world into RAM, flip the load balancer, kill the
old cluster
4. Solr returns IDs with relevance scores; actual presentations of listings
(as JSON documents) are constructed outside of Solr and cached in
Memcached, as a mostly static content with a few templated bits, like
distance%=DISTANCE_TO(-123.0123, 45.6789) %.
5. All Solr caching is switched off.

Obviously, we are not the first people to do something like this with Solr,
so I'm hoping for some collective wisdom on the following:

Does this sounds like a feasible set of requirements in terms of
performance and scalability for Solr? Are we on the right path to solving
this problem well? If not, what should we be doing instead? What nasty
technical/architectural gotchas are we probably missing at this stage?

One particular advice I'd be really happy to hear is you may not need
RAMDataFactory if you use some combination of fast distributed file system
and caching instead.

Aso, is there a blog, wiki page or a maillist thread where a similar
problem is discussed? Yes, we have seen
http://www.ibm.com/developerworks/opensource/library/j-spatial, it's a good
introduction that is outdated and doesn't go into the nasty bits, anyway.

Many thanks in advance,
-- Alex Verkhovsky


AW: is it possible to run deltaimport command with out delta query?

2012-02-15 Thread Ramo Karahasan
Hi,

may you have a look at
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

hth,
Ramo

-Ursprüngliche Nachricht-
Von: nagarjuna [mailto:nagarjuna.avul...@gmail.com] 
Gesendet: Donnerstag, 16. Februar 2012 07:27
An: solr-user@lucene.apache.org
Betreff: is it possible to run deltaimport command with out delta query?

hi all..
  i am new to solr .can any body explain me about the delta-import and
delta query and also i have the below questions 1.is it possible to run
deltaimport without delataquery?
2. is it possible to write a delta query without having last_modified column
in database? if yes pls explain me


pls help me anybody
thanx in advance.

--
View this message in context:
http://lucene.472066.n3.nabble.com/is-it-possible-to-run-deltaimport-command
-with-out-delta-query-tp3749328p3749328.html
Sent from the Solr - User mailing list archive at Nabble.com.