Re: Query by range of price

2014-01-20 Thread rachun
Hi Raymond,

I keep trying to encode the ''  but when I look at the solar log it show me
that '%26' 
I'm using urlencode it didn't work what should i do? Please suggest me.


Thank you very much,
Rachun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112251.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query by range of price

2014-01-20 Thread rachun
Hi Raymond, 

I keep trying to encode the ''  but when I look at the solar log it show me
that '%26' 
I'm using urlencode it didn't work what should i do? Im using PHPSolrClient.
Please suggest me. 

Thank you very much, 
Rachun




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112252.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query by range of price

2014-01-20 Thread rachun
Hi Raymond, 

I keep trying to encode the ''  but when I look at the solar log it show me
that '%26' 
I'm using urlencode it didn't work what should i do? Im using SolrPHPClient. 
Please suggest me. 

Thank you very much, 
Rachun 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112256.html
Sent from the Solr - User mailing list archive at Nabble.com.


Memory Usage on Windows Os while indexing

2014-01-20 Thread onetwothree
Facts:


OS Windows server 2008

4 Cpu
8 GB Ram

Tomcat Service version 7.0 (64 bit)

Only running Solr
Optional JVM parameters set xmx = 3072, xms = 1024
Solr version 4.5.0.

One Core instance (both for querying and indexing)
*Schema config:*
minGramSize=2 maxGramSize=20
most of the fields are stored = true (required)

*Solr config:*
ramBufferSizeMB: 100
maxIndexingThreads: 8
directoryFactory: MMapDirectory
autocommit: maxdocs 1, maxtime 15000, opensearcher false
cache (defaults): 
filtercache initialsize:512 size: 512 autowarm: 0
queryresultcache initialsize:512 size: 512 autowarm: 0
documentcache initialsize:512 size: 512 autowarm: 0

Problem description:


We're using a .Net Service (based on Solr.Net) for updating and inserting
documents on a single Solr Core instance. The size of documents sent to Solr
vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one
or multiple threads. The current size of the Solr Index is about 15GB.

The indexing service is running around 4 a 5 hours per day, to complete all
inserts and updates to Solr. While the indexing process is running the
Tomcat process memory usage keeps growing up to  7GB Ram (using Process
Explorer monitor tool) and does not reduce, even after 24 hours. After a
restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back
to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat
process, the memory usage of Tomcat seems ok, memory consumption is in range
of defined jvm startup params (see image).

So it seems that filesystem buffers are consuming all the leftover memory??,
and don't release memory, even after a quite amount of time? Is there a way
handle this behaviour, in a way that not all memory is consumed? Are there
other alternatives? Best practices?

http://lucene.472066.n3.nabble.com/file/n4112262/Capture.png 

Thanks in advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Query by range of price

2014-01-20 Thread Raymond Wiker
That's exactly what I would expect from url-encoding ''. So, the thing
that you're doing works as it should, but you're probably doing something
that you should not do (in this case, urlencode).

I have not used SolrPHPClient myself, but from the example at
http://code.google.com/p/solr-php-client/wiki/FAQ#How_Can_I_Use_Additional_Parameters_%28like_fq,_facet,_etc%29it
appears that you should not do any urlencoding yourself, at all.
Further, if you're using data that is already urlencoded, you should
urldecode it before handing it over to SolrPHPClient.


On Mon, Jan 20, 2014 at 10:34 AM, rachun rachun.c...@gmail.com wrote:

 Hi Raymond,

 I keep trying to encode the ''  but when I look at the solar log it show
 me
 that '%26'
 I'm using urlencode it didn't work what should i do? Im using
 SolrPHPClient.
 Please suggest me.

 Thank you very much,
 Rachun



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112256.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query by range of price

2014-01-20 Thread Raymond Wiker
Followup: I *think* something like this should work:

$results = $solr-search($query, $start, $rows, array('sort' = 'price_min
asc,update_date desc', 'facet.query' = 'price_min:[* TO 1300]'));


On Mon, Jan 20, 2014 at 11:05 AM, Raymond Wiker rwi...@gmail.com wrote:

 That's exactly what I would expect from url-encoding ''. So, the thing
 that you're doing works as it should, but you're probably doing something
 that you should not do (in this case, urlencode).

 I have not used SolrPHPClient myself, but from the example at
 http://code.google.com/p/solr-php-client/wiki/FAQ#How_Can_I_Use_Additional_Parameters_%28like_fq,_facet,_etc%29it
  appears that you should not do any urlencoding yourself, at all.
 Further, if you're using data that is already urlencoded, you should
 urldecode it before handing it over to SolrPHPClient.


 On Mon, Jan 20, 2014 at 10:34 AM, rachun rachun.c...@gmail.com wrote:

 Hi Raymond,

 I keep trying to encode the ''  but when I look at the solar log it show
 me
 that '%26'
 I'm using urlencode it didn't work what should i do? Im using
 SolrPHPClient.
 Please suggest me.

 Thank you very much,
 Rachun



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112256.html
 Sent from the Solr - User mailing list archive at Nabble.com.





LSH in Solr/Lucene

2014-01-20 Thread Shashi Kant
Hi folks, have any of you successfully implemented LSH (MinHash) in
Solr? If so, could you share some details of how you went about it?

I know LSH is available in Mahout, but was hoping if someone has a
solr or Lucene implementation.

Thanks


Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread Yago Riveiro
The fact that you see the memory consumed too high should be consecuency of 
that some memory of the heap is only released after a full GC. With the 
VisualVM tool you can try to force a full GC and see if the memory is released.


/yago
—
/Yago Riveiro

On Mon, Jan 20, 2014 at 10:03 AM, onetwothree joydivis...@telenet.be
wrote:

 Facts:
 OS Windows server 2008
 4 Cpu
 8 GB Ram
 Tomcat Service version 7.0 (64 bit)
 Only running Solr
 Optional JVM parameters set xmx = 3072, xms = 1024
 Solr version 4.5.0.
 One Core instance (both for querying and indexing)
 *Schema config:*
 minGramSize=2 maxGramSize=20
 most of the fields are stored = true (required)
 *Solr config:*
 ramBufferSizeMB: 100
 maxIndexingThreads: 8
 directoryFactory: MMapDirectory
 autocommit: maxdocs 1, maxtime 15000, opensearcher false
 cache (defaults): 
 filtercache initialsize:512 size: 512 autowarm: 0
 queryresultcache initialsize:512 size: 512 autowarm: 0
 documentcache initialsize:512 size: 512 autowarm: 0
 Problem description:
 We're using a .Net Service (based on Solr.Net) for updating and inserting
 documents on a single Solr Core instance. The size of documents sent to Solr
 vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one
 or multiple threads. The current size of the Solr Index is about 15GB.
 The indexing service is running around 4 a 5 hours per day, to complete all
 inserts and updates to Solr. While the indexing process is running the
 Tomcat process memory usage keeps growing up to  7GB Ram (using Process
 Explorer monitor tool) and does not reduce, even after 24 hours. After a
 restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back
 to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat
 process, the memory usage of Tomcat seems ok, memory consumption is in range
 of defined jvm startup params (see image).
 So it seems that filesystem buffers are consuming all the leftover memory??,
 and don't release memory, even after a quite amount of time? Is there a way
 handle this behaviour, in a way that not all memory is consumed? Are there
 other alternatives? Best practices?
 http://lucene.472066.n3.nabble.com/file/n4112262/Capture.png 
 Thanks in advance
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread Yago Riveiro
Other thing, Solr use a lot the OS cache to cache the index and gain 
performance. This can be another reason why the solr process has a high memory 
value allocated.


/yago
—
/Yago Riveiro

On Mon, Jan 20, 2014 at 10:03 AM, onetwothree joydivis...@telenet.be
wrote:

 Facts:
 OS Windows server 2008
 4 Cpu
 8 GB Ram
 Tomcat Service version 7.0 (64 bit)
 Only running Solr
 Optional JVM parameters set xmx = 3072, xms = 1024
 Solr version 4.5.0.
 One Core instance (both for querying and indexing)
 *Schema config:*
 minGramSize=2 maxGramSize=20
 most of the fields are stored = true (required)
 *Solr config:*
 ramBufferSizeMB: 100
 maxIndexingThreads: 8
 directoryFactory: MMapDirectory
 autocommit: maxdocs 1, maxtime 15000, opensearcher false
 cache (defaults): 
 filtercache initialsize:512 size: 512 autowarm: 0
 queryresultcache initialsize:512 size: 512 autowarm: 0
 documentcache initialsize:512 size: 512 autowarm: 0
 Problem description:
 We're using a .Net Service (based on Solr.Net) for updating and inserting
 documents on a single Solr Core instance. The size of documents sent to Solr
 vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one
 or multiple threads. The current size of the Solr Index is about 15GB.
 The indexing service is running around 4 a 5 hours per day, to complete all
 inserts and updates to Solr. While the indexing process is running the
 Tomcat process memory usage keeps growing up to  7GB Ram (using Process
 Explorer monitor tool) and does not reduce, even after 24 hours. After a
 restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back
 to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat
 process, the memory usage of Tomcat seems ok, memory consumption is in range
 of defined jvm startup params (see image).
 So it seems that filesystem buffers are consuming all the leftover memory??,
 and don't release memory, even after a quite amount of time? Is there a way
 handle this behaviour, in a way that not all memory is consumed? Are there
 other alternatives? Best practices?
 http://lucene.472066.n3.nabble.com/file/n4112262/Capture.png 
 Thanks in advance
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Multi Lingual Analyzer

2014-01-20 Thread David Philip
Hi,



  I have a query on Multi-Lingual Analyser.


 Which one of the  below is the best approach?


1.1.To develop a translator that translates a/any language to
English and then use standard English analyzer to analyse – use translator,
both at index time and while search time?

2.  2.  To develop a language specific analyzer and use that by
creating specific field only for that language?

We have client data coming in different Languages: Kannada and Telegu and
others later.This data is basically the text written by customer in that
language.


Requirement is to develop analyzers particular for these language.



Thanks - David


Re: Search Suggestion Filtering

2014-01-20 Thread Alessandro Benedetti
Hi guys, following this thread I have some question :

1) regarding LUCENE-5350, what is the context quoted ? Is it the context a
filter query ?

2) regarding https://issues.apache.org/jira/browse/SOLR-5378, do we have
the final documentation available ?

Cheers


2014/1/16 Hamish Campbell hamish.campb...@koordinates.com

 Thank you Jorge. We looked at phrase suggestions from previous user
 queries, but they're not so useful in our case. However, I have a follow-up
 question about similar functionality that I'll post shortly.

 The list might like to know that I've come up with a quick and exceedingly
 dirty strikehack/strike solution that works for our limited case.

 You have been warned!

 Note that we're using django-haystack to actually interact with Solr:

 1. Set nonFuzzyPrefix of the Suggester to 4.
 2. At index time, the haystack index will build suggestion terms by
 extracting the relevant terms and prefixing with a 4 (alpha) character
 reference for the target instance.
 3. At search time, the user's query is split, terms are prefixed and
 concatenated. The new query is sent to solr and the results are cleaned of
 references before returned to the front end.

 I'm not proud of it, but it works. =D



 On Fri, Jan 17, 2014 at 3:13 AM, Jorge Luis Betancourt González 
 jlbetanco...@uci.cu wrote:

  In a custom application we have, we use a separated core (under Solr
  3.6.1) to store the queries used by the users and then provide the
  autocomplete feauture. In our case we need to filter some phrases, that
 we
  don't need to be suggested to the users. I build a custom
  UpdateRequestProcessor to implement this logic, so we define this
 blocking
  patterns in some external source of information (DB, files, etc.). For
 the
  suggestions per-se we use as a base
  https://github.com/cominvent/autocomplete configuration, described in
  www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
  which is pretty usable as it comes. I found (personally) this approach
 way
  more flexible than the original suggester component, but it involves
  storing the user's queries into a separated core.
 
  Greetings,
 
  - Original Message -
  From: Hamish Campbell hamish.campb...@koordinates.com
  To: solr-user@lucene.apache.org
  Sent: Wednesday, January 15, 2014 9:10:16 PM
  Subject: Re: Search Suggestion Filtering
 
  Thanks Tomás, I'll take a look.
 
  Still interested to hear from anyone about using queries to populate the
  list - I'm willing to give up a bit of performance for the flexibility it
  would provide.
 
 
  On Thu, Jan 16, 2014 at 1:06 PM, Tomás Fernández Löbbe 
  tomasflo...@gmail.com wrote:
 
   I think your use case is the one described in LUCENE-5350, maybe you
 want
   to take a look to the patch and comments there.
  
   Tomás
  
  
   On Wed, Jan 15, 2014 at 12:58 PM, Hamish Campbell 
   hamish.campb...@koordinates.com wrote:
  
Hi all,
   
I'm looking into options for filtering the search suggestions
  dictionary.
   
Using Solr 4.6.0, Suggester component and fst.FuzzyLookupFactory
 using
  a
field based dictionary, we're indexing records for a multi-tenanted
  SaaS
platform. SearchHandler records are always filtered by the particular
client warehouse (e.g. by domain), however we need a way to apply a
   similar
filter to the spell check dictionary to prevent leaking terms between
clients. In other words: when client A searches for a document title
  they
should not receive spelling suggestions for client B's document
 titles.
   
This has been asked a couple of times, on the mailing list and on
StackOverflow. Some of the suggested approaches:
   
1. Use dynamic fields to create dictionaries per-warehouse (mentioned
   here:
   
   
  
 
 http://lucene.472066.n3.nabble.com/Filtering-down-terms-in-suggest-tt4069627.html
)
   
That might be a reasonable option for us (we already considered a
  similar
approach), but at what point does this stop scaling efficiently? How
  many
dynamic fields are too many?
   
2. Run a query to populate the suggestion list (also mentioned in
 that
thread)
   
If I understand this correctly, this would give us a lot of
 flexibility
   and
power: for example to give a more nuanced result set using the users
permissions to expose private documents in their spelling
 suggestions.
   
I expect this would be a slow query, but our total document count is
currently relatively small (on the order of 10^3 objects) and I
 imagine
   you
could create a specific word index with the appropriate fields to
 keep
   this
in check. Is this a feasible approach, and if so, how do you build a
dynamic suggestion list?
   
3. Other options:
   
It seems like this is a common problem - and we could through some
resources at building an extension to provide some limited suggestion
dictionary filtering. Is anyone already doing something similar, or

Re: Query by range of price

2014-01-20 Thread rachun
Thank you very much Mr. Raymond

You just saved my world ;)
It's worked and *sort by conditions *
but facet.query=price_min:[* TO 1300] not working yet but I will try to
google for the right solution.

Million thanks _/|\_
Rachun.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112272.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error when creating collection in Solr 4.6

2014-01-20 Thread Uwe Reh

Hi,

I had the same problem.
In my case the error was, I had a copy/paste typo in my solr.xml.

str name=genericCoreNodeNames${genericCoreNodeNames:true}/str
!^! Ouch!

With the type 'bool' instead of 'str' it works definitely better. ;-)

Uwe



Am 28.11.2013 08:53, schrieb lansing:

Thank you for your replies,
I am using the new-style discovery
It worked after adding this setting :
bool name=genericCoreNodeNames${genericCoreNodeNames:true}/bool





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-when-creating-collection-in-Solr-4-6-tp4103536p4103696.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread Toke Eskildsen
On Mon, 2014-01-20 at 11:02 +0100, onetwothree wrote:
 Optional JVM parameters set xmx = 3072, xms = 1024
 directoryFactory: MMapDirectory

[...]

 So it seems that filesystem buffers are consuming all the leftover memory??,
 and don't release memory, even after a quite amount of time?

As long as the memory is indeed leftover, that is the optimal strategy.
Maybe Uwe's explanation of MMapDirectory will help:

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Regards,
Toke Eskildsen, State and University Library, Denmark




RE: Indexing URLs from websites

2014-01-20 Thread Markus Jelsma
Well it is hard to get a specific anchor because there is usually more than 
one. The content of the anchors field should be correct. What would you expect 
if there are multiple anchors? 
 
-Original message-
 From:Teague James teag...@insystechinc.com
 Sent: Friday 17th January 2014 18:13
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 Progress!
 
 I changed the value of that property in nutch-default.xml and I am getting 
 the anchor field now. However, the stuff going in there is a bit random and 
 doesn't seem to correlate to the pages I'm crawling. The primary objective is 
 that when there is something on the page that is a link to a file 
 ...href=/blah/somefile.pdfGet the PDF!... (using ... to prevent actual 
 code in the email) I want to capture that URL and the anchor text Get the 
 PDF! into field(s).
 
 Am I going in the right direction on this?
 
 Thank you so much for sticking with me on this - I really appreciate your 
 help!
 
 -Original Message-
 From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
 Sent: Friday, January 17, 2014 6:46 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Indexing URLs from websites
 
 
 
  
  
 -Original message-
  From:Teague James teag...@insystechinc.com
  Sent: Thursday 16th January 2014 20:23
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
  Okay. I had used that previously and I just tried it again. The following 
  generated no errors:
  
  bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
  crawl/linkdb -dir crawl/segments/
  
  Solr is still not getting an anchor field and the outlinks are not 
  appearing in the index anywhere else.
  
  To be sure I deleted the crawl directory and did a fresh crawl using:
  
  bin/nutch crawl urls -dir crawl -depth 3 -topN 50
  
  Then
  
  bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
  crawl/linkdb -dir crawl/segments/
  
  No errors, but no anchor fields or outlinks. One thing in the response from 
  the crawl that I found interesting was a line that said:
  
  LinkDb: internal links will be ignored.
 
 Good catch! That is likely the problem. 
 
  
  What does that mean?
 
 property
   namedb.ignore.internal.links/name
   valuetrue/value
   descriptionIf true, when adding new links to a page, links from
   the same host are ignored.  This is an effective way to limit the
   size of the link database, keeping only the highest quality
   links.
   /description
 /property
 
 So change the property, rebuild the linkdb and try reindexing once again :)
 
  
  -Original Message-
  From: Markus Jelsma [mailto:markus.jel...@openindex.io]
  Sent: Thursday, January 16, 2014 11:08 AM
  To: solr-user@lucene.apache.org
  Subject: RE: Indexing URLs from websites
  
  Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params 
  k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] 
  [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] 
  [-filter] [-normalize]
  
  You must point to the linkdb via the -linkdb parameter. 
   
  -Original message-
   From:Teague James teag...@insystechinc.com
   Sent: Thursday 16th January 2014 16:57
   To: solr-user@lucene.apache.org
   Subject: RE: Indexing URLs from websites
   
   Okay. I changed my solrindex to this:
   
   bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
   crawl/linkdb
   crawl/segments/20140115143147
   
   I got the same errors:
   Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
   does not exist: file:/.../crawl/linkdb/crawl_fetch
   Input path does not exist: file:/.../crawl/linkdb/crawl_parse
   Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
   path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
   Java stacktrace
   
   Those linkdb folders are not being created.
   
   -Original Message-
   From: Markus Jelsma [mailto:markus.jel...@openindex.io]
   Sent: Thursday, January 16, 2014 10:44 AM
   To: solr-user@lucene.apache.org
   Subject: RE: Indexing URLs from websites
   
   Hi - you cannot use wildcards for segments. You need to give one segment 
   or a -dir segments_dir. Check the usage of your indexer command. 

   -Original message-
From:Teague James teag...@insystechinc.com
Sent: Thursday 16th January 2014 16:43
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites

Hello Markus,

I do get a linkdb folder in the crawl folder that gets created - but it 
is created at the time that I execute the command automatically by 
Nutch. I just tried to use solrindex against yesterday's cawl and did 
not get any errors, but did not get the anchor field or any of the 
outlinks. I used this command:
bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*

I then tried:
bin/nutch solrindex 

Re: Changing existing index to use block-join

2014-01-20 Thread dev


Zitat von Mikhail Khludnev mkhlud...@griddynamics.com:


On Sat, Jan 18, 2014 at 11:25 PM, d...@geschan.de wrote:


So, my question now: can I change my existing index in just adding a
is_parent and a _root_ field and saving the journal id there like I did
with j-id or do I have to reindex all my documents?



Absolutely, to use block-join you need to index nested documents as blocks,
as it's described at
http://blog.griddynamics.com/2013/09/solr-block-join-support.html eg
https://gist.github.com/mkhludnev/6406734#file-t-shirts-xml



Thank you for the clarification.
But there is no way to add new children without indexing the parent  
document and all existing childs again?


So, in the example on github, if I want to add new sizes and colors to  
an existing T-Shirt, I have to reindex the already existing T-Shirt  
and all it's variations again?


I understand that the blocks are created at index time, so I can't  
change an existing index to build blocks just in adding the _root_  
field, but I don't get why it's not possible to add new children or  
did I missinterpret your statement?


Thanks,
-Gesh



Re: Multi Lingual Analyzer

2014-01-20 Thread Erick Erickson
It Depends (tm). Approach (2) will give you better, more specific
search results. (1) is simpler to implement and might be good
enough...



On Mon, Jan 20, 2014 at 5:21 AM, David Philip
davidphilipshe...@gmail.com wrote:
 Hi,



   I have a query on Multi-Lingual Analyser.


  Which one of the  below is the best approach?


 1.1.To develop a translator that translates a/any language to
 English and then use standard English analyzer to analyse – use translator,
 both at index time and while search time?

 2.  2.  To develop a language specific analyzer and use that by
 creating specific field only for that language?

 We have client data coming in different Languages: Kannada and Telegu and
 others later.This data is basically the text written by customer in that
 language.


 Requirement is to develop analyzers particular for these language.



 Thanks - David


[OT] Use Cases for Taming Text, 2nd ed.

2014-01-20 Thread Grant Ingersoll
Hi Solr Users,

Drew Farris, Tom Morton and I are currently working on the 2nd Edition of 
Taming Text (http://www.manning.com/ingersoll for first ed.) and are soliciting 
interested parties who would be willing to contribute to a chapter on practical 
use cases (i.e. you have something in production and are willing to write about 
it) for search with Solr, NLP using OpenNLP or Stanford NLP and machine 
learning using Mahout, OpenNLP or MALLET -- ideally you are using combinations 
of 2 or more of these to solve your problems.  We are especially interested in 
large scale use cases in eCommerce, Advertising, social media analytics, fraud, 
etc.

The writing process is fairly straightforward.  A section roughly equates to 
somewhere between 3 - 10 pages, including diagrams/pictures.  After writing, 
there will be some feedback from editors and us, but otherwise the process is 
fairly simple.

In order to participate, you must have permission from your company to write on 
the topic.  You would not need to divulge any proprietary information, but we 
would want enough information for our readers to gain a high-level 
understanding of your use case.  In exchange for your participation, you will 
have your name and company published on that section of the book as well as in 
the acknowledgments section.  If you have a copy of Lucene in Action or Mahout 
In Action, it would be similar to the use case sections in those books.

If you are interested, please respond privately to me using my 
gsing...@apache.org email address with this subject line.

Thanks,
Grant, Drew, Tom







Getting all search words relevant for the document to be found

2014-01-20 Thread Tomaz Kveder
Hi!

I need a little help from you. 

We have complex documents stored in database. On the page we show them from
database. We index them and not store them in Solr. So we can't use Solr
Highlighter. But still we would like to highlight the search words found in
the document. What approach would you suggest? 

Our approuch and idea is hidden in this basic question:
Is it possible to get the list of all search words with which the specific
document was found (with all the language varieties of the word). 

Let me explain what I mean with simplefied example. We index the sentence:
The big cloud is verry dark. User puts these two words in search box:
clouds dark rain. 

Can I get from Solr that that particular document was found because of words
cloud and dark. So we can highlight them in the content. 

Ofcourse we can highlight the exact words user putted in search filed. But
that's not enough. We woul also like to highlight all the language varieties
that the document was found on.

Thanks!

Best regards,

Tomaz




Re: Changing existing index to use block-join

2014-01-20 Thread Mikhail Khludnev
On Mon, Jan 20, 2014 at 6:11 PM, d...@geschan.de wrote:


 Zitat von Mikhail Khludnev mkhlud...@griddynamics.com:

  On Sat, Jan 18, 2014 at 11:25 PM, d...@geschan.de wrote:

  So, my question now: can I change my existing index in just adding a
 is_parent and a _root_ field and saving the journal id there like I did
 with j-id or do I have to reindex all my documents?


 Absolutely, to use block-join you need to index nested documents as
 blocks,
 as it's described at
 http://blog.griddynamics.com/2013/09/solr-block-join-support.html eg
 https://gist.github.com/mkhludnev/6406734#file-t-shirts-xml


 Thank you for the clarification.
 But there is no way to add new children without indexing the parent
 document and all existing childs again?

Yes. There is no way to add children incrementally. You need to nuke whole
block and add it with all necessary children.



 So, in the example on github, if I want to add new sizes and colors to an
 existing T-Shirt, I have to reindex the already existing T-Shirt and all
 it's variations again?

Completely reindex t-shirts with all skus.



 I understand that the blocks are created at index time, so I can't change
 an existing index to build blocks just in adding the _root_ field, but I
 don't get why it's not possible to add new children or did I missinterpret
 your statement?


Block join relies on internal Lucene docnums which are defined by the order
in which documents has been indexed.

this might help
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene



 Thanks,
 -Gesh




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread Shawn Heisey

On 1/20/2014 3:02 AM, onetwothree wrote:

OS Windows server 2008

4 Cpu
8 GB Ram


snip


We're using a .Net Service (based on Solr.Net) for updating and inserting
documents on a single Solr Core instance. The size of documents sent to Solr
vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one
or multiple threads. The current size of the Solr Index is about 15GB.

The indexing service is running around 4 a 5 hours per day, to complete all
inserts and updates to Solr. While the indexing process is running the
Tomcat process memory usage keeps growing up to  7GB Ram (using Process
Explorer monitor tool) and does not reduce, even after 24 hours. After a
restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back
to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat
process, the memory usage of Tomcat seems ok, memory consumption is in range
of defined jvm startup params (see image).

So it seems that filesystem buffers are consuming all the leftover memory??,
and don't release memory, even after a quite amount of time? Is there a way
handle this behaviour, in a way that not all memory is consumed? Are there
other alternatives? Best practices?

http://lucene.472066.n3.nabble.com/file/n4112262/Capture.png


That picture seems to be a very low-res copy of your screenshot.  I 
can't really make it out.  I can tell you that it's completely normal 
for the OS disk cache (the filesystem buffers you mention) to take up 
all leftover memory.  If an application requests some of that memory, 
the OS will instantly give it up.


First, I'm going to explain something about memory reporting and Solr 
that I've noticed, then I will give you some news you probably won't like.


The numbers reported by visualvm are a true picture of Java heap memory 
usage.  The actual memory usage for Solr will be just a little bit more 
than those numbers.  In the newest versions of Solr, there seems to be a 
side effect of the Java MMAP implementation that results in incorrect 
memory usage reporting at the operating system level.  Here's a top 
output on one of my Solr servers running CentOS, sorted by memory 
usage.  The process at the top of the list is Solr.


https://www.dropbox.com/s/y1nus7lpzlb1mp9/solr-memory-usage-2014-01-20%2010.28.28.png

Some quick numbers for you:  The machine has 64GB of RAM.  Solr shows a 
virtual memory size of 59.2GB.  My indexes take up 51293336 of disk 
space, and Solr has a 6GB heap, so 59.2GB is not out of line for the 
virtual memory size.


Now for where things get weird: There is 48GB of RAM taken up by the 
cached value, which is the OS disk cache.  The screenshot also shows 
that Solr is using 22GB of resident RAM.  If you add the 48GB in the OS 
disk cache and the 22GB of resident RAM for Solr, you get 70GB ... which 
is more memory than the machine even HAS, so we know something's off.  
The 'shared' memory for Solr is 15GB, which when you subtract it from 
the 22GB, gives you 7GB, which is much more realistic with a 6GB heap, 
and also makes it fit within the total system RAM.


The news that you probably won't like:

I'm assuming that the whole reason you looked into memory usage was 
because you're having performance problems.  With 8GB of RAM and 3GB 
given to Solr, you basically have a little bit less than 5GB of RAM for 
the OS disk cache.  With that much RAM, most people can effectively 
cache an index up to about 10GB before performance problems show up.  
Your index is 15GB.  You need more total system RAM.  If Solr isn't 
crashing, you can probably leave the heap at 3GB with no problem.


http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Facet count mismatch.

2014-01-20 Thread Luis Cappa Banda
Hello!

I've installed a classical two shards Solr 4.5 topology without SolrCloud
balancing with an HA proxy. I've got a *copyField* like this:

* field name=tagValues type=string indexed=true stored=true
multiValued=false/*

Copied from this one:

* field name=tags type=searchableTextTokenized indexed=true
stored=true multiValued=false/*

* !-- Fieldtype used in fields available to test searching --*
*fieldType name=searchableTextTokenized class=solr.TextField
positionIncrementGap=100*
* analyzer*
* tokenizer class=solr.PatternTokenizerFactory
pattern=[\s\t\n\?\!\¿\¡:,;@\\.,\\(\\)\\{\\}\\/\\-]+ /*
* filter class=solr.ASCIIFoldingFilterFactory/*
* filter class=solr.LowerCaseFilterFactory/*
* filter class=solr.ReversedWildcardFilterFactory/*
* filter class=solr.RemoveDuplicatesTokenFilterFactory/*
* /analyzer  *
*/fieldType*


When faceting with *tagValues* field I've got a total count of 3:


   - facet_counts:
   {
  - facet_queries: { },
  - facet_fields:
  {
 - tagsValues:
 [
- sucks,
- 3
]
 },
  - facet_dates: { },
  - facet_ranges: { }
  }



Bug when searching like this with *tagValues* the total number of documents
is not three, but two:



   - params:
   {
  - facet: true,
  - shards:
  solr1.test:8081/comments/data,solr2.test:8080/comments/data,
  - facet.mincount: 1,
  - facet.sort: count,
  - q: tagsValues:sucks,
  - facet.limit: -1,
  - facet.field: tagsValues,
  - wt: json
  }



Any idea of what's happening here? I'm confused, :-/

Regards,


-- 
- Luis Cappa


Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev
We are testing our shiny new Solr Cloud architecture but we are
experiencing some issues when doing bulk indexing.

We have 5 solr cloud machines running and 3 indexing machines (separate
from the cloud servers). The indexing machines pull off ids from a queue
then they index and ship over a document via a CloudSolrServer. It appears
that the indexers are too fast because the load (particularly disk io) on
the solr cloud machines spikes through the roof making the entire cluster
unusable. It's kind of odd because the total index size is not even
large..ie,  10GB. Are there any optimization/enhancements I could try to
help alleviate these problems?

I should note that for the above collection we have only have 1 shard thats
replicated across all machines so all machines have the full index.

Would we benefit from switching to a ConcurrentUpdateSolrServer where all
updates get sent to 1 machine and 1 machine only? We could then remove this
machine from our cluster than that handles user requests.

Thanks for any input.


Re: Facet count mismatch.

2014-01-20 Thread Ahmet Arslan
Hi Luis,

Do you have deletions? What happens when you expunge Deletes?

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22commit.22

Ahmet


On Monday, January 20, 2014 10:08 PM, Luis Cappa Banda luisca...@gmail.com 
wrote:

Hello!

I've installed a classical two shards Solr 4.5 topology without SolrCloud
balancing with an HA proxy. I've got a *copyField* like this:

* field name=tagValues type=string indexed=true stored=true
multiValued=false/*

Copied from this one:

* field name=tags type=searchableTextTokenized indexed=true
stored=true multiValued=false/*

* !-- Fieldtype used in fields available to test searching --*
*    fieldType name=searchableTextTokenized class=solr.TextField
positionIncrementGap=100*
* analyzer*
* tokenizer class=solr.PatternTokenizerFactory
pattern=[\s\t\n\?\!\¿\¡:,;@\\.,\\(\\)\\{\\}\\/\\-]+ /*
* filter class=solr.ASCIIFoldingFilterFactory/*
* filter class=solr.LowerCaseFilterFactory/*
* filter class=solr.ReversedWildcardFilterFactory/*
* filter class=solr.RemoveDuplicatesTokenFilterFactory/*
* /analyzer      *
*    /fieldType*


When faceting with *tagValues* field I've got a total count of 3:


   - facet_counts:
   {
      - facet_queries: { },
      - facet_fields:
      {
         - tagsValues:
         [
            - sucks,
            - 3
            ]
         },
      - facet_dates: { },
      - facet_ranges: { }
      }



Bug when searching like this with *tagValues* the total number of documents
is not three, but two:



   - params:
   {
      - facet: true,
      - shards:
      solr1.test:8081/comments/data,solr2.test:8080/comments/data,
      - facet.mincount: 1,
      - facet.sort: count,
      - q: tagsValues:sucks,
      - facet.limit: -1,
      - facet.field: tagsValues,
      - wt: json
      }



Any idea of what's happening here? I'm confused, :-/

Regards,


-- 
- Luis Cappa


Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Erick Erickson
Questions: How often do you commit your updates? What is your
indexing rate in docs/second?

In a SolrCloud setup, you should be using a CloudSolrServer. If the
server is having trouble keeping up with updates, switching to CUSS
probably wouldn't help.

So I suspect there's something not optimal about your setup that's
the culprit.

Best,
Erick

On Mon, Jan 20, 2014 at 4:00 PM, Software Dev static.void@gmail.com wrote:
 We are testing our shiny new Solr Cloud architecture but we are
 experiencing some issues when doing bulk indexing.

 We have 5 solr cloud machines running and 3 indexing machines (separate
 from the cloud servers). The indexing machines pull off ids from a queue
 then they index and ship over a document via a CloudSolrServer. It appears
 that the indexers are too fast because the load (particularly disk io) on
 the solr cloud machines spikes through the roof making the entire cluster
 unusable. It's kind of odd because the total index size is not even
 large..ie,  10GB. Are there any optimization/enhancements I could try to
 help alleviate these problems?

 I should note that for the above collection we have only have 1 shard thats
 replicated across all machines so all machines have the full index.

 Would we benefit from switching to a ConcurrentUpdateSolrServer where all
 updates get sent to 1 machine and 1 machine only? We could then remove this
 machine from our cluster than that handles user requests.

 Thanks for any input.


Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev
We commit have a soft commit every 5 seconds and hard commit every 30. As
far as docs/second it would guess around 200/sec which doesn't seem that
high.


On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson erickerick...@gmail.comwrote:

 Questions: How often do you commit your updates? What is your
 indexing rate in docs/second?

 In a SolrCloud setup, you should be using a CloudSolrServer. If the
 server is having trouble keeping up with updates, switching to CUSS
 probably wouldn't help.

 So I suspect there's something not optimal about your setup that's
 the culprit.

 Best,
 Erick

 On Mon, Jan 20, 2014 at 4:00 PM, Software Dev static.void@gmail.com
 wrote:
  We are testing our shiny new Solr Cloud architecture but we are
  experiencing some issues when doing bulk indexing.
 
  We have 5 solr cloud machines running and 3 indexing machines (separate
  from the cloud servers). The indexing machines pull off ids from a queue
  then they index and ship over a document via a CloudSolrServer. It
 appears
  that the indexers are too fast because the load (particularly disk io) on
  the solr cloud machines spikes through the roof making the entire cluster
  unusable. It's kind of odd because the total index size is not even
  large..ie,  10GB. Are there any optimization/enhancements I could try to
  help alleviate these problems?
 
  I should note that for the above collection we have only have 1 shard
 thats
  replicated across all machines so all machines have the full index.
 
  Would we benefit from switching to a ConcurrentUpdateSolrServer where all
  updates get sent to 1 machine and 1 machine only? We could then remove
 this
  machine from our cluster than that handles user requests.
 
  Thanks for any input.



Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev
We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
updates get sent to one machine or something?


On Mon, Jan 20, 2014 at 2:42 PM, Software Dev static.void@gmail.comwrote:

 We commit have a soft commit every 5 seconds and hard commit every 30. As
 far as docs/second it would guess around 200/sec which doesn't seem that
 high.


 On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 Questions: How often do you commit your updates? What is your
 indexing rate in docs/second?

 In a SolrCloud setup, you should be using a CloudSolrServer. If the
 server is having trouble keeping up with updates, switching to CUSS
 probably wouldn't help.

 So I suspect there's something not optimal about your setup that's
 the culprit.

 Best,
 Erick

 On Mon, Jan 20, 2014 at 4:00 PM, Software Dev static.void@gmail.com
 wrote:
  We are testing our shiny new Solr Cloud architecture but we are
  experiencing some issues when doing bulk indexing.
 
  We have 5 solr cloud machines running and 3 indexing machines (separate
  from the cloud servers). The indexing machines pull off ids from a queue
  then they index and ship over a document via a CloudSolrServer. It
 appears
  that the indexers are too fast because the load (particularly disk io)
 on
  the solr cloud machines spikes through the roof making the entire
 cluster
  unusable. It's kind of odd because the total index size is not even
  large..ie,  10GB. Are there any optimization/enhancements I could try
 to
  help alleviate these problems?
 
  I should note that for the above collection we have only have 1 shard
 thats
  replicated across all machines so all machines have the full index.
 
  Would we benefit from switching to a ConcurrentUpdateSolrServer where
 all
  updates get sent to 1 machine and 1 machine only? We could then remove
 this
  machine from our cluster than that handles user requests.
 
  Thanks for any input.





Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Mark Miller
What version are you running?

- Mark

On Jan 20, 2014, at 5:43 PM, Software Dev static.void@gmail.com wrote:

 We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
 updates get sent to one machine or something?
 
 
 On Mon, Jan 20, 2014 at 2:42 PM, Software Dev 
 static.void@gmail.comwrote:
 
 We commit have a soft commit every 5 seconds and hard commit every 30. As
 far as docs/second it would guess around 200/sec which doesn't seem that
 high.
 
 
 On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
 Questions: How often do you commit your updates? What is your
 indexing rate in docs/second?
 
 In a SolrCloud setup, you should be using a CloudSolrServer. If the
 server is having trouble keeping up with updates, switching to CUSS
 probably wouldn't help.
 
 So I suspect there's something not optimal about your setup that's
 the culprit.
 
 Best,
 Erick
 
 On Mon, Jan 20, 2014 at 4:00 PM, Software Dev static.void@gmail.com
 wrote:
 We are testing our shiny new Solr Cloud architecture but we are
 experiencing some issues when doing bulk indexing.
 
 We have 5 solr cloud machines running and 3 indexing machines (separate
 from the cloud servers). The indexing machines pull off ids from a queue
 then they index and ship over a document via a CloudSolrServer. It
 appears
 that the indexers are too fast because the load (particularly disk io)
 on
 the solr cloud machines spikes through the roof making the entire
 cluster
 unusable. It's kind of odd because the total index size is not even
 large..ie,  10GB. Are there any optimization/enhancements I could try
 to
 help alleviate these problems?
 
 I should note that for the above collection we have only have 1 shard
 thats
 replicated across all machines so all machines have the full index.
 
 Would we benefit from switching to a ConcurrentUpdateSolrServer where
 all
 updates get sent to 1 machine and 1 machine only? We could then remove
 this
 machine from our cluster than that handles user requests.
 
 Thanks for any input.
 
 
 



Re: Solr Cloud Bulk Indexing Questions

2014-01-20 Thread Software Dev
4.6.0


On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller markrmil...@gmail.com wrote:

 What version are you running?

 - Mark

 On Jan 20, 2014, at 5:43 PM, Software Dev static.void@gmail.com
 wrote:

  We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all
  updates get sent to one machine or something?
 
 
  On Mon, Jan 20, 2014 at 2:42 PM, Software Dev static.void@gmail.com
 wrote:
 
  We commit have a soft commit every 5 seconds and hard commit every 30.
 As
  far as docs/second it would guess around 200/sec which doesn't seem that
  high.
 
 
  On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson 
 erickerick...@gmail.comwrote:
 
  Questions: How often do you commit your updates? What is your
  indexing rate in docs/second?
 
  In a SolrCloud setup, you should be using a CloudSolrServer. If the
  server is having trouble keeping up with updates, switching to CUSS
  probably wouldn't help.
 
  So I suspect there's something not optimal about your setup that's
  the culprit.
 
  Best,
  Erick
 
  On Mon, Jan 20, 2014 at 4:00 PM, Software Dev 
 static.void@gmail.com
  wrote:
  We are testing our shiny new Solr Cloud architecture but we are
  experiencing some issues when doing bulk indexing.
 
  We have 5 solr cloud machines running and 3 indexing machines
 (separate
  from the cloud servers). The indexing machines pull off ids from a
 queue
  then they index and ship over a document via a CloudSolrServer. It
  appears
  that the indexers are too fast because the load (particularly disk io)
  on
  the solr cloud machines spikes through the roof making the entire
  cluster
  unusable. It's kind of odd because the total index size is not even
  large..ie,  10GB. Are there any optimization/enhancements I could try
  to
  help alleviate these problems?
 
  I should note that for the above collection we have only have 1 shard
  thats
  replicated across all machines so all machines have the full index.
 
  Would we benefit from switching to a ConcurrentUpdateSolrServer where
  all
  updates get sent to 1 machine and 1 machine only? We could then remove
  this
  machine from our cluster than that handles user requests.
 
  Thanks for any input.
 
 
 




Re: Multi Lingual Analyzer

2014-01-20 Thread Benson Margulies
MT is not nearly good enough to allow approach 1 to work.

On Mon, Jan 20, 2014 at 9:25 AM, Erick Erickson erickerick...@gmail.com wrote:
 It Depends (tm). Approach (2) will give you better, more specific
 search results. (1) is simpler to implement and might be good
 enough...



 On Mon, Jan 20, 2014 at 5:21 AM, David Philip
 davidphilipshe...@gmail.com wrote:
 Hi,



   I have a query on Multi-Lingual Analyser.


  Which one of the  below is the best approach?


 1.1.To develop a translator that translates a/any language to
 English and then use standard English analyzer to analyse – use translator,
 both at index time and while search time?

 2.  2.  To develop a language specific analyzer and use that by
 creating specific field only for that language?

 We have client data coming in different Languages: Kannada and Telegu and
 others later.This data is basically the text written by customer in that
 language.


 Requirement is to develop analyzers particular for these language.



 Thanks - David


Optimizing index on Slave

2014-01-20 Thread Salman Akram
All,

I know normally index should be optimized on master and it will be
replicated to slave but we have an issue with the network bandwidth.

We optimize indexes weekly (total size is around 1.5TB). We have few slaves
set up on local network so replication the whole indexes is not a big
issue.

However, we have one slave in another city too (on a backup network) which
of course gets replicated over internet which is quite slow and expensive.
We want to avoid copying the complete indexes every week after optimization
and were thinking if its possible to optimize it independently on slave so
that there is no delta between master and slave? We tried to do it but
still the slave replicated from master.


-- 
Regards,

Salman Akram


Re: Memory Usage on Windows Os while indexing

2014-01-20 Thread onetwothree
Thanks for the reply, dropbox image added.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262p4112403.html
Sent from the Solr - User mailing list archive at Nabble.com.