find all two word phrases that appear in more than one document

2013-09-09 Thread Ali, Saqib
Dear Solr Ninjas,

We would like to run a query that returns two word phrases that appear in
more than one document. So for e.g. take the string Solr Ninja. Since it
appears in more than one document in our Solr instance, the query should
return that. The query should  find all such phrases from all the documents
in our Solr instance, by querying for two adjacent word combination
(forming a phrase) in the documents that are in the Solr. These two
adjacent word combinations should come from the documents in the Solr index.

Any ideas on how to write this query?

Thanks.


Re: find all two word phrases that appear in more than one document

2013-09-09 Thread Ali, Saqib
Thanks Alexandre. I looked at the wiki page for the TermsComponent. But I
am not sure if I follow. Do you have an example or some better document?
Thanks! :)


On Mon, Sep 9, 2013 at 8:17 PM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 The phases are usually called n-grams or shingles.

 You can probably use ShingleFilterFactory to create your shingles (possibly
 with outputUnigrams=false) and then use TermsComponent (
 http://wiki.apache.org/solr/TermsComponent) to list the results.

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Tue, Sep 10, 2013 at 8:22 AM, Ali, Saqib docbook@gmail.com wrote:

  Dear Solr Ninjas,
 
  We would like to run a query that returns two word phrases that appear in
  more than one document. So for e.g. take the string Solr Ninja. Since
 it
  appears in more than one document in our Solr instance, the query should
  return that. The query should  find all such phrases from all the
 documents
  in our Solr instance, by querying for two adjacent word combination
  (forming a phrase) in the documents that are in the Solr. These two
  adjacent word combinations should come from the documents in the Solr
  index.
 
  Any ideas on how to write this query?
 
  Thanks.
 



removing duplicates

2013-08-21 Thread Ali, Saqib
hello,

We have documents that are duplicates i.e. the ID is different, but rest of
the fields are same. Is there a query that can remove duplicate, and just
leave one copy of the document on solr? There is one numeric field that we
can key off for find duplicates.

Please advise.

Thanks


Re: removing duplicates

2013-08-21 Thread Ali, Saqib
Thanks Aloke and Robert. Can you please give me code/query snippets?
(newbie here)


On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal alghos...@gmail.com wrote:

 Hi,

 Facet by one of the duplicate fields (probably by the numeric field that
 you mentioned) and set facet.mincount=2.

 Regards,
 Aloke


 On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib docbook@gmail.com wrote:

  hello,
 
  We have documents that are duplicates i.e. the ID is different, but rest
 of
  the fields are same. Is there a query that can remove duplicate, and just
  leave one copy of the document on solr? There is one numeric field that
 we
  can key off for find duplicates.
 
  Please advise.
 
  Thanks
 



Re: [solr 4.4.0] SPLITSHARD and core autodiscovery

2013-08-02 Thread Ali, Saqib
Dmitry,

That is expected behaviour. You need to manually remove the original core.

Thanks.


On Fri, Aug 2, 2013 at 6:03 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hello list,

 I was wondering, if what I see with the split shard a correct behaviour or
 is something wrong.

 Following this article:

 http://searchhub.org/2013/06/19/shard-splitting-in-solrcloud/

 I have issued a low-level core split query:


 http://localhost:8982/solr/admin/cores?core=core1action=SPLITpath=multicore/core11path=multicore/core12

 which has completed successfully. Two new index directories got created
 under example/multicore directory.

 What didn't happen is core autodiscovery.  That is, the dashboard page
 still shows the original core core1.

 Is this expected or a bug?

 On a separate note: after splitting a core to two new cores, how does the
 search routing work in the non SolrCloud mode environment? Is this taken
 care of by Solr (via the original core) or is client side task?

 Thanks,

 Dmitry



uniqueKey: string vs. long integer

2013-08-01 Thread Ali, Saqib
We have an application that was developed by a third party. It
uses uniqueKey that is a long integer instead of a string. Will there be
any repercussions of using a long integer instead of string for the
uniqueKey?

Thanks! :)


Re: uniqueKey: string vs. long integer

2013-08-01 Thread Ali, Saqib
I think I have found an issue with using the long integer for
uniqueKey*— *Document
routing using ! notation will not work with a long integer uniqueKey :(


Thanks Jack and Robi


On Thu, Aug 1, 2013 at 10:05 AM, Petersen, Robert 
robert.peter...@mail.rakuten.com wrote:

 Hi guys,

 We have used an integer as our unique key since solr 1.3 with no problems
 at all.  We never thought of using anything else because our solr unique
 key is based upon our product sku data base field which is defined as an
 integer also.   We're on solr 3.6.1 currently.

 Thanks
 Robi

 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Thursday, August 01, 2013 9:27 AM
 To: solr-user@lucene.apache.org
 Subject: Re: uniqueKey: string vs. long integer

 Although I cringe at the thought of anybody using anything other than a
 string for the unique key for a document, I can't point to any part of Solr
 that will absolutely fail. I wouldn't be surprised if there weren't a few
 nooks and crannies in Solr that might depend on the type of the ID, or at
 least depend on it being able to converted to and from string. I'm not sure
 if SolrCloud has any dependence on the document ID field type.

 Could you inquire as to why this third party chose to go with a non-string
 document key? Just curious if they perceived some advantage. I mean, is the
 key used in numeric calculations? Can it be negative? Is it ever sorted?

 But as a Solr best practice, I'd advise against it.

 -- Jack Krupansky

 -Original Message-
 From: Ali, Saqib
 Sent: Thursday, August 01, 2013 12:02 PM
 To: solr-user@lucene.apache.org
 Subject: uniqueKey: string vs. long integer

 We have an application that was developed by a third party. It uses
 uniqueKey that is a long integer instead of a string. Will there be any
 repercussions of using a long integer instead of string for the uniqueKey?

 Thanks! :)






Re: FieldCollapsing issues in SolrCloud 4.4

2013-07-31 Thread Ali, Saqib
Hello Paul,

Can you please explain what you mean by:
To get the exact number of groups, you need to shard along your grouping
field

Thanks! :)


On Wed, Jul 31, 2013 at 3:08 AM, Paul Masurel paul.masu...@gmail.comwrote:

 Do you mean you get different results with group=true?
 numFound is supposed returns the number of ungrouped hits.

 To get the number of groups, you are expected to set
 set group.ngroups=true.
 Even then, the result will only give you an upperbound
 in a distributed environment.
 To get the exact number of groups, you need to shard along
 your grouping field.

 If you have many groups, you may also experience a huge performance
 hit, as the current implementation has been heaviy optimized for low
 number of groups (e.g. e-commerce categories).

 Paul



 On Wed, Jul 31, 2013 at 1:59 AM, Ali, Saqib docbook@gmail.com wrote:

  Hello all,
 
  Is anyone experiencing issues with the numFound when using group=true in
  SolrCloud 4.4?
 
  Sometimes the results are off for us.
 
  I will post more details shortly.
 
  Thanks.
 



 --
 __

  Masurel Paul
  e-mail: paul.masu...@gmail.com



Using HP SiteScope to monitor individual Solr shards

2013-07-30 Thread Ali, Saqib
We would like to use HP SiteScope to monitor the availability of
the individual Solr shards. Any ideas on how we can do that? Is there a
shard based URL that is a sure shot of knowing that the shard is feeling
healthy?

Thanks! :)


FieldCollapsing issues in SolrCloud 4.4

2013-07-30 Thread Ali, Saqib
Hello all,

Is anyone experiencing issues with the numFound when using group=true in
SolrCloud 4.4?

Sometimes the results are off for us.

I will post more details shortly.

Thanks.


Re: monitor jvm heap size for solrcloud

2013-07-26 Thread Ali, Saqib
You can use SPM (i think):
http://sematext.com/spm/solr-performance-monitoring/


On Fri, Jul 26, 2013 at 1:36 PM, Joshi, Shital shital.jo...@gs.com wrote:

 We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes. While
 running stress tests, we want to monitor JVM heap size across 10 nodes. Is
 there a utility which would connect to all nodes' jmx port and display all
 bean details for the cloud?

 Thanks!





zkHost in solr.xml goes missing after SPLITSHARD using Collections API

2013-07-23 Thread Ali, Saqib
Hello all,

Every time I issue a SPLITSHARD using Collections API, the zkHost attribute
in the solr.xml goes missing. I have to manually edit the solr.xml to add
zkHost after every SPLITSHARD.

Any thoughts on what could be causing this?

Thanks.


Re: zkHost in solr.xml goes missing after SPLITSHARD using Collections API

2013-07-23 Thread Ali, Saqib
Thanks Alan and Shawn. Just installed Solr 4.4, and no longer experiencing
the issue.

Thanks! :)


On Tue, Jul 23, 2013 at 7:21 AM, Shawn Heisey s...@elyograg.org wrote:

 On 7/23/2013 7:50 AM, Alan Woodward wrote:
  Can you try upgrading to the just-released 4.4?  Solr.xml persistence
 had all kinds of bugs in 4.3, which should have been fixed now.

 The 4.4.0 release has been finalized and uploaded, but the download link
 hasn't been changed yet because the mirror network isn't fully
 synchronized yet.  It is available from many mirrors, but until the
 website download links get changed, there's not yet a direct way to
 access it.

 Here's some generic instructions for situations where the new version is
 done, but the official announcement isn't out yet:

 http://lucene.apache.org/solr/

 1) Go the the Solr website (URL above) and click on the latest version
 download button, which at this moment is 4.3.1.  Wait for the redirect
 to take you to a mirror list.

 2) Click on one of the mirrors, the best option is usually the one right
 on top that the website chose for you.

 3) When the file list comes up, click the Parent Directory link.  If
 this isn't showing, it will most likely be labelled with .. instead.

 4) If a directory for the new version (in this case 4.4.0) is listed,
 click on it and then click the file that you want to download.

 If the new version is not listed, click the Back button on your browser
 twice, then go back to step 2, but this time choose a different mirror.

 One last reminder: This only works right before a release is officially
 announced.  These instructions cannot be used while a release is still
 in development.

 Thanks,
 Shawn




maximum number of documents per shard?

2013-07-23 Thread Ali, Saqib
still 2.1 billion documents?


Re: add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-17 Thread Ali, Saqib
Thanks Erick!

I have added the instructions for running SolrCloud on Jboss:
http://wiki.apache.org/solr/SolrCloud%20using%20Jboss

I will refine the instructions further, and also post some screenshots.

Thanks.


On Sun, Jul 14, 2013 at 5:05 AM, Erick Erickson erickerick...@gmail.comwrote:

 Done, sorry it took so long, hadn't looked at the list in a couple of days.


 Erick

 On Fri, Jul 12, 2013 at 5:46 PM, Ali, Saqib docbook@gmail.com wrote:
  username: saqib
 
 
  On Fri, Jul 12, 2013 at 2:35 PM, Ali, Saqib docbook@gmail.com
 wrote:
 
  Hello,
 
  Can you please add me to the ContributorsGroup? I would like to add
  instructions for setting up SolrCloud using Jboss.
 
  thanks.
 
 



Re: Where to specify numShards when startup up a cloud setup

2013-07-16 Thread Ali, Saqib
What does the solr.xml look like on the nodes?


On Tue, Jul 16, 2013 at 2:36 PM, Robert Stewart robert_stew...@epam.comwrote:

 I want to script the creation of N solr cloud instances (on ec2).

 But its not clear to me where I would specify numShards setting.
 From documentation, I see you can specify on the first node you start
 up, OR alternatively, use the collections API to create a new collection
 - but in that case you need first at least one running SOLR instance.  I
 want to push all solr instances with similar configuration onto N instances
 and just run them with some number of shards pre-set somehow.  Where can I
 put numShards configuration setting?

 What I want to do:

 1) push solr configuration to zookeeper ensemble using zkCli command-line
 tool.
 2) create N instances of SOLR running on Ec2, pointing to the same
 zookeeper
 3) start all SOLR instances which will become a cloud setup with M shards
 (where MN), and N-M replicas.

 Currently everything starts up with 1 shards, and N replicas.

 I already have one single collection pre-configured.



Re: Book contest idea - feedback requested

2013-07-15 Thread Ali, Saqib
Hello Alex,

This sounds like an excellent idea! :)

Saqib


On Mon, Jul 15, 2013 at 8:11 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Hello,

 Packt Publishing has kindly agreed to let me run a contest with e-copies of
 my book as prizes:
 http://www.packtpub.com/apache-solr-for-indexing-data/book

 Since my book is about learning Solr and targeted at beginners and early
 intermediates, here is what I would like to do. I am asking for feedback on
 whether people on the mailing list like the idea or have specific
 objections to it.

 1) The basic idea is to get Solr users and write and vote on what they find
 hard with Solr, especially in understanding the features (as contrasted
 with just missing ones).
 2) I'll probably set it up as a User Voice forum, which has all the
 mechanisms for suggesting and voting on ideas. With an easier interface
 than Jira
 3) The top N voted ideas will get the books as prizes and I will try to
 fix/document/create JIRAs for those issues.
 4) I am hoping to specifically reach out to the communities where Solr is a
 component and where they don't necessarily hang out on our mailing list. I
 am thinking SolrNet, Drupal, project Blacklight, Cloudera, CrafterCMS,
 SiteCore, Typo3, SunSpot, Nutch. Obviously, anybody and everybody from this
 list would be absolutely welcome to participate as well.

 Yes? No? Suggestions?

 Also, if you are maintainer of one of the products/services/libraries that
 has Solr in it and want to reach out to your community yourself, I think it
 would be a lot better than If I did it. Contact me directly and I will let
 you know what template/FAQ I want you to include in the announcement
 message when it is ready.

 Thank you all in advance for the comments and suggestions.

 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)



Re: Clearing old nodes from zookeper without restarting solrcloud cluster

2013-07-15 Thread Ali, Saqib
Hello Luis,

I don't think that is possible. If you delete clusterstate.json from
zookeeper, you will need to restart the nodes.. I could be very wrong
about this

Saqib


On Mon, Jul 15, 2013 at 8:50 PM, Luis Carlos Guerrero Covo 
lcguerreroc...@gmail.com wrote:

 I know that you can clear zookeeper's data directoy using the CLI with the
 clear command, I just want to know if its possible to update the cluster's
 state without wiping everything out. Anyone have any ideas/suggestions?


 On Mon, Jul 15, 2013 at 11:21 AM, Luis Carlos Guerrero Covo 
 lcguerreroc...@gmail.com wrote:

  Hi,
 
  Is there an easy way to clear zookeeper of all offline solr nodes without
  restarting the cluster? We are having some stability issues and we think
 it
  maybe due to the leader querying old offline nodes.
 
  thank you,
 
  Luis Guerrero
 



 --
 Luis Carlos Guerrero Covo
 M.S. Computer Engineering
 (57) 3183542047



add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-12 Thread Ali, Saqib
Hello,

Can you please add me to the ContributorsGroup? I would like to add
instructions for setting up SolrCloud using Jboss.

thanks.


Re: add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-12 Thread Ali, Saqib
username: saqib


On Fri, Jul 12, 2013 at 2:35 PM, Ali, Saqib docbook@gmail.com wrote:

 Hello,

 Can you please add me to the ContributorsGroup? I would like to add
 instructions for setting up SolrCloud using Jboss.

 thanks.




java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2013-07-12 Thread Ali, Saqib
I am getting a java.lang.OutOfMemoryError: Requested array size exceeds VM
limit on certain queries.

Please advise:

19:25:02,632 INFO  [org.apache.solr.core.SolrCore]
(http-oktst1509.company.tld/12.5.105.96:8180-9) [collection1] webapp=/solr
path=/select
params={sort=sent_date+ascdistrib=falsewt=javabinversion=2rows=2147483647df=textfl=idshard.url=
12.5.105.96:8180/solr/collection1/NOW=1373675102627start=0q=thread_id:1439513570014188310isShard=truefq=domain:company.tld+AND+owner:11782344fsv=true}
hits=1 status=0 QTime=1
19:25:02,637 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
(http-oktst1509.company.tld/12.5.105.96:8180-2)
null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Requested
array size exceeds VM limit
at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
at
org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit


preferred container for running SolrCloud

2013-07-11 Thread Ali, Saqib
1) Jboss
2) Jetty
3) Tomcat
4) Other..

?


Re: preferred container for running SolrCloud

2013-07-11 Thread Ali, Saqib
With the embedded Zookeeper or separate Zookeeper? Also have run into any
issues with running SolrCloud on jetty?


On Thu, Jul 11, 2013 at 7:01 PM, Saikat Kanjilal sxk1...@hotmail.comwrote:

 We're running under jetty.

 Sent from my iPhone

 On Jul 11, 2013, at 6:06 PM, Ali, Saqib docbook@gmail.com wrote:

  1) Jboss
  2) Jetty
  3) Tomcat
  4) Other..
 
  ?



Re: preferred container for running SolrCloud

2013-07-11 Thread Ali, Saqib
Thanks Walter. And the container..


On Thu, Jul 11, 2013 at 7:55 PM, Walter Underwood wun...@wunderwood.orgwrote:

 Embedded Zookeeper is only for dev. Production needs to run a ZK cluster.
  --wunder

 On Jul 11, 2013, at 7:27 PM, Ali, Saqib wrote:

  With the embedded Zookeeper or separate Zookeeper? Also have run into any
  issues with running SolrCloud on jetty?
 
 
  On Thu, Jul 11, 2013 at 7:01 PM, Saikat Kanjilal sxk1...@hotmail.com
 wrote:
 
  We're running under jetty.
 
  Sent from my iPhone
 
  On Jul 11, 2013, at 6:06 PM, Ali, Saqib docbook@gmail.com
 wrote:
 
  1) Jboss
  2) Jetty
  3) Tomcat
  4) Other..
 
  ?
 







SolrJ and SolrCloud

2013-07-08 Thread Ali, Saqib
Hello all,

We have an app that uses the SolrJ and instantiates using HttpSolrServer.

Now that we would like to move to SolrCloud, can we still use the same app,
or do we HAVE to switch to

CloudSolrServer server = new CloudSolrServer(?);

right away?

Or will point to one instance using HttpSolrServer suffice for now?

Thanks.


Re: SolrJ and SolrCloud

2013-07-08 Thread Ali, Saqib
Thanks Mark!


On Mon, Jul 8, 2013 at 10:46 AM, Mark Miller markrmil...@gmail.com wrote:


 On Jul 8, 2013, at 1:40 PM, Ali, Saqib docbook@gmail.com wrote:

  Hello all,
 
  We have an app that uses the SolrJ and instantiates using HttpSolrServer.
 
  Now that we would like to move to SolrCloud, can we still use the same
 app,
  or do we HAVE to switch to
 
  CloudSolrServer server = new CloudSolrServer(?);
 
  right away?
 
  Or will point to one instance using HttpSolrServer suffice for now?

 Yes, it will.

 - Mark

 
  Thanks.




SolrCloud on Jboss

2013-07-08 Thread Ali, Saqib
Hello,

Does anyone have step-by-step instructions for running SolrCloud on Jboss?

Thanks


solrj distributed solr example

2013-07-05 Thread Ali, Saqib
Hello all,

Can anyone please share a solrj example for distributed solr?

Thanks! :)


Re: [Announcement] Norch- a search engine for node.js

2013-07-05 Thread Ali, Saqib
Very interesting. What is the upper limit on the number of documents?

Thanks! :)


On Fri, Jul 5, 2013 at 11:53 AM, Fergus McDowall
fergusmcdow...@gmail.comwrote:

 Here is some news that might be of interest to users and implementers of
 Solr


 http://blog.comperiosearch.com/blog/2013/07/05/norch-a-search-engine-for-node-js/

 Norch (http://fergiemcdowall.github.io/norch/) is a search engine written
 for Node.js. Norch uses the Node search-index module which is in turn
 written using the super fast levelDB library that Google open-sourced in
 2011.

 The aim of Norch is to make a simple, fast search server, that requires
 minimal configuration to set up. Norch sacrifices complex functionality for
 a limited robust feature set, that can be used to set up a free test search
 engine for most enterprise scenarios.

 Currently Norch features

 Full text search
 Stopword removal
 Faceting
 Filtering
 Relevance weighting (tf-idf)
 Field weighting
 Paging (offset and resultset length)

 Norch can index any data that is marked up in the appropriate JSON format

 Download the first release of Norch (0.2.1) here (
 https://github.com/fergiemcdowall/norch/releases)



2.1billion+ document

2013-07-05 Thread Ali, Saqib
Question regarding the 2.1 billion+ document.

I understand that a single instance of solr has a limit of 2.1 billion
documents.

We currently have a single solr server. If we reach 2.1billion documents
limit, what is involved in moving to the Solr DistributedSearch?

Thanks! :)


Re: 2.1billion+ document

2013-07-05 Thread Ali, Saqib
Hello Otis,

I was thinking more in terms of Solr DistributedSearch rather than
SolrCloud. I was hoping to add another Solr instance, when the time comes.
This is a low use application, but with lot of data. Uptime and query speed
are not of importance. However we would like to be able to index more then
2.1 b document when the time comes..

Any advise will be highly appreciated.


Thanks!!! :)
Saqib


On Fri, Jul 5, 2013 at 6:23 PM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

 Hi,

 It's a broad question, but it starts with getting a few servers,
 putting Solr 4.3.1 on it (soon 4.4), setting up Zookeeper, creating a
 Solr Collection (index) with N shards and M replicas, and reindexing
 your old data to this new cluster, which you can expand with new nodes
 over time.  If you have specific questions...

 Otis
 --
 Solr  ElasticSearch Support -- http://sematext.com/
 Performance Monitoring -- http://sematext.com/spm



 On Fri, Jul 5, 2013 at 8:42 PM, Ali, Saqib docbook@gmail.com wrote:
  Question regarding the 2.1 billion+ document.
 
  I understand that a single instance of solr has a limit of 2.1 billion
  documents.
 
  We currently have a single solr server. If we reach 2.1billion documents
  limit, what is involved in moving to the Solr DistributedSearch?
 
  Thanks! :)



Re: 2.1billion+ document

2013-07-05 Thread Ali, Saqib
Thanks Jason! That was very helpful.

I read on the solr wiki that:
Documents must have a unique key and the unique key must be stored
(stored=true in schema.xml)

What is this unique key? Is this just a id that we define in the schema.xml
that is unique to all documents? We have something as follows:
field name=id type=long indexed=true stored=true/

Will this suffice?



Thanks.

On Fri, Jul 5, 2013 at 7:45 PM, Jason Hellman 
jhell...@innoventsolutions.com wrote:

 Saqib:

 At the simplest level:

 1)  Source the machine
 2)  Install Java
 3)  Install a servlet container of your choice
 4)  Copy your Solr WAR and conf directories as desired (probably a rough
 mirror of your current single server)
 5)  Start it up and start sending data there
 6)  Query both by simply adding:
  shards=host1/solr/collection,host2/solr/collection
 7)  Profit

 Or, in shorthand:

 1)  Install new Solr instance and start indexing data there
 2)  Add the shards parameter to your queries with both (or more) servers
 3)  …
 4)  Profit

 Now…we usually want to be concerned about how to manage the data so that
 we don't send duplicates.  Without SolrCloud it is our responsibility to
 delegate traffic for updates and deletes.  We also like to think a bit more
 about how to take advantage of our lovely parallelism to increase index or
 query time.  We should also consider strategies to isolate domain data to
 single shards so as to allow isolated queries against dedicated data models
 in single shards.

 But if you just want to basics, it really is as easy as describe above.

 Jason


 On Jul 5, 2013, at 7:36 PM, Ali, Saqib docbook@gmail.com wrote:

  Hello Otis,
 
  I was thinking more in terms of Solr DistributedSearch rather than
  SolrCloud. I was hoping to add another Solr instance, when the time
 comes.
  This is a low use application, but with lot of data. Uptime and query
 speed
  are not of importance. However we would like to be able to index more
 then
  2.1 b document when the time comes..
 
  Any advise will be highly appreciated.
 
 
  Thanks!!! :)
  Saqib
 
 
  On Fri, Jul 5, 2013 at 6:23 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com
  wrote:
 
  Hi,
 
  It's a broad question, but it starts with getting a few servers,
  putting Solr 4.3.1 on it (soon 4.4), setting up Zookeeper, creating a
  Solr Collection (index) with N shards and M replicas, and reindexing
  your old data to this new cluster, which you can expand with new nodes
  over time.  If you have specific questions...
 
  Otis
  --
  Solr  ElasticSearch Support -- http://sematext.com/
  Performance Monitoring -- http://sematext.com/spm
 
 
 
  On Fri, Jul 5, 2013 at 8:42 PM, Ali, Saqib docbook@gmail.com
 wrote:
  Question regarding the 2.1 billion+ document.
 
  I understand that a single instance of solr has a limit of 2.1 billion
  documents.
 
  We currently have a single solr server. If we reach 2.1billion
 documents
  limit, what is involved in moving to the Solr DistributedSearch?
 
  Thanks! :)
 




Re: Moving from single Solr instance to Solr Cloud

2013-07-04 Thread Ali, Saqib
Hello Furkan,

We are using Solr 4.3

Thanks


On Thu, Jul 4, 2013 at 1:43 AM, Furkan KAMACI furkankam...@gmail.comwrote:

 Which version of Solr you are using?

 2013/7/4 Ali, Saqib docbook@gmail.com

  We have single Solr instance with lot of indexed document. Now we would
  like to move to SolrCloud implementation.
 
  Can we move the existing index to SolrCloud? If so, how? Or do we need to
  reindex our data in SolrCloud?
 
  Thanks,
  Saqib
 



Use case indexed=false stored=false field

2013-07-03 Thread Ali, Saqib
Hello all,


What would be the use case for such a field:

field name=stored_on type=tdate indexed=false stored=false/


and

field name=summary type=string indexed=false stored=false/


?


Thanks.


Re: Use case indexed=false stored=false field

2013-07-03 Thread Ali, Saqib
very interesting. thank you all for the explanation!!! :)


On Wed, Jul 3, 2013 at 8:32 AM, Jack Krupansky j...@basetechnology.comwrote:

 Setting both indexed and stored to false means to ignore input values for
 that field.

 The effective use case is that these fields may have values in the update
 input stream and they will be ignored. Without these field definitions,
 those same field values would cause exceptions - references to undefined
 fields. In other words, you are telling Solr that it is okay to have inputs
 for these fields - simply ignore them.

 But... you could still have update processors that look at the values of
 ignored fields and maybe assigns them to other, non-ignored fields.

 -- Jack Krupansky

 -Original Message- From: Ali, Saqib
 Sent: Wednesday, July 03, 2013 11:22 AM
 To: solr-user@lucene.apache.org
 Subject: Use case indexed=false stored=false field


 Hello all,


 What would be the use case for such a field:

field name=stored_on type=tdate indexed=false
 stored=false/


 and

field name=summary type=string indexed=false stored=false/


 ?


 Thanks.



unused fields in Solr schema.xml increase the index size

2013-07-03 Thread Ali, Saqib
Hello all,

Do unused fields in Solr Schem.xml increase the size of the index files?

Should we be cleaning up those fields?

Thanks.

Saqib


Moving from single Solr instance to Solr Cloud

2013-07-03 Thread Ali, Saqib
We have single Solr instance with lot of indexed document. Now we would
like to move to SolrCloud implementation.

Can we move the existing index to SolrCloud? If so, how? Or do we need to
reindex our data in SolrCloud?

Thanks,
Saqib


omitTermFreqAndPositions=true in easy English, please?

2013-07-03 Thread Ali, Saqib
Hello,

Can anyone please explain omitTermFreqAndPositions=true to me in easy
English, please?

Thanks.


Re: unused fields in Solr schema.xml increase the index size

2013-07-03 Thread Ali, Saqib
Thanks Jacks! That was very helpful.


On Wed, Jul 3, 2013 at 9:54 AM, Jack Krupansky j...@basetechnology.comwrote:

 If never used, they take up zero space in the index.

 If they were used but are no longed used, they're still there, but any new
 or replaced documents will not take up any space for the unused fields
 (subject to the facet that deleted fields still exist until a
 merge/optimize compresses them away.)

 But, yes, should should try to keep your schema clean - but if the fields
 are still populated in some of the documents, you might eventually find
 some need to reference them.

 You should keep your schema and config files in a version control system
 so that you can always go back or view differences.

 -- Jack Krupansky

 -Original Message- From: Ali, Saqib
 Sent: Wednesday, July 03, 2013 11:55 AM
 To: solr-user@lucene.apache.org
 Subject: unused fields in Solr schema.xml increase the index size


 Hello all,

 Do unused fields in Solr Schem.xml increase the size of the index files?

 Should we be cleaning up those fields?

 Thanks.

 Saqib



Re: Use case indexed=false stored=false field

2013-07-03 Thread Ali, Saqib
Thank you Shawn for the excellent use case. :)


On Wed, Jul 3, 2013 at 9:34 AM, Shawn Heisey s...@elyograg.org wrote:

 On 7/3/2013 9:22 AM, Ali, Saqib wrote:

 What would be the use case for such a field:

  field name=stored_on type=tdate indexed=false
 stored=false/


 and

  field name=summary type=string indexed=false
 stored=false/


 I have a field like this in my schema. That field is used as one of the
 source fields that get copied to my catchall field.  I don't need the
 field by itself, but I use it in conjunction with other fields.

 If I can get the app developers to switch over to using edismax more, I
 will get rid of the catchall field and then set that field to indexed and
 not stored.

 Thanks,
 Shawn




Re: omitTermFreqAndPositions=true in easy English, please?

2013-07-03 Thread Ali, Saqib
Jack,

Thanks for the explanation! :

We have a multi-value field as following:
field name=label type=string indexed=true stored=true
multiValued=true/

Most of these labels are two or more letter phrase e.g.
1) Google Reader
2) Google Mail
3) Google Cloud Storage

etc. etc.

if we add omitTermFreqAndPositions=true to this field:
field name=label type=string indexed=true stored=true
multiValued=true omitTermFreqAndPositions=true/

Will we be able to execute queries like:
label: (Google Cloud Storage) ?

Thanks.




On Wed, Jul 3, 2013 at 8:23 PM, Jack Krupansky j...@basetechnology.comwrote:

 If you have a text field and simply want to be able to query whether
 individual terms are present in the text without needing to know either how
 frequently the terms occur or that some terms may be in present in phrases.
 So, you can do AND and OR for individual terms in that field, but not
 phrases, and there is no scoring difference whether a term occurs once or a
 thousand times in that field for each document. A lot less information
 needs to be stored in the index.

 -- Jack Krupansky

 -Original Message- From: Ali, Saqib
 Sent: Wednesday, July 03, 2013 10:31 PM
 To: solr-user@lucene.apache.org
 Subject: omitTermFreqAndPositions=**true in easy English, please?


 Hello,

 Can anyone please explain omitTermFreqAndPositions=**true to me in easy
 English, please?

 Thanks.



Re: omitTermFreqAndPositions=true in easy English, please?

2013-07-03 Thread Ali, Saqib
So do I have to change my query to
label: (Google Cloud Storage) ?

or will Solr add AND / OR behind the scenes?


On Wed, Jul 3, 2013 at 9:54 PM, Jack Krupansky j...@basetechnology.comwrote:

 Yes, but it is simply doing an AND or OR of the individual terms - no
 phrases or implied ordering of the terms.


 -- Jack Krupansky

 -Original Message- From: Ali, Saqib
 Sent: Thursday, July 04, 2013 12:52 AM
 To: solr-user@lucene.apache.org
 Subject: Re: omitTermFreqAndPositions=**true in easy English, please?


 Jack,

 Thanks for the explanation! :

 We have a multi-value field as following:
 field name=label type=string indexed=true stored=true
 multiValued=true/

 Most of these labels are two or more letter phrase e.g.
 1) Google Reader
 2) Google Mail
 3) Google Cloud Storage

 etc. etc.

 if we add omitTermFreqAndPositions=**true to this field:
 field name=label type=string indexed=true stored=true
 multiValued=true omitTermFreqAndPositions=**true/

 Will we be able to execute queries like:
 label: (Google Cloud Storage) ?

 Thanks.




 On Wed, Jul 3, 2013 at 8:23 PM, Jack Krupansky j...@basetechnology.com**
 wrote:

  If you have a text field and simply want to be able to query whether
 individual terms are present in the text without needing to know either
 how
 frequently the terms occur or that some terms may be in present in
 phrases.
 So, you can do AND and OR for individual terms in that field, but not
 phrases, and there is no scoring difference whether a term occurs once or
 a
 thousand times in that field for each document. A lot less information
 needs to be stored in the index.

 -- Jack Krupansky

 -Original Message- From: Ali, Saqib
 Sent: Wednesday, July 03, 2013 10:31 PM
 To: solr-user@lucene.apache.org
 Subject: omitTermFreqAndPositions=true in easy English, please?


 Hello,

 Can anyone please explain omitTermFreqAndPositions=true to me in
 easy
 English, please?

 Thanks.





Re: omitTermFreqAndPositions=true in easy English, please?

2013-07-03 Thread Ali, Saqib
sorry change the query to:
label:  (Google AND Cloud AND Storage)

or will Solr add AND / OR behind the scenes?


On Wed, Jul 3, 2013 at 9:59 PM, Ali, Saqib docbook@gmail.com wrote:

 So do I have to change my query to
 label: (Google Cloud Storage) ?

 or will Solr add AND / OR behind the scenes?


 On Wed, Jul 3, 2013 at 9:54 PM, Jack Krupansky j...@basetechnology.comwrote:

 Yes, but it is simply doing an AND or OR of the individual terms - no
 phrases or implied ordering of the terms.


 -- Jack Krupansky

 -Original Message- From: Ali, Saqib
 Sent: Thursday, July 04, 2013 12:52 AM
 To: solr-user@lucene.apache.org
 Subject: Re: omitTermFreqAndPositions=**true in easy English, please?


 Jack,

 Thanks for the explanation! :

 We have a multi-value field as following:
 field name=label type=string indexed=true stored=true
 multiValued=true/

 Most of these labels are two or more letter phrase e.g.
 1) Google Reader
 2) Google Mail
 3) Google Cloud Storage

 etc. etc.

 if we add omitTermFreqAndPositions=**true to this field:
 field name=label type=string indexed=true stored=true
 multiValued=true omitTermFreqAndPositions=**true/

 Will we be able to execute queries like:
 label: (Google Cloud Storage) ?

 Thanks.




 On Wed, Jul 3, 2013 at 8:23 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  If you have a text field and simply want to be able to query whether
 individual terms are present in the text without needing to know either
 how
 frequently the terms occur or that some terms may be in present in
 phrases.
 So, you can do AND and OR for individual terms in that field, but not
 phrases, and there is no scoring difference whether a term occurs once
 or a
 thousand times in that field for each document. A lot less information
 needs to be stored in the index.

 -- Jack Krupansky

 -Original Message- From: Ali, Saqib
 Sent: Wednesday, July 03, 2013 10:31 PM
 To: solr-user@lucene.apache.org
 Subject: omitTermFreqAndPositions=true in easy English, please?


 Hello,

 Can anyone please explain omitTermFreqAndPositions=true to me in
 easy
 English, please?

 Thanks.






Re: omitTermFreqAndPositions=true in easy English, please?

2013-07-03 Thread Ali, Saqib
so in this case since the field type is String, adding
omitTermFreqAndPositions=true does really help in reducing the index size?

field name=label type=string indexed=true stored=true
multiValued=true/



On Wed, Jul 3, 2013 at 10:00 PM, Jack Krupansky j...@basetechnology.comwrote:

 Oops... I wasn't reading carefully enough - frequencies and positions only
 relate to tokenized fields (text) - not string fields.

 That doesn't impact your ability to do AND and OR of discrete string terms
 of a multivalued string field.

 -- Jack Krupansky

 -Original Message- From: Jack Krupansky
 Sent: Thursday, July 04, 2013 12:54 AM

 To: solr-user@lucene.apache.org
 Subject: Re: omitTermFreqAndPositions=**true in easy English, please?

 Yes, but it is simply doing an AND or OR of the individual terms - no
 phrases or implied ordering of the terms.

 -- Jack Krupansky

 -Original Message- From: Ali, Saqib
 Sent: Thursday, July 04, 2013 12:52 AM
 To: solr-user@lucene.apache.org
 Subject: Re: omitTermFreqAndPositions=**true in easy English, please?

 Jack,

 Thanks for the explanation! :

 We have a multi-value field as following:
 field name=label type=string indexed=true stored=true
 multiValued=true/

 Most of these labels are two or more letter phrase e.g.
 1) Google Reader
 2) Google Mail
 3) Google Cloud Storage

 etc. etc.

 if we add omitTermFreqAndPositions=**true to this field:
 field name=label type=string indexed=true stored=true
 multiValued=true omitTermFreqAndPositions=**true/

 Will we be able to execute queries like:
 label: (Google Cloud Storage) ?

 Thanks.




 On Wed, Jul 3, 2013 at 8:23 PM, Jack Krupansky
 j...@basetechnology.com**wrote:

  If you have a text field and simply want to be able to query whether
 individual terms are present in the text without needing to know either
 how
 frequently the terms occur or that some terms may be in present in
 phrases.
 So, you can do AND and OR for individual terms in that field, but not
 phrases, and there is no scoring difference whether a term occurs once or
 a
 thousand times in that field for each document. A lot less information
 needs to be stored in the index.

 -- Jack Krupansky

 -Original Message- From: Ali, Saqib
 Sent: Wednesday, July 03, 2013 10:31 PM
 To: solr-user@lucene.apache.org
 Subject: omitTermFreqAndPositions=true in easy English, please?


 Hello,

 Can anyone please explain omitTermFreqAndPositions=true to me in
 easy
 English, please?

 Thanks.




copyField and storage requirements

2013-07-02 Thread Ali, Saqib
Newbie question:

We have the following fields defined in the schema:

field name=content type=text_general indexed=true stored=false/
field name=teaser type=text_general indexed=false stored=true/
copyField source=content dest=teaser maxChars=80/

the content is field is about 500KB data.

My question is whether Solr stores the entire contents of the that 500KB
content field?

We want to minimize the stored data in the Solr index, that is why we added
the copyField teaser.

Thanks
Saqib


Re: copyField and storage requirements

2013-07-02 Thread Ali, Saqib
Thanks Shawn.

Here is the text_general type definition. We would like to bring down the
storage requirement down to a minimum for those 500KB content documents. We
just need basic full-text search.

Thanks!!! :)




fieldType name=text_general class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true/
!-- in this example, we will only use synonyms at query
time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
filter class=solr.LowerCaseFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt
enablePositionIncrements=true/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
/analyzer
/fieldType



On Tue, Jul 2, 2013 at 11:35 AM, Shawn Heisey s...@elyograg.org wrote:

 On 7/2/2013 12:22 PM, Ali, Saqib wrote:
  Newbie question:
 
  We have the following fields defined in the schema:
 
  field name=content type=text_general indexed=true stored=false/
  field name=teaser type=text_general indexed=false stored=true/
  copyField source=content dest=teaser maxChars=80/
 
  the content is field is about 500KB data.
 
  My question is whether Solr stores the entire contents of the that 500KB
  content field?
 
  We want to minimize the stored data in the Solr index, that is why we
 added
  the copyField teaser.

 With that config, the entire 500KB will not be _stored_ .. but it will
 affect the index size because you are indexing it.  Exactly what degree
 that will be depends on the definition of the text_general type.

 Thanks,
 Shawn




Storing Solr Index on NFS

2013-04-15 Thread Ali, Saqib
Greetings,

Are there any issues with storing Solr Indexes on a NFS share? Also any
recommendations for using NFS for Solr indexes?

Thanks,
Saqib


Re: Storing Solr Index on NFS

2013-04-15 Thread Ali, Saqib
Hello Walter,

Thanks for the response. That has been my experience in the past as well.
But I was wondering if there new are things in Solr 4 and NFS 4.1 that make
the storing of indexes on a NFS mount feasible.

Thanks,
Saqib


On Mon, Apr 15, 2013 at 9:47 AM, Walter Underwood wun...@wunderwood.orgwrote:

 On Apr 15, 2013, at 9:40 AM, Ali, Saqib wrote:

  Greetings,
 
  Are there any issues with storing Solr Indexes on a NFS share? Also any
  recommendations for using NFS for Solr indexes?

 I recommend that you do not put Solr indexes on NFS.

 It can be very slow, I measured indexing as 100X slower on NFS a few years
 ago.

 It is not safe to share Solr index files between two Solr servers, so
 there is no benefit to NFS.

 wunder
 --
 Walter Underwood
 wun...@wunderwood.org






Re: secure deployment of solr.war on jboss

2013-04-01 Thread Ali, Saqib
Thanks. Are you using IP tables firewall on the jboss to prevent access
from other systems? Or are you using some jboss configuration for that?

Thanks,
Saqib


On Mon, Apr 1, 2013 at 6:25 AM, adityab aditya_ba...@yahoo.com wrote:

 Hi Ali,

 We have Solr 4.2 on Jboss running on a separate VM behind firewall. Only IT
 Administration and our FrontEnd Application Server is able to access the
 Solr servers in production.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/secure-deployment-of-solr-war-on-jboss-tp4052754p4052899.html
 Sent from the Solr - User mailing list archive at Nabble.com.



secure deployment of solr.war on jboss

2013-03-31 Thread Ali, Saqib
Hello all,

We are using Apache Solr 4.2 in our application to provide search
capabilities. We are deploying the solr.war file to jboss along with our
application.

Any suggestions on proper security controls for this type of solr setup?

Also solr is now accessible to everyone from the
http://jboss_host/solrURL. How can we prevent /solr/ being accessible
by all IP addresses? We
would like to restrict to certain IP addresses namely the jboss_host and
couple of other management API hosts.

Any help will be much appreciated.

Thanks,
Saqib


Re: What is the graceful shutdown API for Solrj embedded?

2013-02-07 Thread Ali, Saqib
Hello Alex,

I asked a similar question on server fault:
http://serverfault.com/a/474442/156440


On Wed, Feb 6, 2013 at 7:05 PM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 Hello,

 When I CTRL-C the example Solr, it prints a bunch of graceful shutdown
 messages.  I assume it shuts down safe and without corruption issues.

 When I do that to Solrj (embedded, not remote), it just drops dead.

 I found CoreContainer.shutdown(), which looks about right and does
 terminate Solrj but it prints out a completely different set of messages.

 Is CoreContainer.shutdown() the right method for Solrj (4.1)? Is there more
 than just one call?

 And what happens if you just Ctrl-C Solrj instance? Wiki says nothing about
 shutdown, so I can imagine a lot of people probably think it is ok to just
 kill it. Is there a danger of corruption?

 Regards,
 Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)



Re: Configuring the jetty shipped with Solr

2013-02-05 Thread Ali, Saqib
Thanks Alex.

I was able to bind jetty to 127.0.0.1 so that it only accepts connections
from localhost using the following:
Set name=hostSystemProperty name=jetty.host default=127.0.0.1
//Set
But how I do set it so that it can accept connections from certain
non-localhost IP addresses as well?

Thanks.



On Mon, Feb 4, 2013 at 5:06 PM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 I believe, for the example directory (as in relative to start.jar),
 contexts directory has the url mapping to solr (/solr), etc has some global
 jetty properties and solr-webapp/webapp/WEB-INF contains some Solr's
 specific jetty configuration.

 Beware that the last one however is a decompressed version of
 webapps/solr.war. I don't know if it ever gets overriden after the first
 time it is decompressed or not.

 No idea where the actual IP address directive is, though.

 Regards,
Alex.

 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Mon, Feb 4, 2013 at 6:41 PM, Ali, Saqib docbook@gmail.com wrote:

  Hello all,
 
  How do I change the configuration for the Jetty that is shipped with
 Apache
  Solr? Where are the configuration files located? I want to restrict the
 IP
  address that can connect to that instance of Solr
 
  Thanks,
  Saqib
 



Configuring the jetty shipped with Solr

2013-02-04 Thread Ali, Saqib
Hello all,

How do I change the configuration for the Jetty that is shipped with Apache
Solr? Where are the configuration files located? I want to restrict the IP
address that can connect to that instance of Solr

Thanks,
Saqib