Re: Ability to load solrcore.properties from zookeeper

2015-05-30 Thread Steve Davids
Sorry for not responding back earlier, I went ahead and created a ticket
here:

https://issues.apache.org/jira/browse/SOLR-7613

It does look somewhat trivial if you just update the current loading
mechanism as Chris describes, I can provide a patch for that if you want.
Though, if you want to go the refactoring route I can leave it to Alan to
take a crack at it.

Thanks,

-Steve

On Fri, May 29, 2015 at 3:29 AM, Alan Woodward a...@flax.co.uk wrote:

 Yeah, you could do it like that.  But looking at it further, I think
 solrcore.properties is actually being loaded in entirely the wrong place -
 it should be done by whatever is creating the CoreDescriptor, and then
 passed in as a Properties object to the CD constructor.  At the moment, you
 can't refer to a property defined in solrcore.properties within your
 core.properties file.

 I'll open a JIRA if Steve hasn't already done so

 Alan Woodward
 www.flax.co.uk


 On 28 May 2015, at 17:57, Chris Hostetter wrote:

 
  : certainly didn't intend to write it like this!).  The problem here will
  : be that CoreDescriptors are currently built entirely from
  : core.properties files, and the CoreLocators that construct them don't
  : have any access to zookeeper.
 
  But they do have access to the CoreContainer which is passed to the
  CoreDescriptor constructor -- it has all the ZK access you'd need at the
  time when loadExtraProperties() is called.
 
  correct?
 
  as fleshed out in my last emil...
 
  :  patch:  IIUC CoreDescriptor.loadExtraProperties is the relevent
 method ...
  :  it would need to build up the path including the core name and get
 the
  :  system level resource loader (CoreContainer.getResourceLoader()) to
 access
  :  it since the core doesn't exist yet so there is no core level
  :  ResourceLoader to use.
 
 
  -Hoss
  http://www.lucidworks.com/




Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread xavi jmlucjav
Thanks Toke for the input.

I think the plan is to facet only on class_u1, class_u2 for queries from
user1, etc. So faceting would not happen on all fields on a single query.
But still.

I did not design the schema, just found out about the number of fields and
advised again that, when they asked for a second opinion. We did not get to
discuss a different schema, but if we get to this point I will take that
plan into consideration for sure.

xavi

On Sat, May 30, 2015 at 10:17 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 xavi jmlucjav jmluc...@gmail.com wrote:
  They reason for such a large number of fields:
  - users create dynamically 'classes' of documents, say one user creates
 10
  classes on average
  - for each 'class', the fields are created like this:
 unique_id_+fieldname
  - there are potentially hundreds of thousands of users.

 Switch to a scheme where you control the names of fields outside of Solr,
 but share the fields internally:

 User 1 has 10 custom classes: u1_a, u1_b, u1_c, ... u1_j
 Internally they are mapped to class1, class2, class3, ... class10

 User 2 uses 2 classes: u2_horses, u2_elephants
 Internally they are mapped to class1, class2

 When User 2 queries field u2_horses, you rewrite the query to use class1
 instead.

  There is faceting in each users' fields.
  So this will result in 1M fields, very sparsely populated.

 If you are faceting on all of them and if you are not using DocValues,
 this will explode your memory requirements with vanilla Solr: UnInverted
 faceting maintains separate a map from all documentIDs to field values
 (ordinals for Strings) for _all_ the facet fields. Even if you only had 10
 million documents and even if your 1 million facet fields all had just 1
 value, represented by 1 bit, it would still require 10M * 1M * 1 bits in
 memory, which is 10 terabyte of RAM.

 - Toke Eskildsen



Re: Deleting Fields

2015-05-30 Thread Erick Erickson
Faceting on very high cardinality fields can use up memory, no doubt
about that. I think the entire delete question was a red herring, but
you know that already ;)

So I think you can forget about the delete stuff. Although do note
that if you do re-index your old documents, the new version won't have
the field, and as segments are merged the deleted documents will have
all their resources reclaimed, effectively deleting the field from the
old docs So you could gradually re-index your corpus and get this
stuff out of there.

Best,
Erick

On Sat, May 30, 2015 at 5:18 AM, Joseph Obernberger
j...@lovehorsepower.com wrote:
 Thank you Erick.  I was thinking that it actually went through and removed
 the index data; that you for the clarification.  What happened was I had
 some bad data that created a lot of fields (some 8000).  I was getting some
 errors adding new fields where solr could not talk to zookeeper, and I
 thought it may be because there are so many fields.  The index size is some
 420million docs.
 I'm hesitant to try to re-create as when the shards crash, they leave a
 write.lock file in HDFS, and I need to manually delete that file (on 27
 machines) before bringing them back up.
 I believe this is the stack trace - but this looks to be related to facets,
 and I'm not 100% sure that this is the correct trace!  Sorry - I if it
 happens again I will update.

 ERROR - 2015-05-29 20:39:34.707; [UNCLASS shard9 core_node14 UNCLASS]
 org.apache.solr.common.SolrException; null:java.lang.RuntimeException:
 java.lang.OutOfMemoryError: unable to create new native thread
 at
 org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:854)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:463)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)
 at
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
 at
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
 at
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at
 org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
 at
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
 at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
 at
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:714)
 at
 java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
 at
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
 at
 org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:637)
 at
 org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:280)
 at
 org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:106)
 at
 org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:222)
 at
 

Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread Erick Erickson
Nothing's really changed in that area lately. Your co-worker is
perhaps confusing the statement that Solr has no a-priori limit on
the number of distinct fields that can be in a corpus with supporting
an infinite number of fields. Not having a built-in limit is much
different than supporting

Whether Solr breaks with thousands and thousands of fields is pretty
dependent on what you _do_ with those fields. Simply doing keyword
searches isn't going to put the same memory pressure on as, say,
faceting on them all (even if in different queries).

I'd really ask why so many fields are necessary though.

Best,
Erick

On Sat, May 30, 2015 at 6:18 AM, xavi jmlucjav jmluc...@gmail.com wrote:
 Hi guys,

 someone I work with has been advised that currently Solr can support
 'infinite' number of fields.

 I thought there was a practical limitation of say thousands of fields (for
 sure less than a million), orthings can start to break (I think I
 remember seeings memory issues reported on the mailing list by several
 people).


 Was there any change I missed lately that makes having say 1M fields in
 Solr practical??

 thanks


Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread Toke Eskildsen
xavi jmlucjav jmluc...@gmail.com wrote:
 They reason for such a large number of fields:
 - users create dynamically 'classes' of documents, say one user creates 10
 classes on average
 - for each 'class', the fields are created like this: unique_id_+fieldname
 - there are potentially hundreds of thousands of users.

Switch to a scheme where you control the names of fields outside of Solr, but 
share the fields internally:

User 1 has 10 custom classes: u1_a, u1_b, u1_c, ... u1_j
Internally they are mapped to class1, class2, class3, ... class10

User 2 uses 2 classes: u2_horses, u2_elephants
Internally they are mapped to class1, class2

When User 2 queries field u2_horses, you rewrite the query to use class1 
instead.

 There is faceting in each users' fields.
 So this will result in 1M fields, very sparsely populated.

If you are faceting on all of them and if you are not using DocValues, this 
will explode your memory requirements with vanilla Solr: UnInverted faceting 
maintains separate a map from all documentIDs to field values (ordinals for 
Strings) for _all_ the facet fields. Even if you only had 10 million documents 
and even if your 1 million facet fields all had just 1 value, represented by 1 
bit, it would still require 10M * 1M * 1 bits in memory, which is 10 terabyte 
of RAM.

- Toke Eskildsen


Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread Jack Krupansky
Anything more than a few hundred seems very suspicious.

Anything more than a few dozen or 50 or 75 seems suspicious as well.

The point should not be how crazy can you get with Solr, but that craziness
should be avoided altogether!

Solr's design is optimal for a large number of relatively small documents,
not large documents.


-- Jack Krupansky

On Sat, May 30, 2015 at 3:05 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Nothing's really changed in that area lately. Your co-worker is
 perhaps confusing the statement that Solr has no a-priori limit on
 the number of distinct fields that can be in a corpus with supporting
 an infinite number of fields. Not having a built-in limit is much
 different than supporting

 Whether Solr breaks with thousands and thousands of fields is pretty
 dependent on what you _do_ with those fields. Simply doing keyword
 searches isn't going to put the same memory pressure on as, say,
 faceting on them all (even if in different queries).

 I'd really ask why so many fields are necessary though.

 Best,
 Erick

 On Sat, May 30, 2015 at 6:18 AM, xavi jmlucjav jmluc...@gmail.com wrote:
  Hi guys,
 
  someone I work with has been advised that currently Solr can support
  'infinite' number of fields.
 
  I thought there was a practical limitation of say thousands of fields
 (for
  sure less than a million), orthings can start to break (I think I
  remember seeings memory issues reported on the mailing list by several
  people).
 
 
  Was there any change I missed lately that makes having say 1M fields in
  Solr practical??
 
  thanks



Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread Toke Eskildsen
xavi jmlucjav jmluc...@gmail.com wrote:
 I think the plan is to facet only on class_u1, class_u2 for queries from
 user1, etc. So faceting would not happen on all fields on a single query.

I understand that, but most of the created structures stays in memory between 
calls (DocValues helps here). Your heap will slowly fill up as more and more 
users perform faceted queries on their content.

- Toke Eskildsen


Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread xavi jmlucjav
They reason for such a large number of fields:
- users create dynamically 'classes' of documents, say one user creates 10
classes on average
- for each 'class', the fields are created like this: unique_id_+fieldname
- there are potentially hundreds of thousands of users.

There is faceting in each users' fields.

So this will result in 1M fields, very sparsely populated. I warned them
this did not sound like a good design to me, but apparently someone very
knowledgeable in solr said this will work out fine. That is why I wanted to
double check...

On Sat, May 30, 2015 at 9:22 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 Anything more than a few hundred seems very suspicious.

 Anything more than a few dozen or 50 or 75 seems suspicious as well.

 The point should not be how crazy can you get with Solr, but that craziness
 should be avoided altogether!

 Solr's design is optimal for a large number of relatively small documents,
 not large documents.


 -- Jack Krupansky

 On Sat, May 30, 2015 at 3:05 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Nothing's really changed in that area lately. Your co-worker is
  perhaps confusing the statement that Solr has no a-priori limit on
  the number of distinct fields that can be in a corpus with supporting
  an infinite number of fields. Not having a built-in limit is much
  different than supporting
 
  Whether Solr breaks with thousands and thousands of fields is pretty
  dependent on what you _do_ with those fields. Simply doing keyword
  searches isn't going to put the same memory pressure on as, say,
  faceting on them all (even if in different queries).
 
  I'd really ask why so many fields are necessary though.
 
  Best,
  Erick
 
  On Sat, May 30, 2015 at 6:18 AM, xavi jmlucjav jmluc...@gmail.com
 wrote:
   Hi guys,
  
   someone I work with has been advised that currently Solr can support
   'infinite' number of fields.
  
   I thought there was a practical limitation of say thousands of fields
  (for
   sure less than a million), orthings can start to break (I think I
   remember seeings memory issues reported on the mailing list by several
   people).
  
  
   Was there any change I missed lately that makes having say 1M fields in
   Solr practical??
  
   thanks
 



Re: any changes about limitations on huge number of fields lately?

2015-05-30 Thread xavi jmlucjav
On Sat, May 30, 2015 at 11:15 PM, Toke Eskildsen t...@statsbiblioteket.dk
wrote:

 xavi jmlucjav jmluc...@gmail.com wrote:
  I think the plan is to facet only on class_u1, class_u2 for queries from
  user1, etc. So faceting would not happen on all fields on a single query.

 I understand that, but most of the created structures stays in memory
 between calls (DocValues helps here). Your heap will slowly fill up as more
 and more users perform faceted queries on their content.

got it...priceless info, thanks!



 - Toke Eskildsen



Re: Storing positions and offsets vs FieldType IndexOptions DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS

2015-05-30 Thread Andy Lee
I also met the same problem, could you tell me why? Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Storing-positions-and-offsets-vs-FieldType-IndexOptions-DOCS-AND-FREQS-AND-POSITIONS-AND-OFFSETS-tp4061354p4208875.html
Sent from the Solr - User mailing list archive at Nabble.com.


Issues in compiling Solr code.

2015-05-30 Thread Aniket Kumar
Hi All,


I am trying to build and compile Solr . I have been following the below
link .

https://wiki.apache.org/solr/HowToCompileSolr

I have taken the latest version of code and have run ant clean compile
command , followed by ant dist. Both the steps were successful, but no war
 was created(as per the wiki , a war should be created which should be
copied to ...\LuceneSolr
https://wiki.apache.org/solr/LuceneSolr\solr\example\webapps
folder .(I cannot find any webapps folder under example folder either !!).


Can someone please help me , and let me know what steps should be taken to
bring up the server , and am I missing something here.

In case , there's an updated wiki link, please redirect me to the same.


Appreciate the help.



Cheers,
Aniket


Optimal FS block size for small documents in Solr?

2015-05-30 Thread Gili Nachum
Hi, What would be an optimal FS block size to use?

Using Solr 4.7.2, I have an RAID-5 of SSD drives currently configured with
a 128KB block size.
Can I expect better indexing/query time performance with a smaller block
size (say 8K)?
Considering my documents are almost always smaller than 8K.
I assume all stored fields would fit into one block which is good, but what
will Lucene prefer for reading a long posting list and other data
structures.

Any rules of thumb or anyone that had experimented on this?


Re: Deleting Fields

2015-05-30 Thread Joseph Obernberger
Thank you Erick.  I was thinking that it actually went through and 
removed the index data; that you for the clarification.  What happened 
was I had some bad data that created a lot of fields (some 8000).  I was 
getting some errors adding new fields where solr could not talk to 
zookeeper, and I thought it may be because there are so many fields.  
The index size is some 420million docs.
I'm hesitant to try to re-create as when the shards crash, they leave a 
write.lock file in HDFS, and I need to manually delete that file (on 27 
machines) before bringing them back up.
I believe this is the stack trace - but this looks to be related to 
facets, and I'm not 100% sure that this is the correct trace!  Sorry - I 
if it happens again I will update.


ERROR - 2015-05-29 20:39:34.707; [UNCLASS shard9 core_node14 UNCLASS] 
org.apache.solr.common.SolrException; null:java.lang.RuntimeException: 
java.lang.OutOfMemoryError: unable to create new native thread
at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:854)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:463)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)

at org.eclipse.jetty.server.Server.handle(Server.java:368)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at 
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)

at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)

at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:637)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:280)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:106)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:222)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1984)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:829)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:446)

... 26 more

Then later:

ERROR - 2015-05-29 21:57:22.370; [UNCLASS shard9 core_node14 UNCLASS] 
org.apache.solr.common.SolrException; null:java.lang.RuntimeException: 
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:854)
at 

Re: How To: Debuging the whole indexing process

2015-05-30 Thread Aman Tandon
Please help me here

With Regards
Aman Tandon

On Sat, May 30, 2015 at 12:43 AM, Aman Tandon amantandon...@gmail.com
wrote:

 Thanks Alex, yes it for my testing to understand the code/process flow
 actually.

 Any other ideas.

 With Regards
 Aman Tandon

 On Fri, May 29, 2015 at 12:48 PM, Alexandre Rafalovitch 
 arafa...@gmail.com wrote:

 In production or in test? I assume in test.

 This level of detail usually implies some sort of Java debugger and java
 instrumentation enabled. E.g. Chronon, which is commercial but can be
 tried
 as a plugin with IntelliJ Idea full version trial.

 Regards,
 Alex
 On 29 May 2015 4:38 pm, Aman Tandon amantandon...@gmail.com wrote:

  Hi,
 
  I want to debug the whole indexing process, the life cycle of indexing
  process (each and every function call by going via function to
 function),
  from the posting of the data.xml to creation of various index files (
 _fnm,
  _fdt, etc ). So how/what should I setup and start, please help. I will
 be
  thankful to you.
 
 
 
  
  
   *add  doc  field name=title![CDATA[Aman Tandon]]/field
 field name=job_role![CDATA[Search Engineer]]/field*
 
 
   *  /doc/add*
 
 
  With Regards
  Aman Tandon
 





Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting

2015-05-30 Thread Matteo Grolla
Wow,
thanks both for the suggestions

Erik: good point for the uneven shard load
I'm not worried about the growth of a particular shard, in case I'd use 
shard splitting and if necessary add a server to the cluster
but even if I manage to spread docs of typeA producer evenly on the 
cluster I could have an uneven query distribution (the two problems are very 
similar)
at time t I could have a shard queried by 11 type A producers 
while another shard is being queried by a single type A producer, not ideal
So I could use few bits (0 or 1) of the composite id for typeA 
producer's docs to avoid those kinds of problems

For typeB and typeC producers the problems discussed above seem unlikely, so 
I'd like to weight pros and cons of sharding on userid
pros
I'm reducing the size of the problem, instead of searching across the 
whole repository I'm searching only a part of it
cons
I could have uneven distribution of documents and queries across the 
cluster (unlikely, there are lots of users of typeB, typeC)
docs for one user aren't searched in parallel using more shards
this could be useful if one users produces so many docs to 
benefit from sharding (should happen only for typeA)

I think the pro is appealing, under these hypothesis if users of type B, C 
increase I can scale the system without many concerns

Do you agree? 


Il giorno 29/mag/2015, alle ore 20:18, Reitzel, Charles ha scritto:

 Thanks, Erick.   I appreciate the sanity check.
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Thursday, May 28, 2015 5:50 PM
 To: solr-user@lucene.apache.org
 Subject: Re: optimal shard assignment with low shard key cardinality using 
 compositeId to enable shard splitting
 
 Charles:
 
 You raise good points, and I didn't mean to say that co-locating docs due to 
 some critera was never a good idea. That said, it does add administrative 
 complexity that I'd prefer to avoid unless necessary.
 
 I suppose it largely depends on what the load and response SLAs are.
 If there's 1 query/second peak load, the sharding overhead for queries is 
 probably not noticeable. If there are 1,000 QPS, then it might be worth it.
 
 Measure, measure, measure..
 
 I think your composite ID understanding is fine.
 
 Best,
 Erick
 
 On Thu, May 28, 2015 at 1:40 PM, Reitzel, Charles 
 charles.reit...@tiaa-cref.org wrote:
 We have used a similar sharding strategy for exactly the reasons you say.   
 But we are fairly certain that the # of documents per user ID is  5000 and, 
 typically, 500.   Thus, we think the overhead of distributed searches 
 clearly outweighs the benefits.   Would you agree?   We have done some load 
 testing (with 100's of simultaneous users) and performance has been good 
 with data and queries distributed evenly across shards.
 
 In Matteo's case, this model appears to apply well to user types B and C.
 Not sure about user type A, though.At  100,000 docs per user per year, 
 on average, that load seems ok for one node.   But, is it enough to benefit 
 significantly from a parallel search?
 
 With a 2 part composite ID, each part will contribute 16 bits to a 32 bit 
 hash value, which is then compared to the set of hash ranges for each active 
 shard.   Since the user ID will contribute the high-order bytes, it will 
 dominate in matching the target shard(s).   But dominance doesn't mean the 
 lower order 16 bits will always be ignored, does it?   I.e. if the original 
 shard has been split, perhaps multiple times, isn't it possible that one 
 user IDs documents will be spread over a multiple shards?
 
 In Matteo's case, it might make sense to specify fewer bits to the user ID 
 for user category A.   I.e. what I described above is the default for 
 userId!docId.   But if you use userId/8!docId/24 (8 bits for userId and 24 
 bits for the document ID), then couldn't one user's docs might be split over 
 multiple shards, even without splitting?
 
 I'm just making sure I understand how composite ID sharding works correctly. 
   Have I got it right?  Has any of this logic changed in 5.x?
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, May 21, 2015 11:30 AM
 To: solr-user@lucene.apache.org
 Subject: Re: optimal shard assignment with low shard key cardinality 
 using compositeId to enable shard splitting
 
 I question your base assumption:
 
 bq: So shard by document producer seems a good choice
 
 Because what this _also_ does is force all of the work for a query onto one 
 node and all indexing for a particular producer ditto. And will cause you to 
 manually monitor your shards to see if some of them grow out of proportion 
 to others. And
 
 I think it would be much less hassle to just let Solr distribute the docs as 
 it may based on the uniqueKey and forget about it. Unless you want, say, to 
 do joins etc There 

Unsubscribe me

2015-05-30 Thread Neha Jatav
Unsubscribe me


Re: Unsubscribe me

2015-05-30 Thread François Schiettecatte
Quoting Erik from two days ago:

Please follow the instructions here:

http://lucene.apache.org/solr/resources.html. Be sure to use the exact same 
e-mail you used to subscribe.


 On May 30, 2015, at 6:07 AM, Lalit Kumar 4 lkum...@sapient.com wrote:
 
 Please unsubscribe me as well
 
 On May 30, 2015 15:23, Neha Jatav neha.ja...@gmail.com wrote:
 Unsubscribe me



Re: Unsubscribe me

2015-05-30 Thread Lalit Kumar 4
Please unsubscribe me as well

On May 30, 2015 15:23, Neha Jatav neha.ja...@gmail.com wrote:
Unsubscribe me


Re: Deleting Fields

2015-05-30 Thread Steve Rowe
Hi Joseph,

 On May 30, 2015, at 8:18 AM, Joseph Obernberger j...@lovehorsepower.com 
 wrote:
 
 Thank you Erick.  I was thinking that it actually went through and removed 
 the index data; that you for the clarification.

I added more info to the Schema API page about this not being true.  Here’s 
what I’ve got so far - let me know if you think we should add more warnings 
about this:

-
Re-index after schema modifications!

If you modify your schema, you will likely need to re-index all documents. If 
you do not, you may lose access to documents, or not be able to interpret them 
properly, e.g. after replacing a field type.

Modifying your schema will never modify any documents that are already indexed. 
Again, you must re-index documents in order to apply schema changes to them.

[…]

When modifying the schema with the API, a core reload will automatically occur 
in order for the changes to be available immediately for documents indexed 
thereafter.  Previously indexed documents will not be automatically handled - 
they must be re-indexed if they used schema elements that you changed.
-

Steve

any changes about limitations on huge number of fields lately?

2015-05-30 Thread xavi jmlucjav
Hi guys,

someone I work with has been advised that currently Solr can support
'infinite' number of fields.

I thought there was a practical limitation of say thousands of fields (for
sure less than a million), orthings can start to break (I think I
remember seeings memory issues reported on the mailing list by several
people).


Was there any change I missed lately that makes having say 1M fields in
Solr practical??

thanks


Re: Issues in compiling Solr code.

2015-05-30 Thread Shawn Heisey
On 5/30/2015 1:59 AM, Aniket Kumar wrote:
 Hi All,
 
 
 I am trying to build and compile Solr . I have been following the below
 link .
 
 https://wiki.apache.org/solr/HowToCompileSolr
 
 I have taken the latest version of code and have run ant clean compile
 command , followed by ant dist. Both the steps were successful, but no war
  was created(as per the wiki , a war should be created which should be
 copied to ...\LuceneSolr
 https://wiki.apache.org/solr/LuceneSolr\solr\example\webapps
 folder .(I cannot find any webapps folder under example folder either !!).

As of Solr 5.0, things have changed.  Solr is still a war, but the war
is embedded in the example, and the example has changed quite a lot.
Run ant example instead of ant dist and then you will find that you
can start solr by typing bin/solr start or bin\solr start on
Windows.  You might also consider typing ant package which will create
entire .zip and .tgz packages similar to what you download from the website.

Here's the beginnings of a discussion about why we are changing the
packaging:

https://wiki.apache.org/solr/WhyNoWar

Our startup infrastructure and documentation still need work, especially
in the parts describing how to take Solr to production.  When I find
some free time, I will be doing what I can to make this better.

Thanks,
Shawn



Re: docValues: Can we apply synonym

2015-05-30 Thread Upayavira
What I'm suggesting is that you have two fields, one for searching, one
for faceting.

You may find you can't use docValues for your field type, in which case
Solr will just use caches to improve faceting performance.

Upayavira

On Sat, May 30, 2015, at 01:50 AM, Aman Tandon wrote:
 Hi Upayavira,
 
 How the copyField will help in my scenario when I have to add the synonym
 in docValue enable field.
 
 With Regards
 Aman Tandon
 
 On Sat, May 30, 2015 at 1:18 AM, Upayavira u...@odoko.co.uk wrote:
 
  Use copyField to clone the field for faceting purposes.
 
  Upayavira
 
  On Fri, May 29, 2015, at 08:06 PM, Aman Tandon wrote:
   Hi Erick,
  
   Thanks for suggestion, We are this query parser plugin (
   *SynonymExpandingExtendedDismaxQParserPlugin*) to manage multi-word
   synonym. So it does work slower than edismax that's why it is not in
   contrib right? (I am asking this question because we are using for all
   our
   searches to handle 10 multiword ice cube, icecube etc)
  
   *Moreover I thought a solution for this docValue problem*
  
   I need to make city field as *multivalued* and by this I mean i will add
   the synonym (*mumbai, bombay*) as an extra value to that field if
   present.
   Now searching operation will work fine as before.
  
   
*field name=citymumbai/fieldfield name=citybombay/field*
  
  
   The only prob is if we have to remove the 'city alias/synonym facets'
   when
   we are providing results to the clients.
  
   *mumbai, 1000*
  
  
   With Regards
   Aman Tandon
  
   On Fri, May 29, 2015 at 7:26 PM, Erick Erickson erickerick...@gmail.com
  
   wrote:
  
Do take time for performance testing with that parser. It can be slow
depending on your
data as I remember. That said it solves the problem it set out to
solve so if it meets
your SLAs, it can be a life-saver.
   
Best,
Erick
   
   
On Fri, May 29, 2015 at 2:35 AM, Alessandro Benedetti
benedetti.ale...@gmail.com wrote:
 Even if a little bit outdated, that query parser is really really
  cool to
 manage synonyms !
 +1 !

 2015-05-29 1:01 GMT+01:00 Aman Tandon amantandon...@gmail.com:

 Thanks chris.

 Yes we are using it for handling multiword synonym problem.

 With Regards
 Aman Tandon

 On Fri, May 29, 2015 at 12:38 AM, Reitzel, Charles 
 charles.reit...@tiaa-cref.org wrote:

  Again, I would recommend using Nolan Lawson's
  SynonymExpandingExtendedDismaxQParserPlugin.
 
 
  http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
 
  -Original Message-
  From: Aman Tandon [mailto:amantandon...@gmail.com]
  Sent: Wednesday, May 27, 2015 6:42 PM
  To: solr-user@lucene.apache.org
  Subject: Re: docValues: Can we apply synonym
 
  Ok and what synonym processor you is talking about maybe it could
help ?
 
  With Regards
  Aman Tandon
 
  On Thu, May 28, 2015 at 4:01 AM, Reitzel, Charles 
  charles.reit...@tiaa-cref.org wrote:
 
   Sorry, my bad.   The synonym processor I mention works
  differently.
 It's
   an extension of the EDisMax query processor and doesn't require
field
   level synonym configs.
  
   -Original Message-
   From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
   Sent: Wednesday, May 27, 2015 6:12 PM
   To: solr-user@lucene.apache.org
   Subject: RE: docValues: Can we apply synonym
  
   But the query analysis isn't on a specific field, it is applied
  to
the
   query string.
  
   -Original Message-
   From: Aman Tandon [mailto:amantandon...@gmail.com]
   Sent: Wednesday, May 27, 2015 6:08 PM
   To: solr-user@lucene.apache.org
   Subject: Re: docValues: Can we apply synonym
  
   Hi Charles,
  
   The problem here is that the docValues works only with
  primitives
data
   type only like String, int, etc So how could we apply synonym on
   primitive data type.
  
   With Regards
   Aman Tandon
  
   On Thu, May 28, 2015 at 3:19 AM, Reitzel, Charles 
   charles.reit...@tiaa-cref.org wrote:
  
Is there any reason you cannot apply the synonyms at query
  time?
 Applying synonyms at indexing time has problems, e.g.
  polluting
the
term frequency for synonyms added, preventing distance
  queries,
...
   
Since city names often have multiple terms, e.g. New York, Den
Hague, etc., I would recommend using Nolan Lawson's
SynonymExpandingExtendedDismaxQParserPlugin.   Tastes great,
  less
   filling.
   
   
http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/
   
We found this to fix synonyms like ny for New York and
  vice
 versa.
Haven't tried it with docValues, tho.
   
-Original Message-
From: Aman Tandon 

Re: Optimal FS block size for small documents in Solr?

2015-05-30 Thread Upayavira


On Sat, May 30, 2015, at 09:51 AM, Gili Nachum wrote:
 Hi, What would be an optimal FS block size to use?
 
 Using Solr 4.7.2, I have an RAID-5 of SSD drives currently configured
 with
 a 128KB block size.
 Can I expect better indexing/query time performance with a smaller block
 size (say 8K)?
 Considering my documents are almost always smaller than 8K.
 I assume all stored fields would fit into one block which is good, but
 what
 will Lucene prefer for reading a long posting list and other data
 structures.
 
 Any rules of thumb or anyone that had experimented on this?

I'm gonna start this response with the observation that I don't know
anything about the topic you are asking about.

So, with that out of the way, a Lucene index is write only, that is,
when you do a commit, all of the data that makes up your index is
written to disk - that is, all documents making up a single commit are
written into a set of files, making a segment.

Therefore, it isn't the size of a document that matters, more the number
and size of documents making up a single commit. There's a lot more to
it too, e.g. whether fields are stored, how they are analysed, etc.

You could do a simple experiment. Write a little app that pushes docs to
Solr and commits, then look at the file sizes on disk. Then repeat with
more documents, see what impact on file sizes. I suspect you can answer
your question relatively easily.

Upayavira


Re: Optimal FS block size for small documents in Solr?

2015-05-30 Thread Shawn Heisey
On 5/30/2015 2:51 AM, Gili Nachum wrote:
 Hi, What would be an optimal FS block size to use?
 
 Using Solr 4.7.2, I have an RAID-5 of SSD drives currently configured with
 a 128KB block size.
 Can I expect better indexing/query time performance with a smaller block
 size (say 8K)?
 Considering my documents are almost always smaller than 8K.
 I assume all stored fields would fit into one block which is good, but what
 will Lucene prefer for reading a long posting list and other data
 structures.

Generally speaking, RAID levels that use striping should have the
largest block size you can make, which for most modern RAID controllers
is 1MB or 2MB.  When you make the stripe size very small, reading and
writing even small files requires hitting all the disks.  With large
stripes, accessing data randomly is more likely to have one read hit one
disk while another read hits another disk.

For Lucene/Solr, there might be benefits to smaller block sizes, but I
believe that they might cause more problems than they solve.

There are some additional things to think about:

If your server has its memory appropriately sized, then you will have
enough RAM to let your operating system cache your index entirely.  For
queries, you will only rarely be hitting the disk ... so disk speed and
layout don't matter much at all, and you will only need to be concerned
about *write* speed for indexing.

RAID levels 3 through 6 (and any derivations like level 50) are
*horrible* if there is very much write activity -- for a Solr install,
that means indexing, and to a slightly lesser extent, logging.

When you write to a RAID5 array, you slow *everything* down.  Even
*reads* that happen at the same time as writes are strongly affected by
those writes.  It is the nature of RAID5.  If your system is entirely
read-only, then RAID5 is awesome ... but RAID10 is better.  RAID10 *is*
initially more expensive than RAID5 ... but the performance and
reliability benefits are completely worth the additional expense.

Additional reading material below.  I do highly recommend reading at
least the first link:

http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt
http://www.baarf.com/

The RAID10 stripe size should be at least 1MB if your controller
supports blocks that large.

Thanks,
Shawn