Re: How to use BitDocSet within a PostFilter

2015-08-03 Thread Roman Chyla
Hi,
inStockSkusBitSet.get(currentChildDocNumber)

Is that child a lucene id? If yes, does it include offset? Every index
segment starts at a different point, but docs are numbered from zero. So to
check them against the full index bitset, I'd be doing
Bitset.exists(indexBase + docid)

Just one thing to check

Roman
On Aug 3, 2015 1:24 AM, Stephen Weiss steve.we...@wgsn.com wrote:

 Hi everyone,

 I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl
 through grandchild documents during a search through the parents and filter
 out documents based on statistics gathered from aggregating the
 grandchildren together.  I've been successful in getting the logic correct,
 but it does not perform so well - I'm grabbing too many documents from the
 index along the way.  I'm trying to filter out grandchild documents which
 are not relevant to the statistics I'm collecting, in order to reduce the
 number of document objects pulled from the IndexReader.

 I've implemented the following code in my DelegatingCollector.collect:

 if (inStockSkusBitSet == null) {
 SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from
 IndexSearcher to expose getDocSet.
 inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery);
 inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from
 DocSet to expose getBits.
 inStockSkusBitSet = inStockSkusBitDocSet.getBits();
 }


 My BitDocSet reports a size which matches a standard query for the more
 limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also
 reports this same cardinality.  Based on that fact, it seems that the
 getDocSet call itself must be working properly, and returning the right
 number of documents.  However, when I try to filter out grandchild
 documents using either BitDocSet.exists or BitSet.get (passing over any
 grandchild document which doesn't exist in the bitdocset or return true
 from the bitset), I get about 1/3 less results than I'm supposed to.   It
 seems many documents that should match the filter, are being excluded, and
 documents which should not match the filter, are being included.

 I'm trying to use it either of these ways:

 if (!inStockSkusBitSet.get(currentChildDocNumber)) continue;
 if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue;

 The currentChildDocNumber is simply the docNumber which is passed to
 DelegatingCollector.collect, decremented until I hit a document that
 doesn't belong to the parent document.

 I can't seem to figure out a way to actually use the BitDocSet (or its
 derivatives) to quickly eliminate document IDs.  It seems like this is how
 it's supposed to be used.  What am I getting wrong?

 Sorry if this is a newbie question, I've never written a PostFilter
 before, and frankly, the documentation out there is a little sketchy
 (mostly for version 4) - so many classes have changed names and so many of
 the more well-documented techniques are deprecated or removed now, it's
 tough to follow what the current best practice actually is.  I'm using the
 block join functionality heavily so I'm trying to keep more current than
 that.  I would be happy to send along the full source privately if it would
 help figure this out, and plan to write up some more elaborate instructions
 (updated for Solr 5) for the next person who decides to write a PostFilter
 and work with block joins, if I ever manage to get this performing well
 enough.

 Thanks for any pointers!  Totally open to doing this an entirely different
 way.  I read DocValues might be a more elegant approach but currently that
 would require reindexing, so trying to avoid that.

 Also, I've been wondering if the query above would read from the filter
 cache or not.  The query is constructed like this:


 private Term inStockTrueTerm = new Term(sku_history.is_in_stock,
 T);
 private Term objectTypeSkuHistoryTerm = new Term(object_type,
 sku_history);
 ...

 inStockTrueTermQuery = new TermQuery(inStockTrueTerm);
 objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm);
 inStockSkusQuery = new BooleanQuery();
 inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST);
 inStockSkusQuery.add(objectTypeSkuHistoryTermQuery,
 BooleanClause.Occur.MUST);
 --
 Steve

 

 WGSN is a global foresight business. Our experts provide deep insight and
 analysis of consumer, fashion and design trends. We inspire our clients to
 plan and trade their range with unparalleled confidence and accuracy.
 Together, we Create Tomorrow.

 WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of
 market-leading products including WGSN.comhttp://www.wgsn.com, WGSN
 Lifestyle  Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN
 INstockhttp://www.wgsninstock.com/, WGSN StyleTrial
 http://www.wgsn.com/en/styletrial/ and WGSN Mindset
 http://www.wgsn.com/en/services/consultancy/, our bespoke consultancy
 services.

 The information in or attached to this email is 

Re: Collection APIs to create collection and custom cores naming

2015-08-03 Thread Erick Erickson
See: https://issues.apache.org/jira/browse/SOLR-6719

It's not clear that we'll support this, so this may just be a doc
change. How would you properly support having more than one replica?
Or, for that matter, having more than one shard? Property.name would
have to do something to make the core names unique.

I agree that for single-shard, single replica situations it's a
reasonable thing to do, but I'm not at all sure the effort is worth
the gain for that one case. Yes, you can create a bunch of rules that
would allow you to map selected names to a long, comma separated
string or something like that, but it just doesn't seem worth the
effort.

Best,
Erick

On Mon, Aug 3, 2015 at 1:48 AM, davidphilip cherian
davidphilipcher...@gmail.com wrote:
 How to use the 'property.name=value' in the api example[1] to modify
 core.properties value of 'name'

 While creating the collection with below query[2], the core names become
 'aggregator_shard1_replica1' and 'aggregator_shard2_replica1'. I wanted
 have specific/custom name for each of these cores. I tried passing the
 params as property.name=namename=aggregator_s1, but it did not work.

 Editing the core.properties key value pair of name=aggregator_s1 after
 collection is created, works! But I was looking for setting this property
 with create request itself.

 [2]
 http://example.com:8983/solr/admin/collections?action=CREATEname=aggregatornumShards=1replicationFactor=2maxShardsPerNode=1collection.configName=aggregator_configproperty.name=namename=aggregator_s1

 [1]
 https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1


Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
Hi,
I am new in solr development and have a same requirement and I have already
got some knowledge such as how many shard have to created such amount of
data at all. with help of googling.

I want to take Some suggestion there are so many method to do indexing such
as DIH,solr,Solrj.

Please suggest me in which way i have to do it.
1.) Should i  use Solrj 
1.) Should i  use DIH 
1.) Should i  use post method(in terminal) 

or Is there any other way for indexing such amount of data.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220469.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Alexandre Rafalovitch
That's still a VERY open question. The answer is Yes, but the details
depend on the shape and source of your data. And the search you are
anticipating.

Is this a lot of entries with small number of fields. Or a -
relatively - small number of entries with huge field counts. Do you
need to store/return all those fields or just search them?

Is the content coming as one huge file (in which format?) or from an
external source such as database?

And so on.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 3 August 2015 at 11:42, Mugeesh Husain muge...@gmail.com wrote:
 Hi,
 I am new in solr development and have a same requirement and I have already
 got some knowledge such as how many shard have to created such amount of
 data at all. with help of googling.

 I want to take Some suggestion there are so many method to do indexing such
 as DIH,solr,Solrj.

 Please suggest me in which way i have to do it.
 1.) Should i  use Solrj
 1.) Should i  use DIH
 1.) Should i  use post method(in terminal)

 or Is there any other way for indexing such amount of data.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220469.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Collapsing Query Parser returns one record per shard...was not expecting this...

2015-08-03 Thread Joel Bernstein
Your findings are the expected behavior for the Collapsing qparser. The
Collapsing qparser requires records in the same collapsed field to be
located on the same shard. The typical approach for this is to use
composite Id routing to ensure that documents with the same collapse field
land on the same shard.

We should make this clear in the documentation.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 3, 2015 at 4:20 PM, Peter Lee peter@proquest.com wrote:

 From my reading of the solr docs (e.g.
 https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
 and https://cwiki.apache.org/confluence/display/solr/Result+Grouping),
 I've been under the impression that these two methods (result grouping and
 collapsing query parser) can both be used to eliminate duplicates from a
 result set (in our case, we have a duplication field that contains a
 'signature' that identifies duplicates. We use our own signature for a
 variety of reasons that are tied to complex business requirements.).

 In a test environment I scattered 15 duplicate records (with another 10
 unique records) across a test system running Solr Cloud (Solr version
 5.2.1) that had 4 shards and a replication factor of 2. I tried both result
 grouping and the collapsing query parser to remove duplicates. The result
 grouping worked as expected...the collapsing query parser did not.

 My results in using the collapsing query parser showed that Solr was in
 fact including into the result set one of the duplicate records from each
 shard (that is, I received FOUR duplicate records...and turning on debug
 showed that each of the four records came from a  unique shard)...when I
 was expecting solr to do the collapsing on the aggregated result and return
 only ONE of the duplicated records across ALL shards. It appears that solr
 is performing the collapsing query parsing on each individual shard, but
 then NOT performing the operation on the aggregated results from each shard.

 I have searched through the forums and checked the documentation as
 carefully as I can. I find no documentation or mention of this effect (one
 record being returned per shard) when using collapsing query parsing.

 Is this a known behavior? Am I just doing something wrong? Am I missing
 some search parameter? Am I simply not understanding correctly how this is
 supposed to work?

 For reference, I am including below the search url and the response I
 received. Any insights would be appreciated.

 Query:
 http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq={!collapse%20field=dupid_s}debugQuery=true
 
 http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq=%7b!collapse%20field=dupid_s%7ddebugQuery=true
 

 Response (note that dupid_s = 900 is the duplicate value and that I have
 added comments in the output ***comment*** pointing out which shard
 responses came from):

 {
   responseHeader:{
 status:0,
 QTime:31,
 params:{
   debugQuery:true,
   indent:true,
   q:*:*,
   wt:json,
   fq:{!collapse field=dupid_s},
   rows:1000}},
   response:{numFound:14,start:0,maxScore:1.0,docs:[
   {
 storeid_s:1002,
 dupid_s:900, ***AcaColl_shard2_replica2***
 title_pqth:[Dupe Record #2],
 _version_:1508241005512491008,
 indexTime_dt:2015-07-31T19:25:09.914Z},
   {
 storeid_s:8020,
 dupid_s:2005,
 title_pqth:[Unique Record #5],
 _version_:1508241005539753984,
 indexTime_dt:2015-07-31T19:25:09.94Z},
   {
 storeid_s:8023,
 dupid_s:2008,
 title_pqth:[Unique Record #8],
 _version_:1508241005540802560,
 indexTime_dt:2015-07-31T19:25:09.94Z},
   {
 storeid_s:8024,
 dupid_s:2009,
 title_pqth:[Unique Record #9],
 _version_:1508241005541851136,
 indexTime_dt:2015-07-31T19:25:09.94Z},
   {
 storeid_s:1007,
 dupid_s:900, ***AcaColl_shard4_replica2***
 title_pqth:[Dupe Record #7],
 _version_:1508241005515636736,
 indexTime_dt:2015-07-31T19:25:09.91Z},
   {
 storeid_s:8016,
 dupid_s:2001,
 title_pqth:[Unique Record #1],
 _version_:1508241005526122496,
 indexTime_dt:2015-07-31T19:25:09.91Z},
   {
 storeid_s:8019,
 dupid_s:2004,
 title_pqth:[Unique Record #4],
 _version_:1508241005528219648,
 indexTime_dt:2015-07-31T19:25:09.91Z},
   {
 storeid_s:1003,
 dupid_s:900, ***AcaColl_shard1_replica1***
 title_pqth:[Dupe Record #3],
 _version_:1508241005515636736,
 indexTime_dt:2015-07-31T19:25:09.917Z},
   {
 storeid_s:8017,
 dupid_s:2002,
 title_pqth:[Unique Record #2],
 _version_:1508241005518782464,
 indexTime_dt:2015-07-31T19:25:09.917Z},
   {
 storeid_s:8018,
 dupid_s:2003,
 

Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Konstantin Gribov
Upayavira, manual commit isn't a good advice, especially with small bulks
or single document, is it? I see recommendations on using
autoCommit+autoSoftCommit instead of manual commit mostly.

вт, 4 авг. 2015 г. в 1:00, Upayavira u...@odoko.co.uk:

 SolrJ is just a SolrClient. In pseudocode, you say:

 SolrClient client = new
 SolrClient(http://localhost:8983/solr/whatever;);

 ListSolrInputDocument docs = new ArrayList();
 SolrInputDocument doc = new SolrInputDocument();
 doc.addField(id, abc123);
 doc.addField(some-text-field, I like it when the sun shines);
 docs.add(doc);
 client.add(docs);
 client.commit();

 (warning, the above is typed from memory)

 So, the question is simply how many documents do you add to docs before
 you do client.add(docs);

 And how often (if at all) do you call client.commit().

 So when you are told Use SolrJ, really, you are being told to write
 some Java code that happens to use the SolrJ client library for Solr.

 Upayavira


 On Mon, Aug 3, 2015, at 10:01 PM, Alexandre Rafalovitch wrote:
  Well,
 
  If it is just file names, I'd probably use SolrJ client, maybe with
  Java 8. Read file names, split the name into parts with regular
  expressions, stuff parts into different field names and send to Solr.
  Java 8 has FileSystem walkers, etc to make it easier.
 
  You could do it with DIH, but it would be with nested entities and the
  inner entity would probably try to parse the file. So, a lot of wasted
  effort if you just care about the file names.
 
  Or, I would just do a directory listing in the operating system and
  use regular expressions to split it into CSV file, which I would then
  import into Solr directly.
 
  In all of these cases, the question would be which field is the ID of
  the record to ensure no duplicates.
 
  Regards,
 Alex.
 
  
  Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
  http://www.solr-start.com/
 
 
  On 3 August 2015 at 15:34, Mugeesh Husain muge...@gmail.com wrote:
   @Alexandre  No i dont need a content of a file. i am repeating my
 requirement
  
   I have a 40 millions of files which is stored in a file systems,
   the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf
  
   I just  split all Value from a filename only,these values i have to
 index.
  
   I am interested to index value to solr not file contains.
  
   I have tested the DIH from a file system its work fine but i dont know
 how
   can i implement my code in DIH
   if my code get some value than how i can i index it using DIH.
  
   If i will use DIH then How i will make split operation and get value
 from
   it.
  
  
  
  
  
   --
   View this message in context:
 http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
   Sent from the Solr - User mailing list archive at Nabble.com.

-- 
Best regards,
Konstantin Gribov


DateRangeField Query throws NPE

2015-08-03 Thread Stephen Weiss
Hi everyone,

I'm running into a trouble building a query with DateRangeField.  Web-based 
queries work fine, but this code throws an NPE:


dateRangeQuery = dateRangeField.getRangeQuery(null, 
SidxS.getSchema().getField(sku_history.date_range), start_date_str, 
end_date_str, true, true);

ERROR - 2015-08-03 23:07:10.122; [   instock_dev] 
org.apache.solr.common.SolrException; 
null:com.google.common.util.concurrent.UncheckedExecutionException: 
java.lang.NullPointerException
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at 
org.apache.solr.schema.AbstractSpatialFieldType.getStrategy(AbstractSpatialFieldType.java:403)
at 
org.apache.solr.schema.AbstractSpatialFieldType.getQueryFromSpatialArgs(AbstractSpatialFieldType.java:331)
at org.apache.solr.schema.DateRangeField.getRangeQuery(DateRangeField.java:184)
at 
com.wgsn.ginger.stockStatusQuery.getFilterCollector(stockStatusQuery.java:128)
at 
org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1148)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1609)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1485)
at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:561)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:518)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:255)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
...
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at 
org.apache.solr.schema.AbstractSpatialPrefixTreeFieldType.newSpatialStrategy(AbstractSpatialPrefixTreeFieldType.java:117)
at 
org.apache.solr.schema.AbstractSpatialPrefixTreeFieldType.newSpatialStrategy(AbstractSpatialPrefixTreeFieldType.java:40)
at 
org.apache.solr.schema.AbstractSpatialFieldType$2.call(AbstractSpatialFieldType.java:406)
at 
org.apache.solr.schema.AbstractSpatialFieldType$2.call(AbstractSpatialFieldType.java:403)
at 
com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
... 38 more


That line 117 seems to be some logging happening in that class, which is 
probably unnecessary to begin with:

loghttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/5.1.0/org/apache/solr/schema/AbstractSpatialFieldType.java#AbstractSpatialFieldType.0log.infohttp://grepcode.com/file/repo1.maven.org/maven2/org.slf4j/slf4j-api/1.7.7/org/slf4j/Logger.java#Logger.info%28java.lang.String%29(this.toStringhttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/5.1.0/org/apache/solr/schema/FieldType.java#FieldType.toString%28%29()+
 strat: +strat+ maxLevels: + 
gridhttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/5.1.0/org/apache/solr/schema/AbstractSpatialPrefixTreeFieldType.java#AbstractSpatialPrefixTreeFieldType.0grid.getMaxLevelshttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-spatial/5.1.0/org/apache/lucene/spatial/prefix/tree/SpatialPrefixTree.java#SpatialPrefixTree.getMaxLevels%28%29());//TODO
 output maxDetailKm

No idea which variable is returning null, kinda guessing it's the 
grid.getMaxLevels() since I don't see this being initialized by any prior 
methods in this chain.  Is this just a bug or am I using it wrong?  I'm trying 
to wrap this query together with two other queries but I can't even get the 
Query object back.

--
Steve



WGSN is a global foresight business. Our experts provide deep insight and 
analysis of consumer, fashion and design trends. We inspire our clients to plan 
and trade their range with unparalleled confidence and accuracy. Together, we 
Create Tomorrow.

WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of 
market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle 
 Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN 
INstockhttp://www.wgsninstock.com/, WGSN 
StyleTrialhttp://www.wgsn.com/en/styletrial/ and WGSN 
Mindsethttp://www.wgsn.com/en/services/consultancy/, our bespoke consultancy 
services.

The information in or attached to this email is confidential and may be legally 
privileged. If you are not the intended recipient of this message, any use, 
disclosure, copying, distribution or any action taken in reliance on it is 
prohibited and may be unlawful. If you 

Re: Documentation for: solr.EnglishPossessiveFilterFactory

2015-08-03 Thread Alexandre Rafalovitch
Seems simple enough that the source answers all the questions:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_9/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishPossessiveFilter.java#L66

It just looks for a couple of versions of apostrophe followed by s or S.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 3 August 2015 at 17:56, Steven White swhite4...@gmail.com wrote:
 Hi Everyone,

 Does anyone knows where I can find docs on filter
 class=solr.EnglishPossessiveFilterFactory/?  The only one I found is the
 API doc:
 http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilterFactory.html
 but that's not what I'm looking for, I'm looking for one to describe in
 details how this filter works with examples.

 Thanks

 Steve


Documentation for: solr.EnglishPossessiveFilterFactory

2015-08-03 Thread Steven White
Hi Everyone,

Does anyone knows where I can find docs on filter
class=solr.EnglishPossessiveFilterFactory/?  The only one I found is the
API doc:
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilterFactory.html
but that's not what I'm looking for, I'm looking for one to describe in
details how this filter works with examples.

Thanks

Steve


Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Upayavira
SolrJ is just a SolrClient. In pseudocode, you say:

SolrClient client = new
SolrClient(http://localhost:8983/solr/whatever;);

ListSolrInputDocument docs = new ArrayList();
SolrInputDocument doc = new SolrInputDocument();
doc.addField(id, abc123);
doc.addField(some-text-field, I like it when the sun shines);
docs.add(doc);
client.add(docs);
client.commit();

(warning, the above is typed from memory)

So, the question is simply how many documents do you add to docs before
you do client.add(docs);

And how often (if at all) do you call client.commit().

So when you are told Use SolrJ, really, you are being told to write
some Java code that happens to use the SolrJ client library for Solr.

Upayavira


On Mon, Aug 3, 2015, at 10:01 PM, Alexandre Rafalovitch wrote:
 Well,
 
 If it is just file names, I'd probably use SolrJ client, maybe with
 Java 8. Read file names, split the name into parts with regular
 expressions, stuff parts into different field names and send to Solr.
 Java 8 has FileSystem walkers, etc to make it easier.
 
 You could do it with DIH, but it would be with nested entities and the
 inner entity would probably try to parse the file. So, a lot of wasted
 effort if you just care about the file names.
 
 Or, I would just do a directory listing in the operating system and
 use regular expressions to split it into CSV file, which I would then
 import into Solr directly.
 
 In all of these cases, the question would be which field is the ID of
 the record to ensure no duplicates.
 
 Regards,
Alex.
 
 
 Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
 http://www.solr-start.com/
 
 
 On 3 August 2015 at 15:34, Mugeesh Husain muge...@gmail.com wrote:
  @Alexandre  No i dont need a content of a file. i am repeating my 
  requirement
 
  I have a 40 millions of files which is stored in a file systems,
  the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf
 
  I just  split all Value from a filename only,these values i have to index.
 
  I am interested to index value to solr not file contains.
 
  I have tested the DIH from a file system its work fine but i dont know how
  can i implement my code in DIH
  if my code get some value than how i can i index it using DIH.
 
  If i will use DIH then How i will make split operation and get value from
  it.
 
 
 
 
 
  --
  View this message in context: 
  http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
  Sent from the Solr - User mailing list archive at Nabble.com.


Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Erick Erickson
This doesn't work in SolrCloud, but it really sounds like lots of
cores which is designed
to keep the most recent N cores loaded and auto-unload older ones, see:
http://wiki.apache.org/solr/LotsOfCores

Best,
Erick

On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt bhur...@gmail.com wrote:
 Is there are an easy way for a client to tell Solr to close or release the
 IndexSearcher and/or IndexWriter for a core?

 I have a use case where we're creating a lot of cores with not that many
 documents per zone (a few hundred to maybe 10's of thousands).  Writes come
 in batches, and reads also tend to be bursty, if less so than the writes.

 And we're having problems with ram usage on the server.  Poking around a
 heap dump, the problem is that every IndexSearcher or IndexWriter being
 opened is taking up large amounts of memory.

 I've looked at the unload call, and while it is unclear, it seems like it
 deletes the data on disk as well.  I don't want to delete the data on disk,
 I just want to unload the searcher and writer, and free up the memory.

 So I'm wondering if there is a call I can make when I know or suspect that
 the core isn't going to be used in the near future to release these objects
 and return the memory?  Or a configuration option I can set to do so after,
 say, being idle for 5 seconds?  It's OK for there to be a performance hit
 the first time I reopen the core.

 Thanks,

 Brian


Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
So unloading a core doesn't delete the data?  That is good to know.

On Mon, Aug 3, 2015 at 6:22 PM, Erick Erickson erickerick...@gmail.com
wrote:

 This doesn't work in SolrCloud, but it really sounds like lots of
 cores which is designed
 to keep the most recent N cores loaded and auto-unload older ones, see:
 http://wiki.apache.org/solr/LotsOfCores

 Best,
 Erick

 On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt bhur...@gmail.com wrote:
  Is there are an easy way for a client to tell Solr to close or release
 the
  IndexSearcher and/or IndexWriter for a core?
 
  I have a use case where we're creating a lot of cores with not that many
  documents per zone (a few hundred to maybe 10's of thousands).  Writes
 come
  in batches, and reads also tend to be bursty, if less so than the writes.
 
  And we're having problems with ram usage on the server.  Poking around a
  heap dump, the problem is that every IndexSearcher or IndexWriter being
  opened is taking up large amounts of memory.
 
  I've looked at the unload call, and while it is unclear, it seems like it
  deletes the data on disk as well.  I don't want to delete the data on
 disk,
  I just want to unload the searcher and writer, and free up the memory.
 
  So I'm wondering if there is a call I can make when I know or suspect
 that
  the core isn't going to be used in the near future to release these
 objects
  and return the memory?  Or a configuration option I can set to do so
 after,
  say, being idle for 5 seconds?  It's OK for there to be a performance hit
  the first time I reopen the core.
 
  Thanks,
 
  Brian



Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
Some further information:

The main things use memory that I see from my heap dump are:

1. Arrays of org.apache.lucene.util.fst.FST$Arc classes- which mainly seem
to hold nulls.  The ones of these I've investigated have been held by
org.apache.lucene.util.fst.FST objects, I have 38 cores open and have over
121,000 of these arrays, taking up over 126M of space.

2. Byte arrays, of which I have 384,000 of, taking up 106M of space.

When I trace the cycle of references up, I've always ended up at an
IndexSearcher or IndexWriter class, causing me to assume the problem was
that I was simply opening up too many cores, but I could be mistaken.

This was on a freshly started system without many cores having been touched
yet- so the memory usage, while larger than I expect, isn't critical yet.
It does become critical as the number of cores increases.

Thanks,

Brian



On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt bhur...@gmail.com wrote:


 Is there are an easy way for a client to tell Solr to close or release the
 IndexSearcher and/or IndexWriter for a core?

 I have a use case where we're creating a lot of cores with not that many
 documents per zone (a few hundred to maybe 10's of thousands).  Writes come
 in batches, and reads also tend to be bursty, if less so than the writes.

 And we're having problems with ram usage on the server.  Poking around a
 heap dump, the problem is that every IndexSearcher or IndexWriter being
 opened is taking up large amounts of memory.

 I've looked at the unload call, and while it is unclear, it seems like it
 deletes the data on disk as well.  I don't want to delete the data on disk,
 I just want to unload the searcher and writer, and free up the memory.

 So I'm wondering if there is a call I can make when I know or suspect that
 the core isn't going to be used in the near future to release these objects
 and return the memory?  Or a configuration option I can set to do so after,
 say, being idle for 5 seconds?  It's OK for there to be a performance hit
 the first time I reopen the core.

 Thanks,

 Brian




Re: Large number of collections in SolrCloud

2015-08-03 Thread Mukesh Jha
We have similar date and language based collection.
We also ran into similar issues of having huge clusterstate.json file which
took an eternity to load up.

In our case the search cases were language specific so we moved to multiple
solr cluster each having a different zk namespace per language, something
you might look at.
On 27 Jul 2015 20:47, Olivier olivau...@gmail.com wrote:

 Hi,

 I have a SolrCloud cluster with 3 nodes :  3 shards per node and
 replication factor at 3.
 The collections number is around 1000. All the collections use the same
 Zookeeper configuration.
 So when I create each collection, the ZK configuration is pulled from ZK
 and the configuration files are stored in the JVM.
 I thought that if the configuration was the same for each collection, the
 impact on the JVM would be insignifiant because the configuration should be
 loaded only once. But it is not the case, for each collection created, the
 JVM size increases because the configuration is loaded again, am I correct
 ?

 If I have a small configuration folder size, I have no problem because the
 folder size is less than 500 KB so if we count 1000 collections x 500 KB,
 the JVM impact is 500 MB.
 But we manage a lot of languages with some dictionaries so the
 configuration folder size is about 6 MB. The JVM impact is very important
 now because it can be more than 6 GB (1000 x 6 MB).

 So I would like to have the feeback of people who have a cluster with a
 large number of collections too. Do I have to change some settings to
 handle this case better ? What can I do to optimize this behaviour ?
 For now, we just increase the RAM size per node at 16 GB but we plan to
 increase the collections number.

 Thanks,

 Olivier



Re: Collapsing Query Parser returns one record per shard...was not expecting this...

2015-08-03 Thread Joel Bernstein
One of things to keep in mind with Grouping is that if you are relying on
an accurate group count (ngroups) then you will also have to collocate
documents based on the grouping field.

The main advantage to the Collapsing qparser plugin is it provides fast
field collapsing on high cardinality fields with an accurate group count.

If you don't need ngroups, then Grouping is usually just as fast if not
faster.


Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Aug 3, 2015 at 10:14 PM, Joel Bernstein joels...@gmail.com wrote:

 Your findings are the expected behavior for the Collapsing qparser. The
 Collapsing qparser requires records in the same collapsed field to be
 located on the same shard. The typical approach for this is to use
 composite Id routing to ensure that documents with the same collapse field
 land on the same shard.

 We should make this clear in the documentation.

 Joel Bernstein
 http://joelsolr.blogspot.com/

 On Mon, Aug 3, 2015 at 4:20 PM, Peter Lee peter@proquest.com wrote:

 From my reading of the solr docs (e.g.
 https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
 and https://cwiki.apache.org/confluence/display/solr/Result+Grouping),
 I've been under the impression that these two methods (result grouping and
 collapsing query parser) can both be used to eliminate duplicates from a
 result set (in our case, we have a duplication field that contains a
 'signature' that identifies duplicates. We use our own signature for a
 variety of reasons that are tied to complex business requirements.).

 In a test environment I scattered 15 duplicate records (with another 10
 unique records) across a test system running Solr Cloud (Solr version
 5.2.1) that had 4 shards and a replication factor of 2. I tried both result
 grouping and the collapsing query parser to remove duplicates. The result
 grouping worked as expected...the collapsing query parser did not.

 My results in using the collapsing query parser showed that Solr was in
 fact including into the result set one of the duplicate records from each
 shard (that is, I received FOUR duplicate records...and turning on debug
 showed that each of the four records came from a  unique shard)...when I
 was expecting solr to do the collapsing on the aggregated result and return
 only ONE of the duplicated records across ALL shards. It appears that solr
 is performing the collapsing query parsing on each individual shard, but
 then NOT performing the operation on the aggregated results from each shard.

 I have searched through the forums and checked the documentation as
 carefully as I can. I find no documentation or mention of this effect (one
 record being returned per shard) when using collapsing query parsing.

 Is this a known behavior? Am I just doing something wrong? Am I missing
 some search parameter? Am I simply not understanding correctly how this is
 supposed to work?

 For reference, I am including below the search url and the response I
 received. Any insights would be appreciated.

 Query:
 http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq={!collapse%20field=dupid_s}debugQuery=true
 http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq=%7B!collapse%20field=dupid_s%7DdebugQuery=true
 
 http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq=%7b!collapse%20field=dupid_s%7ddebugQuery=true
 

 Response (note that dupid_s = 900 is the duplicate value and that I have
 added comments in the output ***comment*** pointing out which shard
 responses came from):

 {
   responseHeader:{
 status:0,
 QTime:31,
 params:{
   debugQuery:true,
   indent:true,
   q:*:*,
   wt:json,
   fq:{!collapse field=dupid_s},
   rows:1000}},
   response:{numFound:14,start:0,maxScore:1.0,docs:[
   {
 storeid_s:1002,
 dupid_s:900, ***AcaColl_shard2_replica2***
 title_pqth:[Dupe Record #2],
 _version_:1508241005512491008,
 indexTime_dt:2015-07-31T19:25:09.914Z},
   {
 storeid_s:8020,
 dupid_s:2005,
 title_pqth:[Unique Record #5],
 _version_:1508241005539753984,
 indexTime_dt:2015-07-31T19:25:09.94Z},
   {
 storeid_s:8023,
 dupid_s:2008,
 title_pqth:[Unique Record #8],
 _version_:1508241005540802560,
 indexTime_dt:2015-07-31T19:25:09.94Z},
   {
 storeid_s:8024,
 dupid_s:2009,
 title_pqth:[Unique Record #9],
 _version_:1508241005541851136,
 indexTime_dt:2015-07-31T19:25:09.94Z},
   {
 storeid_s:1007,
 dupid_s:900, ***AcaColl_shard4_replica2***
 title_pqth:[Dupe Record #7],
 _version_:1508241005515636736,
 indexTime_dt:2015-07-31T19:25:09.91Z},
   {
 storeid_s:8016,
 dupid_s:2001,
 title_pqth:[Unique Record #1],
 _version_:1508241005526122496,
 

RE: Do not match on high frequency terms

2015-08-03 Thread Swedish, Steve
Thanks for your response. For TermsComponent, I am able to get a list of all 
terms in a field that have a document frequency under a certain threshold, but 
I was wondering if I could instead pass a list of terms, and get back only the 
terms from that list that have a document frequency under a certain threshold 
in a field. I can't find an easy way to do this, do you know if this is 
possible?

Thanks,
Steve

-Original Message-
From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] 
Sent: Saturday, August 1, 2015 6:35 AM
To: solr-user solr-user@lucene.apache.org
Subject: Re: Do not match on high frequency terms

It seems like you need to develop custom query or query parser. Regarding
SolrJ: you can try to call http://wiki.apache.org/solr/TermsComponent
https://cwiki.apache.org/confluence/display/solr/The+Terms+Component I'm not 
sure how exactly call TermsComponent in SolrJ, I just found 
https://lucene.apache.org/solr/5_2_1/solr-solrj/org/apache/solr/client/solrj/response/TermsResponse.html
to read its' response.

On Fri, Jul 31, 2015 at 11:31 PM, Swedish, Steve steve.swed...@noblis.org
wrote:

 Hello,

 I'm hoping someone might be able to help me out with this as I do not 
 have very much solr experience. Basically, I am wondering if it is 
 possible to not match on terms that have a document frequency above a 
 certain threshold. For my situation, a stop word list will be 
 unrealistic to maintain, so I was wondering if there may be an 
 alternative solution using term document frequency to identify common terms.

 What would actually be ideal is if I could somehow use the 
 CommonTermsQuery. The problem I ran across when looking at this option 
 was that the CommonTermsQuery seems to only work for queries on one 
 field at a time (unless I'm mistaken). However, I have a query of the 
 structure
 q=(field1:(blah) AND (field2:(blah) OR field3:(blah))) OR 
 field1:(blah) OR
 (field2:(blah) AND field3:(blah)). If there are any ideas on how to 
 use the CommonTermsQuery with this query structure, that would be great.

 If it's possible to extract the document frequency for terms in my 
 query before the query is run, allowing me to remove the high 
 frequency terms from the query first, that could also be a valid 
 solution. I'm using solrj as well, so a solution that works with solrj would 
 be appreciated.

 Thanks,
 Steve




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Indexing issues after cluster restart.

2015-08-03 Thread Fadi Mohsen
Hi, using SOLR 5.2 after restarting the cluster, I get below exceptions

org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes
published as DOWN in our cluster state.

followed by :

org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:

So looked into ZK tree, and saw a load of entries in /overseer/queue , so
went off and cleaned up (removed all entries), restarted the cluster and
now being able to index again.

What might be the cause to this?

Regards


Re: solr multicore vs sharding vs 1 big collection

2015-08-03 Thread Upayavira
There are two things that are likely to cause the timeouts you are
seeing, I'd say.

Firstly, your server is overloaded - that can be handled by adding
additional replicas.

However, it doesn't seem like this is the case, because the second query
works fine.

Secondly, you are hitting garbage collection issues. This seems more
likely to me. You have 40m documents inside a 6Gb heap. That seems
relatively tight to me. What that means is that Java may well not have
enough space to create all the objects it needs inside a single commit
cycle, forcing a garbage collection which can cause application pauses,
which would fit with what you are seeing.

I'd suggest using the jstat -gcutil command (I think I have that right)
to watch the number of garbage collections taking place. You will
quickly see from that if garbage collection is your issue. The
simplistic remedy would be to allow your JVM a bit more memory.

The other concern I have is that Solr (and Lucene) is intended for high
read/low write scenarios. Its index structure is highly tuned for this
scenario. If you are doing a lot of writes, then you will be creating a
lot of index churn which will require more frequent merges, consuming
both CPU and memory in the process. It may be worth looking at *how* you
use Solr, and see whether, for example, you can separate your documents
into slow moving, and fast moving parts, to better suit the Lucene index
structures. Or to consider whether a Lucene based system is best for
what you are attempting to achieve.

For garbage collection, see here for a good Solr related write-up:

  http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

Upayavira

On Mon, Aug 3, 2015, at 12:29 AM, Jay Potharaju wrote:
 Shawn,
 Thanks for the feedback. I agree that increasing timeout might alleviate
 the timeout issue. The main problem with increasing timeout is the
 detrimental effect it will have on the user experience, therefore can't
 increase it.
 I have looked at the queries that threw errors, next time I try it
 everything seems to work fine. Not sure how to reproduce the error.
 My concern with increasing the memory to 32GB is what happens when the
 index size grows over the next few months.
 One of the other solutions I have been thinking about is to rebuild
 index(weekly) and create a new collection and use it. Are there any good
 references for doing that?
 Thanks
 Jay
 
 On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey apa...@elyograg.org
 wrote:
 
  On 8/2/2015 8:29 AM, Jay Potharaju wrote:
   The document contains around 30 fields and have stored set to true for
   almost 15 of them. And these stored fields are queried and updated all
  the
   time. You will notice that the deleted documents is almost 30% of the
   docs.  And it has stayed around that percent and has not come down.
   I did try optimize but that was disruptive as it caused search errors.
   I have been playing with merge factor to see if that helps with deleted
   documents or not. It is currently set to 5.
  
   The server has 24 GB of memory out of which memory consumption is around
  23
   GB normally and the jvm is set to 6 GB. And have noticed that the
  available
   memory on the server goes to 100 MB at times during a day.
   All the updates are run through DIH.
 
  Using all availble memory is completely normal operation for ANY
  operating system.  If you hold up Windows as an example of one that
  doesn't ... it lies to you about available memory.  All modern
  operating systems will utilize memory that is not explicitly allocated
  for the OS disk cache.
 
  The disk cache will instantly give up any of the memory it is using for
  programs that request it.  Linux doesn't try to hide the disk cache from
  you, but older versions of Windows do.  In the newer versions of Windows
  that have the Resource Monitor, you can go there to see the actual
  memory usage including the cache.
 
   Every day at least once i see the following error, which result in search
   errors on the front end of the site.
  
   ERROR org.apache.solr.servlet.SolrDispatchFilter -
   null:org.eclipse.jetty.io.EofException
  
   From what I have read these are mainly due to timeout and my timeout is
  set
   to 30 seconds and cant set it to a higher number. I was thinking maybe
  due
   to high memory usage, sometimes it leads to bad performance/errors.
 
  Although this error can be caused by timeouts, it has a specific
  meaning.  It means that the client disconnected before Solr responded to
  the request, so when Solr tried to respond (through jetty), it found a
  closed TCP connection.
 
  Client timeouts need to either be completely removed, or set to a value
  much longer than any request will take.  Five minutes is a good starting
  value.
 
  If all your client timeout is set to 30 seconds and you are seeing
  EofExceptions, that means that your requests are taking longer than 30
  seconds, and you likely have some performance issues.  It's also
  possible that 

Re: Multiple boost queries on a specific field

2015-08-03 Thread bengates
Hello Chris,

This totally does the trick. I drastically improved relevancy. Thank you
much for your advices !

- Ben



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-boost-queries-on-a-specific-field-tp4217678p4220396.html
Sent from the Solr - User mailing list archive at Nabble.com.


Trouble getting langid.map.individual setting to work in Solr 5.0.x

2015-08-03 Thread David Smith
I am trying to use “languid.map.individual” setting to allow field “a” to 
detect as, say, English, and be mapped to “a_en”, while in the same document, 
field “b” detects as, say, German and is mapped to “b_de”.

What happens in my tests is that the global language is detected (for example, 
German), but BOTH fields are mapped to “_de” as a result.  I cannot get 
individual detection or mapping to work.  Am I mis-understanding the purpose of 
this setting?

Here is the resulting document from my test:


  {
id: 1005!22345,
language: [
  de
],
a_de: A title that should be detected as English with high 
confidence,
b_de: Die Einführung einer anlasslosen Speicherung von 
Passagierdaten für alle Flüge aus einem Nicht-EU-Staat in die EU und umgekehrt 
ist näher gerückt. Der Ausschuss des EU-Parlaments für bürgerliche Freiheiten, 
Justiz und Inneres (LIBE) hat heute mit knapper Mehrheit für einen 
entsprechenden Richtlinien-Entwurf der EU-Kommission gestimmt. Bürgerrechtler, 
Grüne und Linke halten die geplante Richtlinie für eine andere Form der 
anlasslosen Vorratsdatenspeicherung, die alle Flugreisenden zu Verdächtigen 
mache.,
_version_: 1508494723734569000
  }


I expected “a_de” to be “a_en”, and the “language” multi-valued field to have 
“en” and “de”.

Here is my configuration in solrconfig.xml:


updateRequestProcessorChain name=langid default=true
processor 
class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory
lst name=defaults
str name=langidtrue/str
str name=langid.fla,b/str
str name=langid.maptrue/str
str name=langid.map.individualtrue/str
str name=langid.langFieldlanguage/str
str 
name=langid.map.lcmapaf:uns,ar:uns,bg:uns,bn:uns,cs:uns,da:uns,el:uns,et:uns,fa:uns,fi:uns,gu:uns,he:uns,hi:uns,hr:uns,hu:uns,id:uns,ja:uns,kn:uns,ko:uns,lt:uns,lv:uns,mk:uns,ml:uns,mr:uns,ne:uns,nl:uns,no:uns,pa:uns,pl:uns,ro:uns,ru:uns,sk:uns,sl:uns,so:uns,sq:uns,sv:uns,sw:uns,ta:uns,te:uns,th:uns,tl:uns,tr:uns,uk:uns,ur:uns,vi:uns,zh-cn:uns,zh-tw:uns/str
str name=langid.fallbacken/str
/lst
/processor
processor class=solr.LogUpdateProcessorFactory /
processor class=solr.RunUpdateProcessorFactory /
/updateRequestProcessorChain



The debug output of lang detect, during indexing, is as follows:

---
DEBUG - 2015-08-03 14:37:54.450; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language 
detected de with certainty 0.964723182276
DEBUG - 2015-08-03 14:37:54.450; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Detected 
main document language from fields [a, b]: de
DEBUG - 2015-08-03 14:37:54.450; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field a
DEBUG - 2015-08-03 14:37:54.451; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field b
DEBUG - 2015-08-03 14:37:54.453; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language 
detected de with certainty 0.964723182276
DEBUG - 2015-08-03 14:37:54.453; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field a using individually detected language de
DEBUG - 2015-08-03 14:37:54.454; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing 
mapping from a with language de to field a_de
DEBUG - 2015-08-03 14:37:54.454; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.454; org.eclipse.jetty.webapp.WebAppClassLoader; 
loaded class org.apache.solr.common.SolrInputField from 
WebAppClassLoader=525571@80503
DEBUG - 2015-08-03 14:37:54.454; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing 
old field a
DEBUG - 2015-08-03 14:37:54.455; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field a
DEBUG - 2015-08-03 14:37:54.455; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field b
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language 
detected de with certainty 0.980402022373
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field b using individually detected language de
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing 
mapping from b with language de to field b_de
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.456; 

Re: Large number of collections in SolrCloud

2015-08-03 Thread Olivier
Hi,

Thanks a lot Erick and Shawn for your answers.
I am aware that it is a very particular issue with not a common use of
Solr. I just wondered if people had the similar business case. For
information we need a very important number of collections with the same
configuration cause of legally reasons. Indeed each collection represents
one of our customers and by contract we have to separate the data of each
of them.
If we had the choice, we just would have one collection with a field name
'Customers' and we would do filter queries on it but we can't !

Anyway thanks again for your answers. For now, we finally did not add the
different languages dictionaries per collection and it is fine for 1K+
customers with more resources added to the servers.

Best,

Olivier Tavard



2015-07-27 17:53 GMT+02:00 Shawn Heisey apa...@elyograg.org:

 On 7/27/2015 9:16 AM, Olivier wrote:
  I have a SolrCloud cluster with 3 nodes :  3 shards per node and
  replication factor at 3.
  The collections number is around 1000. All the collections use the same
  Zookeeper configuration.
  So when I create each collection, the ZK configuration is pulled from ZK
  and the configuration files are stored in the JVM.
  I thought that if the configuration was the same for each collection, the
  impact on the JVM would be insignifiant because the configuration should
 be
  loaded only once. But it is not the case, for each collection created,
 the
  JVM size increases because the configuration is loaded again, am I
 correct ?
 
  If I have a small configuration folder size, I have no problem because
 the
  folder size is less than 500 KB so if we count 1000 collections x 500 KB,
  the JVM impact is 500 MB.
  But we manage a lot of languages with some dictionaries so the
  configuration folder size is about 6 MB. The JVM impact is very important
  now because it can be more than 6 GB (1000 x 6 MB).
 
  So I would like to have the feeback of people who have a cluster with a
  large number of collections too. Do I have to change some settings to
  handle this case better ? What can I do to optimize this behaviour ?
  For now, we just increase the RAM size per node at 16 GB but we plan to
  increase the collections number.

 Severe issues were noticed when dealing with many collections, and this
 was with a simple config, and completely empty indexes.  A complex
 config and actual index data would make it run that much more slowly.

 https://issues.apache.org/jira/browse/SOLR-7191

 Memory usage for the config wasn't even considered when I was working on
 reporting that issue.

 SolrCloud is highly optimized to work well when there are a relatively
 small number of collections.  I think there is work that we can do which
 will optimize operations to the point where thousands of collections
 will work well, especially if they all share the same config/schema ...
 but this is likely to be a fair amount of work, which will only benefit
 a handful of users who are pushing the boundaries of what Solr can do.
 In the open source world, a problem like that doesn't normally receive a
 lot of developer attention, and we rely much more on help from the
 community, specifically from knowledgeable users who are having the
 problem and know enough to try and fix it.

 FYI -- 16GB of RAM per machine is quite small for Solr, particularly
 when pushing the envelope.  My Solr machines are maxed at 64GB, and I
 frequently wish I could install more.

 https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

 One possible solution for your dilemma is simply adding more machines
 and spreading your collections out so each machine's memory requirements
 go down.

 Thanks,
 Shawn




Why is /query needed for Json Facet?

2015-08-03 Thread William Bell
I tried using /select and this query does not work? Cannot understand why.
Passing Parameters via JSON

We can also pass normal request parameters in the JSON body within the
params block:
$ curl http://localhost:8983/solr/query?fl=title,author-d '
{
  params:{
q:title:hero,
rows:1
  }
}
'

Which is equivalent to:
$ curl http://localhost:8983/solr/query?fl=title
,authorq=title:herorows=1


-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Erick Erickson
I'd go with SolrJ personally. For a terabyte of data that (I'm inferring)
are PDF files and the like (aka semi-structured documents) you'll
need to have Tika parse out the data you need to index. And doing
that through posting or DIH puts all the analysis on the Solr servers,
which will work, but not optimally.

Here's something to get you started:

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Mon, Aug 3, 2015 at 1:56 PM, Mugeesh Husain muge...@gmail.com wrote:
 Hi Alexandre,
 I have a 40 millions of files which is stored in a file systems,
 the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf
 1.)I have to split all underscore value from a filename and these value have
 to be index to the solr.
 2.)Do Not need file contains(Text) to index.

 You Told me The answer is Yes i didn't get in which way you said Yes.

 Thanks




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Erik Hatcher
Most definitely yes given your criteria below.  If you don’t care for the text 
to be parsed and indexed within the files, a simple file system crawler that 
just got the directory listings and posted the file names split as you’d like 
to Solr would suffice it sounds like.
—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com http://www.lucidworks.com/




 On Aug 3, 2015, at 1:56 PM, Mugeesh Husain muge...@gmail.com wrote:
 
 Hi Alexandre,
 I have a 40 millions of files which is stored in a file systems,
 the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf
 1.)I have to split all underscore value from a filename and these value have
 to be index to the solr.
 2.)Do Not need file contains(Text) to index.
 
 You Told me The answer is Yes i didn't get in which way you said Yes.
 
 Thanks
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Erick Erickson
Ahhh, listen to Hatcher if you're not indexing the _contents_ of the
files, just the filenames

Erick

On Mon, Aug 3, 2015 at 2:22 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 Most definitely yes given your criteria below.  If you don’t care for the 
 text to be parsed and indexed within the files, a simple file system crawler 
 that just got the directory listings and posted the file names split as you’d 
 like to Solr would suffice it sounds like.
 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com http://www.lucidworks.com/




 On Aug 3, 2015, at 1:56 PM, Mugeesh Husain muge...@gmail.com wrote:

 Hi Alexandre,
 I have a 40 millions of files which is stored in a file systems,
 the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf
 1.)I have to split all underscore value from a filename and these value have
 to be index to the solr.
 2.)Do Not need file contains(Text) to index.

 You Told me The answer is Yes i didn't get in which way you said Yes.

 Thanks




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Alexandre Rafalovitch
Just to reconfirm, are you indexing file content? Because if you are,
you need to be aware most of the PDF do not extract well, as they do
not have text flow preserved.

If you are indexing PDF files, I would run a sample through Tika
directly (that's what Solr uses under the covers anyway) and see what
the output looks like.

Apart from that, either SolrJ or DIH would work. If this is for a
production system, I'd use SolrJ with client-side Tika parsing. But
you could use DIH for a quick test run.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 3 August 2015 at 13:56, Mugeesh Husain muge...@gmail.com wrote:
 Hi Alexandre,
 I have a 40 millions of files which is stored in a file systems,
 the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf
 1.)I have to split all underscore value from a filename and these value have
 to be index to the solr.
 2.)Do Not need file contains(Text) to index.

 You Told me The answer is Yes i didn't get in which way you said Yes.

 Thanks




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: HTTP Error 500 on /admin/ping request

2015-08-03 Thread Steven White
I found the issue.  With GET, the legacy code I'm calling into was written
like so:

clientResponse =
resource.contentType(application/atom+xml).accept(application/atom+xml).get();

This is a bug, and should have been:

clientResponse = resource.accept(application/atom+xml).get();

Google'ing on the issue helped me narrow it down.  Looks like others run
into it moving from Solr 5.0 to 5.1 [1] [2].

Steve

[1]
http://lucene.472066.n3.nabble.com/Bad-contentType-for-search-handler-text-xml-charset-UTF-8-td4200314.html
[2] https://github.com/solariumphp/solarium/issues/326


On Mon, Aug 3, 2015 at 2:16 PM, Steven White swhite4...@gmail.com wrote:

 Yes, my application is in Java, no I cannot switch to SolrJ because I'm
 working off legacy code for which I don't have the luxury to refactor..

 If my application is sending the wrong Content-Type HTTP header, which
 part is it and why the same header is working for the other query paths
 such as: /solr/db/config/requestHandler?wt=xml or
 /solr/db/schema/fieldtypes/?wt=xml or /solr/db/schema/fields/?wt=xml ?

 Steve

 On Mon, Aug 3, 2015 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 8/3/2015 11:34 AM, Steven White wrote:
  Hi Everyone,
 
  I cannot figure out why I'm getting HTTP Error 500 off the following
 code:

 snip

  Ping query caused exception: Bad contentType for search handler
  :application/atom+xml

 Your application is sending an incorrect Content-Type HTTP header that
 Solr doesn't know how to handle.

 If your application is Java, why are you not using SolrJ?  You'll likely
 find that to be a lot easier to use than even a REST client.

 Thanks,
 Shawn





Re: Why is /query needed for Json Facet?

2015-08-03 Thread William Bell
OK I figured it out. The documentation is not updated. The default
component are as follows:

FacetModule.COMPONENT_NAME = facet_module

Thus. The following is the default with the new facet_module.

We need someone to update the solrconfig.xml and the docs.

arr name=components

  strquery/str

  strfacet/str

 strfacet_module/str

  strmlt/str

  strhighlight/str

  strstats/str

  strdebug/str

  strexpand/str

/arr

 protected ListString getDefaultComponents()

  {

ArrayListString names = new ArrayList(6);

names.add( QueryComponent.COMPONENT_NAME );

names.add( FacetComponent.COMPONENT_NAME );

names.add( FacetModule.COMPONENT_NAME );

names.add( MoreLikeThisComponent.COMPONENT_NAME );

names.add( HighlightComponent.COMPONENT_NAME );

names.add( StatsComponent.COMPONENT_NAME );

names.add( DebugComponent.COMPONENT_NAME );

names.add( ExpandComponent.COMPONENT_NAME);

return names;

  }



On Mon, Aug 3, 2015 at 11:31 AM, William Bell billnb...@gmail.com wrote:

 I tried using /select and this query does not work? Cannot understand why.
 Passing Parameters via JSON

 We can also pass normal request parameters in the JSON body within the
 params block:
 $ curl http://localhost:8983/solr/query?fl=title,author-d '
 {
   params:{
 q:title:hero,
 rows:1
   }
 }
 '

 Which is equivalent to:
 $ curl http://localhost:8983/solr/query?fl=title
 ,authorq=title:herorows=1


 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
@Erik Hatcher You mean i have to use Solrj for indexing to it.(right ?)

Can Solrj handle large amount of data which i have mentioned previous post ?
If i will use DIH then how will i split value from filename etc.

I want to start my development in a right direction that why i am little
confuse on which way i will start my
requirement.

Please told me you guys are told me yes(Is yes for Solrj ? or DIH ?)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220550.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Large number of collections in SolrCloud

2015-08-03 Thread Erick Erickson
Hmmm, one thing that will certainly help is the new per-collection
state.json that will replace clusterstate.json. That'll reduce a lot
of chatter.

You might also get a lot of mileage out of breaking the collections
into sub-groups that are distinct thus reducing the number of
collections on each shard.

This is totally off the wall, as in I haven't thought about it
much But what about implicit routing? That is, you take control
of what shard documents land on and specifically route the docs there.
Then, instead of one _collection_ per client you might have one
_shard_ per client. Not sure if that meets your legal requirements
either though. And, essentially since each shard is a core it might
have the exact same issues you have now with bringing up lots and lots
and lots of cores...

Speaking of which, if you're not sharding then the Lots of cores
option might make sense, see:
http://wiki.apache.org/solr/LotsOfCores

But do note that this is specifically _not_ supported in SolrCloud mode.

Best,
Erick

On Mon, Aug 3, 2015 at 11:06 AM, Olivier olivau...@gmail.com wrote:
 Hi,

 Thanks a lot Erick and Shawn for your answers.
 I am aware that it is a very particular issue with not a common use of
 Solr. I just wondered if people had the similar business case. For
 information we need a very important number of collections with the same
 configuration cause of legally reasons. Indeed each collection represents
 one of our customers and by contract we have to separate the data of each
 of them.
 If we had the choice, we just would have one collection with a field name
 'Customers' and we would do filter queries on it but we can't !

 Anyway thanks again for your answers. For now, we finally did not add the
 different languages dictionaries per collection and it is fine for 1K+
 customers with more resources added to the servers.

 Best,

 Olivier Tavard



 2015-07-27 17:53 GMT+02:00 Shawn Heisey apa...@elyograg.org:

 On 7/27/2015 9:16 AM, Olivier wrote:
  I have a SolrCloud cluster with 3 nodes :  3 shards per node and
  replication factor at 3.
  The collections number is around 1000. All the collections use the same
  Zookeeper configuration.
  So when I create each collection, the ZK configuration is pulled from ZK
  and the configuration files are stored in the JVM.
  I thought that if the configuration was the same for each collection, the
  impact on the JVM would be insignifiant because the configuration should
 be
  loaded only once. But it is not the case, for each collection created,
 the
  JVM size increases because the configuration is loaded again, am I
 correct ?
 
  If I have a small configuration folder size, I have no problem because
 the
  folder size is less than 500 KB so if we count 1000 collections x 500 KB,
  the JVM impact is 500 MB.
  But we manage a lot of languages with some dictionaries so the
  configuration folder size is about 6 MB. The JVM impact is very important
  now because it can be more than 6 GB (1000 x 6 MB).
 
  So I would like to have the feeback of people who have a cluster with a
  large number of collections too. Do I have to change some settings to
  handle this case better ? What can I do to optimize this behaviour ?
  For now, we just increase the RAM size per node at 16 GB but we plan to
  increase the collections number.

 Severe issues were noticed when dealing with many collections, and this
 was with a simple config, and completely empty indexes.  A complex
 config and actual index data would make it run that much more slowly.

 https://issues.apache.org/jira/browse/SOLR-7191

 Memory usage for the config wasn't even considered when I was working on
 reporting that issue.

 SolrCloud is highly optimized to work well when there are a relatively
 small number of collections.  I think there is work that we can do which
 will optimize operations to the point where thousands of collections
 will work well, especially if they all share the same config/schema ...
 but this is likely to be a fair amount of work, which will only benefit
 a handful of users who are pushing the boundaries of what Solr can do.
 In the open source world, a problem like that doesn't normally receive a
 lot of developer attention, and we rely much more on help from the
 community, specifically from knowledgeable users who are having the
 problem and know enough to try and fix it.

 FYI -- 16GB of RAM per machine is quite small for Solr, particularly
 when pushing the envelope.  My Solr machines are maxed at 64GB, and I
 frequently wish I could install more.

 https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

 One possible solution for your dilemma is simply adding more machines
 and spreading your collections out so each machine's memory requirements
 go down.

 Thanks,
 Shawn




Re: HTTP Error 500 on /admin/ping request

2015-08-03 Thread Shawn Heisey
On 8/3/2015 11:34 AM, Steven White wrote:
 Hi Everyone,

 I cannot figure out why I'm getting HTTP Error 500 off the following code:

snip

 Ping query caused exception: Bad contentType for search handler
 :application/atom+xml

Your application is sending an incorrect Content-Type HTTP header that
Solr doesn't know how to handle.

If your application is Java, why are you not using SolrJ?  You'll likely
find that to be a lot easier to use than even a REST client.

Thanks,
Shawn



Re: How to use BitDocSet within a PostFilter

2015-08-03 Thread Stephen Weiss
Yes that was it.  Had no idea this was an issue!

On Monday, August 3, 2015, Roman Chyla 
roman.ch...@gmail.commailto:roman.ch...@gmail.com wrote:
Hi,
inStockSkusBitSet.get(currentChildDocNumber)

Is that child a lucene id? If yes, does it include offset? Every index
segment starts at a different point, but docs are numbered from zero. So to
check them against the full index bitset, I'd be doing
Bitset.exists(indexBase + docid)

Just one thing to check

Roman
On Aug 3, 2015 1:24 AM, Stephen Weiss steve.we...@wgsn.comjavascript:; 
wrote:

 Hi everyone,

 I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl
 through grandchild documents during a search through the parents and filter
 out documents based on statistics gathered from aggregating the
 grandchildren together.  I've been successful in getting the logic correct,
 but it does not perform so well - I'm grabbing too many documents from the
 index along the way.  I'm trying to filter out grandchild documents which
 are not relevant to the statistics I'm collecting, in order to reduce the
 number of document objects pulled from the IndexReader.

 I've implemented the following code in my DelegatingCollector.collect:

 if (inStockSkusBitSet == null) {
 SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from
 IndexSearcher to expose getDocSet.
 inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery);
 inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from
 DocSet to expose getBits.
 inStockSkusBitSet = inStockSkusBitDocSet.getBits();
 }


 My BitDocSet reports a size which matches a standard query for the more
 limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also
 reports this same cardinality.  Based on that fact, it seems that the
 getDocSet call itself must be working properly, and returning the right
 number of documents.  However, when I try to filter out grandchild
 documents using either BitDocSet.exists or BitSet.get (passing over any
 grandchild document which doesn't exist in the bitdocset or return true
 from the bitset), I get about 1/3 less results than I'm supposed to.   It
 seems many documents that should match the filter, are being excluded, and
 documents which should not match the filter, are being included.

 I'm trying to use it either of these ways:

 if (!inStockSkusBitSet.get(currentChildDocNumber)) continue;
 if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue;

 The currentChildDocNumber is simply the docNumber which is passed to
 DelegatingCollector.collect, decremented until I hit a document that
 doesn't belong to the parent document.

 I can't seem to figure out a way to actually use the BitDocSet (or its
 derivatives) to quickly eliminate document IDs.  It seems like this is how
 it's supposed to be used.  What am I getting wrong?

 Sorry if this is a newbie question, I've never written a PostFilter
 before, and frankly, the documentation out there is a little sketchy
 (mostly for version 4) - so many classes have changed names and so many of
 the more well-documented techniques are deprecated or removed now, it's
 tough to follow what the current best practice actually is.  I'm using the
 block join functionality heavily so I'm trying to keep more current than
 that.  I would be happy to send along the full source privately if it would
 help figure this out, and plan to write up some more elaborate instructions
 (updated for Solr 5) for the next person who decides to write a PostFilter
 and work with block joins, if I ever manage to get this performing well
 enough.

 Thanks for any pointers!  Totally open to doing this an entirely different
 way.  I read DocValues might be a more elegant approach but currently that
 would require reindexing, so trying to avoid that.

 Also, I've been wondering if the query above would read from the filter
 cache or not.  The query is constructed like this:


 private Term inStockTrueTerm = new Term(sku_history.is_in_stock,
 T);
 private Term objectTypeSkuHistoryTerm = new Term(object_type,
 sku_history);
 ...

 inStockTrueTermQuery = new TermQuery(inStockTrueTerm);
 objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm);
 inStockSkusQuery = new BooleanQuery();
 inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST);
 inStockSkusQuery.add(objectTypeSkuHistoryTermQuery,
 BooleanClause.Occur.MUST);
 --
 Steve

 

 WGSN is a global foresight business. Our experts provide deep insight and
 analysis of consumer, fashion and design trends. We inspire our clients to
 plan and trade their range with unparalleled confidence and accuracy.
 Together, we Create Tomorrow.

 WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of
 market-leading products including WGSN.comhttp://www.wgsn.com, WGSN
 Lifestyle  Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN
 INstockhttp://www.wgsninstock.com/, WGSN StyleTrial
 

posting html files

2015-08-03 Thread Huiying Ma
Hi everyone,

I created a core with the basic config sets and schema, when I use bin/post
to post one html file, I got the error:

SimplePostTool: WARNING: IOException while reading response:
java.io.FileNotFoundException..
HTTP ERROR 404

when I go to localhost:8983/solr/core/update, I got:
response
lst name=responseHeader
int name=status400/int
int name=QTime3int
/lst
lst name=error
str name=msgmissing content stream/str
int name=code400/int
/lst
/response

I'm really new to solr and wondering if anyone know how to index html files
according to my own schema and how to configure the schema.xml or
solrconfig file. Thank you so much!

Thanks,
Huiying


HTTP Error 500 on /admin/ping request

2015-08-03 Thread Steven White
Hi Everyone,

I cannot figure out why I'm getting HTTP Error 500 off the following code:

// Using: org.apache.wink.client
String contentType = application/atom+xml;
URI uri = new URI(http://localhost:8983; +
/solr/db/admin/ping?wt=xml);
Resource resource = client.resource(uri.toURL().toString());

ClientResponse clientResponse = null;

clientResponse =
resource.contentType(contentType ).accept(contentType ).get();

clientResponse.getStatusCode();// Gives back: 500

Here is the call stack I get back from the call (it's also the same in
solr.log):

ERROR - 2015-08-03 17:30:29.457; [   db]
org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
Bad contentType for search handler :application/atom+xml
request={wt=xmlq=solrpingqueryechoParams=alldistrib=false}
at
org.apache.solr.request.json.RequestUtil.processParams(RequestUtil.java:74)
at
org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginUtils.java:167)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:140)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
at
org.apache.solr.handler.PingRequestHandler.handlePing(PingRequestHandler.java:254)
at
org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:211)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)

INFO  - 2015-08-03 17:30:29.459; [   db] org.apache.solr.core.SolrCore;
[db] webapp=/solr path=/admin/ping params={wt=xml} status=400 QTime=6
ERROR - 2015-08-03 17:30:29.459; [   db]
org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
Ping query caused exception: Bad contentType for search handler
:application/atom+xml
request={wt=xmlq=solrpingqueryechoParams=alldistrib=false}
at
org.apache.solr.handler.PingRequestHandler.handlePing(PingRequestHandler.java:263)
at
org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:211)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
at java.lang.Thread.run(Thread.java:853)
Caused by: org.apache.solr.common.SolrException: Bad contentType for search
handler :application/atom+xml
request={wt=xmlq=solrpingqueryechoParams=alldistrib=false}
at
org.apache.solr.request.json.RequestUtil.processParams(RequestUtil.java:74)
at
org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginUtils.java:167)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:140)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
at
org.apache.solr.handler.PingRequestHandler.handlePing(PingRequestHandler.java:254)
... 27 more
INFO  - 2015-08-03 17:30:29.461; [   db] org.apache.solr.core.SolrCore;
[db] webapp=/solr path=/admin/ping params={wt=xml} status=500 QTime=8
ERROR - 2015-08-03 17:30:29.462; [   db]
org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException: Ping query caused exception: Bad
contentType for search handler :application/atom+xml
request={wt=xmlq=solrpingqueryechoParams=alldistrib=false}
at
org.apache.solr.handler.PingRequestHandler.handlePing(PingRequestHandler.java:263)
at
org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:211)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)

If I use a browser plug-in Rest Client it works just fine.

My Java code works with other paths such as
/solr/db/config/requestHandler?wt=xml or
/solr/db/schema/fieldtypes/?wt=xml or /solr/db/schema/fields/?wt=xml.

Yes, I did try other content types, the outcome is the same error.

I'm using the default ping handler:

  requestHandler name=/admin/ping class=solr.PingRequestHandler
lst name=invariants
  str name=qsolrpingquery/str
/lst
lst name=defaults
  str name=echoParamsall/str
/lst
  /requestHandler

Any clues / pointers why /admin/ping doesn't work but other query paths
do?

Thanks

Steve


Re: posting html files

2015-08-03 Thread Huiying Ma
Thanks Erik,

I'm trying to index some html files in the same format and I need to index
them according to classes and tags. I've tried data_driven_schema_configs
but I can only get the title and id but not other tags and classes I
wanted. So now I want to edit the schema in the basic_configs but turned
out that error. So do you have any good idea for me? Also, I also tried to
use bin/post to post an xml file to that same core and it worked so I'm
wondering why the html file won't work. Thank you so much!! Since I don't
know much about solr, it's really good that some one can help!

Best,
Huiying

On Mon, Aug 3, 2015 at 1:54 PM, Erik Hatcher erik.hatc...@gmail.com wrote:

 My hunch is that the basic_configs is *too* basic for your needs here.
 basic_configs does not include /update/extract - it’s very basic - stripped
 of all the “extra” components.

 Try using the default, data_driven_schema_configs instead.

 If you’re still having issues, please provide full details of what you’ve
 tried.

 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com http://www.lucidworks.com/




  On Aug 3, 2015, at 1:43 PM, Huiying Ma mahuiying...@gmail.com wrote:
 
  Hi everyone,
 
  I created a core with the basic config sets and schema, when I use
 bin/post
  to post one html file, I got the error:
 
  SimplePostTool: WARNING: IOException while reading response:
  java.io.FileNotFoundException..
  HTTP ERROR 404
 
  when I go to localhost:8983/solr/core/update, I got:
  response
  lst name=responseHeader
  int name=status400/int
  int name=QTime3int
  /lst
  lst name=error
  str name=msgmissing content stream/str
  int name=code400/int
  /lst
  /response
 
  I'm really new to solr and wondering if anyone know how to index html
 files
  according to my own schema and how to configure the schema.xml or
  solrconfig file. Thank you so much!
 
  Thanks,
  Huiying




Re: HTTP Error 500 on /admin/ping request

2015-08-03 Thread Steven White
Yes, my application is in Java, no I cannot switch to SolrJ because I'm
working off legacy code for which I don't have the luxury to refactor..

If my application is sending the wrong Content-Type HTTP header, which part
is it and why the same header is working for the other query paths
such as: /solr/db/config/requestHandler?wt=xml
or /solr/db/schema/fieldtypes/?wt=xml or /solr/db/schema/fields/?wt=xml
?

Steve

On Mon, Aug 3, 2015 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 8/3/2015 11:34 AM, Steven White wrote:
  Hi Everyone,
 
  I cannot figure out why I'm getting HTTP Error 500 off the following
 code:

 snip

  Ping query caused exception: Bad contentType for search handler
  :application/atom+xml

 Your application is sending an incorrect Content-Type HTTP header that
 Solr doesn't know how to handle.

 If your application is Java, why are you not using SolrJ?  You'll likely
 find that to be a lot easier to use than even a REST client.

 Thanks,
 Shawn




Re: posting html files

2015-08-03 Thread Erik Hatcher
My recommendation, start with the default configset 
(data_driven_schema_configs) like this:

  # grab an HTML page to use
  curl http://lucene.apache.org/solr/index.html  index.html

  bin/solr start
  bin/solr create -c html_test
  bin/post -c html_test index.html

$ curl http://localhost:8983/solr/html_test/select?q=*:*wt=csv;
stream_size,stream_content_type,keywords,x_parsed_by,content_encoding,distribution,title,content_type,viewport,_version_,dc_title,id,resourcename,robots
23049,text/html,apache\, apache lucene\, apache solr\, solr\, lucene   
  search\, information retrieval\, spell checking\, faceting\, inverted index\, 
open 
source,org.apache.tika.parser.DefaultParser,org.apache.tika.parser.html.HtmlParser,UTF-8,Global,Apache
 Solr -,text/html; charset=UTF-8,minimal-ui\, initial-scale=1\, 
maximum-scale=1\, user-scalable=0,1508508085335883776,Apache Solr 
-,/Users/erikhatcher/dev/trunk/solr/index.html,/Users/erikhatcher/dev/trunk/solr/index.html,index\,follow”

If you’d like to enhance the extraction for specific xpaths, see 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika#UploadingDatawithSolrCellusingApacheTika-InputParameters
 - you can set these parameters on the upload, using -params (see the 
“Capturing and Mapping” example with -params on the bin/post) or by adjusting 
the settings of /update/extract in solrconfig.xml.


—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com




 On Aug 3, 2015, at 2:00 PM, Huiying Ma mahuiying...@gmail.com wrote:
 
 Thanks Erik,
 
 I'm trying to index some html files in the same format and I need to index
 them according to classes and tags. I've tried data_driven_schema_configs
 but I can only get the title and id but not other tags and classes I
 wanted. So now I want to edit the schema in the basic_configs but turned
 out that error. So do you have any good idea for me? Also, I also tried to
 use bin/post to post an xml file to that same core and it worked so I'm
 wondering why the html file won't work. Thank you so much!! Since I don't
 know much about solr, it's really good that some one can help!
 
 Best,
 Huiying
 
 On Mon, Aug 3, 2015 at 1:54 PM, Erik Hatcher erik.hatc...@gmail.com wrote:
 
 My hunch is that the basic_configs is *too* basic for your needs here.
 basic_configs does not include /update/extract - it’s very basic - stripped
 of all the “extra” components.
 
 Try using the default, data_driven_schema_configs instead.
 
 If you’re still having issues, please provide full details of what you’ve
 tried.
 
 —
 Erik Hatcher, Senior Solutions Architect
 http://www.lucidworks.com http://www.lucidworks.com/
 
 
 
 
 On Aug 3, 2015, at 1:43 PM, Huiying Ma mahuiying...@gmail.com wrote:
 
 Hi everyone,
 
 I created a core with the basic config sets and schema, when I use
 bin/post
 to post one html file, I got the error:
 
 SimplePostTool: WARNING: IOException while reading response:
 java.io.FileNotFoundException..
 HTTP ERROR 404
 
 when I go to localhost:8983/solr/core/update, I got:
 response
 lst name=responseHeader
 int name=status400/int
 int name=QTime3int
 /lst
 lst name=error
 str name=msgmissing content stream/str
 int name=code400/int
 /lst
 /response
 
 I'm really new to solr and wondering if anyone know how to index html
 files
 according to my own schema and how to configure the schema.xml or
 solrconfig file. Thank you so much!
 
 Thanks,
 Huiying
 
 



Re: posting html files

2015-08-03 Thread Erik Hatcher
My hunch is that the basic_configs is *too* basic for your needs here.  
basic_configs does not include /update/extract - it’s very basic - stripped of 
all the “extra” components.

Try using the default, data_driven_schema_configs instead.

If you’re still having issues, please provide full details of what you’ve tried.

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com http://www.lucidworks.com/




 On Aug 3, 2015, at 1:43 PM, Huiying Ma mahuiying...@gmail.com wrote:
 
 Hi everyone,
 
 I created a core with the basic config sets and schema, when I use bin/post
 to post one html file, I got the error:
 
 SimplePostTool: WARNING: IOException while reading response:
 java.io.FileNotFoundException..
 HTTP ERROR 404
 
 when I go to localhost:8983/solr/core/update, I got:
 response
 lst name=responseHeader
 int name=status400/int
 int name=QTime3int
 /lst
 lst name=error
 str name=msgmissing content stream/str
 int name=code400/int
 /lst
 /response
 
 I'm really new to solr and wondering if anyone know how to index html files
 according to my own schema and how to configure the schema.xml or
 solrconfig file. Thank you so much!
 
 Thanks,
 Huiying



Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
Hi Alexandre,
I have a 40 millions of files which is stored in a file systems,
the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf
1.)I have to split all underscore value from a filename and these value have
to be index to the solr.
2.)Do Not need file contains(Text) to index.

You Told me The answer is Yes i didn't get in which way you said Yes.

Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html
Sent from the Solr - User mailing list archive at Nabble.com.


reload collections timeout

2015-08-03 Thread olivier

Hi everybody,

I have about 1300 collections, 3 shards, replicationfactor = 3, 
MaxShardPerNode=3.
I have 3 boxes of 64G (32 JVM).

When I want to reload all my collections I get a timeout error.
Is there a way to make a reload in async as to create collections 
(async=requestid)?
I saw on this issue that it was done but it did not seem to work.

https://issues.apache.org/jira/browse/SOLR-5477

how to use the async mode to reload collections ?

thanks a lot

Olivier Damiot



[JOB] Financial search engine company AlphaSense is looking for Search Engineers

2015-08-03 Thread Dmitry Kan
Hi fellow Solr devs / users,

I decided to resend the info on the opening assuming most of you could have
been on vacation in July. I don't intend to send it any longer :)




Company: AlphaSense https://www.alpha-sense.com/
Position: Search Engineer

AlphaSense is a one-stop financial search engine for financial research
analysts all around the world.

AlphaSense is looking for Search Engineers experienced with Lucene / Solr
and search architectures in general. Positions are open in Helsinki (
http://www.visitfinland.com/helsinki/).

Daily routine topics for our search team:

1. Sharding
2. Commit vs query performance
3. Performance benchmarking
4. Custom query syntax, lucene / solr grammars
5. Relevancy
6. Query optimization
7. Search system monitoring: cache, RAM, throughput etc
8. Automatic deployment
9. Internal tool development

We have evolved the system through a series of Solr releases starting from
1.4 to 4.10, pushing forward all our solr-level customizations.

Requirements:

1. Core Java + web services
2. Understanding of distributed search engine architecture
3. Java concurrency
4. Understanding of performance issues and approaches to tackle them
5. Clean and beautiful code + design patterns

Our search team members are active in the open source search scene, in
particular we support and develop luke toolbox:
https://github.com/dmitrykey/luke, participate in search / OS conferences
(Lucene Revolution, ApacheCon, Berlin buzzwords), review books on Solr.

Send your CV over and let's have a chat. Please e-mail me, if you have any
questions.

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Duplicate Documents

2015-08-03 Thread Tarala, Magesh
I'm using solr 4.10.2. I'm using id field as the unique key - it is passed in 
with the document when ingesting the documents into solr. When querying I get 
duplicate documents with different _version_. Out off approx. 25K unique 
documents ingested into solr, I see approx. 300 duplicates.

It is a 3 node solr cloud with one shard and 2 replicas.
I'm also using nested documents.

Thanks in advance for any insights.

--Magesh



Re: solr multicore vs sharding vs 1 big collection

2015-08-03 Thread Bill Bell
Yeah a separate by month or year is good and can really help in this case.

Bill Bell
Sent from mobile


 On Aug 2, 2015, at 5:29 PM, Jay Potharaju jspothar...@gmail.com wrote:
 
 Shawn,
 Thanks for the feedback. I agree that increasing timeout might alleviate
 the timeout issue. The main problem with increasing timeout is the
 detrimental effect it will have on the user experience, therefore can't
 increase it.
 I have looked at the queries that threw errors, next time I try it
 everything seems to work fine. Not sure how to reproduce the error.
 My concern with increasing the memory to 32GB is what happens when the
 index size grows over the next few months.
 One of the other solutions I have been thinking about is to rebuild
 index(weekly) and create a new collection and use it. Are there any good
 references for doing that?
 Thanks
 Jay
 
 On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 8/2/2015 8:29 AM, Jay Potharaju wrote:
 The document contains around 30 fields and have stored set to true for
 almost 15 of them. And these stored fields are queried and updated all
 the
 time. You will notice that the deleted documents is almost 30% of the
 docs.  And it has stayed around that percent and has not come down.
 I did try optimize but that was disruptive as it caused search errors.
 I have been playing with merge factor to see if that helps with deleted
 documents or not. It is currently set to 5.
 
 The server has 24 GB of memory out of which memory consumption is around
 23
 GB normally and the jvm is set to 6 GB. And have noticed that the
 available
 memory on the server goes to 100 MB at times during a day.
 All the updates are run through DIH.
 
 Using all availble memory is completely normal operation for ANY
 operating system.  If you hold up Windows as an example of one that
 doesn't ... it lies to you about available memory.  All modern
 operating systems will utilize memory that is not explicitly allocated
 for the OS disk cache.
 
 The disk cache will instantly give up any of the memory it is using for
 programs that request it.  Linux doesn't try to hide the disk cache from
 you, but older versions of Windows do.  In the newer versions of Windows
 that have the Resource Monitor, you can go there to see the actual
 memory usage including the cache.
 
 Every day at least once i see the following error, which result in search
 errors on the front end of the site.
 
 ERROR org.apache.solr.servlet.SolrDispatchFilter -
 null:org.eclipse.jetty.io.EofException
 
 From what I have read these are mainly due to timeout and my timeout is
 set
 to 30 seconds and cant set it to a higher number. I was thinking maybe
 due
 to high memory usage, sometimes it leads to bad performance/errors.
 
 Although this error can be caused by timeouts, it has a specific
 meaning.  It means that the client disconnected before Solr responded to
 the request, so when Solr tried to respond (through jetty), it found a
 closed TCP connection.
 
 Client timeouts need to either be completely removed, or set to a value
 much longer than any request will take.  Five minutes is a good starting
 value.
 
 If all your client timeout is set to 30 seconds and you are seeing
 EofExceptions, that means that your requests are taking longer than 30
 seconds, and you likely have some performance issues.  It's also
 possible that some of your client timeouts are set a lot shorter than 30
 seconds.
 
 My objective is to stop the errors, adding more memory to the server is
 not
 a good scaling strategy. That is why i was thinking maybe there is a
 issue
 with the way things are set up and need to be revisited.
 
 You're right that adding more memory to the servers is not a good
 scaling strategy for the general case ... but in this situation, I think
 it might be prudent.  For your index and heap sizes, I would want the
 company to pay for at least 32GB of RAM.
 
 Having said that ... I've seen Solr installs work well with a LOT less
 memory than the ideal.  I don't know that adding more memory is
 necessary, unless your system (CPU, storage, and memory speeds) is
 particularly slow.  Based on your document count and index size, your
 documents are quite small, so I think your memory size is probably good
 -- if the CPU, memory bus, and storage are very fast.  If one or more of
 those subsystems aren't fast, then make up the difference with lots of
 memory.
 
 Some light reading, where you will learn why I think 32GB is an ideal
 memory size for your system:
 
 https://wiki.apache.org/solr/SolrPerformanceProblems
 
 It is possible that your 6GB heap is not quite big enough for good
 performance, or that your GC is not well-tuned.  These topics are also
 discussed on that wiki page.  If you increase your heap size, then the
 likelihood of needing more memory in the system becomes greater, because
 there will be less memory available for the disk cache.
 
 Thanks,
 Shawn
 
 
 -- 
 Thanks
 Jay Potharaju


Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
Is there are an easy way for a client to tell Solr to close or release the
IndexSearcher and/or IndexWriter for a core?

I have a use case where we're creating a lot of cores with not that many
documents per zone (a few hundred to maybe 10's of thousands).  Writes come
in batches, and reads also tend to be bursty, if less so than the writes.

And we're having problems with ram usage on the server.  Poking around a
heap dump, the problem is that every IndexSearcher or IndexWriter being
opened is taking up large amounts of memory.

I've looked at the unload call, and while it is unclear, it seems like it
deletes the data on disk as well.  I don't want to delete the data on disk,
I just want to unload the searcher and writer, and free up the memory.

So I'm wondering if there is a call I can make when I know or suspect that
the core isn't going to be used in the near future to release these objects
and return the memory?  Or a configuration option I can set to do so after,
say, being idle for 5 seconds?  It's OK for there to be a performance hit
the first time I reopen the core.

Thanks,

Brian


Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Alexandre Rafalovitch
Well,

If it is just file names, I'd probably use SolrJ client, maybe with
Java 8. Read file names, split the name into parts with regular
expressions, stuff parts into different field names and send to Solr.
Java 8 has FileSystem walkers, etc to make it easier.

You could do it with DIH, but it would be with nested entities and the
inner entity would probably try to parse the file. So, a lot of wasted
effort if you just care about the file names.

Or, I would just do a directory listing in the operating system and
use regular expressions to split it into CSV file, which I would then
import into Solr directly.

In all of these cases, the question would be which field is the ID of
the record to ensure no duplicates.

Regards,
   Alex.


Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 3 August 2015 at 15:34, Mugeesh Husain muge...@gmail.com wrote:
 @Alexandre  No i dont need a content of a file. i am repeating my requirement

 I have a 40 millions of files which is stored in a file systems,
 the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf

 I just  split all Value from a filename only,these values i have to index.

 I am interested to index value to solr not file contains.

 I have tested the DIH from a file system its work fine but i dont know how
 can i implement my code in DIH
 if my code get some value than how i can i index it using DIH.

 If i will use DIH then How i will make split operation and get value from
 it.





 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Apache Solr Handle TeraByte Large Data

2015-08-03 Thread Mugeesh Husain
@Alexandre  No i dont need a content of a file. i am repeating my requirement

I have a 40 millions of files which is stored in a file systems, 
the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf 

I just  split all Value from a filename only,these values i have to index.

I am interested to index value to solr not file contains.

I have tested the DIH from a file system its work fine but i dont know how
can i implement my code in DIH
if my code get some value than how i can i index it using DIH.

If i will use DIH then How i will make split operation and get value from
it.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
Sent from the Solr - User mailing list archive at Nabble.com.


Collapsing Query Parser returns one record per shard...was not expecting this...

2015-08-03 Thread Peter Lee
From my reading of the solr docs (e.g. 
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results 
and https://cwiki.apache.org/confluence/display/solr/Result+Grouping), I've 
been under the impression that these two methods (result grouping and 
collapsing query parser) can both be used to eliminate duplicates from a 
result set (in our case, we have a duplication field that contains a 
'signature' that identifies duplicates. We use our own signature for a variety 
of reasons that are tied to complex business requirements.).

In a test environment I scattered 15 duplicate records (with another 10 unique 
records) across a test system running Solr Cloud (Solr version 5.2.1) that had 
4 shards and a replication factor of 2. I tried both result grouping and the 
collapsing query parser to remove duplicates. The result grouping worked as 
expected...the collapsing query parser did not.

My results in using the collapsing query parser showed that Solr was in fact 
including into the result set one of the duplicate records from each shard 
(that is, I received FOUR duplicate records...and turning on debug showed that 
each of the four records came from a  unique shard)...when I was expecting solr 
to do the collapsing on the aggregated result and return only ONE of the 
duplicated records across ALL shards. It appears that solr is performing the 
collapsing query parsing on each individual shard, but then NOT performing the 
operation on the aggregated results from each shard.

I have searched through the forums and checked the documentation as carefully 
as I can. I find no documentation or mention of this effect (one record being 
returned per shard) when using collapsing query parsing.

Is this a known behavior? Am I just doing something wrong? Am I missing some 
search parameter? Am I simply not understanding correctly how this is supposed 
to work?

For reference, I am including below the search url and the response I received. 
Any insights would be appreciated.

Query: 
http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq={!collapse%20field=dupid_s}debugQuery=truehttp://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq=%7b!collapse%20field=dupid_s%7ddebugQuery=true

Response (note that dupid_s = 900 is the duplicate value and that I have added 
comments in the output ***comment*** pointing out which shard responses came 
from):

{
  responseHeader:{
status:0,
QTime:31,
params:{
  debugQuery:true,
  indent:true,
  q:*:*,
  wt:json,
  fq:{!collapse field=dupid_s},
  rows:1000}},
  response:{numFound:14,start:0,maxScore:1.0,docs:[
  {
storeid_s:1002,
dupid_s:900, ***AcaColl_shard2_replica2***
title_pqth:[Dupe Record #2],
_version_:1508241005512491008,
indexTime_dt:2015-07-31T19:25:09.914Z},
  {
storeid_s:8020,
dupid_s:2005,
title_pqth:[Unique Record #5],
_version_:1508241005539753984,
indexTime_dt:2015-07-31T19:25:09.94Z},
  {
storeid_s:8023,
dupid_s:2008,
title_pqth:[Unique Record #8],
_version_:1508241005540802560,
indexTime_dt:2015-07-31T19:25:09.94Z},
  {
storeid_s:8024,
dupid_s:2009,
title_pqth:[Unique Record #9],
_version_:1508241005541851136,
indexTime_dt:2015-07-31T19:25:09.94Z},
  {
storeid_s:1007,
dupid_s:900, ***AcaColl_shard4_replica2***
title_pqth:[Dupe Record #7],
_version_:1508241005515636736,
indexTime_dt:2015-07-31T19:25:09.91Z},
  {
storeid_s:8016,
dupid_s:2001,
title_pqth:[Unique Record #1],
_version_:1508241005526122496,
indexTime_dt:2015-07-31T19:25:09.91Z},
  {
storeid_s:8019,
dupid_s:2004,
title_pqth:[Unique Record #4],
_version_:1508241005528219648,
indexTime_dt:2015-07-31T19:25:09.91Z},
  {
storeid_s:1003,
dupid_s:900, ***AcaColl_shard1_replica1***
title_pqth:[Dupe Record #3],
_version_:1508241005515636736,
indexTime_dt:2015-07-31T19:25:09.917Z},
  {
storeid_s:8017,
dupid_s:2002,
title_pqth:[Unique Record #2],
_version_:1508241005518782464,
indexTime_dt:2015-07-31T19:25:09.917Z},
  {
storeid_s:8018,
dupid_s:2003,
title_pqth:[Unique Record #3],
_version_:1508241005519831040,
indexTime_dt:2015-07-31T19:25:09.917Z},
  {
storeid_s:1001,
dupid_s:900, ***AcaColl_shard3_replica1***
title_pqth:[Dupe Record #1],
_version_:1508241005511442432,
indexTime_dt:2015-07-31T19:25:09.912Z},
  {
storeid_s:8021,
dupid_s:2006,
title_pqth:[Unique Record #6],
_version_:1508241005532413952,
indexTime_dt:2015-07-31T19:25:09.929Z},
  {
storeid_s:8022,