Re: How to use BitDocSet within a PostFilter
Hi, inStockSkusBitSet.get(currentChildDocNumber) Is that child a lucene id? If yes, does it include offset? Every index segment starts at a different point, but docs are numbered from zero. So to check them against the full index bitset, I'd be doing Bitset.exists(indexBase + docid) Just one thing to check Roman On Aug 3, 2015 1:24 AM, Stephen Weiss steve.we...@wgsn.com wrote: Hi everyone, I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl through grandchild documents during a search through the parents and filter out documents based on statistics gathered from aggregating the grandchildren together. I've been successful in getting the logic correct, but it does not perform so well - I'm grabbing too many documents from the index along the way. I'm trying to filter out grandchild documents which are not relevant to the statistics I'm collecting, in order to reduce the number of document objects pulled from the IndexReader. I've implemented the following code in my DelegatingCollector.collect: if (inStockSkusBitSet == null) { SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from IndexSearcher to expose getDocSet. inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery); inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from DocSet to expose getBits. inStockSkusBitSet = inStockSkusBitDocSet.getBits(); } My BitDocSet reports a size which matches a standard query for the more limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also reports this same cardinality. Based on that fact, it seems that the getDocSet call itself must be working properly, and returning the right number of documents. However, when I try to filter out grandchild documents using either BitDocSet.exists or BitSet.get (passing over any grandchild document which doesn't exist in the bitdocset or return true from the bitset), I get about 1/3 less results than I'm supposed to. It seems many documents that should match the filter, are being excluded, and documents which should not match the filter, are being included. I'm trying to use it either of these ways: if (!inStockSkusBitSet.get(currentChildDocNumber)) continue; if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue; The currentChildDocNumber is simply the docNumber which is passed to DelegatingCollector.collect, decremented until I hit a document that doesn't belong to the parent document. I can't seem to figure out a way to actually use the BitDocSet (or its derivatives) to quickly eliminate document IDs. It seems like this is how it's supposed to be used. What am I getting wrong? Sorry if this is a newbie question, I've never written a PostFilter before, and frankly, the documentation out there is a little sketchy (mostly for version 4) - so many classes have changed names and so many of the more well-documented techniques are deprecated or removed now, it's tough to follow what the current best practice actually is. I'm using the block join functionality heavily so I'm trying to keep more current than that. I would be happy to send along the full source privately if it would help figure this out, and plan to write up some more elaborate instructions (updated for Solr 5) for the next person who decides to write a PostFilter and work with block joins, if I ever manage to get this performing well enough. Thanks for any pointers! Totally open to doing this an entirely different way. I read DocValues might be a more elegant approach but currently that would require reindexing, so trying to avoid that. Also, I've been wondering if the query above would read from the filter cache or not. The query is constructed like this: private Term inStockTrueTerm = new Term(sku_history.is_in_stock, T); private Term objectTypeSkuHistoryTerm = new Term(object_type, sku_history); ... inStockTrueTermQuery = new TermQuery(inStockTrueTerm); objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm); inStockSkusQuery = new BooleanQuery(); inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST); inStockSkusQuery.add(objectTypeSkuHistoryTermQuery, BooleanClause.Occur.MUST); -- Steve WGSN is a global foresight business. Our experts provide deep insight and analysis of consumer, fashion and design trends. We inspire our clients to plan and trade their range with unparalleled confidence and accuracy. Together, we Create Tomorrow. WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN INstockhttp://www.wgsninstock.com/, WGSN StyleTrial http://www.wgsn.com/en/styletrial/ and WGSN Mindset http://www.wgsn.com/en/services/consultancy/, our bespoke consultancy services. The information in or attached to this email is
Re: Collection APIs to create collection and custom cores naming
See: https://issues.apache.org/jira/browse/SOLR-6719 It's not clear that we'll support this, so this may just be a doc change. How would you properly support having more than one replica? Or, for that matter, having more than one shard? Property.name would have to do something to make the core names unique. I agree that for single-shard, single replica situations it's a reasonable thing to do, but I'm not at all sure the effort is worth the gain for that one case. Yes, you can create a bunch of rules that would allow you to map selected names to a long, comma separated string or something like that, but it just doesn't seem worth the effort. Best, Erick On Mon, Aug 3, 2015 at 1:48 AM, davidphilip cherian davidphilipcher...@gmail.com wrote: How to use the 'property.name=value' in the api example[1] to modify core.properties value of 'name' While creating the collection with below query[2], the core names become 'aggregator_shard1_replica1' and 'aggregator_shard2_replica1'. I wanted have specific/custom name for each of these cores. I tried passing the params as property.name=namename=aggregator_s1, but it did not work. Editing the core.properties key value pair of name=aggregator_s1 after collection is created, works! But I was looking for setting this property with create request itself. [2] http://example.com:8983/solr/admin/collections?action=CREATEname=aggregatornumShards=1replicationFactor=2maxShardsPerNode=1collection.configName=aggregator_configproperty.name=namename=aggregator_s1 [1] https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1
Re: Can Apache Solr Handle TeraByte Large Data
Hi, I am new in solr development and have a same requirement and I have already got some knowledge such as how many shard have to created such amount of data at all. with help of googling. I want to take Some suggestion there are so many method to do indexing such as DIH,solr,Solrj. Please suggest me in which way i have to do it. 1.) Should i use Solrj 1.) Should i use DIH 1.) Should i use post method(in terminal) or Is there any other way for indexing such amount of data. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220469.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can Apache Solr Handle TeraByte Large Data
That's still a VERY open question. The answer is Yes, but the details depend on the shape and source of your data. And the search you are anticipating. Is this a lot of entries with small number of fields. Or a - relatively - small number of entries with huge field counts. Do you need to store/return all those fields or just search them? Is the content coming as one huge file (in which format?) or from an external source such as database? And so on. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 3 August 2015 at 11:42, Mugeesh Husain muge...@gmail.com wrote: Hi, I am new in solr development and have a same requirement and I have already got some knowledge such as how many shard have to created such amount of data at all. with help of googling. I want to take Some suggestion there are so many method to do indexing such as DIH,solr,Solrj. Please suggest me in which way i have to do it. 1.) Should i use Solrj 1.) Should i use DIH 1.) Should i use post method(in terminal) or Is there any other way for indexing such amount of data. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220469.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Collapsing Query Parser returns one record per shard...was not expecting this...
Your findings are the expected behavior for the Collapsing qparser. The Collapsing qparser requires records in the same collapsed field to be located on the same shard. The typical approach for this is to use composite Id routing to ensure that documents with the same collapse field land on the same shard. We should make this clear in the documentation. Joel Bernstein http://joelsolr.blogspot.com/ On Mon, Aug 3, 2015 at 4:20 PM, Peter Lee peter@proquest.com wrote: From my reading of the solr docs (e.g. https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results and https://cwiki.apache.org/confluence/display/solr/Result+Grouping), I've been under the impression that these two methods (result grouping and collapsing query parser) can both be used to eliminate duplicates from a result set (in our case, we have a duplication field that contains a 'signature' that identifies duplicates. We use our own signature for a variety of reasons that are tied to complex business requirements.). In a test environment I scattered 15 duplicate records (with another 10 unique records) across a test system running Solr Cloud (Solr version 5.2.1) that had 4 shards and a replication factor of 2. I tried both result grouping and the collapsing query parser to remove duplicates. The result grouping worked as expected...the collapsing query parser did not. My results in using the collapsing query parser showed that Solr was in fact including into the result set one of the duplicate records from each shard (that is, I received FOUR duplicate records...and turning on debug showed that each of the four records came from a unique shard)...when I was expecting solr to do the collapsing on the aggregated result and return only ONE of the duplicated records across ALL shards. It appears that solr is performing the collapsing query parsing on each individual shard, but then NOT performing the operation on the aggregated results from each shard. I have searched through the forums and checked the documentation as carefully as I can. I find no documentation or mention of this effect (one record being returned per shard) when using collapsing query parsing. Is this a known behavior? Am I just doing something wrong? Am I missing some search parameter? Am I simply not understanding correctly how this is supposed to work? For reference, I am including below the search url and the response I received. Any insights would be appreciated. Query: http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq={!collapse%20field=dupid_s}debugQuery=true http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq=%7b!collapse%20field=dupid_s%7ddebugQuery=true Response (note that dupid_s = 900 is the duplicate value and that I have added comments in the output ***comment*** pointing out which shard responses came from): { responseHeader:{ status:0, QTime:31, params:{ debugQuery:true, indent:true, q:*:*, wt:json, fq:{!collapse field=dupid_s}, rows:1000}}, response:{numFound:14,start:0,maxScore:1.0,docs:[ { storeid_s:1002, dupid_s:900, ***AcaColl_shard2_replica2*** title_pqth:[Dupe Record #2], _version_:1508241005512491008, indexTime_dt:2015-07-31T19:25:09.914Z}, { storeid_s:8020, dupid_s:2005, title_pqth:[Unique Record #5], _version_:1508241005539753984, indexTime_dt:2015-07-31T19:25:09.94Z}, { storeid_s:8023, dupid_s:2008, title_pqth:[Unique Record #8], _version_:1508241005540802560, indexTime_dt:2015-07-31T19:25:09.94Z}, { storeid_s:8024, dupid_s:2009, title_pqth:[Unique Record #9], _version_:1508241005541851136, indexTime_dt:2015-07-31T19:25:09.94Z}, { storeid_s:1007, dupid_s:900, ***AcaColl_shard4_replica2*** title_pqth:[Dupe Record #7], _version_:1508241005515636736, indexTime_dt:2015-07-31T19:25:09.91Z}, { storeid_s:8016, dupid_s:2001, title_pqth:[Unique Record #1], _version_:1508241005526122496, indexTime_dt:2015-07-31T19:25:09.91Z}, { storeid_s:8019, dupid_s:2004, title_pqth:[Unique Record #4], _version_:1508241005528219648, indexTime_dt:2015-07-31T19:25:09.91Z}, { storeid_s:1003, dupid_s:900, ***AcaColl_shard1_replica1*** title_pqth:[Dupe Record #3], _version_:1508241005515636736, indexTime_dt:2015-07-31T19:25:09.917Z}, { storeid_s:8017, dupid_s:2002, title_pqth:[Unique Record #2], _version_:1508241005518782464, indexTime_dt:2015-07-31T19:25:09.917Z}, { storeid_s:8018, dupid_s:2003,
Re: Can Apache Solr Handle TeraByte Large Data
Upayavira, manual commit isn't a good advice, especially with small bulks or single document, is it? I see recommendations on using autoCommit+autoSoftCommit instead of manual commit mostly. вт, 4 авг. 2015 г. в 1:00, Upayavira u...@odoko.co.uk: SolrJ is just a SolrClient. In pseudocode, you say: SolrClient client = new SolrClient(http://localhost:8983/solr/whatever;); ListSolrInputDocument docs = new ArrayList(); SolrInputDocument doc = new SolrInputDocument(); doc.addField(id, abc123); doc.addField(some-text-field, I like it when the sun shines); docs.add(doc); client.add(docs); client.commit(); (warning, the above is typed from memory) So, the question is simply how many documents do you add to docs before you do client.add(docs); And how often (if at all) do you call client.commit(). So when you are told Use SolrJ, really, you are being told to write some Java code that happens to use the SolrJ client library for Solr. Upayavira On Mon, Aug 3, 2015, at 10:01 PM, Alexandre Rafalovitch wrote: Well, If it is just file names, I'd probably use SolrJ client, maybe with Java 8. Read file names, split the name into parts with regular expressions, stuff parts into different field names and send to Solr. Java 8 has FileSystem walkers, etc to make it easier. You could do it with DIH, but it would be with nested entities and the inner entity would probably try to parse the file. So, a lot of wasted effort if you just care about the file names. Or, I would just do a directory listing in the operating system and use regular expressions to split it into CSV file, which I would then import into Solr directly. In all of these cases, the question would be which field is the ID of the record to ensure no duplicates. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 3 August 2015 at 15:34, Mugeesh Husain muge...@gmail.com wrote: @Alexandre No i dont need a content of a file. i am repeating my requirement I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf I just split all Value from a filename only,these values i have to index. I am interested to index value to solr not file contains. I have tested the DIH from a file system its work fine but i dont know how can i implement my code in DIH if my code get some value than how i can i index it using DIH. If i will use DIH then How i will make split operation and get value from it. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html Sent from the Solr - User mailing list archive at Nabble.com. -- Best regards, Konstantin Gribov
DateRangeField Query throws NPE
Hi everyone, I'm running into a trouble building a query with DateRangeField. Web-based queries work fine, but this code throws an NPE: dateRangeQuery = dateRangeField.getRangeQuery(null, SidxS.getSchema().getField(sku_history.date_range), start_date_str, end_date_str, true, true); ERROR - 2015-08-03 23:07:10.122; [ instock_dev] org.apache.solr.common.SolrException; null:com.google.common.util.concurrent.UncheckedExecutionException: java.lang.NullPointerException at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789) at org.apache.solr.schema.AbstractSpatialFieldType.getStrategy(AbstractSpatialFieldType.java:403) at org.apache.solr.schema.AbstractSpatialFieldType.getQueryFromSpatialArgs(AbstractSpatialFieldType.java:331) at org.apache.solr.schema.DateRangeField.getRangeQuery(DateRangeField.java:184) at com.wgsn.ginger.stockStatusQuery.getFilterCollector(stockStatusQuery.java:128) at org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1148) at org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1609) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1485) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:561) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:518) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:255) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) ... at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.solr.schema.AbstractSpatialPrefixTreeFieldType.newSpatialStrategy(AbstractSpatialPrefixTreeFieldType.java:117) at org.apache.solr.schema.AbstractSpatialPrefixTreeFieldType.newSpatialStrategy(AbstractSpatialPrefixTreeFieldType.java:40) at org.apache.solr.schema.AbstractSpatialFieldType$2.call(AbstractSpatialFieldType.java:406) at org.apache.solr.schema.AbstractSpatialFieldType$2.call(AbstractSpatialFieldType.java:403) at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) ... 38 more That line 117 seems to be some logging happening in that class, which is probably unnecessary to begin with: loghttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/5.1.0/org/apache/solr/schema/AbstractSpatialFieldType.java#AbstractSpatialFieldType.0log.infohttp://grepcode.com/file/repo1.maven.org/maven2/org.slf4j/slf4j-api/1.7.7/org/slf4j/Logger.java#Logger.info%28java.lang.String%29(this.toStringhttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/5.1.0/org/apache/solr/schema/FieldType.java#FieldType.toString%28%29()+ strat: +strat+ maxLevels: + gridhttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/5.1.0/org/apache/solr/schema/AbstractSpatialPrefixTreeFieldType.java#AbstractSpatialPrefixTreeFieldType.0grid.getMaxLevelshttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-spatial/5.1.0/org/apache/lucene/spatial/prefix/tree/SpatialPrefixTree.java#SpatialPrefixTree.getMaxLevels%28%29());//TODO output maxDetailKm No idea which variable is returning null, kinda guessing it's the grid.getMaxLevels() since I don't see this being initialized by any prior methods in this chain. Is this just a bug or am I using it wrong? I'm trying to wrap this query together with two other queries but I can't even get the Query object back. -- Steve WGSN is a global foresight business. Our experts provide deep insight and analysis of consumer, fashion and design trends. We inspire our clients to plan and trade their range with unparalleled confidence and accuracy. Together, we Create Tomorrow. WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN INstockhttp://www.wgsninstock.com/, WGSN StyleTrialhttp://www.wgsn.com/en/styletrial/ and WGSN Mindsethttp://www.wgsn.com/en/services/consultancy/, our bespoke consultancy services. The information in or attached to this email is confidential and may be legally privileged. If you are not the intended recipient of this message, any use, disclosure, copying, distribution or any action taken in reliance on it is prohibited and may be unlawful. If you
Re: Documentation for: solr.EnglishPossessiveFilterFactory
Seems simple enough that the source answers all the questions: https://github.com/apache/lucene-solr/blob/lucene_solr_4_9/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishPossessiveFilter.java#L66 It just looks for a couple of versions of apostrophe followed by s or S. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 3 August 2015 at 17:56, Steven White swhite4...@gmail.com wrote: Hi Everyone, Does anyone knows where I can find docs on filter class=solr.EnglishPossessiveFilterFactory/? The only one I found is the API doc: http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilterFactory.html but that's not what I'm looking for, I'm looking for one to describe in details how this filter works with examples. Thanks Steve
Documentation for: solr.EnglishPossessiveFilterFactory
Hi Everyone, Does anyone knows where I can find docs on filter class=solr.EnglishPossessiveFilterFactory/? The only one I found is the API doc: http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilterFactory.html but that's not what I'm looking for, I'm looking for one to describe in details how this filter works with examples. Thanks Steve
Re: Can Apache Solr Handle TeraByte Large Data
SolrJ is just a SolrClient. In pseudocode, you say: SolrClient client = new SolrClient(http://localhost:8983/solr/whatever;); ListSolrInputDocument docs = new ArrayList(); SolrInputDocument doc = new SolrInputDocument(); doc.addField(id, abc123); doc.addField(some-text-field, I like it when the sun shines); docs.add(doc); client.add(docs); client.commit(); (warning, the above is typed from memory) So, the question is simply how many documents do you add to docs before you do client.add(docs); And how often (if at all) do you call client.commit(). So when you are told Use SolrJ, really, you are being told to write some Java code that happens to use the SolrJ client library for Solr. Upayavira On Mon, Aug 3, 2015, at 10:01 PM, Alexandre Rafalovitch wrote: Well, If it is just file names, I'd probably use SolrJ client, maybe with Java 8. Read file names, split the name into parts with regular expressions, stuff parts into different field names and send to Solr. Java 8 has FileSystem walkers, etc to make it easier. You could do it with DIH, but it would be with nested entities and the inner entity would probably try to parse the file. So, a lot of wasted effort if you just care about the file names. Or, I would just do a directory listing in the operating system and use regular expressions to split it into CSV file, which I would then import into Solr directly. In all of these cases, the question would be which field is the ID of the record to ensure no duplicates. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 3 August 2015 at 15:34, Mugeesh Husain muge...@gmail.com wrote: @Alexandre No i dont need a content of a file. i am repeating my requirement I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf I just split all Value from a filename only,these values i have to index. I am interested to index value to solr not file contains. I have tested the DIH from a file system its work fine but i dont know how can i implement my code in DIH if my code get some value than how i can i index it using DIH. If i will use DIH then How i will make split operation and get value from it. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Closing the IndexSearcher/IndexWriter for a core
This doesn't work in SolrCloud, but it really sounds like lots of cores which is designed to keep the most recent N cores loaded and auto-unload older ones, see: http://wiki.apache.org/solr/LotsOfCores Best, Erick On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt bhur...@gmail.com wrote: Is there are an easy way for a client to tell Solr to close or release the IndexSearcher and/or IndexWriter for a core? I have a use case where we're creating a lot of cores with not that many documents per zone (a few hundred to maybe 10's of thousands). Writes come in batches, and reads also tend to be bursty, if less so than the writes. And we're having problems with ram usage on the server. Poking around a heap dump, the problem is that every IndexSearcher or IndexWriter being opened is taking up large amounts of memory. I've looked at the unload call, and while it is unclear, it seems like it deletes the data on disk as well. I don't want to delete the data on disk, I just want to unload the searcher and writer, and free up the memory. So I'm wondering if there is a call I can make when I know or suspect that the core isn't going to be used in the near future to release these objects and return the memory? Or a configuration option I can set to do so after, say, being idle for 5 seconds? It's OK for there to be a performance hit the first time I reopen the core. Thanks, Brian
Re: Closing the IndexSearcher/IndexWriter for a core
So unloading a core doesn't delete the data? That is good to know. On Mon, Aug 3, 2015 at 6:22 PM, Erick Erickson erickerick...@gmail.com wrote: This doesn't work in SolrCloud, but it really sounds like lots of cores which is designed to keep the most recent N cores loaded and auto-unload older ones, see: http://wiki.apache.org/solr/LotsOfCores Best, Erick On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt bhur...@gmail.com wrote: Is there are an easy way for a client to tell Solr to close or release the IndexSearcher and/or IndexWriter for a core? I have a use case where we're creating a lot of cores with not that many documents per zone (a few hundred to maybe 10's of thousands). Writes come in batches, and reads also tend to be bursty, if less so than the writes. And we're having problems with ram usage on the server. Poking around a heap dump, the problem is that every IndexSearcher or IndexWriter being opened is taking up large amounts of memory. I've looked at the unload call, and while it is unclear, it seems like it deletes the data on disk as well. I don't want to delete the data on disk, I just want to unload the searcher and writer, and free up the memory. So I'm wondering if there is a call I can make when I know or suspect that the core isn't going to be used in the near future to release these objects and return the memory? Or a configuration option I can set to do so after, say, being idle for 5 seconds? It's OK for there to be a performance hit the first time I reopen the core. Thanks, Brian
Re: Closing the IndexSearcher/IndexWriter for a core
Some further information: The main things use memory that I see from my heap dump are: 1. Arrays of org.apache.lucene.util.fst.FST$Arc classes- which mainly seem to hold nulls. The ones of these I've investigated have been held by org.apache.lucene.util.fst.FST objects, I have 38 cores open and have over 121,000 of these arrays, taking up over 126M of space. 2. Byte arrays, of which I have 384,000 of, taking up 106M of space. When I trace the cycle of references up, I've always ended up at an IndexSearcher or IndexWriter class, causing me to assume the problem was that I was simply opening up too many cores, but I could be mistaken. This was on a freshly started system without many cores having been touched yet- so the memory usage, while larger than I expect, isn't critical yet. It does become critical as the number of cores increases. Thanks, Brian On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt bhur...@gmail.com wrote: Is there are an easy way for a client to tell Solr to close or release the IndexSearcher and/or IndexWriter for a core? I have a use case where we're creating a lot of cores with not that many documents per zone (a few hundred to maybe 10's of thousands). Writes come in batches, and reads also tend to be bursty, if less so than the writes. And we're having problems with ram usage on the server. Poking around a heap dump, the problem is that every IndexSearcher or IndexWriter being opened is taking up large amounts of memory. I've looked at the unload call, and while it is unclear, it seems like it deletes the data on disk as well. I don't want to delete the data on disk, I just want to unload the searcher and writer, and free up the memory. So I'm wondering if there is a call I can make when I know or suspect that the core isn't going to be used in the near future to release these objects and return the memory? Or a configuration option I can set to do so after, say, being idle for 5 seconds? It's OK for there to be a performance hit the first time I reopen the core. Thanks, Brian
Re: Large number of collections in SolrCloud
We have similar date and language based collection. We also ran into similar issues of having huge clusterstate.json file which took an eternity to load up. In our case the search cases were language specific so we moved to multiple solr cluster each having a different zk namespace per language, something you might look at. On 27 Jul 2015 20:47, Olivier olivau...@gmail.com wrote: Hi, I have a SolrCloud cluster with 3 nodes : 3 shards per node and replication factor at 3. The collections number is around 1000. All the collections use the same Zookeeper configuration. So when I create each collection, the ZK configuration is pulled from ZK and the configuration files are stored in the JVM. I thought that if the configuration was the same for each collection, the impact on the JVM would be insignifiant because the configuration should be loaded only once. But it is not the case, for each collection created, the JVM size increases because the configuration is loaded again, am I correct ? If I have a small configuration folder size, I have no problem because the folder size is less than 500 KB so if we count 1000 collections x 500 KB, the JVM impact is 500 MB. But we manage a lot of languages with some dictionaries so the configuration folder size is about 6 MB. The JVM impact is very important now because it can be more than 6 GB (1000 x 6 MB). So I would like to have the feeback of people who have a cluster with a large number of collections too. Do I have to change some settings to handle this case better ? What can I do to optimize this behaviour ? For now, we just increase the RAM size per node at 16 GB but we plan to increase the collections number. Thanks, Olivier
Re: Collapsing Query Parser returns one record per shard...was not expecting this...
One of things to keep in mind with Grouping is that if you are relying on an accurate group count (ngroups) then you will also have to collocate documents based on the grouping field. The main advantage to the Collapsing qparser plugin is it provides fast field collapsing on high cardinality fields with an accurate group count. If you don't need ngroups, then Grouping is usually just as fast if not faster. Joel Bernstein http://joelsolr.blogspot.com/ On Mon, Aug 3, 2015 at 10:14 PM, Joel Bernstein joels...@gmail.com wrote: Your findings are the expected behavior for the Collapsing qparser. The Collapsing qparser requires records in the same collapsed field to be located on the same shard. The typical approach for this is to use composite Id routing to ensure that documents with the same collapse field land on the same shard. We should make this clear in the documentation. Joel Bernstein http://joelsolr.blogspot.com/ On Mon, Aug 3, 2015 at 4:20 PM, Peter Lee peter@proquest.com wrote: From my reading of the solr docs (e.g. https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results and https://cwiki.apache.org/confluence/display/solr/Result+Grouping), I've been under the impression that these two methods (result grouping and collapsing query parser) can both be used to eliminate duplicates from a result set (in our case, we have a duplication field that contains a 'signature' that identifies duplicates. We use our own signature for a variety of reasons that are tied to complex business requirements.). In a test environment I scattered 15 duplicate records (with another 10 unique records) across a test system running Solr Cloud (Solr version 5.2.1) that had 4 shards and a replication factor of 2. I tried both result grouping and the collapsing query parser to remove duplicates. The result grouping worked as expected...the collapsing query parser did not. My results in using the collapsing query parser showed that Solr was in fact including into the result set one of the duplicate records from each shard (that is, I received FOUR duplicate records...and turning on debug showed that each of the four records came from a unique shard)...when I was expecting solr to do the collapsing on the aggregated result and return only ONE of the duplicated records across ALL shards. It appears that solr is performing the collapsing query parsing on each individual shard, but then NOT performing the operation on the aggregated results from each shard. I have searched through the forums and checked the documentation as carefully as I can. I find no documentation or mention of this effect (one record being returned per shard) when using collapsing query parsing. Is this a known behavior? Am I just doing something wrong? Am I missing some search parameter? Am I simply not understanding correctly how this is supposed to work? For reference, I am including below the search url and the response I received. Any insights would be appreciated. Query: http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq={!collapse%20field=dupid_s}debugQuery=true http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq=%7B!collapse%20field=dupid_s%7DdebugQuery=true http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq=%7b!collapse%20field=dupid_s%7ddebugQuery=true Response (note that dupid_s = 900 is the duplicate value and that I have added comments in the output ***comment*** pointing out which shard responses came from): { responseHeader:{ status:0, QTime:31, params:{ debugQuery:true, indent:true, q:*:*, wt:json, fq:{!collapse field=dupid_s}, rows:1000}}, response:{numFound:14,start:0,maxScore:1.0,docs:[ { storeid_s:1002, dupid_s:900, ***AcaColl_shard2_replica2*** title_pqth:[Dupe Record #2], _version_:1508241005512491008, indexTime_dt:2015-07-31T19:25:09.914Z}, { storeid_s:8020, dupid_s:2005, title_pqth:[Unique Record #5], _version_:1508241005539753984, indexTime_dt:2015-07-31T19:25:09.94Z}, { storeid_s:8023, dupid_s:2008, title_pqth:[Unique Record #8], _version_:1508241005540802560, indexTime_dt:2015-07-31T19:25:09.94Z}, { storeid_s:8024, dupid_s:2009, title_pqth:[Unique Record #9], _version_:1508241005541851136, indexTime_dt:2015-07-31T19:25:09.94Z}, { storeid_s:1007, dupid_s:900, ***AcaColl_shard4_replica2*** title_pqth:[Dupe Record #7], _version_:1508241005515636736, indexTime_dt:2015-07-31T19:25:09.91Z}, { storeid_s:8016, dupid_s:2001, title_pqth:[Unique Record #1], _version_:1508241005526122496,
RE: Do not match on high frequency terms
Thanks for your response. For TermsComponent, I am able to get a list of all terms in a field that have a document frequency under a certain threshold, but I was wondering if I could instead pass a list of terms, and get back only the terms from that list that have a document frequency under a certain threshold in a field. I can't find an easy way to do this, do you know if this is possible? Thanks, Steve -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Saturday, August 1, 2015 6:35 AM To: solr-user solr-user@lucene.apache.org Subject: Re: Do not match on high frequency terms It seems like you need to develop custom query or query parser. Regarding SolrJ: you can try to call http://wiki.apache.org/solr/TermsComponent https://cwiki.apache.org/confluence/display/solr/The+Terms+Component I'm not sure how exactly call TermsComponent in SolrJ, I just found https://lucene.apache.org/solr/5_2_1/solr-solrj/org/apache/solr/client/solrj/response/TermsResponse.html to read its' response. On Fri, Jul 31, 2015 at 11:31 PM, Swedish, Steve steve.swed...@noblis.org wrote: Hello, I'm hoping someone might be able to help me out with this as I do not have very much solr experience. Basically, I am wondering if it is possible to not match on terms that have a document frequency above a certain threshold. For my situation, a stop word list will be unrealistic to maintain, so I was wondering if there may be an alternative solution using term document frequency to identify common terms. What would actually be ideal is if I could somehow use the CommonTermsQuery. The problem I ran across when looking at this option was that the CommonTermsQuery seems to only work for queries on one field at a time (unless I'm mistaken). However, I have a query of the structure q=(field1:(blah) AND (field2:(blah) OR field3:(blah))) OR field1:(blah) OR (field2:(blah) AND field3:(blah)). If there are any ideas on how to use the CommonTermsQuery with this query structure, that would be great. If it's possible to extract the document frequency for terms in my query before the query is run, allowing me to remove the high frequency terms from the query first, that could also be a valid solution. I'm using solrj as well, so a solution that works with solrj would be appreciated. Thanks, Steve -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Indexing issues after cluster restart.
Hi, using SOLR 5.2 after restarting the cluster, I get below exceptions org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state. followed by : org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: So looked into ZK tree, and saw a load of entries in /overseer/queue , so went off and cleaned up (removed all entries), restarted the cluster and now being able to index again. What might be the cause to this? Regards
Re: solr multicore vs sharding vs 1 big collection
There are two things that are likely to cause the timeouts you are seeing, I'd say. Firstly, your server is overloaded - that can be handled by adding additional replicas. However, it doesn't seem like this is the case, because the second query works fine. Secondly, you are hitting garbage collection issues. This seems more likely to me. You have 40m documents inside a 6Gb heap. That seems relatively tight to me. What that means is that Java may well not have enough space to create all the objects it needs inside a single commit cycle, forcing a garbage collection which can cause application pauses, which would fit with what you are seeing. I'd suggest using the jstat -gcutil command (I think I have that right) to watch the number of garbage collections taking place. You will quickly see from that if garbage collection is your issue. The simplistic remedy would be to allow your JVM a bit more memory. The other concern I have is that Solr (and Lucene) is intended for high read/low write scenarios. Its index structure is highly tuned for this scenario. If you are doing a lot of writes, then you will be creating a lot of index churn which will require more frequent merges, consuming both CPU and memory in the process. It may be worth looking at *how* you use Solr, and see whether, for example, you can separate your documents into slow moving, and fast moving parts, to better suit the Lucene index structures. Or to consider whether a Lucene based system is best for what you are attempting to achieve. For garbage collection, see here for a good Solr related write-up: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ Upayavira On Mon, Aug 3, 2015, at 12:29 AM, Jay Potharaju wrote: Shawn, Thanks for the feedback. I agree that increasing timeout might alleviate the timeout issue. The main problem with increasing timeout is the detrimental effect it will have on the user experience, therefore can't increase it. I have looked at the queries that threw errors, next time I try it everything seems to work fine. Not sure how to reproduce the error. My concern with increasing the memory to 32GB is what happens when the index size grows over the next few months. One of the other solutions I have been thinking about is to rebuild index(weekly) and create a new collection and use it. Are there any good references for doing that? Thanks Jay On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/2/2015 8:29 AM, Jay Potharaju wrote: The document contains around 30 fields and have stored set to true for almost 15 of them. And these stored fields are queried and updated all the time. You will notice that the deleted documents is almost 30% of the docs. And it has stayed around that percent and has not come down. I did try optimize but that was disruptive as it caused search errors. I have been playing with merge factor to see if that helps with deleted documents or not. It is currently set to 5. The server has 24 GB of memory out of which memory consumption is around 23 GB normally and the jvm is set to 6 GB. And have noticed that the available memory on the server goes to 100 MB at times during a day. All the updates are run through DIH. Using all availble memory is completely normal operation for ANY operating system. If you hold up Windows as an example of one that doesn't ... it lies to you about available memory. All modern operating systems will utilize memory that is not explicitly allocated for the OS disk cache. The disk cache will instantly give up any of the memory it is using for programs that request it. Linux doesn't try to hide the disk cache from you, but older versions of Windows do. In the newer versions of Windows that have the Resource Monitor, you can go there to see the actual memory usage including the cache. Every day at least once i see the following error, which result in search errors on the front end of the site. ERROR org.apache.solr.servlet.SolrDispatchFilter - null:org.eclipse.jetty.io.EofException From what I have read these are mainly due to timeout and my timeout is set to 30 seconds and cant set it to a higher number. I was thinking maybe due to high memory usage, sometimes it leads to bad performance/errors. Although this error can be caused by timeouts, it has a specific meaning. It means that the client disconnected before Solr responded to the request, so when Solr tried to respond (through jetty), it found a closed TCP connection. Client timeouts need to either be completely removed, or set to a value much longer than any request will take. Five minutes is a good starting value. If all your client timeout is set to 30 seconds and you are seeing EofExceptions, that means that your requests are taking longer than 30 seconds, and you likely have some performance issues. It's also possible that
Re: Multiple boost queries on a specific field
Hello Chris, This totally does the trick. I drastically improved relevancy. Thank you much for your advices ! - Ben -- View this message in context: http://lucene.472066.n3.nabble.com/Multiple-boost-queries-on-a-specific-field-tp4217678p4220396.html Sent from the Solr - User mailing list archive at Nabble.com.
Trouble getting langid.map.individual setting to work in Solr 5.0.x
I am trying to use “languid.map.individual” setting to allow field “a” to detect as, say, English, and be mapped to “a_en”, while in the same document, field “b” detects as, say, German and is mapped to “b_de”. What happens in my tests is that the global language is detected (for example, German), but BOTH fields are mapped to “_de” as a result. I cannot get individual detection or mapping to work. Am I mis-understanding the purpose of this setting? Here is the resulting document from my test: { id: 1005!22345, language: [ de ], a_de: A title that should be detected as English with high confidence, b_de: Die Einführung einer anlasslosen Speicherung von Passagierdaten für alle Flüge aus einem Nicht-EU-Staat in die EU und umgekehrt ist näher gerückt. Der Ausschuss des EU-Parlaments für bürgerliche Freiheiten, Justiz und Inneres (LIBE) hat heute mit knapper Mehrheit für einen entsprechenden Richtlinien-Entwurf der EU-Kommission gestimmt. Bürgerrechtler, Grüne und Linke halten die geplante Richtlinie für eine andere Form der anlasslosen Vorratsdatenspeicherung, die alle Flugreisenden zu Verdächtigen mache., _version_: 1508494723734569000 } I expected “a_de” to be “a_en”, and the “language” multi-valued field to have “en” and “de”. Here is my configuration in solrconfig.xml: updateRequestProcessorChain name=langid default=true processor class=org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory lst name=defaults str name=langidtrue/str str name=langid.fla,b/str str name=langid.maptrue/str str name=langid.map.individualtrue/str str name=langid.langFieldlanguage/str str name=langid.map.lcmapaf:uns,ar:uns,bg:uns,bn:uns,cs:uns,da:uns,el:uns,et:uns,fa:uns,fi:uns,gu:uns,he:uns,hi:uns,hr:uns,hu:uns,id:uns,ja:uns,kn:uns,ko:uns,lt:uns,lv:uns,mk:uns,ml:uns,mr:uns,ne:uns,nl:uns,no:uns,pa:uns,pl:uns,ro:uns,ru:uns,sk:uns,sl:uns,so:uns,sq:uns,sv:uns,sw:uns,ta:uns,te:uns,th:uns,tl:uns,tr:uns,uk:uns,ur:uns,vi:uns,zh-cn:uns,zh-tw:uns/str str name=langid.fallbacken/str /lst /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain The debug output of lang detect, during indexing, is as follows: --- DEBUG - 2015-08-03 14:37:54.450; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language detected de with certainty 0.964723182276 DEBUG - 2015-08-03 14:37:54.450; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Detected main document language from fields [a, b]: de DEBUG - 2015-08-03 14:37:54.450; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; Appending field a DEBUG - 2015-08-03 14:37:54.451; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; Appending field b DEBUG - 2015-08-03 14:37:54.453; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language detected de with certainty 0.964723182276 DEBUG - 2015-08-03 14:37:54.453; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping field a using individually detected language de DEBUG - 2015-08-03 14:37:54.454; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing mapping from a with language de to field a_de DEBUG - 2015-08-03 14:37:54.454; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping field 1005!22345 to de DEBUG - 2015-08-03 14:37:54.454; org.eclipse.jetty.webapp.WebAppClassLoader; loaded class org.apache.solr.common.SolrInputField from WebAppClassLoader=525571@80503 DEBUG - 2015-08-03 14:37:54.454; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing old field a DEBUG - 2015-08-03 14:37:54.455; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; Appending field a DEBUG - 2015-08-03 14:37:54.455; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; Appending field b DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language detected de with certainty 0.980402022373 DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping field b using individually detected language de DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing mapping from b with language de to field b_de DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping field 1005!22345 to de DEBUG - 2015-08-03 14:37:54.456;
Re: Large number of collections in SolrCloud
Hi, Thanks a lot Erick and Shawn for your answers. I am aware that it is a very particular issue with not a common use of Solr. I just wondered if people had the similar business case. For information we need a very important number of collections with the same configuration cause of legally reasons. Indeed each collection represents one of our customers and by contract we have to separate the data of each of them. If we had the choice, we just would have one collection with a field name 'Customers' and we would do filter queries on it but we can't ! Anyway thanks again for your answers. For now, we finally did not add the different languages dictionaries per collection and it is fine for 1K+ customers with more resources added to the servers. Best, Olivier Tavard 2015-07-27 17:53 GMT+02:00 Shawn Heisey apa...@elyograg.org: On 7/27/2015 9:16 AM, Olivier wrote: I have a SolrCloud cluster with 3 nodes : 3 shards per node and replication factor at 3. The collections number is around 1000. All the collections use the same Zookeeper configuration. So when I create each collection, the ZK configuration is pulled from ZK and the configuration files are stored in the JVM. I thought that if the configuration was the same for each collection, the impact on the JVM would be insignifiant because the configuration should be loaded only once. But it is not the case, for each collection created, the JVM size increases because the configuration is loaded again, am I correct ? If I have a small configuration folder size, I have no problem because the folder size is less than 500 KB so if we count 1000 collections x 500 KB, the JVM impact is 500 MB. But we manage a lot of languages with some dictionaries so the configuration folder size is about 6 MB. The JVM impact is very important now because it can be more than 6 GB (1000 x 6 MB). So I would like to have the feeback of people who have a cluster with a large number of collections too. Do I have to change some settings to handle this case better ? What can I do to optimize this behaviour ? For now, we just increase the RAM size per node at 16 GB but we plan to increase the collections number. Severe issues were noticed when dealing with many collections, and this was with a simple config, and completely empty indexes. A complex config and actual index data would make it run that much more slowly. https://issues.apache.org/jira/browse/SOLR-7191 Memory usage for the config wasn't even considered when I was working on reporting that issue. SolrCloud is highly optimized to work well when there are a relatively small number of collections. I think there is work that we can do which will optimize operations to the point where thousands of collections will work well, especially if they all share the same config/schema ... but this is likely to be a fair amount of work, which will only benefit a handful of users who are pushing the boundaries of what Solr can do. In the open source world, a problem like that doesn't normally receive a lot of developer attention, and we rely much more on help from the community, specifically from knowledgeable users who are having the problem and know enough to try and fix it. FYI -- 16GB of RAM per machine is quite small for Solr, particularly when pushing the envelope. My Solr machines are maxed at 64GB, and I frequently wish I could install more. https://wiki.apache.org/solr/SolrPerformanceProblems#RAM One possible solution for your dilemma is simply adding more machines and spreading your collections out so each machine's memory requirements go down. Thanks, Shawn
Why is /query needed for Json Facet?
I tried using /select and this query does not work? Cannot understand why. Passing Parameters via JSON We can also pass normal request parameters in the JSON body within the params block: $ curl http://localhost:8983/solr/query?fl=title,author-d ' { params:{ q:title:hero, rows:1 } } ' Which is equivalent to: $ curl http://localhost:8983/solr/query?fl=title ,authorq=title:herorows=1 -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Can Apache Solr Handle TeraByte Large Data
I'd go with SolrJ personally. For a terabyte of data that (I'm inferring) are PDF files and the like (aka semi-structured documents) you'll need to have Tika parse out the data you need to index. And doing that through posting or DIH puts all the analysis on the Solr servers, which will work, but not optimally. Here's something to get you started: https://lucidworks.com/blog/indexing-with-solrj/ Best, Erick On Mon, Aug 3, 2015 at 1:56 PM, Mugeesh Husain muge...@gmail.com wrote: Hi Alexandre, I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf 1.)I have to split all underscore value from a filename and these value have to be index to the solr. 2.)Do Not need file contains(Text) to index. You Told me The answer is Yes i didn't get in which way you said Yes. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can Apache Solr Handle TeraByte Large Data
Most definitely yes given your criteria below. If you don’t care for the text to be parsed and indexed within the files, a simple file system crawler that just got the directory listings and posted the file names split as you’d like to Solr would suffice it sounds like. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Aug 3, 2015, at 1:56 PM, Mugeesh Husain muge...@gmail.com wrote: Hi Alexandre, I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf 1.)I have to split all underscore value from a filename and these value have to be index to the solr. 2.)Do Not need file contains(Text) to index. You Told me The answer is Yes i didn't get in which way you said Yes. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can Apache Solr Handle TeraByte Large Data
Ahhh, listen to Hatcher if you're not indexing the _contents_ of the files, just the filenames Erick On Mon, Aug 3, 2015 at 2:22 PM, Erik Hatcher erik.hatc...@gmail.com wrote: Most definitely yes given your criteria below. If you don’t care for the text to be parsed and indexed within the files, a simple file system crawler that just got the directory listings and posted the file names split as you’d like to Solr would suffice it sounds like. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Aug 3, 2015, at 1:56 PM, Mugeesh Husain muge...@gmail.com wrote: Hi Alexandre, I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf 1.)I have to split all underscore value from a filename and these value have to be index to the solr. 2.)Do Not need file contains(Text) to index. You Told me The answer is Yes i didn't get in which way you said Yes. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can Apache Solr Handle TeraByte Large Data
Just to reconfirm, are you indexing file content? Because if you are, you need to be aware most of the PDF do not extract well, as they do not have text flow preserved. If you are indexing PDF files, I would run a sample through Tika directly (that's what Solr uses under the covers anyway) and see what the output looks like. Apart from that, either SolrJ or DIH would work. If this is for a production system, I'd use SolrJ with client-side Tika parsing. But you could use DIH for a quick test run. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 3 August 2015 at 13:56, Mugeesh Husain muge...@gmail.com wrote: Hi Alexandre, I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf 1.)I have to split all underscore value from a filename and these value have to be index to the solr. 2.)Do Not need file contains(Text) to index. You Told me The answer is Yes i didn't get in which way you said Yes. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTTP Error 500 on /admin/ping request
I found the issue. With GET, the legacy code I'm calling into was written like so: clientResponse = resource.contentType(application/atom+xml).accept(application/atom+xml).get(); This is a bug, and should have been: clientResponse = resource.accept(application/atom+xml).get(); Google'ing on the issue helped me narrow it down. Looks like others run into it moving from Solr 5.0 to 5.1 [1] [2]. Steve [1] http://lucene.472066.n3.nabble.com/Bad-contentType-for-search-handler-text-xml-charset-UTF-8-td4200314.html [2] https://github.com/solariumphp/solarium/issues/326 On Mon, Aug 3, 2015 at 2:16 PM, Steven White swhite4...@gmail.com wrote: Yes, my application is in Java, no I cannot switch to SolrJ because I'm working off legacy code for which I don't have the luxury to refactor.. If my application is sending the wrong Content-Type HTTP header, which part is it and why the same header is working for the other query paths such as: /solr/db/config/requestHandler?wt=xml or /solr/db/schema/fieldtypes/?wt=xml or /solr/db/schema/fields/?wt=xml ? Steve On Mon, Aug 3, 2015 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote: On 8/3/2015 11:34 AM, Steven White wrote: Hi Everyone, I cannot figure out why I'm getting HTTP Error 500 off the following code: snip Ping query caused exception: Bad contentType for search handler :application/atom+xml Your application is sending an incorrect Content-Type HTTP header that Solr doesn't know how to handle. If your application is Java, why are you not using SolrJ? You'll likely find that to be a lot easier to use than even a REST client. Thanks, Shawn
Re: Why is /query needed for Json Facet?
OK I figured it out. The documentation is not updated. The default component are as follows: FacetModule.COMPONENT_NAME = facet_module Thus. The following is the default with the new facet_module. We need someone to update the solrconfig.xml and the docs. arr name=components strquery/str strfacet/str strfacet_module/str strmlt/str strhighlight/str strstats/str strdebug/str strexpand/str /arr protected ListString getDefaultComponents() { ArrayListString names = new ArrayList(6); names.add( QueryComponent.COMPONENT_NAME ); names.add( FacetComponent.COMPONENT_NAME ); names.add( FacetModule.COMPONENT_NAME ); names.add( MoreLikeThisComponent.COMPONENT_NAME ); names.add( HighlightComponent.COMPONENT_NAME ); names.add( StatsComponent.COMPONENT_NAME ); names.add( DebugComponent.COMPONENT_NAME ); names.add( ExpandComponent.COMPONENT_NAME); return names; } On Mon, Aug 3, 2015 at 11:31 AM, William Bell billnb...@gmail.com wrote: I tried using /select and this query does not work? Cannot understand why. Passing Parameters via JSON We can also pass normal request parameters in the JSON body within the params block: $ curl http://localhost:8983/solr/query?fl=title,author-d ' { params:{ q:title:hero, rows:1 } } ' Which is equivalent to: $ curl http://localhost:8983/solr/query?fl=title ,authorq=title:herorows=1 -- Bill Bell billnb...@gmail.com cell 720-256-8076 -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Can Apache Solr Handle TeraByte Large Data
@Erik Hatcher You mean i have to use Solrj for indexing to it.(right ?) Can Solrj handle large amount of data which i have mentioned previous post ? If i will use DIH then how will i split value from filename etc. I want to start my development in a right direction that why i am little confuse on which way i will start my requirement. Please told me you guys are told me yes(Is yes for Solrj ? or DIH ?) -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220550.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Large number of collections in SolrCloud
Hmmm, one thing that will certainly help is the new per-collection state.json that will replace clusterstate.json. That'll reduce a lot of chatter. You might also get a lot of mileage out of breaking the collections into sub-groups that are distinct thus reducing the number of collections on each shard. This is totally off the wall, as in I haven't thought about it much But what about implicit routing? That is, you take control of what shard documents land on and specifically route the docs there. Then, instead of one _collection_ per client you might have one _shard_ per client. Not sure if that meets your legal requirements either though. And, essentially since each shard is a core it might have the exact same issues you have now with bringing up lots and lots and lots of cores... Speaking of which, if you're not sharding then the Lots of cores option might make sense, see: http://wiki.apache.org/solr/LotsOfCores But do note that this is specifically _not_ supported in SolrCloud mode. Best, Erick On Mon, Aug 3, 2015 at 11:06 AM, Olivier olivau...@gmail.com wrote: Hi, Thanks a lot Erick and Shawn for your answers. I am aware that it is a very particular issue with not a common use of Solr. I just wondered if people had the similar business case. For information we need a very important number of collections with the same configuration cause of legally reasons. Indeed each collection represents one of our customers and by contract we have to separate the data of each of them. If we had the choice, we just would have one collection with a field name 'Customers' and we would do filter queries on it but we can't ! Anyway thanks again for your answers. For now, we finally did not add the different languages dictionaries per collection and it is fine for 1K+ customers with more resources added to the servers. Best, Olivier Tavard 2015-07-27 17:53 GMT+02:00 Shawn Heisey apa...@elyograg.org: On 7/27/2015 9:16 AM, Olivier wrote: I have a SolrCloud cluster with 3 nodes : 3 shards per node and replication factor at 3. The collections number is around 1000. All the collections use the same Zookeeper configuration. So when I create each collection, the ZK configuration is pulled from ZK and the configuration files are stored in the JVM. I thought that if the configuration was the same for each collection, the impact on the JVM would be insignifiant because the configuration should be loaded only once. But it is not the case, for each collection created, the JVM size increases because the configuration is loaded again, am I correct ? If I have a small configuration folder size, I have no problem because the folder size is less than 500 KB so if we count 1000 collections x 500 KB, the JVM impact is 500 MB. But we manage a lot of languages with some dictionaries so the configuration folder size is about 6 MB. The JVM impact is very important now because it can be more than 6 GB (1000 x 6 MB). So I would like to have the feeback of people who have a cluster with a large number of collections too. Do I have to change some settings to handle this case better ? What can I do to optimize this behaviour ? For now, we just increase the RAM size per node at 16 GB but we plan to increase the collections number. Severe issues were noticed when dealing with many collections, and this was with a simple config, and completely empty indexes. A complex config and actual index data would make it run that much more slowly. https://issues.apache.org/jira/browse/SOLR-7191 Memory usage for the config wasn't even considered when I was working on reporting that issue. SolrCloud is highly optimized to work well when there are a relatively small number of collections. I think there is work that we can do which will optimize operations to the point where thousands of collections will work well, especially if they all share the same config/schema ... but this is likely to be a fair amount of work, which will only benefit a handful of users who are pushing the boundaries of what Solr can do. In the open source world, a problem like that doesn't normally receive a lot of developer attention, and we rely much more on help from the community, specifically from knowledgeable users who are having the problem and know enough to try and fix it. FYI -- 16GB of RAM per machine is quite small for Solr, particularly when pushing the envelope. My Solr machines are maxed at 64GB, and I frequently wish I could install more. https://wiki.apache.org/solr/SolrPerformanceProblems#RAM One possible solution for your dilemma is simply adding more machines and spreading your collections out so each machine's memory requirements go down. Thanks, Shawn
Re: HTTP Error 500 on /admin/ping request
On 8/3/2015 11:34 AM, Steven White wrote: Hi Everyone, I cannot figure out why I'm getting HTTP Error 500 off the following code: snip Ping query caused exception: Bad contentType for search handler :application/atom+xml Your application is sending an incorrect Content-Type HTTP header that Solr doesn't know how to handle. If your application is Java, why are you not using SolrJ? You'll likely find that to be a lot easier to use than even a REST client. Thanks, Shawn
Re: How to use BitDocSet within a PostFilter
Yes that was it. Had no idea this was an issue! On Monday, August 3, 2015, Roman Chyla roman.ch...@gmail.commailto:roman.ch...@gmail.com wrote: Hi, inStockSkusBitSet.get(currentChildDocNumber) Is that child a lucene id? If yes, does it include offset? Every index segment starts at a different point, but docs are numbered from zero. So to check them against the full index bitset, I'd be doing Bitset.exists(indexBase + docid) Just one thing to check Roman On Aug 3, 2015 1:24 AM, Stephen Weiss steve.we...@wgsn.comjavascript:; wrote: Hi everyone, I'm trying to write a PostFilter for Solr 5.1.0, which is meant to crawl through grandchild documents during a search through the parents and filter out documents based on statistics gathered from aggregating the grandchildren together. I've been successful in getting the logic correct, but it does not perform so well - I'm grabbing too many documents from the index along the way. I'm trying to filter out grandchild documents which are not relevant to the statistics I'm collecting, in order to reduce the number of document objects pulled from the IndexReader. I've implemented the following code in my DelegatingCollector.collect: if (inStockSkusBitSet == null) { SolrIndexSearcher SidxS = (SolrIndexSearcher) idxS; // type cast from IndexSearcher to expose getDocSet. inStockSkusDocSet = SidxS.getDocSet(inStockSkusQuery); inStockSkusBitDocSet = (BitDocSet) inStockSkusDocSet; // type cast from DocSet to expose getBits. inStockSkusBitSet = inStockSkusBitDocSet.getBits(); } My BitDocSet reports a size which matches a standard query for the more limited set of grandchildren, and the FixedBitSet (inStockSkusBitSet) also reports this same cardinality. Based on that fact, it seems that the getDocSet call itself must be working properly, and returning the right number of documents. However, when I try to filter out grandchild documents using either BitDocSet.exists or BitSet.get (passing over any grandchild document which doesn't exist in the bitdocset or return true from the bitset), I get about 1/3 less results than I'm supposed to. It seems many documents that should match the filter, are being excluded, and documents which should not match the filter, are being included. I'm trying to use it either of these ways: if (!inStockSkusBitSet.get(currentChildDocNumber)) continue; if (!inStockSkusBitDocSet.exists(currentChildDocNumber)) continue; The currentChildDocNumber is simply the docNumber which is passed to DelegatingCollector.collect, decremented until I hit a document that doesn't belong to the parent document. I can't seem to figure out a way to actually use the BitDocSet (or its derivatives) to quickly eliminate document IDs. It seems like this is how it's supposed to be used. What am I getting wrong? Sorry if this is a newbie question, I've never written a PostFilter before, and frankly, the documentation out there is a little sketchy (mostly for version 4) - so many classes have changed names and so many of the more well-documented techniques are deprecated or removed now, it's tough to follow what the current best practice actually is. I'm using the block join functionality heavily so I'm trying to keep more current than that. I would be happy to send along the full source privately if it would help figure this out, and plan to write up some more elaborate instructions (updated for Solr 5) for the next person who decides to write a PostFilter and work with block joins, if I ever manage to get this performing well enough. Thanks for any pointers! Totally open to doing this an entirely different way. I read DocValues might be a more elegant approach but currently that would require reindexing, so trying to avoid that. Also, I've been wondering if the query above would read from the filter cache or not. The query is constructed like this: private Term inStockTrueTerm = new Term(sku_history.is_in_stock, T); private Term objectTypeSkuHistoryTerm = new Term(object_type, sku_history); ... inStockTrueTermQuery = new TermQuery(inStockTrueTerm); objectTypeSkuHistoryTermQuery = new TermQuery(objectTypeSkuHistoryTerm); inStockSkusQuery = new BooleanQuery(); inStockSkusQuery.add(inStockTrueTermQuery, BooleanClause.Occur.MUST); inStockSkusQuery.add(objectTypeSkuHistoryTermQuery, BooleanClause.Occur.MUST); -- Steve WGSN is a global foresight business. Our experts provide deep insight and analysis of consumer, fashion and design trends. We inspire our clients to plan and trade their range with unparalleled confidence and accuracy. Together, we Create Tomorrow. WGSNhttp://www.wgsn.com/ is part of WGSN Limited, comprising of market-leading products including WGSN.comhttp://www.wgsn.com, WGSN Lifestyle Interiorshttp://www.wgsn.com/en/lifestyle-interiors, WGSN INstockhttp://www.wgsninstock.com/, WGSN StyleTrial
posting html files
Hi everyone, I created a core with the basic config sets and schema, when I use bin/post to post one html file, I got the error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException.. HTTP ERROR 404 when I go to localhost:8983/solr/core/update, I got: response lst name=responseHeader int name=status400/int int name=QTime3int /lst lst name=error str name=msgmissing content stream/str int name=code400/int /lst /response I'm really new to solr and wondering if anyone know how to index html files according to my own schema and how to configure the schema.xml or solrconfig file. Thank you so much! Thanks, Huiying
HTTP Error 500 on /admin/ping request
Hi Everyone, I cannot figure out why I'm getting HTTP Error 500 off the following code: // Using: org.apache.wink.client String contentType = application/atom+xml; URI uri = new URI(http://localhost:8983; + /solr/db/admin/ping?wt=xml); Resource resource = client.resource(uri.toURL().toString()); ClientResponse clientResponse = null; clientResponse = resource.contentType(contentType ).accept(contentType ).get(); clientResponse.getStatusCode();// Gives back: 500 Here is the call stack I get back from the call (it's also the same in solr.log): ERROR - 2015-08-03 17:30:29.457; [ db] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Bad contentType for search handler :application/atom+xml request={wt=xmlq=solrpingqueryechoParams=alldistrib=false} at org.apache.solr.request.json.RequestUtil.processParams(RequestUtil.java:74) at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginUtils.java:167) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:140) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) at org.apache.solr.handler.PingRequestHandler.handlePing(PingRequestHandler.java:254) at org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:211) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) INFO - 2015-08-03 17:30:29.459; [ db] org.apache.solr.core.SolrCore; [db] webapp=/solr path=/admin/ping params={wt=xml} status=400 QTime=6 ERROR - 2015-08-03 17:30:29.459; [ db] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Ping query caused exception: Bad contentType for search handler :application/atom+xml request={wt=xmlq=solrpingqueryechoParams=alldistrib=false} at org.apache.solr.handler.PingRequestHandler.handlePing(PingRequestHandler.java:263) at org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:211) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) at java.lang.Thread.run(Thread.java:853) Caused by: org.apache.solr.common.SolrException: Bad contentType for search handler :application/atom+xml request={wt=xmlq=solrpingqueryechoParams=alldistrib=false} at org.apache.solr.request.json.RequestUtil.processParams(RequestUtil.java:74) at org.apache.solr.util.SolrPluginUtils.setDefaults(SolrPluginUtils.java:167) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:140) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) at org.apache.solr.handler.PingRequestHandler.handlePing(PingRequestHandler.java:254) ... 27 more INFO - 2015-08-03 17:30:29.461; [ db] org.apache.solr.core.SolrCore; [db] webapp=/solr path=/admin/ping params={wt=xml} status=500 QTime=8 ERROR - 2015-08-03 17:30:29.462; [ db] org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: Ping query caused exception: Bad contentType for search handler :application/atom+xml request={wt=xmlq=solrpingqueryechoParams=alldistrib=false} at org.apache.solr.handler.PingRequestHandler.handlePing(PingRequestHandler.java:263) at org.apache.solr.handler.PingRequestHandler.handleRequestBody(PingRequestHandler.java:211) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) If I use a browser plug-in Rest Client it works just fine. My Java code works with other paths such as /solr/db/config/requestHandler?wt=xml or /solr/db/schema/fieldtypes/?wt=xml or /solr/db/schema/fields/?wt=xml. Yes, I did try other content types, the outcome is the same error. I'm using the default ping handler: requestHandler name=/admin/ping class=solr.PingRequestHandler lst name=invariants str name=qsolrpingquery/str /lst lst name=defaults str name=echoParamsall/str /lst /requestHandler Any clues / pointers why /admin/ping doesn't work but other query paths do? Thanks Steve
Re: posting html files
Thanks Erik, I'm trying to index some html files in the same format and I need to index them according to classes and tags. I've tried data_driven_schema_configs but I can only get the title and id but not other tags and classes I wanted. So now I want to edit the schema in the basic_configs but turned out that error. So do you have any good idea for me? Also, I also tried to use bin/post to post an xml file to that same core and it worked so I'm wondering why the html file won't work. Thank you so much!! Since I don't know much about solr, it's really good that some one can help! Best, Huiying On Mon, Aug 3, 2015 at 1:54 PM, Erik Hatcher erik.hatc...@gmail.com wrote: My hunch is that the basic_configs is *too* basic for your needs here. basic_configs does not include /update/extract - it’s very basic - stripped of all the “extra” components. Try using the default, data_driven_schema_configs instead. If you’re still having issues, please provide full details of what you’ve tried. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Aug 3, 2015, at 1:43 PM, Huiying Ma mahuiying...@gmail.com wrote: Hi everyone, I created a core with the basic config sets and schema, when I use bin/post to post one html file, I got the error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException.. HTTP ERROR 404 when I go to localhost:8983/solr/core/update, I got: response lst name=responseHeader int name=status400/int int name=QTime3int /lst lst name=error str name=msgmissing content stream/str int name=code400/int /lst /response I'm really new to solr and wondering if anyone know how to index html files according to my own schema and how to configure the schema.xml or solrconfig file. Thank you so much! Thanks, Huiying
Re: HTTP Error 500 on /admin/ping request
Yes, my application is in Java, no I cannot switch to SolrJ because I'm working off legacy code for which I don't have the luxury to refactor.. If my application is sending the wrong Content-Type HTTP header, which part is it and why the same header is working for the other query paths such as: /solr/db/config/requestHandler?wt=xml or /solr/db/schema/fieldtypes/?wt=xml or /solr/db/schema/fields/?wt=xml ? Steve On Mon, Aug 3, 2015 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote: On 8/3/2015 11:34 AM, Steven White wrote: Hi Everyone, I cannot figure out why I'm getting HTTP Error 500 off the following code: snip Ping query caused exception: Bad contentType for search handler :application/atom+xml Your application is sending an incorrect Content-Type HTTP header that Solr doesn't know how to handle. If your application is Java, why are you not using SolrJ? You'll likely find that to be a lot easier to use than even a REST client. Thanks, Shawn
Re: posting html files
My recommendation, start with the default configset (data_driven_schema_configs) like this: # grab an HTML page to use curl http://lucene.apache.org/solr/index.html index.html bin/solr start bin/solr create -c html_test bin/post -c html_test index.html $ curl http://localhost:8983/solr/html_test/select?q=*:*wt=csv; stream_size,stream_content_type,keywords,x_parsed_by,content_encoding,distribution,title,content_type,viewport,_version_,dc_title,id,resourcename,robots 23049,text/html,apache\, apache lucene\, apache solr\, solr\, lucene search\, information retrieval\, spell checking\, faceting\, inverted index\, open source,org.apache.tika.parser.DefaultParser,org.apache.tika.parser.html.HtmlParser,UTF-8,Global,Apache Solr -,text/html; charset=UTF-8,minimal-ui\, initial-scale=1\, maximum-scale=1\, user-scalable=0,1508508085335883776,Apache Solr -,/Users/erikhatcher/dev/trunk/solr/index.html,/Users/erikhatcher/dev/trunk/solr/index.html,index\,follow” If you’d like to enhance the extraction for specific xpaths, see https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika#UploadingDatawithSolrCellusingApacheTika-InputParameters - you can set these parameters on the upload, using -params (see the “Capturing and Mapping” example with -params on the bin/post) or by adjusting the settings of /update/extract in solrconfig.xml. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com On Aug 3, 2015, at 2:00 PM, Huiying Ma mahuiying...@gmail.com wrote: Thanks Erik, I'm trying to index some html files in the same format and I need to index them according to classes and tags. I've tried data_driven_schema_configs but I can only get the title and id but not other tags and classes I wanted. So now I want to edit the schema in the basic_configs but turned out that error. So do you have any good idea for me? Also, I also tried to use bin/post to post an xml file to that same core and it worked so I'm wondering why the html file won't work. Thank you so much!! Since I don't know much about solr, it's really good that some one can help! Best, Huiying On Mon, Aug 3, 2015 at 1:54 PM, Erik Hatcher erik.hatc...@gmail.com wrote: My hunch is that the basic_configs is *too* basic for your needs here. basic_configs does not include /update/extract - it’s very basic - stripped of all the “extra” components. Try using the default, data_driven_schema_configs instead. If you’re still having issues, please provide full details of what you’ve tried. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Aug 3, 2015, at 1:43 PM, Huiying Ma mahuiying...@gmail.com wrote: Hi everyone, I created a core with the basic config sets and schema, when I use bin/post to post one html file, I got the error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException.. HTTP ERROR 404 when I go to localhost:8983/solr/core/update, I got: response lst name=responseHeader int name=status400/int int name=QTime3int /lst lst name=error str name=msgmissing content stream/str int name=code400/int /lst /response I'm really new to solr and wondering if anyone know how to index html files according to my own schema and how to configure the schema.xml or solrconfig file. Thank you so much! Thanks, Huiying
Re: posting html files
My hunch is that the basic_configs is *too* basic for your needs here. basic_configs does not include /update/extract - it’s very basic - stripped of all the “extra” components. Try using the default, data_driven_schema_configs instead. If you’re still having issues, please provide full details of what you’ve tried. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com http://www.lucidworks.com/ On Aug 3, 2015, at 1:43 PM, Huiying Ma mahuiying...@gmail.com wrote: Hi everyone, I created a core with the basic config sets and schema, when I use bin/post to post one html file, I got the error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException.. HTTP ERROR 404 when I go to localhost:8983/solr/core/update, I got: response lst name=responseHeader int name=status400/int int name=QTime3int /lst lst name=error str name=msgmissing content stream/str int name=code400/int /lst /response I'm really new to solr and wondering if anyone know how to index html files according to my own schema and how to configure the schema.xml or solrconfig file. Thank you so much! Thanks, Huiying
Re: Can Apache Solr Handle TeraByte Large Data
Hi Alexandre, I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf 1.)I have to split all underscore value from a filename and these value have to be index to the solr. 2.)Do Not need file contains(Text) to index. You Told me The answer is Yes i didn't get in which way you said Yes. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220527.html Sent from the Solr - User mailing list archive at Nabble.com.
reload collections timeout
Hi everybody, I have about 1300 collections, 3 shards, replicationfactor = 3, MaxShardPerNode=3. I have 3 boxes of 64G (32 JVM). When I want to reload all my collections I get a timeout error. Is there a way to make a reload in async as to create collections (async=requestid)? I saw on this issue that it was done but it did not seem to work. https://issues.apache.org/jira/browse/SOLR-5477 how to use the async mode to reload collections ? thanks a lot Olivier Damiot
[JOB] Financial search engine company AlphaSense is looking for Search Engineers
Hi fellow Solr devs / users, I decided to resend the info on the opening assuming most of you could have been on vacation in July. I don't intend to send it any longer :) Company: AlphaSense https://www.alpha-sense.com/ Position: Search Engineer AlphaSense is a one-stop financial search engine for financial research analysts all around the world. AlphaSense is looking for Search Engineers experienced with Lucene / Solr and search architectures in general. Positions are open in Helsinki ( http://www.visitfinland.com/helsinki/). Daily routine topics for our search team: 1. Sharding 2. Commit vs query performance 3. Performance benchmarking 4. Custom query syntax, lucene / solr grammars 5. Relevancy 6. Query optimization 7. Search system monitoring: cache, RAM, throughput etc 8. Automatic deployment 9. Internal tool development We have evolved the system through a series of Solr releases starting from 1.4 to 4.10, pushing forward all our solr-level customizations. Requirements: 1. Core Java + web services 2. Understanding of distributed search engine architecture 3. Java concurrency 4. Understanding of performance issues and approaches to tackle them 5. Clean and beautiful code + design patterns Our search team members are active in the open source search scene, in particular we support and develop luke toolbox: https://github.com/dmitrykey/luke, participate in search / OS conferences (Lucene Revolution, ApacheCon, Berlin buzzwords), review books on Solr. Send your CV over and let's have a chat. Please e-mail me, if you have any questions. -- Dmitry Kan Luke Toolbox: http://github.com/DmitryKey/luke Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan SemanticAnalyzer: www.semanticanalyzer.info
Duplicate Documents
I'm using solr 4.10.2. I'm using id field as the unique key - it is passed in with the document when ingesting the documents into solr. When querying I get duplicate documents with different _version_. Out off approx. 25K unique documents ingested into solr, I see approx. 300 duplicates. It is a 3 node solr cloud with one shard and 2 replicas. I'm also using nested documents. Thanks in advance for any insights. --Magesh
Re: solr multicore vs sharding vs 1 big collection
Yeah a separate by month or year is good and can really help in this case. Bill Bell Sent from mobile On Aug 2, 2015, at 5:29 PM, Jay Potharaju jspothar...@gmail.com wrote: Shawn, Thanks for the feedback. I agree that increasing timeout might alleviate the timeout issue. The main problem with increasing timeout is the detrimental effect it will have on the user experience, therefore can't increase it. I have looked at the queries that threw errors, next time I try it everything seems to work fine. Not sure how to reproduce the error. My concern with increasing the memory to 32GB is what happens when the index size grows over the next few months. One of the other solutions I have been thinking about is to rebuild index(weekly) and create a new collection and use it. Are there any good references for doing that? Thanks Jay On Sun, Aug 2, 2015 at 10:19 AM, Shawn Heisey apa...@elyograg.org wrote: On 8/2/2015 8:29 AM, Jay Potharaju wrote: The document contains around 30 fields and have stored set to true for almost 15 of them. And these stored fields are queried and updated all the time. You will notice that the deleted documents is almost 30% of the docs. And it has stayed around that percent and has not come down. I did try optimize but that was disruptive as it caused search errors. I have been playing with merge factor to see if that helps with deleted documents or not. It is currently set to 5. The server has 24 GB of memory out of which memory consumption is around 23 GB normally and the jvm is set to 6 GB. And have noticed that the available memory on the server goes to 100 MB at times during a day. All the updates are run through DIH. Using all availble memory is completely normal operation for ANY operating system. If you hold up Windows as an example of one that doesn't ... it lies to you about available memory. All modern operating systems will utilize memory that is not explicitly allocated for the OS disk cache. The disk cache will instantly give up any of the memory it is using for programs that request it. Linux doesn't try to hide the disk cache from you, but older versions of Windows do. In the newer versions of Windows that have the Resource Monitor, you can go there to see the actual memory usage including the cache. Every day at least once i see the following error, which result in search errors on the front end of the site. ERROR org.apache.solr.servlet.SolrDispatchFilter - null:org.eclipse.jetty.io.EofException From what I have read these are mainly due to timeout and my timeout is set to 30 seconds and cant set it to a higher number. I was thinking maybe due to high memory usage, sometimes it leads to bad performance/errors. Although this error can be caused by timeouts, it has a specific meaning. It means that the client disconnected before Solr responded to the request, so when Solr tried to respond (through jetty), it found a closed TCP connection. Client timeouts need to either be completely removed, or set to a value much longer than any request will take. Five minutes is a good starting value. If all your client timeout is set to 30 seconds and you are seeing EofExceptions, that means that your requests are taking longer than 30 seconds, and you likely have some performance issues. It's also possible that some of your client timeouts are set a lot shorter than 30 seconds. My objective is to stop the errors, adding more memory to the server is not a good scaling strategy. That is why i was thinking maybe there is a issue with the way things are set up and need to be revisited. You're right that adding more memory to the servers is not a good scaling strategy for the general case ... but in this situation, I think it might be prudent. For your index and heap sizes, I would want the company to pay for at least 32GB of RAM. Having said that ... I've seen Solr installs work well with a LOT less memory than the ideal. I don't know that adding more memory is necessary, unless your system (CPU, storage, and memory speeds) is particularly slow. Based on your document count and index size, your documents are quite small, so I think your memory size is probably good -- if the CPU, memory bus, and storage are very fast. If one or more of those subsystems aren't fast, then make up the difference with lots of memory. Some light reading, where you will learn why I think 32GB is an ideal memory size for your system: https://wiki.apache.org/solr/SolrPerformanceProblems It is possible that your 6GB heap is not quite big enough for good performance, or that your GC is not well-tuned. These topics are also discussed on that wiki page. If you increase your heap size, then the likelihood of needing more memory in the system becomes greater, because there will be less memory available for the disk cache. Thanks, Shawn -- Thanks Jay Potharaju
Closing the IndexSearcher/IndexWriter for a core
Is there are an easy way for a client to tell Solr to close or release the IndexSearcher and/or IndexWriter for a core? I have a use case where we're creating a lot of cores with not that many documents per zone (a few hundred to maybe 10's of thousands). Writes come in batches, and reads also tend to be bursty, if less so than the writes. And we're having problems with ram usage on the server. Poking around a heap dump, the problem is that every IndexSearcher or IndexWriter being opened is taking up large amounts of memory. I've looked at the unload call, and while it is unclear, it seems like it deletes the data on disk as well. I don't want to delete the data on disk, I just want to unload the searcher and writer, and free up the memory. So I'm wondering if there is a call I can make when I know or suspect that the core isn't going to be used in the near future to release these objects and return the memory? Or a configuration option I can set to do so after, say, being idle for 5 seconds? It's OK for there to be a performance hit the first time I reopen the core. Thanks, Brian
Re: Can Apache Solr Handle TeraByte Large Data
Well, If it is just file names, I'd probably use SolrJ client, maybe with Java 8. Read file names, split the name into parts with regular expressions, stuff parts into different field names and send to Solr. Java 8 has FileSystem walkers, etc to make it easier. You could do it with DIH, but it would be with nested entities and the inner entity would probably try to parse the file. So, a lot of wasted effort if you just care about the file names. Or, I would just do a directory listing in the operating system and use regular expressions to split it into CSV file, which I would then import into Solr directly. In all of these cases, the question would be which field is the ID of the record to ensure no duplicates. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 3 August 2015 at 15:34, Mugeesh Husain muge...@gmail.com wrote: @Alexandre No i dont need a content of a file. i am repeating my requirement I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf I just split all Value from a filename only,these values i have to index. I am interested to index value to solr not file contains. I have tested the DIH from a file system its work fine but i dont know how can i implement my code in DIH if my code get some value than how i can i index it using DIH. If i will use DIH then How i will make split operation and get value from it. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can Apache Solr Handle TeraByte Large Data
@Alexandre No i dont need a content of a file. i am repeating my requirement I have a 40 millions of files which is stored in a file systems, the filename saved as ARIA_SSN10_0007_LOCATION_129.pdf I just split all Value from a filename only,these values i have to index. I am interested to index value to solr not file contains. I have tested the DIH from a file system its work fine but i dont know how can i implement my code in DIH if my code get some value than how i can i index it using DIH. If i will use DIH then How i will make split operation and get value from it. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html Sent from the Solr - User mailing list archive at Nabble.com.
Collapsing Query Parser returns one record per shard...was not expecting this...
From my reading of the solr docs (e.g. https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results and https://cwiki.apache.org/confluence/display/solr/Result+Grouping), I've been under the impression that these two methods (result grouping and collapsing query parser) can both be used to eliminate duplicates from a result set (in our case, we have a duplication field that contains a 'signature' that identifies duplicates. We use our own signature for a variety of reasons that are tied to complex business requirements.). In a test environment I scattered 15 duplicate records (with another 10 unique records) across a test system running Solr Cloud (Solr version 5.2.1) that had 4 shards and a replication factor of 2. I tried both result grouping and the collapsing query parser to remove duplicates. The result grouping worked as expected...the collapsing query parser did not. My results in using the collapsing query parser showed that Solr was in fact including into the result set one of the duplicate records from each shard (that is, I received FOUR duplicate records...and turning on debug showed that each of the four records came from a unique shard)...when I was expecting solr to do the collapsing on the aggregated result and return only ONE of the duplicated records across ALL shards. It appears that solr is performing the collapsing query parsing on each individual shard, but then NOT performing the operation on the aggregated results from each shard. I have searched through the forums and checked the documentation as carefully as I can. I find no documentation or mention of this effect (one record being returned per shard) when using collapsing query parsing. Is this a known behavior? Am I just doing something wrong? Am I missing some search parameter? Am I simply not understanding correctly how this is supposed to work? For reference, I am including below the search url and the response I received. Any insights would be appreciated. Query: http://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq={!collapse%20field=dupid_s}debugQuery=truehttp://172.26.250.150:8983/solr/AcaColl/select?q=*%3A*wt=jsonindent=truerows=1000fq=%7b!collapse%20field=dupid_s%7ddebugQuery=true Response (note that dupid_s = 900 is the duplicate value and that I have added comments in the output ***comment*** pointing out which shard responses came from): { responseHeader:{ status:0, QTime:31, params:{ debugQuery:true, indent:true, q:*:*, wt:json, fq:{!collapse field=dupid_s}, rows:1000}}, response:{numFound:14,start:0,maxScore:1.0,docs:[ { storeid_s:1002, dupid_s:900, ***AcaColl_shard2_replica2*** title_pqth:[Dupe Record #2], _version_:1508241005512491008, indexTime_dt:2015-07-31T19:25:09.914Z}, { storeid_s:8020, dupid_s:2005, title_pqth:[Unique Record #5], _version_:1508241005539753984, indexTime_dt:2015-07-31T19:25:09.94Z}, { storeid_s:8023, dupid_s:2008, title_pqth:[Unique Record #8], _version_:1508241005540802560, indexTime_dt:2015-07-31T19:25:09.94Z}, { storeid_s:8024, dupid_s:2009, title_pqth:[Unique Record #9], _version_:1508241005541851136, indexTime_dt:2015-07-31T19:25:09.94Z}, { storeid_s:1007, dupid_s:900, ***AcaColl_shard4_replica2*** title_pqth:[Dupe Record #7], _version_:1508241005515636736, indexTime_dt:2015-07-31T19:25:09.91Z}, { storeid_s:8016, dupid_s:2001, title_pqth:[Unique Record #1], _version_:1508241005526122496, indexTime_dt:2015-07-31T19:25:09.91Z}, { storeid_s:8019, dupid_s:2004, title_pqth:[Unique Record #4], _version_:1508241005528219648, indexTime_dt:2015-07-31T19:25:09.91Z}, { storeid_s:1003, dupid_s:900, ***AcaColl_shard1_replica1*** title_pqth:[Dupe Record #3], _version_:1508241005515636736, indexTime_dt:2015-07-31T19:25:09.917Z}, { storeid_s:8017, dupid_s:2002, title_pqth:[Unique Record #2], _version_:1508241005518782464, indexTime_dt:2015-07-31T19:25:09.917Z}, { storeid_s:8018, dupid_s:2003, title_pqth:[Unique Record #3], _version_:1508241005519831040, indexTime_dt:2015-07-31T19:25:09.917Z}, { storeid_s:1001, dupid_s:900, ***AcaColl_shard3_replica1*** title_pqth:[Dupe Record #1], _version_:1508241005511442432, indexTime_dt:2015-07-31T19:25:09.912Z}, { storeid_s:8021, dupid_s:2006, title_pqth:[Unique Record #6], _version_:1508241005532413952, indexTime_dt:2015-07-31T19:25:09.929Z}, { storeid_s:8022,