How to update SOLR schema from continuous integration environment
Hi, How do people usually update Solr configuration files from continuous integration environment like TeamCity or Jenkins. We have multiple development and testing environments and use WebDeploy and AwsDeploy type of tools to remotely deploy code multiple times a day, to update solr I wrote a simple node server which accepts conf folder over http, updates the specified conf core folder and restarts the solr service. Does there exists a standard tool for this uses case. I know about schema rest api, but, I want to update all the files in the conf folder rather than just updating a single file or adding or removing synonyms piecemeal. Here is the link for the node server I mentioned if anyone is interested. https://github.com/faisalmansoor/UpdateSolrConfig Thanks, Faisal
Re: Consul instead of ZooKeeper anyone?
It looks like Consul solves a different problem than Zookeeper. Consul manages what servers are up and starts new ones as needed. Zookeeper doesn’t start servers, but does leader election when they fail. I don’t see any way that Consul could replace Zookeeper, but it could solve another part of the problem. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Oct 31, 2014, at 5:15 PM, Erick Erickson wrote: > Not that I know of, but look before you leap. I took a quick look at > Consul and it really doesn't look like any kind of drop-in replacement. > Also, the Zookeeper usage in SolrCloud isn't really pluggable > AFAIK, so there'll be lots of places in the Solr code that need to be > reworked etc., especially in the realm of collections and sharding. > > The Collections API will be challenging to port over I think. > > Not to mention SolrJ and CloudSolrServer for clients who want to interact > with SolrCloud through Java. > > Not saying it won't work, I just suspect that getting it done would be > a big job, and thereafter keeping those changes in sync with the > changing SolrCloud code base would chew up a lots of time. So if > I were putting my Product Manager hat on I'd ask "is the benefit > worth the effort?". > > All that said, go for it if you've a mind to! > > Best, > Erick > > On Fri, Oct 31, 2014 at 4:08 PM, Greg Solovyev wrote: >> I am investigating a project to make SolrCloud run on Consul instead of >> ZooKeeper. So far, my research revealed no such efforts, but I wanted to >> check with this list to make sure I am not going to be reinventing the >> wheel. Have anyone attempted using Consul instead of ZK to coordinate >> SolrCloud nodes? >> >> Thanks, >> Greg
Re: Consul instead of ZooKeeper anyone?
Not that I know of, but look before you leap. I took a quick look at Consul and it really doesn't look like any kind of drop-in replacement. Also, the Zookeeper usage in SolrCloud isn't really pluggable AFAIK, so there'll be lots of places in the Solr code that need to be reworked etc., especially in the realm of collections and sharding. The Collections API will be challenging to port over I think. Not to mention SolrJ and CloudSolrServer for clients who want to interact with SolrCloud through Java. Not saying it won't work, I just suspect that getting it done would be a big job, and thereafter keeping those changes in sync with the changing SolrCloud code base would chew up a lots of time. So if I were putting my Product Manager hat on I'd ask "is the benefit worth the effort?". All that said, go for it if you've a mind to! Best, Erick On Fri, Oct 31, 2014 at 4:08 PM, Greg Solovyev wrote: > I am investigating a project to make SolrCloud run on Consul instead of > ZooKeeper. So far, my research revealed no such efforts, but I wanted to > check with this list to make sure I am not going to be reinventing the wheel. > Have anyone attempted using Consul instead of ZK to coordinate SolrCloud > nodes? > > Thanks, > Greg
Consul instead of ZooKeeper anyone?
I am investigating a project to make SolrCloud run on Consul instead of ZooKeeper. So far, my research revealed no such efforts, but I wanted to check with this list to make sure I am not going to be reinventing the wheel. Have anyone attempted using Consul instead of ZK to coordinate SolrCloud nodes? Thanks, Greg
Re: exporting to CSV with solrj
@Will: I can't tell you how many times questions like "Why do you want to use CSV in SolrJ?" have lead to solutions different from what the original question might imply. It's a question I frequently ask in almost the exact same way; it's a perfectly legitimate question IMO. Best, Erick On Fri, Oct 31, 2014 at 1:25 PM, Chris Hostetter wrote: > > : "Why do you want to use CSV in SolrJ?" Alexandre are you looking for a > > It's a legitmate question - part of providing good community support is > making sure we understand *why* users are asking how to do something, so > we can give good advice on other solutions people might not even have > thought of -- teach a man to fish, vs give a man a fish, etc... > > https://people.apache.org/~hossman/#xyproblem > > ...if we understand *why* people ask questions, or aproach problems in > certain ways, we can not only offer the best possible suggestions, but > also consider how the underlying usecase (and other similar use cases like > it) might be better served in the future. > > -Hoss > http://www.lucidworks.com/
Re: prefix length in fuzzy search solr 4.10.1
No, but it is a reasonable request, as a global default, a collection-specific default, a request-specific default, and on an individual fuzzy term. -- Jack Krupansky -Original Message- From: elisabeth benoit Sent: Thursday, October 30, 2014 6:07 AM To: solr-user@lucene.apache.org Subject: prefix length in fuzzy search solr 4.10.1 Hello all, Is there a parameter in solr 4.10.1 api allowing user to fix prefix length in fuzzy search. Best regards, Elisabeth
[ANNOUNCE] Apache Solr 4.10.2 released
October 2014, Apache Solr™ 4.10.2 available The Lucene PMC is pleased to announce the release of Apache Solr 4.10.2 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.10.2 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr 4.10.2 includes 10 bug fixes, as well as Lucene 4.10.2 and its 2 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy Halloween, Mike McCandless http://blog.mikemccandless.com
Re: exporting to CSV with solrj
: "Why do you want to use CSV in SolrJ?" Alexandre are you looking for a It's a legitmate question - part of providing good community support is making sure we understand *why* users are asking how to do something, so we can give good advice on other solutions people might not even have thought of -- teach a man to fish, vs give a man a fish, etc... https://people.apache.org/~hossman/#xyproblem ...if we understand *why* people ask questions, or aproach problems in certain ways, we can not only offer the best possible suggestions, but also consider how the underlying usecase (and other similar use cases like it) might be better served in the future. -Hoss http://www.lucidworks.com/
Re: exporting to CSV with solrj
On 31 October 2014 14:58, will martin wrote: > "Why do you want to use CSV in SolrJ?" Alexandre are you looking for a > design gig. This kind of question really begs nothing but disdain. Nope. Not looking for a design gig. I give that advice away for free: http://www.airpair.com/solr/workshops/discovering-your-inner-search-engine, http://www.bigdatamontreal.org/?p=310 , http://www.solr-start.com/, etc Though, in all fairness, I do charge for my Solr book: https://www.packtpub.com/big-data-and-business-intelligence/instant-apache-solr-indexing-data-how-instant In this particular case, there might have been two or three ways to answer the question, depending on why Ted wanted to use CSV from SolrJ as opposed to the more common command line approach, which is the example given in the tutorial and online documentation. Depending on his business-level goals, there might have been different types of help offered. We, in the Solr community sometimes call it an X-Y problem. However, if you, Will Martin of USA, take a second-hand offence on behalf of another person, I do apologize to you. There certainly was no intent in upsetting innocent bystanders caught in the cross-fire of determining the right answer to a slightly-unusual question. Regards, Alex.
Re: Solr index corrupt question
Erick Erickson wrote > What version of Solr/Lucene? First of all, was Lucene\Solr v.4.6, but later it was changed to Lucene\Solr 4.8. More later to the schema was added _root_ field and child doc support. Full data re-index on each change was not done. But not so long ago I had made an optimize to one segment. No problems were with optimize. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810p4166908.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Ideas for debugging poor SolrCloud scalability
Yes, I was inadvertently sending them to a replica. When I sent them to the leader, the leader reported (1000 adds) and the replica reported only 1 add per document. So, it looks like the leader forwards the batched jobs individually to the replicas. On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson wrote: > Internally, the docs are batched up into smaller buckets (10 as I > remember) and forwarded to the correct shard leader. I suspect that's > what you're seeing. > > Erick > > On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan > wrote: > > Regarding batch indexing: > > When I send batches of 1000 docs to a standalone Solr server, the log > file > > reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the > > leader of a replicated index, the leader log file reports much smaller > > numbers, usually "(12 adds)". Why do the batches appear to be broken up? > > > > Peter > > > > On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson < > erickerick...@gmail.com> > > wrote: > > > >> NP, just making sure. > >> > >> I suspect you'll get lots more bang for the buck, and > >> results much more closely matching your expectations if > >> > >> 1> you batch up a bunch of docs at once rather than > >> sending them one at a time. That's probably the easiest > >> thing to try. Sending docs one at a time is something of > >> an anti-pattern. I usually start with batches of 1,000. > >> > >> And just to check.. You're not issuing any commits from the > >> client, right? Performance will be terrible if you issue commits > >> after every doc, that's totally an anti-pattern. Doubly so for > >> optimizes Since you showed us your solrconfig autocommit > >> settings I'm assuming not but want to be sure. > >> > >> 2> use a leader-aware client. I'm totally unfamiliar with Go, > >> so I have no suggestions whatsoever to offer there But you'll > >> want to batch in this case too. > >> > >> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose > wrote: > >> > Hi Erick - > >> > > >> > Thanks for the detailed response and apologies for my confusing > >> > terminology. I should have said "WPS" (writes per second) instead of > QPS > >> > but I didn't want to introduce a weird new acronym since QPS is well > >> > known. Clearly a bad decision on my part. To clarify: I am doing > >> > *only* writes > >> > (document adds). Whenever I wrote "QPS" I was referring to writes. > >> > > >> > It seems clear at this point that I should wrap up the code to do > "smart" > >> > routing rather than choose Solr nodes randomly. And then see if that > >> > changes things. I must admit that although I understand that random > node > >> > selection will impose a performance hit, theoretically it seems to me > >> that > >> > the system should still scale up as you add more nodes (albeit at > lower > >> > absolute level of performance than if you used a smart router). > >> > Nonetheless, I'm just theorycrafting here so the better thing to do is > >> just > >> > try it experimentally. I hope to have that working today - will > report > >> > back on my findings. > >> > > >> > Cheers, > >> > - Ian > >> > > >> > p.s. To clarify why we are rolling our own smart router code, we use > Go > >> > over here rather than Java. Although if we still get bad performance > >> with > >> > our custom Go router I may try a pure Java load client using > >> > CloudSolrServer to eliminate the possibility of bugs in our > >> implementation. > >> > > >> > > >> > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson < > erickerick...@gmail.com > >> > > >> > wrote: > >> > > >> >> I'm really confused: > >> >> > >> >> bq: I am not issuing any queries, only writes (document inserts) > >> >> > >> >> bq: It's clear that once the load test client has ~40 simulated users > >> >> > >> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support > >> >> a higher QPS than 2 shards over 2 Solr nodes, right > >> >> > >> >> QPS is usually used to mean "Queries Per Second", which is different > >> from > >> >> the statement that "I am not issuing any queries". And what do > the > >> >> number of users have to do with inserting documents? > >> >> > >> >> You also state: " In many cases, CPU on the solr servers is quite > low as > >> >> well" > >> >> > >> >> So let's talk about indexing first. Indexing should scale nearly > >> >> linearly as long as > >> >> 1> you are routing your docs to the correct leader, which happens > with > >> >> SolrJ > >> >> and the CloudSolrSever automatically. Rather than rolling your own, I > >> >> strongly > >> >> suggest you try this out. > >> >> 2> you have enough clients feeding the cluster to push CPU > utilization > >> >> on them all. > >> >> Very often "slow indexing", or in your case "lack of scaling" is a > >> >> result of document > >> >> acquisition or, in your case, your doc generator is spending all it's > >> >> time waiting for > >> >> the individual documents to get to Solr and come back. > >> >> > >> >> bq: "chooses a random solr server for e
Re: Ideas for debugging poor SolrCloud scalability
Internally, the docs are batched up into smaller buckets (10 as I remember) and forwarded to the correct shard leader. I suspect that's what you're seeing. Erick On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan wrote: > Regarding batch indexing: > When I send batches of 1000 docs to a standalone Solr server, the log file > reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the > leader of a replicated index, the leader log file reports much smaller > numbers, usually "(12 adds)". Why do the batches appear to be broken up? > > Peter > > On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson > wrote: > >> NP, just making sure. >> >> I suspect you'll get lots more bang for the buck, and >> results much more closely matching your expectations if >> >> 1> you batch up a bunch of docs at once rather than >> sending them one at a time. That's probably the easiest >> thing to try. Sending docs one at a time is something of >> an anti-pattern. I usually start with batches of 1,000. >> >> And just to check.. You're not issuing any commits from the >> client, right? Performance will be terrible if you issue commits >> after every doc, that's totally an anti-pattern. Doubly so for >> optimizes Since you showed us your solrconfig autocommit >> settings I'm assuming not but want to be sure. >> >> 2> use a leader-aware client. I'm totally unfamiliar with Go, >> so I have no suggestions whatsoever to offer there But you'll >> want to batch in this case too. >> >> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose wrote: >> > Hi Erick - >> > >> > Thanks for the detailed response and apologies for my confusing >> > terminology. I should have said "WPS" (writes per second) instead of QPS >> > but I didn't want to introduce a weird new acronym since QPS is well >> > known. Clearly a bad decision on my part. To clarify: I am doing >> > *only* writes >> > (document adds). Whenever I wrote "QPS" I was referring to writes. >> > >> > It seems clear at this point that I should wrap up the code to do "smart" >> > routing rather than choose Solr nodes randomly. And then see if that >> > changes things. I must admit that although I understand that random node >> > selection will impose a performance hit, theoretically it seems to me >> that >> > the system should still scale up as you add more nodes (albeit at lower >> > absolute level of performance than if you used a smart router). >> > Nonetheless, I'm just theorycrafting here so the better thing to do is >> just >> > try it experimentally. I hope to have that working today - will report >> > back on my findings. >> > >> > Cheers, >> > - Ian >> > >> > p.s. To clarify why we are rolling our own smart router code, we use Go >> > over here rather than Java. Although if we still get bad performance >> with >> > our custom Go router I may try a pure Java load client using >> > CloudSolrServer to eliminate the possibility of bugs in our >> implementation. >> > >> > >> > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson > > >> > wrote: >> > >> >> I'm really confused: >> >> >> >> bq: I am not issuing any queries, only writes (document inserts) >> >> >> >> bq: It's clear that once the load test client has ~40 simulated users >> >> >> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support >> >> a higher QPS than 2 shards over 2 Solr nodes, right >> >> >> >> QPS is usually used to mean "Queries Per Second", which is different >> from >> >> the statement that "I am not issuing any queries". And what do the >> >> number of users have to do with inserting documents? >> >> >> >> You also state: " In many cases, CPU on the solr servers is quite low as >> >> well" >> >> >> >> So let's talk about indexing first. Indexing should scale nearly >> >> linearly as long as >> >> 1> you are routing your docs to the correct leader, which happens with >> >> SolrJ >> >> and the CloudSolrSever automatically. Rather than rolling your own, I >> >> strongly >> >> suggest you try this out. >> >> 2> you have enough clients feeding the cluster to push CPU utilization >> >> on them all. >> >> Very often "slow indexing", or in your case "lack of scaling" is a >> >> result of document >> >> acquisition or, in your case, your doc generator is spending all it's >> >> time waiting for >> >> the individual documents to get to Solr and come back. >> >> >> >> bq: "chooses a random solr server for each ADD request (with 1 doc per >> add >> >> request)" >> >> >> >> Probably your culprit right there. Each and every document requires that >> >> you >> >> have to cross the network (and forward that doc to the correct leader). >> So >> >> given >> >> that you're not seeing high CPU utilization, I suspect that you're not >> >> sending >> >> enough docs to SolrCloud fast enough to see scaling. You need to batch >> up >> >> multiple docs, I generally send 1,000 docs at a time. >> >> >> >> But even if you do solve this, the inter-node routing will prevent >> >> linear scaling. >> >> When a doc (or a ba
Re: Ideas for debugging poor SolrCloud scalability
Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually "(12 adds)". Why do the batches appear to be broken up? Peter On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson wrote: > NP, just making sure. > > I suspect you'll get lots more bang for the buck, and > results much more closely matching your expectations if > > 1> you batch up a bunch of docs at once rather than > sending them one at a time. That's probably the easiest > thing to try. Sending docs one at a time is something of > an anti-pattern. I usually start with batches of 1,000. > > And just to check.. You're not issuing any commits from the > client, right? Performance will be terrible if you issue commits > after every doc, that's totally an anti-pattern. Doubly so for > optimizes Since you showed us your solrconfig autocommit > settings I'm assuming not but want to be sure. > > 2> use a leader-aware client. I'm totally unfamiliar with Go, > so I have no suggestions whatsoever to offer there But you'll > want to batch in this case too. > > On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose wrote: > > Hi Erick - > > > > Thanks for the detailed response and apologies for my confusing > > terminology. I should have said "WPS" (writes per second) instead of QPS > > but I didn't want to introduce a weird new acronym since QPS is well > > known. Clearly a bad decision on my part. To clarify: I am doing > > *only* writes > > (document adds). Whenever I wrote "QPS" I was referring to writes. > > > > It seems clear at this point that I should wrap up the code to do "smart" > > routing rather than choose Solr nodes randomly. And then see if that > > changes things. I must admit that although I understand that random node > > selection will impose a performance hit, theoretically it seems to me > that > > the system should still scale up as you add more nodes (albeit at lower > > absolute level of performance than if you used a smart router). > > Nonetheless, I'm just theorycrafting here so the better thing to do is > just > > try it experimentally. I hope to have that working today - will report > > back on my findings. > > > > Cheers, > > - Ian > > > > p.s. To clarify why we are rolling our own smart router code, we use Go > > over here rather than Java. Although if we still get bad performance > with > > our custom Go router I may try a pure Java load client using > > CloudSolrServer to eliminate the possibility of bugs in our > implementation. > > > > > > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson > > > wrote: > > > >> I'm really confused: > >> > >> bq: I am not issuing any queries, only writes (document inserts) > >> > >> bq: It's clear that once the load test client has ~40 simulated users > >> > >> bq: A cluster of 3 shards over 3 Solr nodes *should* support > >> a higher QPS than 2 shards over 2 Solr nodes, right > >> > >> QPS is usually used to mean "Queries Per Second", which is different > from > >> the statement that "I am not issuing any queries". And what do the > >> number of users have to do with inserting documents? > >> > >> You also state: " In many cases, CPU on the solr servers is quite low as > >> well" > >> > >> So let's talk about indexing first. Indexing should scale nearly > >> linearly as long as > >> 1> you are routing your docs to the correct leader, which happens with > >> SolrJ > >> and the CloudSolrSever automatically. Rather than rolling your own, I > >> strongly > >> suggest you try this out. > >> 2> you have enough clients feeding the cluster to push CPU utilization > >> on them all. > >> Very often "slow indexing", or in your case "lack of scaling" is a > >> result of document > >> acquisition or, in your case, your doc generator is spending all it's > >> time waiting for > >> the individual documents to get to Solr and come back. > >> > >> bq: "chooses a random solr server for each ADD request (with 1 doc per > add > >> request)" > >> > >> Probably your culprit right there. Each and every document requires that > >> you > >> have to cross the network (and forward that doc to the correct leader). > So > >> given > >> that you're not seeing high CPU utilization, I suspect that you're not > >> sending > >> enough docs to SolrCloud fast enough to see scaling. You need to batch > up > >> multiple docs, I generally send 1,000 docs at a time. > >> > >> But even if you do solve this, the inter-node routing will prevent > >> linear scaling. > >> When a doc (or a batch of docs) goes to a random Solr node, here's what > >> happens: > >> 1> the docs are re-packaged into groups based on which shard they're > >> destined for > >> 2> the sub-packets are forwarded to the leader for each shard > >> 3> the responses are gathered back and returned to the client. > >> > >> This set of operations will eventually de
Re: exporting to CSV with solrj
"Why do you want to use CSV in SolrJ?" Alexandre are you looking for a design gig. This kind of question really begs nothing but disdain. Commodity search exists, not matter what Paul Nelson writes and part of that problem is due to advanced users always rewriting the reqs and specs of less experienced users. On Fri, Oct 31, 2014 at 1:05 PM, Alexandre Rafalovitch wrote: > Why do you want to use CSV in SolrJ? You would just have to parse it again. > > You could just trigger that as a URL call from outside with cURL or as > just an HTTP (not SolrJ) call from Java client. > > Regards, >Alex. > Personal: http://www.outerthoughts.com/ and @arafalov > Solr resources and newsletter: http://www.solr-start.com/ and @solrstart > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 > > > On 31 October 2014 12:34, tedsolr wrote: > > Sure thing, but how do I get the results output in CSV format? > > response.getResults() is a list of SolrDocuments. > > > > > > > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845p4166861.html > > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: exporting to CSV with solrj
I think I'm getting the idea now. You either use the response writer via an HTTP call, or you write your own exporter. Thanks to everyone for their input. -- View this message in context: http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845p4166889.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Missing Records
Sorry to say this, but I don't think the numDocs/maxDoc numbers are telling you anything. because it looks like you've optimized which purges any data associated with deleted docs, including the internal IDs which are the numDocs/maxDocs figures. So if there were deletions, we can't see any evidence of same. Siih. On Fri, Oct 31, 2014 at 9:56 AM, AJ Lemke wrote: > I have run some more tests so the numbers have changed a bit. > > Index Results done on Node 1: > Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. > (Duration: 31m 47s) > Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993 > > Node 1: > Last Modified: 44 minutes ago > Num Docs: 824216 > Max Doc: 824216 > Heap Memory Usage: -1 > Deleted Docs: 0 > Version: 1051 > Segment Count: 1 > Optimized: checked > Current: checked > > Node 2: > Last Modified: 44 minutes ago > Num Docs: 824216 > Max Doc: 824216 > Heap Memory Usage: -1 > Deleted Docs: 0 > Version: 1051 > Segment Count: 1 > Optimized: checked > Current: checked > > Search results are the same as the doc numbers above. > > Logs only have one instance of an error: > > ERROR - 2014-10-31 10:47:12.867; > org.apache.solr.update.StreamingSolrServers$1; error > org.apache.solr.common.SolrException: Bad Request > > > > request: > http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2 > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > Some info that may be of help > This is on my local vm using jetty with the embedded zookeeper. > Commands to start cloud: > > java -DzkRun -jar start.jar > java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar > > sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir > ~/development/configs/inventory/ -confname config_ inventory > sh zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory > -confname config_ inventory > > curl > "http://localhost:8983/solr/admin/collections?action=CREATE&name=inventory&numShards=1&replicationFactor=2&maxShardsPerNode=4"; > curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name= > inventory " > > AJ > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Friday, October 31, 2014 9:49 AM > To: solr-user@lucene.apache.org > Subject: Re: Missing Records > > OK, that is puzzling. > > bq: If there were duplicates only one of the duplicates should be removed and > I still should be able to search for the ID and find one correct? > > Correct. > > Your bad request error is puzzling, you may be on to something there. > What it looks like is that somehow some of the documents you're sending to > Solr aren't getting indexed, either being dropped through the network or > perhaps have invalid fields, field formats (i.e. a date in the wrong format, > whatever) or some such. When you complete the run, what are the maxDoc and > numDocs numbers on one of the nodes? > > What else do you see in the logs? They're pretty big after that many adds, > but maybe you can grep for ERROR and see something interesting like stack > traces. Or even "org.apache.solr". This latter will give you some false hits, > but at least it's better than paging through a huge log file > > Personally, in this kind of situation I sometimes use SolrJ to do my indexing > rather than DIH, I find it easier to debug so that's another possibility. In > the worst case with SolrJ, you can send the docs one at a time > > Best, > Erick > > On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke wrote: >> Hi Erick: >> >> All of the records are coming out of an auto numbered field so the ID's will >> all be unique. >> >> Here is the the test I ran this morning: >> >> Indexing completed. Added/Updated: 903,993 documents. Deleted 0 >> documents. (Duration: 28m) >> Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: >> 903,993 (538/s) >> Started: 33 minutes ago >> >> Last Modified:4 minutes ago >> Num Docs:903829 >> Max Doc:903829 >> Heap Memory Usage:-1 >> Deleted Docs:0 >> Version:1517 >> Segment Count:16 >> Optimized: checked >> Current: checked >> >> If there were duplicates only one of the duplicates should be removed and I >> still should be able to search for the ID and find one correct? >> As it is right now I am missing records that should be in the collection. >> >> I also noticed this: >> >> org.apache.solr.common.SolrException: Bad Request >> >> >> >> request: >> http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.fr
Re: Only copy string up to certain character symbol?
In addition to Alexandre's comment, your index chain looks suspect: So the pattern replace stuff happens on the grams, not the full input. You might be better off with a solr.PatternReplaceCharFilterFactory which works on the entire input string before even tokenization is done. That said, Alexandre's comment is spot on. If your evidence for not respecting the regex is that the document returns the whole input, it's because the stored="true" stores the raw input and has nothing to do with the analysis chain, the split to store the input happens before any kind of analysis processing. On Fri, Oct 31, 2014 at 9:33 AM, Alexandre Rafalovitch wrote: > copyField can copy only part of the string but it is defined by > character count. If you want to use regular expressions, you may be > better off to do the copy in the UpdateRequestProcessor chain instead: > http://www.solr-start.com/info/update-request-processors/#RegexReplaceProcessorFactory > > What you are doing (RegEx in the chain) only affects "indexed" > representation of the text. Not the stored content. I suspect that's > not what you want. > > Regards, >Alex. > Personal: http://www.outerthoughts.com/ and @arafalov > Solr resources and newsletter: http://www.solr-start.com/ and @solrstart > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 > > > On 31 October 2014 11:49, hschillig wrote: >> So I have a title field that is common to look like this: >> >> Personal legal forms simplified : the ultimate guide to personal legal forms >> / Daniel Sitarz. >> >> I made a copyField that is of type "title_only". I want to ONLY copy the >> text "Personal legal forms simplified : the ultimate guide to personal legal >> forms".. so everything before the "/" symbol. I have it like this in my >> schema.xml: >> >> >> >> >> >> > maxGramSize="15" side="front" /> >> > pattern="(\/.+?$)" replacement=""/> >> >> >> >> >> > pattern="(\/.+?$)" replacement=""/> >> >> >> >> My regex seems to be off though as the field still holds the entire value >> when I reindex and restart SolR. Thanks for any help! >> >> >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Only-copy-string-up-to-certain-character-symbol-tp4166857.html >> Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr index corrupt question
What version of Solr/Lucene? There have been some instances of index corruption, see the lucene/CHANGES.txt file that might account for it. This is something of a stab in the dark though. Because this is troubling... Best, Erick On Fri, Oct 31, 2014 at 7:57 AM, ku3ia wrote: > Hi, Erick. Thanks for you response. > > I'd checked my index via check index utility, and what I'm got: > > 3 of 41: name=_1ouwn docCount=518333 > codec=Lucene46 > compound=false > numFiles=11 > size (MB)=431.564 > diagnostics = {timestamp=1412166850391, os=Linux, > os.version=3.2.0-68-generic, mergeFactor=10, source=merge, > lucene.version=4.8-SNAPSHOT - root - 2014-09-04 12:30:45, os.arch=amd64, > mergeMaxNumSegments=-1, java.version=1.7.0_67, java.vendor=Oracle > Corporation} > has deletions [delGen=2260] > test: open reader.OK > test: check integrity.FAILED > WARNING: fixIndex() would remove reference to this segment; full > exception: > org.apache.lucene.index.CorruptIndexException: checksum failed (hardware > problem?) : expected=e240ae5a actual=12262037 > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/data/solrcloud/node1/index.bak/_1ouwn_Lucene41_0.pos"))) > at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211) > at > org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268) > at > org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.checkIntegrity(Lucene41PostingsReader.java:1556) > at > org.apache.lucene.codecs.BlockTreeTermsReader.checkIntegrity(BlockTreeTermsReader.java:3018) > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:243) > at > org.apache.lucene.index.SegmentReader.checkIntegrity(SegmentReader.java:587) > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:561) > at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1967) > > I have 3 dead segments, but there is one interesting thing: I have a backup > of this segment, which I make after an optimize to one segment a month ago, > naturally w/o del-file. So, when I'd replaced it - nothing was changed. > > It is possible, that my HDD is corrupted, but I'd checked it on bads and was > not found anything. > > May be a del-file is corrupted? How I can check it or restore? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810p4166848.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: exporting to CSV with solrj
: Sure thing, but how do I get the results output in CSV format? : response.getResults() is a list of SolrDocuments. Either use something like the NoOpResponseParser which will give you the entire response back as a single string, or implement your own ResponseParser along hte lines of... public class YourRawParser extends ResponseParser { public NamedList processResponse(InputStream body, String encoding) { // do whatever you want with the data in the InputStream // as the data comes over the wire doStuff(body); // just ignore the result SolrServer gives you return new NamedList(); } } -Hoss http://www.lucidworks.com/
Re: exporting to CSV with solrj
Why do you want to use CSV in SolrJ? You would just have to parse it again. You could just trigger that as a URL call from outside with cURL or as just an HTTP (not SolrJ) call from Java client. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 31 October 2014 12:34, tedsolr wrote: > Sure thing, but how do I get the results output in CSV format? > response.getResults() is a list of SolrDocuments. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845p4166861.html > Sent from the Solr - User mailing list archive at Nabble.com.
RE: Missing Records
I have run some more tests so the numbers have changed a bit. Index Results done on Node 1: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 31m 47s) Requests: 1 (0/s), Fetched: 903,993 (474/s), Skipped: 0, Processed: 903,993 Node 1: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Node 2: Last Modified: 44 minutes ago Num Docs: 824216 Max Doc: 824216 Heap Memory Usage: -1 Deleted Docs: 0 Version: 1051 Segment Count: 1 Optimized: checked Current: checked Search results are the same as the doc numbers above. Logs only have one instance of an error: ERROR - 2014-10-31 10:47:12.867; org.apache.solr.update.StreamingSolrServers$1; error org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Some info that may be of help This is on my local vm using jetty with the embedded zookeeper. Commands to start cloud: java -DzkRun -jar start.jar java -Djetty.port=7574 -DzkRun -DzkHost=localhost:9983 -jar start.jar sh zkcli.sh -zkhost localhost:9983 -cmd upconfig -confdir ~/development/configs/inventory/ -confname config_ inventory sh zkcli.sh -zkhost localhost:9983 -cmd linkconfig -collection inventory -confname config_ inventory curl "http://localhost:8983/solr/admin/collections?action=CREATE&name=inventory&numShards=1&replicationFactor=2&maxShardsPerNode=4"; curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name= inventory " AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, October 31, 2014 9:49 AM To: solr-user@lucene.apache.org Subject: Re: Missing Records OK, that is puzzling. bq: If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? Correct. Your bad request error is puzzling, you may be on to something there. What it looks like is that somehow some of the documents you're sending to Solr aren't getting indexed, either being dropped through the network or perhaps have invalid fields, field formats (i.e. a date in the wrong format, whatever) or some such. When you complete the run, what are the maxDoc and numDocs numbers on one of the nodes? What else do you see in the logs? They're pretty big after that many adds, but maybe you can grep for ERROR and see something interesting like stack traces. Or even "org.apache.solr". This latter will give you some false hits, but at least it's better than paging through a huge log file Personally, in this kind of situation I sometimes use SolrJ to do my indexing rather than DIH, I find it easier to debug so that's another possibility. In the worst case with SolrJ, you can send the docs one at a time Best, Erick On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke wrote: > Hi Erick: > > All of the records are coming out of an auto numbered field so the ID's will > all be unique. > > Here is the the test I ran this morning: > > Indexing completed. Added/Updated: 903,993 documents. Deleted 0 > documents. (Duration: 28m) > Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: > 903,993 (538/s) > Started: 33 minutes ago > > Last Modified:4 minutes ago > Num Docs:903829 > Max Doc:903829 > Heap Memory Usage:-1 > Deleted Docs:0 > Version:1517 > Segment Count:16 > Optimized: checked > Current: checked > > If there were duplicates only one of the duplicates should be removed and I > still should be able to search for the ID and find one correct? > As it is right now I am missing records that should be in the collection. > > I also noticed this: > > org.apache.solr.common.SolrException: Bad Request > > > > request: > http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2 > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > AJ > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Thursday, October 30,
Re: exporting to CSV with solrj
Sure thing, but how do I get the results output in CSV format? response.getResults() is a list of SolrDocuments. -- View this message in context: http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845p4166861.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Only copy string up to certain character symbol?
copyField can copy only part of the string but it is defined by character count. If you want to use regular expressions, you may be better off to do the copy in the UpdateRequestProcessor chain instead: http://www.solr-start.com/info/update-request-processors/#RegexReplaceProcessorFactory What you are doing (RegEx in the chain) only affects "indexed" representation of the text. Not the stored content. I suspect that's not what you want. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 31 October 2014 11:49, hschillig wrote: > So I have a title field that is common to look like this: > > Personal legal forms simplified : the ultimate guide to personal legal forms > / Daniel Sitarz. > > I made a copyField that is of type "title_only". I want to ONLY copy the > text "Personal legal forms simplified : the ultimate guide to personal legal > forms".. so everything before the "/" symbol. I have it like this in my > schema.xml: > > > > > > maxGramSize="15" side="front" /> > pattern="(\/.+?$)" replacement=""/> > > > > > pattern="(\/.+?$)" replacement=""/> > > > > My regex seems to be off though as the field still holds the entire value > when I reindex and restart SolR. Thanks for any help! > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Only-copy-string-up-to-certain-character-symbol-tp4166857.html > Sent from the Solr - User mailing list archive at Nabble.com.
Only copy string up to certain character symbol?
So I have a title field that is common to look like this: Personal legal forms simplified : the ultimate guide to personal legal forms / Daniel Sitarz. I made a copyField that is of type "title_only". I want to ONLY copy the text "Personal legal forms simplified : the ultimate guide to personal legal forms".. so everything before the "/" symbol. I have it like this in my schema.xml: My regex seems to be off though as the field still holds the entire value when I reindex and restart SolR. Thanks for any help! -- View this message in context: http://lucene.472066.n3.nabble.com/Only-copy-string-up-to-certain-character-symbol-tp4166857.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: exporting to CSV with solrj
When you fire a query against Solr with the wt=csv the response coming from Solr is *already* in CSV, the CSVResponseWriter is responsible for translating SolrDocument instances into a CSV on the server side, son I don’t see any reason on using it by your self, Solr already do the heavy lifting for you. Regards, On Oct 31, 2014, at 10:44 AM, tedsolr wrote: > I am trying to invoke the CSVResponseWriter to create a CSV file of all > stored fields. There are millions of documents so I need to write to the > file iteratively. I saw a snippet of code online that claimed it could > effectively remove the SorDocumentList wrapper and allow the docs to be > retrieved in the actual format requested in the query. However, I get a null > pointer from the CSVResponseWriter.write() method. > > SolrQuery qry = new SolrQuery("*:*"); > qry.setParam("wt", "csv"); > // set other params > SolrServer server = getSolrServer(); > try { > QueryResponse res = server.query(qry); > > CSVResponseWriter writer = new CSVResponseWriter(); > Writer w = new StringWriter(); > SolrQueryResponse solrResponse = new SolrQueryResponse(); > solrResponse.setAllValues(res.getResponse()); >try { > SolrParams list = new MapSolrParams(new HashMap String>()); > writer.write(w, new LocalSolrQueryRequest(null, list), > solrResponse); >} catch (IOException e) { >throw new RuntimeException(e); >} >System.out.print(w.toString()); > > } catch (SolrServerException e) { > e.printStackTrace(); > } > > NPE snippet: > org.apache.solr.response.CSVWriter.writeResponse(CSVResponseWriter.java:281) > org.apache.solr.response.CSVResponseWriter.write(CSVResponseWriter.java:56) > > Am I on the right track with the approach? I really don't want to roll my > own document to CSV line convertor. Thanks! > Solr 4.9 > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr index corrupt question
Hi, Erick. Thanks for you response. I'd checked my index via check index utility, and what I'm got: 3 of 41: name=_1ouwn docCount=518333 codec=Lucene46 compound=false numFiles=11 size (MB)=431.564 diagnostics = {timestamp=1412166850391, os=Linux, os.version=3.2.0-68-generic, mergeFactor=10, source=merge, lucene.version=4.8-SNAPSHOT - root - 2014-09-04 12:30:45, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.7.0_67, java.vendor=Oracle Corporation} has deletions [delGen=2260] test: open reader.OK test: check integrity.FAILED WARNING: fixIndex() would remove reference to this segment; full exception: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=e240ae5a actual=12262037 (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/data/solrcloud/node1/index.bak/_1ouwn_Lucene41_0.pos"))) at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:211) at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:268) at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.checkIntegrity(Lucene41PostingsReader.java:1556) at org.apache.lucene.codecs.BlockTreeTermsReader.checkIntegrity(BlockTreeTermsReader.java:3018) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.checkIntegrity(PerFieldPostingsFormat.java:243) at org.apache.lucene.index.SegmentReader.checkIntegrity(SegmentReader.java:587) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:561) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1967) I have 3 dead segments, but there is one interesting thing: I have a backup of this segment, which I make after an optimize to one segment a month ago, naturally w/o del-file. So, when I'd replaced it - nothing was changed. It is possible, that my HDD is corrupted, but I'd checked it on bads and was not found anything. May be a del-file is corrupted? How I can check it or restore? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810p4166848.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Missing Records
OK, that is puzzling. bq: If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? Correct. Your bad request error is puzzling, you may be on to something there. What it looks like is that somehow some of the documents you're sending to Solr aren't getting indexed, either being dropped through the network or perhaps have invalid fields, field formats (i.e. a date in the wrong format, whatever) or some such. When you complete the run, what are the maxDoc and numDocs numbers on one of the nodes? What else do you see in the logs? They're pretty big after that many adds, but maybe you can grep for ERROR and see something interesting like stack traces. Or even "org.apache.solr". This latter will give you some false hits, but at least it's better than paging through a huge log file Personally, in this kind of situation I sometimes use SolrJ to do my indexing rather than DIH, I find it easier to debug so that's another possibility. In the worst case with SolrJ, you can send the docs one at a time Best, Erick On Fri, Oct 31, 2014 at 7:37 AM, AJ Lemke wrote: > Hi Erick: > > All of the records are coming out of an auto numbered field so the ID's will > all be unique. > > Here is the the test I ran this morning: > > Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. > (Duration: 28m) > Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 > (538/s) > Started: 33 minutes ago > > Last Modified:4 minutes ago > Num Docs:903829 > Max Doc:903829 > Heap Memory Usage:-1 > Deleted Docs:0 > Version:1517 > Segment Count:16 > Optimized: checked > Current: checked > > If there were duplicates only one of the duplicates should be removed and I > still should be able to search for the ID and find one correct? > As it is right now I am missing records that should be in the collection. > > I also noticed this: > > org.apache.solr.common.SolrException: Bad Request > > > > request: > http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2 > at > org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > AJ > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Thursday, October 30, 2014 7:08 PM > To: solr-user@lucene.apache.org > Subject: Re: Missing Records > > First question: Is there any possibility that some of the docs have duplicate > IDs (s)? If so, then some of the docs will be replaced, which will > lower your returns. > One way to figuring this out is to go to the admin screen and if numDocs < > maxDoc, then documents have been replaced. > > Also, if numDocs is smaller than 903,993 then you probably have some docs > being replaced. One warning, however. Even if docs are deleted, then this > could still be the case because when segments are merged the deleted docs are > purged. > > Best, > Erick > > On Thu, Oct 30, 2014 at 3:12 PM, S.L wrote: >> I am curious , how many shards do you have and whats the replication >> factor you are using ? >> >> On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke wrote: >> >>> Hi All, >>> >>> We have a SOLR cloud instance that has been humming along nicely for >>> months. >>> Last week we started experiencing missing records. >>> >>> Admin DIH Example: >>> Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A >>> *:* search claims that there are only 903,902 this is the first full >>> index. >>> Subsequent full indexes give the following counts for the *:* search >>> 903,805 >>> 903,665 >>> 826,357 >>> >>> All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, >>> Processed: 903,993 (x/s) every time. ---records per second is >>> variable >>> >>> >>> I found an item that should be in the index but is not found in a search. >>> >>> Here are the referenced lines of the log file. >>> >>> DEBUG - 2014-10-30 15:10:51.160; >>> org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE >>> add{,id=750041421} >>> {{params(debug=false&optimize=true&indent=true&commit=true&clean=true >>> &wt=json&command=full-import&entity=ads&verbose=false),defaults(confi >>> g=data-config.xml)}} >>> DEBUG - 2014-10-30 15:10:51.160; >>> org.apache.solr.update.SolrCmdDistributor; sending update to >>> http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 >>> add{,id=750041421} >>> params:update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.5 >>> 7%3A8983%2Fsolr%2Finventory_shard1_replica1%2F >>> >>> --- there are 746 lines of log between entries ---
exporting to CSV with solrj
I am trying to invoke the CSVResponseWriter to create a CSV file of all stored fields. There are millions of documents so I need to write to the file iteratively. I saw a snippet of code online that claimed it could effectively remove the SorDocumentList wrapper and allow the docs to be retrieved in the actual format requested in the query. However, I get a null pointer from the CSVResponseWriter.write() method. SolrQuery qry = new SolrQuery("*:*"); qry.setParam("wt", "csv"); // set other params SolrServer server = getSolrServer(); try { QueryResponse res = server.query(qry); CSVResponseWriter writer = new CSVResponseWriter(); Writer w = new StringWriter(); SolrQueryResponse solrResponse = new SolrQueryResponse(); solrResponse.setAllValues(res.getResponse()); try { SolrParams list = new MapSolrParams(new HashMap()); writer.write(w, new LocalSolrQueryRequest(null, list), solrResponse); } catch (IOException e) { throw new RuntimeException(e); } System.out.print(w.toString()); } catch (SolrServerException e) { e.printStackTrace(); } NPE snippet: org.apache.solr.response.CSVWriter.writeResponse(CSVResponseWriter.java:281) org.apache.solr.response.CSVResponseWriter.write(CSVResponseWriter.java:56) Am I on the right track with the approach? I really don't want to roll my own document to CSV line convertor. Thanks! Solr 4.9 -- View this message in context: http://lucene.472066.n3.nabble.com/exporting-to-CSV-with-solrj-tp4166845.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Ideas for debugging poor SolrCloud scalability
NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1> you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2> use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose wrote: > Hi Erick - > > Thanks for the detailed response and apologies for my confusing > terminology. I should have said "WPS" (writes per second) instead of QPS > but I didn't want to introduce a weird new acronym since QPS is well > known. Clearly a bad decision on my part. To clarify: I am doing > *only* writes > (document adds). Whenever I wrote "QPS" I was referring to writes. > > It seems clear at this point that I should wrap up the code to do "smart" > routing rather than choose Solr nodes randomly. And then see if that > changes things. I must admit that although I understand that random node > selection will impose a performance hit, theoretically it seems to me that > the system should still scale up as you add more nodes (albeit at lower > absolute level of performance than if you used a smart router). > Nonetheless, I'm just theorycrafting here so the better thing to do is just > try it experimentally. I hope to have that working today - will report > back on my findings. > > Cheers, > - Ian > > p.s. To clarify why we are rolling our own smart router code, we use Go > over here rather than Java. Although if we still get bad performance with > our custom Go router I may try a pure Java load client using > CloudSolrServer to eliminate the possibility of bugs in our implementation. > > > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson > wrote: > >> I'm really confused: >> >> bq: I am not issuing any queries, only writes (document inserts) >> >> bq: It's clear that once the load test client has ~40 simulated users >> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support >> a higher QPS than 2 shards over 2 Solr nodes, right >> >> QPS is usually used to mean "Queries Per Second", which is different from >> the statement that "I am not issuing any queries". And what do the >> number of users have to do with inserting documents? >> >> You also state: " In many cases, CPU on the solr servers is quite low as >> well" >> >> So let's talk about indexing first. Indexing should scale nearly >> linearly as long as >> 1> you are routing your docs to the correct leader, which happens with >> SolrJ >> and the CloudSolrSever automatically. Rather than rolling your own, I >> strongly >> suggest you try this out. >> 2> you have enough clients feeding the cluster to push CPU utilization >> on them all. >> Very often "slow indexing", or in your case "lack of scaling" is a >> result of document >> acquisition or, in your case, your doc generator is spending all it's >> time waiting for >> the individual documents to get to Solr and come back. >> >> bq: "chooses a random solr server for each ADD request (with 1 doc per add >> request)" >> >> Probably your culprit right there. Each and every document requires that >> you >> have to cross the network (and forward that doc to the correct leader). So >> given >> that you're not seeing high CPU utilization, I suspect that you're not >> sending >> enough docs to SolrCloud fast enough to see scaling. You need to batch up >> multiple docs, I generally send 1,000 docs at a time. >> >> But even if you do solve this, the inter-node routing will prevent >> linear scaling. >> When a doc (or a batch of docs) goes to a random Solr node, here's what >> happens: >> 1> the docs are re-packaged into groups based on which shard they're >> destined for >> 2> the sub-packets are forwarded to the leader for each shard >> 3> the responses are gathered back and returned to the client. >> >> This set of operations will eventually degrade the scaling. >> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support >> a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea >> behind sharding. >> >> If we're talking search requests, the answer is no. Sharding is >> what you do when your collection no longer fits on a single node. >> If it _does_ fit on a single node, then you'll usually get better query >> performance by adding a bunch of replicas to a single shard. When >> the number of docs on each shard grows large enough that you >> no longer get good query performance, _then_ you shard. And >> take th
RE: Missing Records
Hi Erick: All of the records are coming out of an auto numbered field so the ID's will all be unique. Here is the the test I ran this morning: Indexing completed. Added/Updated: 903,993 documents. Deleted 0 documents. (Duration: 28m) Requests: 1 (0/s), Fetched: 903,993 (538/s), Skipped: 0, Processed: 903,993 (538/s) Started: 33 minutes ago Last Modified:4 minutes ago Num Docs:903829 Max Doc:903829 Heap Memory Usage:-1 Deleted Docs:0 Version:1517 Segment Count:16 Optimized: checked Current: checked If there were duplicates only one of the duplicates should be removed and I still should be able to search for the ID and find one correct? As it is right now I am missing records that should be in the collection. I also noticed this: org.apache.solr.common.SolrException: Bad Request request: http://192.168.20.57:7574/solr/inventory_shard1_replica1/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica2%2F&wt=javabin&version=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:241) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) AJ -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, October 30, 2014 7:08 PM To: solr-user@lucene.apache.org Subject: Re: Missing Records First question: Is there any possibility that some of the docs have duplicate IDs (s)? If so, then some of the docs will be replaced, which will lower your returns. One way to figuring this out is to go to the admin screen and if numDocs < maxDoc, then documents have been replaced. Also, if numDocs is smaller than 903,993 then you probably have some docs being replaced. One warning, however. Even if docs are deleted, then this could still be the case because when segments are merged the deleted docs are purged. Best, Erick On Thu, Oct 30, 2014 at 3:12 PM, S.L wrote: > I am curious , how many shards do you have and whats the replication > factor you are using ? > > On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke wrote: > >> Hi All, >> >> We have a SOLR cloud instance that has been humming along nicely for >> months. >> Last week we started experiencing missing records. >> >> Admin DIH Example: >> Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A >> *:* search claims that there are only 903,902 this is the first full >> index. >> Subsequent full indexes give the following counts for the *:* search >> 903,805 >> 903,665 >> 826,357 >> >> All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, >> Processed: 903,993 (x/s) every time. ---records per second is >> variable >> >> >> I found an item that should be in the index but is not found in a search. >> >> Here are the referenced lines of the log file. >> >> DEBUG - 2014-10-30 15:10:51.160; >> org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE >> add{,id=750041421} >> {{params(debug=false&optimize=true&indent=true&commit=true&clean=true >> &wt=json&command=full-import&entity=ads&verbose=false),defaults(confi >> g=data-config.xml)}} >> DEBUG - 2014-10-30 15:10:51.160; >> org.apache.solr.update.SolrCmdDistributor; sending update to >> http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 >> add{,id=750041421} >> params:update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.5 >> 7%3A8983%2Fsolr%2Finventory_shard1_replica1%2F >> >> --- there are 746 lines of log between entries --- >> >> DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire; >> >> "[0x2][0xc3][0xe0]¶ms[0xa2][0xe0].update.distrib(TOLEADER[0xe0],d >> istrib.from?[0x17] >> http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]&delBy >> Q[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zi >> p%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower >> 'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit >> Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.4 >> 8929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2 >> DivisionName_Lower,recreational[0xe0]&latlon042.4893,-96.3693[0xe0]*P >> hotoCount!8[0xe0](HasVideo[0x2][0xe0]"ID)750041421[0xe0]&Engine >> [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux >> City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162" >> Long Track >> [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0 >> ]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[ >> 0xe0]+Description?VThis Bad boy will pull you through the deepest >> snow!With the 162" track and 1000cc of power you can fly up any >> hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission >> [0xe0]*ModelFacet7Ski-Doo|Summit >> Highmark[0xe0]/DealerNameFacet9C
Re: The exact same query gets executed n times for the nth row when retrieving body (plaintext) from BLOB column with Tika Entity Processor
Your message looks like it's missing stuff (snapshots?), the e-mail for this list generally strips attachments, so you'll have to put them somewhere else and link to them if you want us to see them. Best, Erick On Fri, Oct 31, 2014 at 5:11 AM, 5ton3 wrote: > Hi! > > Not sure if this is a problem or if I just don't understand the debug > response, but it seems somewhat odd to me. > The "main" entity can have multiple BLOB documents. I'm using Tika Entity > Processor to retrieve the body (plaintext) from these documents and put the > result in a multivalued field, "filedata". The data-config looks like this: > > > It seems to work properly, but when I debug the data import, it seems that > the query on TABLE2 on the BLOB column ("FILEDATA_BIN") gets executed 1 time > for document #1, which is correct, but 2 times for document #2, 3 times for > document #3, and so on. > I.e. for document #1: > > And for document #2: > > The result seems correct, ie. it doesn't duplicate the filedata. But why > does it query the DB two times for document #2? Any ideas? Maybe something > wrong in my config? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/The-exact-same-query-gets-executed-n-times-for-the-nth-row-when-retrieving-body-plaintext-from-BLOB-r-tp4166822.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr index corrupt question
Not quite sure what you mean by "destroy". I can use a delete-by-query with *:* and mark all docs in my index deleted. Search results will return nothing but it's still a valid index, it just consists of all deleted docs. All the segments may be removed even in the absence of an optimize due to segment merging. But it's still a perfectly valid index, it just has nothing in it. Are you seeing a real problem here or are you just wondering why all your segment files disappeared? Best, Erick On Fri, Oct 31, 2014 at 3:33 AM, ku3ia wrote: > Hi folks! > I'm interesting in, can delete operation destroy Solr index, if optimize > command never perform? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810.html > Sent from the Solr - User mailing list archive at Nabble.com.
RE: Missing Records
I started this collection using this command: http://localhost:8983/solr/admin/collections?action=CREATE&name=inventory&numShards=1&replicationFactor=2&maxShardsPerNode=4 So 1 shard and replicationFactor of 2 AJ -Original Message- From: S.L [mailto:simpleliving...@gmail.com] Sent: Thursday, October 30, 2014 5:12 PM To: solr-user@lucene.apache.org Subject: Re: Missing Records I am curious , how many shards do you have and whats the replication factor you are using ? On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke wrote: > Hi All, > > We have a SOLR cloud instance that has been humming along nicely for > months. > Last week we started experiencing missing records. > > Admin DIH Example: > Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s) A *:* > search claims that there are only 903,902 this is the first full > index. > Subsequent full indexes give the following counts for the *:* search > 903,805 > 903,665 > 826,357 > > All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, > Processed: 903,993 (x/s) every time. ---records per second is variable > > > I found an item that should be in the index but is not found in a search. > > Here are the referenced lines of the log file. > > DEBUG - 2014-10-30 15:10:51.160; > org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE > add{,id=750041421} > {{params(debug=false&optimize=true&indent=true&commit=true&clean=true& > wt=json&command=full-import&entity=ads&verbose=false),defaults(config= > data-config.xml)}} > DEBUG - 2014-10-30 15:10:51.160; > org.apache.solr.update.SolrCmdDistributor; sending update to > http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0 > add{,id=750041421} > params:update.distrib=TOLEADER&distrib.from=http%3A%2F%2F192.168.20.57 > %3A8983%2Fsolr%2Finventory_shard1_replica1%2F > > --- there are 746 lines of log between entries --- > > DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire; >> > "[0x2][0xc3][0xe0]¶ms[0xa2][0xe0].update.distrib(TOLEADER[0xe0],di > strib.from?[0x17] > http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]&delByQ > [0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip% > 51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'sk > i-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit > Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48 > 929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2Di > visionName_Lower,recreational[0xe0]&latlon042.4893,-96.3693[0xe0]*Phot > oCount!8[0xe0](HasVideo[0x2][0xe0]"ID)750041421[0xe0]&Engine > [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux > City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162" > Long Track > [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0] > 1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0x > e0]+Description?VThis Bad boy will pull you through the deepest > snow!With the 162" track and 1000cc of power you can fly up any > hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission > [0xe0]*ModelFacet7Ski-Doo|Summit > Highmark[0xe0]/DealerNameFacet9Certified > Auto, > Inc.|4150[0xe0])StateAbbr"IA[0xe0])ClassName+Snowmobiles[0xe0](DealerI > D$4150[0xe0]&AdCode$DX1Q[0xe0]*DealerName4Certified > Auto, > Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorCol > or+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000 > SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID"12[0xe0].F > uelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certif > ied Auto, Inc.|Sioux > City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit > Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber&000105 > [0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit > highmark[\n]" > What could be the issue and how does one fix this issue? > > Thanks so much and if more information is needed I have preserved the > log files. > > AJ >
Re: Ideas for debugging poor SolrCloud scalability
Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said "WPS" (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote "QPS" I was referring to writes. It seems clear at this point that I should wrap up the code to do "smart" routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson wrote: > I'm really confused: > > bq: I am not issuing any queries, only writes (document inserts) > > bq: It's clear that once the load test client has ~40 simulated users > > bq: A cluster of 3 shards over 3 Solr nodes *should* support > a higher QPS than 2 shards over 2 Solr nodes, right > > QPS is usually used to mean "Queries Per Second", which is different from > the statement that "I am not issuing any queries". And what do the > number of users have to do with inserting documents? > > You also state: " In many cases, CPU on the solr servers is quite low as > well" > > So let's talk about indexing first. Indexing should scale nearly > linearly as long as > 1> you are routing your docs to the correct leader, which happens with > SolrJ > and the CloudSolrSever automatically. Rather than rolling your own, I > strongly > suggest you try this out. > 2> you have enough clients feeding the cluster to push CPU utilization > on them all. > Very often "slow indexing", or in your case "lack of scaling" is a > result of document > acquisition or, in your case, your doc generator is spending all it's > time waiting for > the individual documents to get to Solr and come back. > > bq: "chooses a random solr server for each ADD request (with 1 doc per add > request)" > > Probably your culprit right there. Each and every document requires that > you > have to cross the network (and forward that doc to the correct leader). So > given > that you're not seeing high CPU utilization, I suspect that you're not > sending > enough docs to SolrCloud fast enough to see scaling. You need to batch up > multiple docs, I generally send 1,000 docs at a time. > > But even if you do solve this, the inter-node routing will prevent > linear scaling. > When a doc (or a batch of docs) goes to a random Solr node, here's what > happens: > 1> the docs are re-packaged into groups based on which shard they're > destined for > 2> the sub-packets are forwarded to the leader for each shard > 3> the responses are gathered back and returned to the client. > > This set of operations will eventually degrade the scaling. > > bq: A cluster of 3 shards over 3 Solr nodes *should* support > a higher QPS than 2 shards over 2 Solr nodes, right? That's the whole idea > behind sharding. > > If we're talking search requests, the answer is no. Sharding is > what you do when your collection no longer fits on a single node. > If it _does_ fit on a single node, then you'll usually get better query > performance by adding a bunch of replicas to a single shard. When > the number of docs on each shard grows large enough that you > no longer get good query performance, _then_ you shard. And > take the query hit. > > If we're talking about inserts, then see above. I suspect your problem is > that you're _not_ "saturating the SolrCloud cluster", you're sending > docs to Solr very inefficiently and waiting on I/O. Batching docs and > sending them to the right leader should scale pretty linearly until you > start saturating your network. > > Best, > Erick > > On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose wrote: > > Thanks for the suggestions so for, all. > > > > 1) We are not using SolrJ on the client (not using Java at all) but I am > > working on writing a "smart" router so that we can always send to the > > correct node. I am certainly curious to see how that changes things. > > Nonetheless even with the overhead of extra routing hops, the observed > > behavior (no increase in performance with more nodes) doesn't make any > > sense to me. > > > > 2) Commits: we are using autoCommit with openSearcher=false > (maxTime=6) > > and autoSoftCommit (maxTime=15000). > > >
The exact same query gets executed n times for the nth row when retrieving body (plaintext) from BLOB column with Tika Entity Processor
Hi! Not sure if this is a problem or if I just don't understand the debug response, but it seems somewhat odd to me. The "main" entity can have multiple BLOB documents. I'm using Tika Entity Processor to retrieve the body (plaintext) from these documents and put the result in a multivalued field, "filedata". The data-config looks like this: It seems to work properly, but when I debug the data import, it seems that the query on TABLE2 on the BLOB column ("FILEDATA_BIN") gets executed 1 time for document #1, which is correct, but 2 times for document #2, 3 times for document #3, and so on. I.e. for document #1: And for document #2: The result seems correct, ie. it doesn't duplicate the filedata. But why does it query the DB two times for document #2? Any ideas? Maybe something wrong in my config? -- View this message in context: http://lucene.472066.n3.nabble.com/The-exact-same-query-gets-executed-n-times-for-the-nth-row-when-retrieving-body-plaintext-from-BLOB-r-tp4166822.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr index corrupt question
Hi folks! I'm interesting in, can delete operation destroy Solr index, if optimize command never perform? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-index-corrupt-question-tp4166810.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: issue related to blank value in datefield
Thanks Chris With Regards Aman Tandon On Fri, Oct 31, 2014 at 5:45 AM, Chris Hostetter wrote: > > : I was just trying to index the fields returned by my msql and i found > this > > If you are importing dates from MySql where you have -00-00T00:00:00Z > as the default value, you should actaully be getting an error lsat time i > checked, but this explains the right way to tell the MySQL JDBC driver not > to give you those values ... > > > https://wiki.apache.org/solr/DataImportHandlerFaq#Invalid_dates_.28e.g._.22-00-00.22.29_in_my_MySQL_database_cause_my_import_to_abort > > (even if you aren't using DIH to talk to MySQL, the same principle holds > if you are using JDBC, if you are talking to MySQL from some other client > langauge there should be a similar option) > > : Actually i just want to know why it is getting stored as ' > : 0002-11-30T00:00:00Z' on indexing the value -00-00T00:00:00Z. > > like i said: bugs. behavior with "Year " is ndefined in alot of the > underlying date code. as for what that speciic date? ... no idea. > > > -Hoss > http://www.lucidworks.com/ >
Re: Design optimal Solr Schema
Oh yes, i want to display stored data in html file. I have 2 pages, at one page is form and i show here results. Result here is link (by ID) at file where is all conversation in second page. And how did you mean sepparate each conversation interaction ? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4166805.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Design optimal Solr Schema
Thanks for your help. Ok i try it explain one more, sorry for my english. I need to some functions in my searching. 1.) I will have a lot of documents naturally and i want find out if is for example is phrase for example to 5 words apart. I used w:"Good morning"~5. (in example solr it works, but i don't know how do it at my project). 2.) Find some word(phrase) to a certain time, for example Good morning to time 5.25 3.) And if it is possible order of the words. I'm using solarium client for highlight and I want to highlight words in this order Hello How Are you for example, then in this field are words *hello* you are * how are you* and if the searching word is not in order, then skip it, but it not necessary, primary i have problem with first 2 points. How i make ideal schema and parse data for source file. I've done some demo with basic searching in one page i have form and results are links at files by id (i have id as filename) and when i clicked at link i set a parameter query and in result page i get a necessary data for display result. And result file is table with all rewrite interview whit highlighted results . Thanks for help. -- View this message in context: http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632p4166793.html Sent from the Solr - User mailing list archive at Nabble.com.