Query Performance
Any recommended tool to test the query performance would be of great help. Thanks
Migrating junit tests from Solr 4.5.1 to Solr 5.2.1
I am migrating from Solr 4.5.1 to Solr 5.2.1 on a Windows platform. I am using multi-core, but not Solr cloud. I am having issues with my suite of junit tests. My tests currently use code I found in SOLR-4502. I was wondering whether anyone could point me at best-practice examples of multi-core junit tests for Solr 5.2.1? Thanks Rich
Re: Use REST API URL to update field
Ok. Thanks for your advice. Regards, Edwin On 21 July 2015 at 15:37, Upayavira u...@odoko.co.uk wrote: curl is just a command line HTTP client. You can use HTTP POST to send the JSON that you are mentioning below via any means that works for you - the file does not need to exist on disk - it just needs to be added to the body of the POST request. I'd say review how to do HTTP POST requests from your chosen programming language and you should see how to do this. Upayavira On Tue, Jul 21, 2015, at 04:12 AM, Zheng Lin Edwin Yeo wrote: Hi Shawn, So it means that if my following is in a text file called update.txt, {id:testing_0001, popularity:{inc:1} This text file must still exist if I use the URL? Or can this information in the text file be put directly onto the URL? Regards, Edwin On 20 July 2015 at 22:04, Shawn Heisey apa...@elyograg.org wrote: On 7/20/2015 2:06 AM, Zheng Lin Edwin Yeo wrote: I'm using Solr 5.2.1, and I would like to check, is there a way to update certain field by using REST API URL directly instead of using curl? For example, I would like to increase the popularity field in my index each time a user click on the record. Currently, it can work with the curl command by having this in my text file to be read by curl (the id is hard-coded here for example purpose) {id:testing_0001, popularity:{inc:1} Is there a REST API URL that I can call to achieve the same purpose? The URL that you would use with curl *IS* the URL that you would use for a REST-like call. Thanks, Shawn
Re: Query Performance
I tried using SolrMeter but for some reason it does not detect my url and throws solr server exception Sent from my iPhone On 21-Jul-2015, at 10:58 am, Alessandro Benedetti benedetti.ale...@gmail.com wrote: SolrMeter mate, http://code.google.com/p/solrmeter/ Take a look, it will help you a lot ! Cheers 2015-07-21 16:49 GMT+01:00 Nagasharath sharathrayap...@gmail.com: Any recommended tool to test the query performance would be of great help. Thanks -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Solr Cloud: Duplicate documents in multiple shards
Hi Mese, let me try to answer to your 2 questions : 1. What happens if a shard(both leader and replica) goes down. If the document on the dead shard is updated, will it forward the document to the new shard. If so, when the dead shard comes up again, will this not be considered for the same hask key range? I see some confusion here. First of all you need a smart client that will load balance the docs to index. Let's say the CloudSolrClient . A solr document update is always a deletion and a re-insertion. This means that you get the document from the index ( the stored fields), and you add the document again. If the document is on a dead shard, you have lost it, you can not retrieve it until you have that shard to go up again. Possibly it's still in the transaction log. In the case you are re-indexing the doc, the doc will be re-index. When the shard is up again, there will be 2 versions of the documents. With some different fields but the same id. What do you mean with : will this not be considered for the same hask key range ? 2. Is there a way to fix this[removing duplicates across shards]? i assume not an easy way. You could re-index the content applying a Deduplication Update Request processor. But it will be costly. Cheers 2015-07-21 15:01 GMT+01:00 Reitzel, Charles charles.reit...@tiaa-cref.org: Also, the function used to generate hashes is org.apache.solr.common.util.Hash.murmurhash3_x86_32(), which produces a 32-bit value. The range of the hash values assigned to each shard are resident in Zookeeper. Since you are using only a single hash component, all 32-bits will be used by the entire ID field value. I.e. I see no routing delimiter (!) in your example ID value: possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30 Which isn't required, but it means that documents (logs?) will be distributed in a round-robin fashion over the shards. Not grouped by host or environment (if I am reading it right). You might consider the following: environment!hostname!UUID E.g. intl-staging!possting.mongo-v2.services.com !c2d2a376-5e4a-11e2-8963-0026b9414f30 This way documents from the same host will be grouped together, most likely on the same shard. Further, within the same environment, documents will be grouped on the same subset of shards. This will allow client applications to set _route_=environment! or _route_=environment!hostname! and limit queries to those shards containing relevant data when the corresponding filter queries are applied. If you were using route delimiters, then the default for a 2-part key (1 delimiter) is to use 16 bits for each part. The default for a 3-part key (2 delimiters) is to use 8-bits each for the 1st 2 parts and 16 bits for the 3rd part. In any case, the high-order bytes of the hash dominate the distribution of data. -Original Message- From: Reitzel, Charles Sent: Tuesday, July 21, 2015 9:55 AM To: solr-user@lucene.apache.org Subject: RE: Solr Cloud: Duplicate documents in multiple shards When are you generating the UUID exactly? If you set the unique ID field on an update, and it contains a new UUID, you have effectively created a new document. Just a thought. -Original Message- From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] Sent: Tuesday, July 21, 2015 4:11 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud: Duplicate documents in multiple shards Unable to delete by passing distrib=false as well. Also it is difficult to identify those duplicate documents among the 130 million. Is there a way we can see the generated hash key and mapping them to the specific shard? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html Sent from the Solr - User mailing list archive at Nabble.com. * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF * -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Data Import Handler Stays Idle
Okay. I'm going to run the index again with specifications that you recommended. This could take a few hours but I will post the entire trace on that error when it pops up again and I will let you guys know the results of increasing the heap size. -- View this message in context: http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218382.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Data Import Handler Stays Idle
Hey shawn when I use the -m 2g command in my script I get the error a 'cannot open [path]/server/logs/solr.log for reading: No such file or directory' I do not see how this would affect that. -- View this message in context: http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218389.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Performance
SolrMeter mate, http://code.google.com/p/solrmeter/ Take a look, it will help you a lot ! Cheers 2015-07-21 16:49 GMT+01:00 Nagasharath sharathrayap...@gmail.com: Any recommended tool to test the query performance would be of great help. Thanks -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: SOLR nrt read writes
Could this be due to caching? I have tried to disable all in my solrconfig. If you mean Solr caches ? NO . Solr caches live the life of the searcher. So new searcher, new caches ( possibly warmed with updated results) . If you mean your application caching or browser caching, you should verify, i assume you have control on that. Cheers 2015-07-21 6:02 GMT+01:00 Bhawna Asnani bhawna.asn...@gmail.com: Thanks, I tried turning off auto softCommits but that didn't help much. Still seeing stale results every now and then. Also load on the server very light. We are running this just on a test server with one or two users. I don't see any warning in logs whole doing softCommits and it says it successfully opened new searcher and registered it as main searcher. Could this be due to caching? I have tried to disable all in my solrconfig. Sent from my iPhone On Jul 20, 2015, at 12:16 PM, Shawn Heisey apa...@elyograg.org wrote: On 7/20/2015 9:29 AM, Bhawna Asnani wrote: Thanks for your suggestions. The requirement is still the same , to be able to make a change to some solr documents and be able to see it on subsequent search/facet calls. I am using softCommit with waitSearcher=true. Also I am sending reads/writes to a single solr node only. I have tried disabling caches and warmup time in logs is '0' but every once in a while I do get the document just updated with stale data. I went through lucene documentation and it seems opening the IndexReader with the IndexWriter should make the changes visible to the reader. I checked solr logs no errors. I see this in logs each time 'Registered new searcher Searcher@x' even before searches that had the stale document. I have attached my solrconfig.xml for reference. Your attachment made it through the mailing list processing. Most don't, I'm surprised. Some thoughts: maxBooleanClauses has been set to 40. This is a lot. If you actually need a setting that high, then you are sending some MASSIVE queries, which probably means that your Solr install is exceptionally busy running those queries. If the server is fairly busy, then you should increase maxTime on autoCommit. I use a value of five minutes (30) ... and my server is NOT very busy most of the time. A commit with openSearcher set to false is relatively fast, but it still has somewhat heavy CPU, memory, and disk I/O resource requirements. You have autoSoftCommit set to happen after five seconds. If updates happen frequently or run for very long, this is potentially a LOT of committing and opening new searchers. I guess it's better than trying for one second, but anything more frequent than once a minute is likely to get you into trouble unless the system load is extremely light ... but as already discussed, your system load is probably not light. For the kind of Near Real Time setup you have mentioned, where you want to do one or more updates, commit, and then query for the changes, you probably should completely remove autoSoftCommit from the config and *only* open new searchers with explicit soft commits. Let autoCommit (with a maxTime of 1 to 5 minutes) handle durability concerns. A lot of pieces in your config file are set to depend on java system properties just like the example does, but since we do not know what system properties have been set, we can't tell for sure what those parts of the config are doing. Thanks, Shawn -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: solr blocking and client timeout issue
I did find a dark corner of our application that a dev had left some experimental code in that snuck past QA, because it was rarely used. A client discovered and was using it heavily over the past week. It was generating multiple consecutive update/commit requests. Its been disabled and the long GC pauses have nearly stopped (so far). We did see one at about 4am for about 5 minutes. is there a way to try to mitigate these longer GC, if/when they do happen. (FYI, we are upgrading to OpenJDK 1.8 tonight. its been working great in dev/QA, so hopefully it will make enough of a difference) On 07/20/2015 09:31 PM, Erick Erickson wrote: bq: the config is set up per the NRT suggestions in the docs. autoSoftCommit every 2 seconds and autoCommit every 10 minutes. 2 second soft commit is very aggressive, no matter what the NRT suggestions are. My first question is whether that's really needed. The soft commits should be as long as you can stand. And don't listen to your product manager who says 2 seconds is required, push back and answer whether that's really necessary. Most people won't notice the difference. bq: ...we are noticing a lot higher number of hard commits than usual. Is a client somewhere issuing a hard commit? This is rarely recommended... And is openSearcher true or false? False is a relatively cheap operation, true is quite expensive. More than you want to know about hard and soft commits: https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best, Erick Best, Erick On Mon, Jul 20, 2015 at 12:48 PM, Jeremy Ashcraft jashcr...@edgate.com wrote: heap is already at 5GB On 07/20/2015 12:29 PM, Jeremy Ashcraft wrote: no swapping that I'm seeing, although we are noticing a lot higher number of hard commits than usual. the config is set up per the NRT suggestions in the docs. autoSoftCommit every 2 seconds and autoCommit every 10 minutes. there have been 463 updates in the past 2 hours, all followed by hard commits INFO - 2015-07-20 12:26:20.979; org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} INFO - 2015-07-20 12:26:21.021; org.apache.solr.core.SolrDeletionPolicy; SolrDeletionPolicy.onCommit: commits: num=2 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_e9nk,generation=665696} commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/opt/solr/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_e9nl,generation=665697} INFO - 2015-07-20 12:26:21.022; org.apache.solr.core.SolrDeletionPolicy; newest commit generation = 665697 INFO - 2015-07-20 12:26:21.026; org.apache.solr.update.DirectUpdateHandler2; end_commit_flush INFO - 2015-07-20 12:26:21.026; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update params={omitHeader=falsewt=json} {add=[8653ea29-a327-4a54-9b00-8468241f2d7c (1507244513403338752), 5cf034a9-d93a-4307-a367-02cb21fa8e35 (1507244513404387328), 816e3a04-9d0e-4587-a3ee-9f9e7b0c7d74 (1507244513405435904)],commit=} 0 50 could that be causing some of the problems? From: Shawn Heisey apa...@elyograg.org Sent: Monday, July 20, 2015 11:44 AM To: solr-user@lucene.apache.org Subject: Re: solr blocking and client timeout issue On 7/20/2015 11:54 AM, Jeremy Ashcraft wrote: I'm ugrading to the 1.8 JDK on our dev VM now and testing. Hopefully i can get production upgraded tonight. still getting the big GC pauses this morning, even after applying the GC tuning options. Everything was fine throughout the weekend. My biggest concern is that this instance had been running with no issues for almost 2 years, but these GC issues started just last week. It's very possible that you're simply going to need a larger heap than you have needed in the past, either because your index has grown, or because your query patterns have changed and now your queries need more memory. It could even be both of these. At your current index size, assuming that there's nothing else on this machine, you should have enough memory to raise your heap to 5GB. If there ARE other software pieces on this machine, then the long GC pauses (along with other performance issues) could be explained by too much memory allocation out of the 8GB total memory, resulting in swapping at the OS level. Thanks, Shawn -- *jeremy ashcraft* development manager EdGate Correlation Services http://correlation.edgate.com /253.853.7133 x228/ -- *jeremy ashcraft* development manager EdGate Correlation Services http://correlation.edgate.com /253.853.7133 x228/
upgrade clusterstate.json fom 4.10.4 to split state.json in 5.2.1
Hi, How can I upgrade the clusterstate.json to be split by collection? I read this issue https://issues.apache.org/jira/browse/SOLR-5473. In theory exists a param “stateFormat” that configured to 2 says to use the /collections/collection/cluster.son format. Where can I configure this? —/Yago Riveiro
Parsing and indexing parts of the input file paths
Dear user and dev lists, We are loading files from a directory and would like to index a portion of each file path as a field as well as the text inside the file. E.g., on HDFS we have this file path: /user/andrew/1234/1234/file.pdf And we would like the 1234 token parsed from the file path and indexed as an additional field that can be searched on. From my initial searches I can't see how to do this easily, so would I need to write some custom code, or a plugin? Thanks!
issue with query boost using qf and edismax
Hi, I am implementing searching using SOLR 5.0 and facing very strange problem. I am having 4 fields Name and address, city and state in the document apart from a unique ID. My requirement is that it should give me those results first where there is a match in name , then address, then state, city Scenerio 1 : When searching *louis* My query params is something like below q: person_full_name:*louis* OR address1:*louis* OR city:*louis* OR state_code:*louis* qf: person_full_name^5.0 address1^0.8 city^0.7 state_code^1.0 defType: edismax This is not giving results as per boost mentioned in qf param. This is giving me result where city is getting matched first. Score is coming as below: explain: { 11470307: \n1.4429675E-4 = (MATCH) sum of:\n 1.4429675E-4 = (MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n 0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*), product of:\n1.0 = boost\n0.0015872642 = queryNorm\n 0.09090909 = coord(1/11)\n, 11470282: \n1.4429675E-4 = (MATCH) sum of:\n 1.4429675E-4 = (MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n 0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*), product of:\n1.0 = boost\n0.0015872642 = queryNorm\n 0.09090909 = coord(1/11)\n, 11470291: \n1.4429675E-4 = (MATCH) sum of:\n 1.4429675E-4 = (MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n 0.0015872642 = (MATCH) ConstantScore(city:*louis*), product of:\n 1.0 = boost\n0.0015872642 = queryNorm\n0.09090909 = coord(1/11)\n, 11470261: \n1.4429675E-4 = (MATCH) sum of:\n 1.4429675E-4 = (MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n 0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*), product of:\n1.0 = boost\n0.0015872642 = queryNorm\n 0.09090909 = coord(1/11)\n, 11470328: \n1.4429675E-4 = (MATCH) sum of:\n 1.4429675E-4 = (MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n 0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*), product of:\n1.0 = boost\n0.0015872642 = queryNorm\n 0.09090909 = coord(1/11)\n, 11470331: \n1.4429675E-4 = (MATCH) sum of:\n 1.4429675E-4 = (MATCH) product of:\n0.0015872642 = (MATCH) sum of:\n 0.0015872642 = (MATCH) ConstantScore(person_full_name:*louis*), product of:\n1.0 = boost\n0.0015872642 = queryNorm\n 0.09090909 = coord(1/11)\n }, Scenerio 2: But when I am matching 2 keywords. *louis cen* explain: { 11470286: \n0.9805807 = (MATCH) product of:\n 1.9611614 = (MATCH) sum of:\n0.49029034 = (MATCH) max of:\n 0.49029034 = (MATCH) ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n 5.0 = boost\n0.09805807 = queryNorm\n0.49029034 = (MATCH) max of:\n 0.49029034 = (MATCH) ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n 5.0 = boost\n0.09805807 = queryNorm\n0.49029034 = (MATCH) max of:\n 0.49029034 = (MATCH) ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n 5.0 = boost\n0.09805807 = queryNorm\n0.49029034 = (MATCH) max of:\n 0.49029034 = (MATCH) ConstantScore(person_full_name:*cen*^5.0)^5.0, product of:\n 5.0 = boost\n0.09805807 = queryNorm\n 0.5 = coord(4/8)\n, 11470284: \n0.15689291 = (MATCH) product of:\n 0.31378582 = (MATCH) sum of:\n0.078446455 = (MATCH) max of:\n 0.078446455 = (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n 0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max of:\n 0.078446455 = (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max of:\n 0.078446455 = (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max of:\n 0.078446455 = (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 = boost\n0.09805807 = queryNorm\n 0.5 = coord(4/8)\n, 11470232: \n0.15689291 = (MATCH) product of:\n 0.31378582 = (MATCH) sum of:\n0.078446455 = (MATCH) max of:\n 0.078446455 = (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n 0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max of:\n 0.078446455 = (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max of:\n 0.078446455 = (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 = boost\n0.09805807 = queryNorm\n0.078446455 = (MATCH) max of:\n 0.078446455 = (MATCH) ConstantScore(address1:*cen*^0.8)^0.8, product of:\n0.8 = boost\n0.09805807 = queryNorm\n 0.5 = coord(4/8)\n, 11469707: \n0.15689291 = (MATCH) product of:\n 0.31378582 = (MATCH) sum of:\n0.078446455 = (MATCH) max of:\n 0.078446455 = (MATCH)
Running SolrJ from Solr's REST API
Hi, Would like to check, as I've created a SorJ program and exported it as an Runnable JAR, how do I integrate it together with Solr so that I can call this JAR directly from Solr's REST API? Currently I can only run it on command prompt using the command java -jar solrj.jar I'm using Solr 5.2.1. Regards, Edwin
Re: Performance of facet contain search in 5.2.1
contains has to basically examine each and every term to see if it matches. Say my facet.contains=bbb. A matching term could be aaabbbxyz or zzzbbbxyz So there's no way to _know_ when you've found them all without examining every last one. So I'd try to redefine the problem to not require that. If it's absolutely required, you can do some interesting things but it's going to inflate your index. For instance, rotate words (assuming word boundaries here). So, for instance, you have a text field with my dog has fleas. Index things like my dog has fleas|my dog has fleas dog has fleas my|my dog has fleas has fleas my dog|my dog has fleas fleas my dog has|my dog has fleas Literally with the pipe followed by the original text. Now all your contains clauses are simple prefix facets, and you can have the UI split the token on the pipe and display the original. Best, Erick On Tue, Jul 21, 2015 at 1:16 AM, Lo Dave dav...@hotmail.com wrote: I found that facet contain search take much longer time than facet prefix search. Do anyone have idea how to make contain search faster? org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.prefix=duty+of+carerows=1wt=jsonfacet=true_=1437462916852} hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.contains=duty+of+carerows=1wt=jsonfacet=truefacet.contains.ignoreCase=true} hits=1856 status=0 QTime=10951 As show above, prefix search take 5 but contain search take 10951 Thanks.
Re: Tips for faster indexing
Hi, Thank You Erick for your inputs. I tried creating batches of 1000 objects and indexing it to solr. The performance is way better than before but I find that number of indexed documents that is shown in the dashboard is lesser than the number of documents that I had actually indexed through solrj. My code is as follows: private static String SOLR_SERVER_URL = http://localhost:8983/solr/newcore ; private static String JSON_FILE_PATH = /home/vineeth/week1_fixed.json; private static JSONParser parser = new JSONParser(); private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL); public static void main(String[] args) throws IOException, SolrServerException, ParseException { File file = new File(JSON_FILE_PATH); Scanner scn=new Scanner(file,UTF-8); JSONObject object; int i = 0; CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); while(scn.hasNext()){ object= (JSONObject) parser.parse(scn.nextLine()); SolrInputDocument doc = indexJSON(object); batch.add(doc); if(i%1000==0){ System.out.println(Indexed + (i+1) + objects. ); solr.add(batch); batch = new ArrayListSolrInputDocument(); } i++; } solr.add(batch); solr.commit(); System.out.println(Indexed + (i+1) + objects. ); } public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws ParseException, IOException, SolrServerException { CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); SolrInputDocument mainEvent = new SolrInputDocument(); mainEvent.addField(id, generateID()); mainEvent.addField(RawEventMessage, jsonOBJ.get(RawEventMessage)); mainEvent.addField(EventUid, jsonOBJ.get(EventUid)); mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector)); mainEvent.addField(EventMessageType, jsonOBJ.get(EventMessageType)); mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent)); mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC)); Object obj = parser.parse(jsonOBJ.get(User).toString()); JSONObject userObj = (JSONObject) obj; SolrInputDocument childUserEvent = new SolrInputDocument(); childUserEvent.addField(id, generateID()); childUserEvent.addField(User, userObj.get(User)); obj = parser.parse(jsonOBJ.get(EventDescription).toString()); JSONObject eventdescriptionObj = (JSONObject) obj; SolrInputDocument childEventDescEvent = new SolrInputDocument(); childEventDescEvent.addField(id, generateID()); childEventDescEvent.addField(EventApplicationName, eventdescriptionObj.get(EventApplicationName)); childEventDescEvent.addField(Query, eventdescriptionObj.get(Query)); obj= JSONValue.parse(eventdescriptionObj.get(Information).toString()); JSONArray informationArray = (JSONArray) obj; for(int i = 0; iinformationArray.size(); i++){ JSONObject domain = (JSONObject) informationArray.get(i); SolrInputDocument domainDoc = new SolrInputDocument(); domainDoc.addField(id, generateID()); domainDoc.addField(domainName, domain.get(domainName)); String s = domain.get(columns).toString(); obj= JSONValue.parse(s); JSONArray ColumnsArray = (JSONArray) obj; SolrInputDocument columnsDoc = new SolrInputDocument(); columnsDoc.addField(id, generateID()); for(int j = 0; jColumnsArray.size(); j++){ JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j); SolrInputDocument columnDoc = new SolrInputDocument(); columnDoc.addField(id, generateID()); columnDoc.addField(movieName, ColumnsObj.get(movieName)); columnsDoc.addChildDocument(columnDoc); } domainDoc.addChildDocument(columnsDoc); childEventDescEvent.addChildDocument(domainDoc); } mainEvent.addChildDocument(childEventDescEvent); mainEvent.addChildDocument(childUserEvent); return mainEvent; } I would be grateful if you could let me know what I am missing. On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson erickerick...@gmail.com wrote: First thing is it looks like you're only sending one document at a time, perhaps with child objects. This is not optimal at all. I usually batch my docs up in groups of 1,000, and there is anecdotal evidence that there may (depending on the docs) be some gains above that number. Gotta balance the batch size off against how bug the docs are of course. Assuming that you really are calling this method for one doc (and children) at a time, the far bigger problem other than calling server.add for each parent/children is that you're then calling solr.commit() every time. This is an anti-pattern. Generally, let the autoCommit setting in solrconfig.xml handle the intermediate commits while the indexing program is running and only issue a commit at the
IntelliJ setup
I followed the instructions here https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ, including `ant idea`, but I'm still not getting the links in solr classes and methods; do I need to add libraries, or am I missing something else? Thanks!
Re: Parsing and indexing parts of the input file paths
Keeping to the user list (the right place for this question). More information is needed here - how are you getting these documents into Solr? Are you posting them to /update/extract? Or using DIH, or? Upayavira On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote: Dear user and dev lists, We are loading files from a directory and would like to index a portion of each file path as a field as well as the text inside the file. E.g., on HDFS we have this file path: /user/andrew/1234/1234/file.pdf And we would like the 1234 token parsed from the file path and indexed as an additional field that can be searched on. From my initial searches I can't see how to do this easily, so would I need to write some custom code, or a plugin? Thanks!
Re: Parsing and indexing parts of the input file paths
I'm not sure, it's a remote team but will get more info. For now, assuming that a certain directory is specified, like /user/andrew/, and a regex is applied to capture anything two directories below matching */*/*.pdf. Would there be a way to capture the wild-carded values and index them as fields? On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote: Keeping to the user list (the right place for this question). More information is needed here - how are you getting these documents into Solr? Are you posting them to /update/extract? Or using DIH, or? Upayavira On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote: Dear user and dev lists, We are loading files from a directory and would like to index a portion of each file path as a field as well as the text inside the file. E.g., on HDFS we have this file path: /user/andrew/1234/1234/file.pdf And we would like the 1234 token parsed from the file path and indexed as an additional field that can be searched on. From my initial searches I can't see how to do this easily, so would I need to write some custom code, or a plugin? Thanks!
Re: Parsing and indexing parts of the input file paths
Which can only happen if I post it to a web service, and won't happen if I do it through config? On Tue, Jul 21, 2015 at 2:19 PM, Upayavira u...@odoko.co.uk wrote: yes, unless it has been added consciously as a separate field. On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote: Thanks, so by the time we would get to an Analyzer the file path is forgotten? https://cwiki.apache.org/confluence/display/solr/Analyzers On Tue, Jul 21, 2015 at 1:27 PM, Upayavira u...@odoko.co.uk wrote: Solr generally does not interact with the file system in that way (with the exception of the DIH). It is the job of the code that pushes a file to Solr to process the filename and send that along with the request. See here for more info: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika You could provide literal.filename=blah/blah Upayavira On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote: I'm not sure, it's a remote team but will get more info. For now, assuming that a certain directory is specified, like /user/andrew/, and a regex is applied to capture anything two directories below matching */*/*.pdf. Would there be a way to capture the wild-carded values and index them as fields? On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote: Keeping to the user list (the right place for this question). More information is needed here - how are you getting these documents into Solr? Are you posting them to /update/extract? Or using DIH, or? Upayavira On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote: Dear user and dev lists, We are loading files from a directory and would like to index a portion of each file path as a field as well as the text inside the file. E.g., on HDFS we have this file path: /user/andrew/1234/1234/file.pdf And we would like the 1234 token parsed from the file path and indexed as an additional field that can be searched on. From my initial searches I can't see how to do this easily, so would I need to write some custom code, or a plugin? Thanks!
Re: Parsing and indexing parts of the input file paths
Solr generally does not interact with the file system in that way (with the exception of the DIH). It is the job of the code that pushes a file to Solr to process the filename and send that along with the request. See here for more info: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika You could provide literal.filename=blah/blah Upayavira On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote: I'm not sure, it's a remote team but will get more info. For now, assuming that a certain directory is specified, like /user/andrew/, and a regex is applied to capture anything two directories below matching */*/*.pdf. Would there be a way to capture the wild-carded values and index them as fields? On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote: Keeping to the user list (the right place for this question). More information is needed here - how are you getting these documents into Solr? Are you posting them to /update/extract? Or using DIH, or? Upayavira On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote: Dear user and dev lists, We are loading files from a directory and would like to index a portion of each file path as a field as well as the text inside the file. E.g., on HDFS we have this file path: /user/andrew/1234/1234/file.pdf And we would like the 1234 token parsed from the file path and indexed as an additional field that can be searched on. From my initial searches I can't see how to do this easily, so would I need to write some custom code, or a plugin? Thanks!
Re: Tips for faster indexing
Are you making sure that every document has a unique ID? Index into an empty Solr, then look at your maxdocs vs numdocs. If they are different (maxdocs is higher) then some of your documents have been deleted, meaning some were overwritten. That might be a place to look. Upayavira On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote: I can confirm this behavior, seen when sending json docs in batch, never happens when sending one by one, but sporadic when sending batches. Like if sole/jetty drops couple of documents out of the batch. Regards On 21 Jul 2015, at 21:38, Vineeth Dasaraju vineeth.ii...@gmail.com wrote: Hi, Thank You Erick for your inputs. I tried creating batches of 1000 objects and indexing it to solr. The performance is way better than before but I find that number of indexed documents that is shown in the dashboard is lesser than the number of documents that I had actually indexed through solrj. My code is as follows: private static String SOLR_SERVER_URL = http://localhost:8983/solr/newcore ; private static String JSON_FILE_PATH = /home/vineeth/week1_fixed.json; private static JSONParser parser = new JSONParser(); private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL); public static void main(String[] args) throws IOException, SolrServerException, ParseException { File file = new File(JSON_FILE_PATH); Scanner scn=new Scanner(file,UTF-8); JSONObject object; int i = 0; CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); while(scn.hasNext()){ object= (JSONObject) parser.parse(scn.nextLine()); SolrInputDocument doc = indexJSON(object); batch.add(doc); if(i%1000==0){ System.out.println(Indexed + (i+1) + objects. ); solr.add(batch); batch = new ArrayListSolrInputDocument(); } i++; } solr.add(batch); solr.commit(); System.out.println(Indexed + (i+1) + objects. ); } public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws ParseException, IOException, SolrServerException { CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); SolrInputDocument mainEvent = new SolrInputDocument(); mainEvent.addField(id, generateID()); mainEvent.addField(RawEventMessage, jsonOBJ.get(RawEventMessage)); mainEvent.addField(EventUid, jsonOBJ.get(EventUid)); mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector)); mainEvent.addField(EventMessageType, jsonOBJ.get(EventMessageType)); mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent)); mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC)); Object obj = parser.parse(jsonOBJ.get(User).toString()); JSONObject userObj = (JSONObject) obj; SolrInputDocument childUserEvent = new SolrInputDocument(); childUserEvent.addField(id, generateID()); childUserEvent.addField(User, userObj.get(User)); obj = parser.parse(jsonOBJ.get(EventDescription).toString()); JSONObject eventdescriptionObj = (JSONObject) obj; SolrInputDocument childEventDescEvent = new SolrInputDocument(); childEventDescEvent.addField(id, generateID()); childEventDescEvent.addField(EventApplicationName, eventdescriptionObj.get(EventApplicationName)); childEventDescEvent.addField(Query, eventdescriptionObj.get(Query)); obj= JSONValue.parse(eventdescriptionObj.get(Information).toString()); JSONArray informationArray = (JSONArray) obj; for(int i = 0; iinformationArray.size(); i++){ JSONObject domain = (JSONObject) informationArray.get(i); SolrInputDocument domainDoc = new SolrInputDocument(); domainDoc.addField(id, generateID()); domainDoc.addField(domainName, domain.get(domainName)); String s = domain.get(columns).toString(); obj= JSONValue.parse(s); JSONArray ColumnsArray = (JSONArray) obj; SolrInputDocument columnsDoc = new SolrInputDocument(); columnsDoc.addField(id, generateID()); for(int j = 0; jColumnsArray.size(); j++){ JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j); SolrInputDocument columnDoc = new SolrInputDocument(); columnDoc.addField(id, generateID()); columnDoc.addField(movieName, ColumnsObj.get(movieName)); columnsDoc.addChildDocument(columnDoc); } domainDoc.addChildDocument(columnsDoc); childEventDescEvent.addChildDocument(domainDoc); } mainEvent.addChildDocument(childEventDescEvent); mainEvent.addChildDocument(childUserEvent); return mainEvent; } I would be grateful if you could let me know what I am missing. On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson
Re: Parsing and indexing parts of the input file paths
Thanks, so by the time we would get to an Analyzer the file path is forgotten? https://cwiki.apache.org/confluence/display/solr/Analyzers On Tue, Jul 21, 2015 at 1:27 PM, Upayavira u...@odoko.co.uk wrote: Solr generally does not interact with the file system in that way (with the exception of the DIH). It is the job of the code that pushes a file to Solr to process the filename and send that along with the request. See here for more info: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika You could provide literal.filename=blah/blah Upayavira On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote: I'm not sure, it's a remote team but will get more info. For now, assuming that a certain directory is specified, like /user/andrew/, and a regex is applied to capture anything two directories below matching */*/*.pdf. Would there be a way to capture the wild-carded values and index them as fields? On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote: Keeping to the user list (the right place for this question). More information is needed here - how are you getting these documents into Solr? Are you posting them to /update/extract? Or using DIH, or? Upayavira On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote: Dear user and dev lists, We are loading files from a directory and would like to index a portion of each file path as a field as well as the text inside the file. E.g., on HDFS we have this file path: /user/andrew/1234/1234/file.pdf And we would like the 1234 token parsed from the file path and indexed as an additional field that can be searched on. From my initial searches I can't see how to do this easily, so would I need to write some custom code, or a plugin? Thanks!
Re: Tips for faster indexing
I can confirm this behavior, seen when sending json docs in batch, never happens when sending one by one, but sporadic when sending batches. Like if sole/jetty drops couple of documents out of the batch. Regards On 21 Jul 2015, at 21:38, Vineeth Dasaraju vineeth.ii...@gmail.com wrote: Hi, Thank You Erick for your inputs. I tried creating batches of 1000 objects and indexing it to solr. The performance is way better than before but I find that number of indexed documents that is shown in the dashboard is lesser than the number of documents that I had actually indexed through solrj. My code is as follows: private static String SOLR_SERVER_URL = http://localhost:8983/solr/newcore ; private static String JSON_FILE_PATH = /home/vineeth/week1_fixed.json; private static JSONParser parser = new JSONParser(); private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL); public static void main(String[] args) throws IOException, SolrServerException, ParseException { File file = new File(JSON_FILE_PATH); Scanner scn=new Scanner(file,UTF-8); JSONObject object; int i = 0; CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); while(scn.hasNext()){ object= (JSONObject) parser.parse(scn.nextLine()); SolrInputDocument doc = indexJSON(object); batch.add(doc); if(i%1000==0){ System.out.println(Indexed + (i+1) + objects. ); solr.add(batch); batch = new ArrayListSolrInputDocument(); } i++; } solr.add(batch); solr.commit(); System.out.println(Indexed + (i+1) + objects. ); } public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws ParseException, IOException, SolrServerException { CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); SolrInputDocument mainEvent = new SolrInputDocument(); mainEvent.addField(id, generateID()); mainEvent.addField(RawEventMessage, jsonOBJ.get(RawEventMessage)); mainEvent.addField(EventUid, jsonOBJ.get(EventUid)); mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector)); mainEvent.addField(EventMessageType, jsonOBJ.get(EventMessageType)); mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent)); mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC)); Object obj = parser.parse(jsonOBJ.get(User).toString()); JSONObject userObj = (JSONObject) obj; SolrInputDocument childUserEvent = new SolrInputDocument(); childUserEvent.addField(id, generateID()); childUserEvent.addField(User, userObj.get(User)); obj = parser.parse(jsonOBJ.get(EventDescription).toString()); JSONObject eventdescriptionObj = (JSONObject) obj; SolrInputDocument childEventDescEvent = new SolrInputDocument(); childEventDescEvent.addField(id, generateID()); childEventDescEvent.addField(EventApplicationName, eventdescriptionObj.get(EventApplicationName)); childEventDescEvent.addField(Query, eventdescriptionObj.get(Query)); obj= JSONValue.parse(eventdescriptionObj.get(Information).toString()); JSONArray informationArray = (JSONArray) obj; for(int i = 0; iinformationArray.size(); i++){ JSONObject domain = (JSONObject) informationArray.get(i); SolrInputDocument domainDoc = new SolrInputDocument(); domainDoc.addField(id, generateID()); domainDoc.addField(domainName, domain.get(domainName)); String s = domain.get(columns).toString(); obj= JSONValue.parse(s); JSONArray ColumnsArray = (JSONArray) obj; SolrInputDocument columnsDoc = new SolrInputDocument(); columnsDoc.addField(id, generateID()); for(int j = 0; jColumnsArray.size(); j++){ JSONObject ColumnsObj = (JSONObject) ColumnsArray.get(j); SolrInputDocument columnDoc = new SolrInputDocument(); columnDoc.addField(id, generateID()); columnDoc.addField(movieName, ColumnsObj.get(movieName)); columnsDoc.addChildDocument(columnDoc); } domainDoc.addChildDocument(columnsDoc); childEventDescEvent.addChildDocument(domainDoc); } mainEvent.addChildDocument(childEventDescEvent); mainEvent.addChildDocument(childUserEvent); return mainEvent; } I would be grateful if you could let me know what I am missing. On Sun, Jul 19, 2015 at 2:16 PM, Erick Erickson erickerick...@gmail.com wrote: First thing is it looks like you're only sending one document at a time, perhaps with child objects. This is not optimal at all. I usually batch my docs up in groups of 1,000, and there is anecdotal evidence that there may (depending on the docs) be some gains above that number. Gotta balance the batch size off against how bug the docs are of course. Assuming that you really are calling this method for one doc (and
Re: Issue with using createNodeSet in Solr Cloud
Ah, nice tip, thanks! This could also make scripts more portable too. Cheers, Savvas On 21 July 2015 at 08:40, Upayavira u...@odoko.co.uk wrote: Note, when you start up the instances, you can pass in a hostname to use instead of the IP address. If you are using bin/solr (which you should be!!) then you can use bin/solr -h my-host-name and that'll be used in place of the IP. Upayavira On Tue, Jul 21, 2015, at 05:45 AM, Erick Erickson wrote: Glad you found a solution Best, Erick On Mon, Jul 20, 2015 at 3:21 AM, Savvas Andreas Moysidis savvas.andreas.moysi...@gmail.com wrote: Erick, spot on! The nodes had been registered in zookeeper under my network interface's IP address...after specifying those the command worked just fine. It was indeed the thing I thought was true that wasn't... :) Many thanks, Savvas On 18 July 2015 at 20:47, Erick Erickson erickerick...@gmail.com wrote: P.S. It ain't the things ya don't know that'll kill ya, it's the things ya _do_ know that ain't so... On Sat, Jul 18, 2015 at 12:46 PM, Erick Erickson erickerick...@gmail.com wrote: Could you post your clusterstate.json? Or at least the live nodes section of your ZK config? (adminUIcloudtreelive_nodes. The addresses of my nodes are things like 192.168.1.201:8983_solr. I'm wondering if you're taking your node names from the information ZK records or assuming it's 127.0.0.1 On Sat, Jul 18, 2015 at 8:56 AM, Savvas Andreas Moysidis savvas.andreas.moysi...@gmail.com wrote: Thanks Eric, The strange thing is that although I have set the log level to ALL I see no error messages in the logs (apart from the line saying that the response is a 400 one). I'm quite confident the configset does exist as the collection gets created fine if I don't specify the createNodeSet param. Complete mystery..! I'll keep on troubleshooting and report back with my findings. Cheers, Savvas On 17 July 2015 at 02:14, Erick Erickson erickerick...@gmail.com wrote: There were a couple of cases where the no live servers was being returned when the error was something completely different. Does the Solr log show something more useful? And are you sure you have a configset named collection_A? 'cause this works (admittedly on 5.x) fine for me, and I'm quite sure there are bunches of automated tests that would be failing so I suspect it's just a misleading error being returned. Best, Erick On Thu, Jul 16, 2015 at 2:22 AM, Savvas Andreas Moysidis savvas.andreas.moysi...@gmail.com wrote: Hello There, I am trying to use the createNodeSet parameter when creating a new collection but I'm getting an error when doing so. More specifically, I have four Solr instances running locally in separate JVMs (127.0.0.1:8983, 127.0.0.1:8984, 127.0.0.1:8985, 127.0.0.1:8986 ) and a standalone Zookeeper instance which all Solr instances point to. The four Solr instances have no collections added to them and are all up and running (I can access the admin page in all of them). Now, I want to create a collections in only two of these four instances ( 127.0.0.1:8983, 127.0.0.1:8984) but when I hit one instance with the following URL: http://localhost:8983/solr/admin/collections?action=CREATEname=collection_AnumShards=1replicationFactor=2maxShardsPerNode=1createNodeSet=127.0.0.1:8983_solr,127.0.0.1:8984_solrcollection.configName=collection_A I am getting the following response: response lst name=responseHeader int name=status400/int int name=QTime3503/int /lst str name=Operation createcollection caused exception: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Cannot create collection collection_A. No live Solr-instances among Solr-instances specified in createNodeSet:127.0.0.1:8983_solr, 127.0.0.1:8984 _solr /str lst name=exception str name=msg Cannot create collection collection_A. No live Solr-instances among Solr-instances specified in createNodeSet:127.0.0.1:8983_solr, 127.0.0.1:8984 _solr /str int name=rspCode400/int /lst lst name=error str name=msg Cannot create collection collection_A. No live Solr-instances among Solr-instances specified in createNodeSet:127.0.0.1:8983_solr, 127.0.0.1:8984 _solr /str int name=code400/int /lst /response The instances are definitely up and running (at least the admin console can be accessed as mentioned) and if I remove the createNodeSet parameter the collection is created as expected. Am I missing something obvious or is this a bug?
Re: Parsing and indexing parts of the input file paths
yes, unless it has been added consciously as a separate field. On Tue, Jul 21, 2015, at 09:40 PM, Andrew Musselman wrote: Thanks, so by the time we would get to an Analyzer the file path is forgotten? https://cwiki.apache.org/confluence/display/solr/Analyzers On Tue, Jul 21, 2015 at 1:27 PM, Upayavira u...@odoko.co.uk wrote: Solr generally does not interact with the file system in that way (with the exception of the DIH). It is the job of the code that pushes a file to Solr to process the filename and send that along with the request. See here for more info: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika You could provide literal.filename=blah/blah Upayavira On Tue, Jul 21, 2015, at 07:37 PM, Andrew Musselman wrote: I'm not sure, it's a remote team but will get more info. For now, assuming that a certain directory is specified, like /user/andrew/, and a regex is applied to capture anything two directories below matching */*/*.pdf. Would there be a way to capture the wild-carded values and index them as fields? On Tue, Jul 21, 2015 at 11:20 AM, Upayavira u...@odoko.co.uk wrote: Keeping to the user list (the right place for this question). More information is needed here - how are you getting these documents into Solr? Are you posting them to /update/extract? Or using DIH, or? Upayavira On Tue, Jul 21, 2015, at 06:31 PM, Andrew Musselman wrote: Dear user and dev lists, We are loading files from a directory and would like to index a portion of each file path as a field as well as the text inside the file. E.g., on HDFS we have this file path: /user/andrew/1234/1234/file.pdf And we would like the 1234 token parsed from the file path and indexed as an additional field that can be searched on. From my initial searches I can't see how to do this easily, so would I need to write some custom code, or a plugin? Thanks!
Re: Tips for faster indexing
Hi Upayavira, I guess that is the problem. I am currently using a function for generating an ID. It takes the current date and time to milliseconds and generates the id. This is the function. public static String generateID(){ Date dNow = new Date(); SimpleDateFormat ft = new SimpleDateFormat(yyMMddhhmmssMs); String datetime = ft.format(dNow); return datetime; } I believe that despite having a millisecond precision in the id generation, multiple objects are being assigned the same ID. Can you suggest a better way to generate the ID? Regards, Vineeth On Tue, Jul 21, 2015 at 1:29 PM, Upayavira u...@odoko.co.uk wrote: Are you making sure that every document has a unique ID? Index into an empty Solr, then look at your maxdocs vs numdocs. If they are different (maxdocs is higher) then some of your documents have been deleted, meaning some were overwritten. That might be a place to look. Upayavira On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote: I can confirm this behavior, seen when sending json docs in batch, never happens when sending one by one, but sporadic when sending batches. Like if sole/jetty drops couple of documents out of the batch. Regards On 21 Jul 2015, at 21:38, Vineeth Dasaraju vineeth.ii...@gmail.com wrote: Hi, Thank You Erick for your inputs. I tried creating batches of 1000 objects and indexing it to solr. The performance is way better than before but I find that number of indexed documents that is shown in the dashboard is lesser than the number of documents that I had actually indexed through solrj. My code is as follows: private static String SOLR_SERVER_URL = http://localhost:8983/solr/newcore ; private static String JSON_FILE_PATH = /home/vineeth/week1_fixed.json; private static JSONParser parser = new JSONParser(); private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL); public static void main(String[] args) throws IOException, SolrServerException, ParseException { File file = new File(JSON_FILE_PATH); Scanner scn=new Scanner(file,UTF-8); JSONObject object; int i = 0; CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); while(scn.hasNext()){ object= (JSONObject) parser.parse(scn.nextLine()); SolrInputDocument doc = indexJSON(object); batch.add(doc); if(i%1000==0){ System.out.println(Indexed + (i+1) + objects. ); solr.add(batch); batch = new ArrayListSolrInputDocument(); } i++; } solr.add(batch); solr.commit(); System.out.println(Indexed + (i+1) + objects. ); } public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws ParseException, IOException, SolrServerException { CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); SolrInputDocument mainEvent = new SolrInputDocument(); mainEvent.addField(id, generateID()); mainEvent.addField(RawEventMessage, jsonOBJ.get(RawEventMessage)); mainEvent.addField(EventUid, jsonOBJ.get(EventUid)); mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector)); mainEvent.addField(EventMessageType, jsonOBJ.get(EventMessageType)); mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent)); mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC)); Object obj = parser.parse(jsonOBJ.get(User).toString()); JSONObject userObj = (JSONObject) obj; SolrInputDocument childUserEvent = new SolrInputDocument(); childUserEvent.addField(id, generateID()); childUserEvent.addField(User, userObj.get(User)); obj = parser.parse(jsonOBJ.get(EventDescription).toString()); JSONObject eventdescriptionObj = (JSONObject) obj; SolrInputDocument childEventDescEvent = new SolrInputDocument(); childEventDescEvent.addField(id, generateID()); childEventDescEvent.addField(EventApplicationName, eventdescriptionObj.get(EventApplicationName)); childEventDescEvent.addField(Query, eventdescriptionObj.get(Query)); obj= JSONValue.parse(eventdescriptionObj.get(Information).toString()); JSONArray informationArray = (JSONArray) obj; for(int i = 0; iinformationArray.size(); i++){ JSONObject domain = (JSONObject) informationArray.get(i); SolrInputDocument domainDoc = new SolrInputDocument(); domainDoc.addField(id, generateID()); domainDoc.addField(domainName, domain.get(domainName)); String s = domain.get(columns).toString(); obj= JSONValue.parse(s); JSONArray ColumnsArray = (JSONArray) obj; SolrInputDocument columnsDoc = new SolrInputDocument(); columnsDoc.addField(id,
Re: Tips for faster indexing
In Java: UUID.randomUUID(); That is what I'm using. Regards On 21 Jul 2015, at 22:38, Vineeth Dasaraju vineeth.ii...@gmail.com wrote: Hi Upayavira, I guess that is the problem. I am currently using a function for generating an ID. It takes the current date and time to milliseconds and generates the id. This is the function. public static String generateID(){ Date dNow = new Date(); SimpleDateFormat ft = new SimpleDateFormat(yyMMddhhmmssMs); String datetime = ft.format(dNow); return datetime; } I believe that despite having a millisecond precision in the id generation, multiple objects are being assigned the same ID. Can you suggest a better way to generate the ID? Regards, Vineeth On Tue, Jul 21, 2015 at 1:29 PM, Upayavira u...@odoko.co.uk wrote: Are you making sure that every document has a unique ID? Index into an empty Solr, then look at your maxdocs vs numdocs. If they are different (maxdocs is higher) then some of your documents have been deleted, meaning some were overwritten. That might be a place to look. Upayavira On Tue, Jul 21, 2015, at 09:24 PM, solr.user.1...@gmail.com wrote: I can confirm this behavior, seen when sending json docs in batch, never happens when sending one by one, but sporadic when sending batches. Like if sole/jetty drops couple of documents out of the batch. Regards On 21 Jul 2015, at 21:38, Vineeth Dasaraju vineeth.ii...@gmail.com wrote: Hi, Thank You Erick for your inputs. I tried creating batches of 1000 objects and indexing it to solr. The performance is way better than before but I find that number of indexed documents that is shown in the dashboard is lesser than the number of documents that I had actually indexed through solrj. My code is as follows: private static String SOLR_SERVER_URL = http://localhost:8983/solr/newcore ; private static String JSON_FILE_PATH = /home/vineeth/week1_fixed.json; private static JSONParser parser = new JSONParser(); private static SolrClient solr = new HttpSolrClient(SOLR_SERVER_URL); public static void main(String[] args) throws IOException, SolrServerException, ParseException { File file = new File(JSON_FILE_PATH); Scanner scn=new Scanner(file,UTF-8); JSONObject object; int i = 0; CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); while(scn.hasNext()){ object= (JSONObject) parser.parse(scn.nextLine()); SolrInputDocument doc = indexJSON(object); batch.add(doc); if(i%1000==0){ System.out.println(Indexed + (i+1) + objects. ); solr.add(batch); batch = new ArrayListSolrInputDocument(); } i++; } solr.add(batch); solr.commit(); System.out.println(Indexed + (i+1) + objects. ); } public static SolrInputDocument indexJSON(JSONObject jsonOBJ) throws ParseException, IOException, SolrServerException { CollectionSolrInputDocument batch = new ArrayListSolrInputDocument(); SolrInputDocument mainEvent = new SolrInputDocument(); mainEvent.addField(id, generateID()); mainEvent.addField(RawEventMessage, jsonOBJ.get(RawEventMessage)); mainEvent.addField(EventUid, jsonOBJ.get(EventUid)); mainEvent.addField(EventCollector, jsonOBJ.get(EventCollector)); mainEvent.addField(EventMessageType, jsonOBJ.get(EventMessageType)); mainEvent.addField(TimeOfEvent, jsonOBJ.get(TimeOfEvent)); mainEvent.addField(TimeOfEventUTC, jsonOBJ.get(TimeOfEventUTC)); Object obj = parser.parse(jsonOBJ.get(User).toString()); JSONObject userObj = (JSONObject) obj; SolrInputDocument childUserEvent = new SolrInputDocument(); childUserEvent.addField(id, generateID()); childUserEvent.addField(User, userObj.get(User)); obj = parser.parse(jsonOBJ.get(EventDescription).toString()); JSONObject eventdescriptionObj = (JSONObject) obj; SolrInputDocument childEventDescEvent = new SolrInputDocument(); childEventDescEvent.addField(id, generateID()); childEventDescEvent.addField(EventApplicationName, eventdescriptionObj.get(EventApplicationName)); childEventDescEvent.addField(Query, eventdescriptionObj.get(Query)); obj= JSONValue.parse(eventdescriptionObj.get(Information).toString()); JSONArray informationArray = (JSONArray) obj; for(int i = 0; iinformationArray.size(); i++){ JSONObject domain = (JSONObject) informationArray.get(i); SolrInputDocument domainDoc = new SolrInputDocument(); domainDoc.addField(id, generateID()); domainDoc.addField(domainName, domain.get(domainName)); String s = domain.get(columns).toString(); obj= JSONValue.parse(s); JSONArray ColumnsArray = (JSONArray) obj; SolrInputDocument columnsDoc = new SolrInputDocument(); columnsDoc.addField(id, generateID()); for(int j = 0;
Re: IntelliJ setup
Try invalidate caches and restart in IDEA, remove .idea directory in lucene-solr dir. After that run ant idea and re-open project. Also, you have to, at least, close project, run ant idea and re-open it if switching between too diverged branches (e.g., 4.10 and 5_x). вт, 21 июля 2015 г. в 21:53, Andrew Musselman andrew.mussel...@gmail.com: I followed the instructions here https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ, including `ant idea`, but I'm still not getting the links in solr classes and methods; do I need to add libraries, or am I missing something else? Thanks! -- Best regards, Konstantin Gribov
Re: IntelliJ setup
Bingo, thanks! On Tue, Jul 21, 2015 at 4:12 PM, Konstantin Gribov gros...@gmail.com wrote: Try invalidate caches and restart in IDEA, remove .idea directory in lucene-solr dir. After that run ant idea and re-open project. Also, you have to, at least, close project, run ant idea and re-open it if switching between too diverged branches (e.g., 4.10 and 5_x). вт, 21 июля 2015 г. в 21:53, Andrew Musselman andrew.mussel...@gmail.com : I followed the instructions here https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ, including `ant idea`, but I'm still not getting the links in solr classes and methods; do I need to add libraries, or am I missing something else? Thanks! -- Best regards, Konstantin Gribov
Re: WordDelimiterFilter Leading Trailing Special Character
Upayavira, thanks for the helpful suggestion, that works. I was looking for an option to turn off/circumvent that particular WordDelimiterFilter's behavior completely. Since our indexes are hundred's of Terabytes, every time we find a term that needs to be added, it will be a cumbersome process to reload all the cores. thanks On Tue, Jul 21, 2015 at 12:57 AM, Upayavira u...@odoko.co.uk wrote: Looking at the javadoc for the WordDelimiterFilterFactory, it suggests this config: fieldType name=text_wd class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory protected=protectedword.txt preserveOriginal=0 splitOnNumerics=1 splitOnCaseChange=1 catenateWords=0 catenateNumbers=0 catenateAll=0 generateWordParts=1 generateNumberParts=1 stemEnglishPossessive=1 types=wdfftypes.txt / /analyzer /fieldType Note the protected=x attribute. I suspect if you put Yahoo! into a file referenced by that attribute, it may survive analysis. I'd be curious to hear whether it works. Upayavira On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote: Question about WordDelimiterFilter. The search behavior that we experience with WordDelimiterFilter satisfies well, except for the case where there is a special character either at the leading or trailing end of the term. For instance: *‘db’ * — Works as expected. Finds all docs with ‘db’. *‘p!nk’* — Works fine as above. But on cases when, there is a special character towards the trailing end of the term, like ‘Yahoo!’ *‘yahoo!’* — Turns out to be a search for just *‘yahoo’* with the special character *‘!’* stripped out. This WordDelimiterFilter behavior is documented http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html What I would like to have is, the search performed without stripping out the leading trailing special character. Is there a way to achieve this behavior with WordDelimiterFilter. This is current config that we have for the field: fieldType name=text_wdf class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 types=specialchartypes.txt/ filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 types=specialchartypes.txt/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType thanks
Re: WordDelimiterFilter Leading Trailing Special Character
You can also use the types attribute to change the type of specific characters, such as to treat the ! or as an ALPHA. -- Jack Krupansky On Tue, Jul 21, 2015 at 7:43 PM, Sathiya N Sundararajan ausat...@gmail.com wrote: Upayavira, thanks for the helpful suggestion, that works. I was looking for an option to turn off/circumvent that particular WordDelimiterFilter's behavior completely. Since our indexes are hundred's of Terabytes, every time we find a term that needs to be added, it will be a cumbersome process to reload all the cores. thanks On Tue, Jul 21, 2015 at 12:57 AM, Upayavira u...@odoko.co.uk wrote: Looking at the javadoc for the WordDelimiterFilterFactory, it suggests this config: fieldType name=text_wd class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory protected=protectedword.txt preserveOriginal=0 splitOnNumerics=1 splitOnCaseChange=1 catenateWords=0 catenateNumbers=0 catenateAll=0 generateWordParts=1 generateNumberParts=1 stemEnglishPossessive=1 types=wdfftypes.txt / /analyzer /fieldType Note the protected=x attribute. I suspect if you put Yahoo! into a file referenced by that attribute, it may survive analysis. I'd be curious to hear whether it works. Upayavira On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote: Question about WordDelimiterFilter. The search behavior that we experience with WordDelimiterFilter satisfies well, except for the case where there is a special character either at the leading or trailing end of the term. For instance: *‘db’ * — Works as expected. Finds all docs with ‘db’. *‘p!nk’* — Works fine as above. But on cases when, there is a special character towards the trailing end of the term, like ‘Yahoo!’ *‘yahoo!’* — Turns out to be a search for just *‘yahoo’* with the special character *‘!’* stripped out. This WordDelimiterFilter behavior is documented http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html What I would like to have is, the search performed without stripping out the leading trailing special character. Is there a way to achieve this behavior with WordDelimiterFilter. This is current config that we have for the field: fieldType name=text_wdf class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 types=specialchartypes.txt/ filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 types=specialchartypes.txt/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType thanks
Re: Issue with using createNodeSet in Solr Cloud
Note, when you start up the instances, you can pass in a hostname to use instead of the IP address. If you are using bin/solr (which you should be!!) then you can use bin/solr -h my-host-name and that'll be used in place of the IP. Upayavira On Tue, Jul 21, 2015, at 05:45 AM, Erick Erickson wrote: Glad you found a solution Best, Erick On Mon, Jul 20, 2015 at 3:21 AM, Savvas Andreas Moysidis savvas.andreas.moysi...@gmail.com wrote: Erick, spot on! The nodes had been registered in zookeeper under my network interface's IP address...after specifying those the command worked just fine. It was indeed the thing I thought was true that wasn't... :) Many thanks, Savvas On 18 July 2015 at 20:47, Erick Erickson erickerick...@gmail.com wrote: P.S. It ain't the things ya don't know that'll kill ya, it's the things ya _do_ know that ain't so... On Sat, Jul 18, 2015 at 12:46 PM, Erick Erickson erickerick...@gmail.com wrote: Could you post your clusterstate.json? Or at least the live nodes section of your ZK config? (adminUIcloudtreelive_nodes. The addresses of my nodes are things like 192.168.1.201:8983_solr. I'm wondering if you're taking your node names from the information ZK records or assuming it's 127.0.0.1 On Sat, Jul 18, 2015 at 8:56 AM, Savvas Andreas Moysidis savvas.andreas.moysi...@gmail.com wrote: Thanks Eric, The strange thing is that although I have set the log level to ALL I see no error messages in the logs (apart from the line saying that the response is a 400 one). I'm quite confident the configset does exist as the collection gets created fine if I don't specify the createNodeSet param. Complete mystery..! I'll keep on troubleshooting and report back with my findings. Cheers, Savvas On 17 July 2015 at 02:14, Erick Erickson erickerick...@gmail.com wrote: There were a couple of cases where the no live servers was being returned when the error was something completely different. Does the Solr log show something more useful? And are you sure you have a configset named collection_A? 'cause this works (admittedly on 5.x) fine for me, and I'm quite sure there are bunches of automated tests that would be failing so I suspect it's just a misleading error being returned. Best, Erick On Thu, Jul 16, 2015 at 2:22 AM, Savvas Andreas Moysidis savvas.andreas.moysi...@gmail.com wrote: Hello There, I am trying to use the createNodeSet parameter when creating a new collection but I'm getting an error when doing so. More specifically, I have four Solr instances running locally in separate JVMs (127.0.0.1:8983, 127.0.0.1:8984, 127.0.0.1:8985, 127.0.0.1:8986 ) and a standalone Zookeeper instance which all Solr instances point to. The four Solr instances have no collections added to them and are all up and running (I can access the admin page in all of them). Now, I want to create a collections in only two of these four instances ( 127.0.0.1:8983, 127.0.0.1:8984) but when I hit one instance with the following URL: http://localhost:8983/solr/admin/collections?action=CREATEname=collection_AnumShards=1replicationFactor=2maxShardsPerNode=1createNodeSet=127.0.0.1:8983_solr,127.0.0.1:8984_solrcollection.configName=collection_A I am getting the following response: response lst name=responseHeader int name=status400/int int name=QTime3503/int /lst str name=Operation createcollection caused exception: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Cannot create collection collection_A. No live Solr-instances among Solr-instances specified in createNodeSet:127.0.0.1:8983_solr, 127.0.0.1:8984 _solr /str lst name=exception str name=msg Cannot create collection collection_A. No live Solr-instances among Solr-instances specified in createNodeSet:127.0.0.1:8983_solr, 127.0.0.1:8984 _solr /str int name=rspCode400/int /lst lst name=error str name=msg Cannot create collection collection_A. No live Solr-instances among Solr-instances specified in createNodeSet:127.0.0.1:8983_solr, 127.0.0.1:8984 _solr /str int name=code400/int /lst /response The instances are definitely up and running (at least the admin console can be accessed as mentioned) and if I remove the createNodeSet parameter the collection is created as expected. Am I missing something obvious or is this a bug? The exact Solr version I'm using is 4.9.1. Any pointers would be much appreciated. Thanks, Savvas
Re: SOLR nrt read writes
Bhawna, I think you need to reconcile yourself to the fact that what you want to achieve is not going to be possible. Solr (and Lucene underneath it) is HEAVILY optimised for high read/low write situations, and that leads to some latency in content reaching the index. If you wanted to change this, you'd have to get into some heavy Java/Lucene coding, as I believe Twitter have done on Lucene itself. I'd say, rather than attempting to change this, I'd say you need to work out a way in your UI to handle this situation. E.g. have a refresh on stale results button, or not seeing your data, try here. Or, if a user submits data, then wants to search for it in the same session, have your UI enforce a minimum 10s delay before it sends a request to Solr, or something like that. Efforts to solve this at the Solr end, without spending substantial sums and effort on it, will be futile as it isn't what Solr/Lucene are designed for. Upayavira On Tue, Jul 21, 2015, at 06:02 AM, Bhawna Asnani wrote: Thanks, I tried turning off auto softCommits but that didn't help much. Still seeing stale results every now and then. Also load on the server very light. We are running this just on a test server with one or two users. I don't see any warning in logs whole doing softCommits and it says it successfully opened new searcher and registered it as main searcher. Could this be due to caching? I have tried to disable all in my solrconfig. Sent from my iPhone On Jul 20, 2015, at 12:16 PM, Shawn Heisey apa...@elyograg.org wrote: On 7/20/2015 9:29 AM, Bhawna Asnani wrote: Thanks for your suggestions. The requirement is still the same , to be able to make a change to some solr documents and be able to see it on subsequent search/facet calls. I am using softCommit with waitSearcher=true. Also I am sending reads/writes to a single solr node only. I have tried disabling caches and warmup time in logs is '0' but every once in a while I do get the document just updated with stale data. I went through lucene documentation and it seems opening the IndexReader with the IndexWriter should make the changes visible to the reader. I checked solr logs no errors. I see this in logs each time 'Registered new searcher Searcher@x' even before searches that had the stale document. I have attached my solrconfig.xml for reference. Your attachment made it through the mailing list processing. Most don't, I'm surprised. Some thoughts: maxBooleanClauses has been set to 40. This is a lot. If you actually need a setting that high, then you are sending some MASSIVE queries, which probably means that your Solr install is exceptionally busy running those queries. If the server is fairly busy, then you should increase maxTime on autoCommit. I use a value of five minutes (30) ... and my server is NOT very busy most of the time. A commit with openSearcher set to false is relatively fast, but it still has somewhat heavy CPU, memory, and disk I/O resource requirements. You have autoSoftCommit set to happen after five seconds. If updates happen frequently or run for very long, this is potentially a LOT of committing and opening new searchers. I guess it's better than trying for one second, but anything more frequent than once a minute is likely to get you into trouble unless the system load is extremely light ... but as already discussed, your system load is probably not light. For the kind of Near Real Time setup you have mentioned, where you want to do one or more updates, commit, and then query for the changes, you probably should completely remove autoSoftCommit from the config and *only* open new searchers with explicit soft commits. Let autoCommit (with a maxTime of 1 to 5 minutes) handle durability concerns. A lot of pieces in your config file are set to depend on java system properties just like the example does, but since we do not know what system properties have been set, we can't tell for sure what those parts of the config are doing. Thanks, Shawn
Re: WordDelimiterFilter Leading Trailing Special Character
Looking at the javadoc for the WordDelimiterFilterFactory, it suggests this config: fieldType name=text_wd class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory protected=protectedword.txt preserveOriginal=0 splitOnNumerics=1 splitOnCaseChange=1 catenateWords=0 catenateNumbers=0 catenateAll=0 generateWordParts=1 generateNumberParts=1 stemEnglishPossessive=1 types=wdfftypes.txt / /analyzer /fieldType Note the protected=x attribute. I suspect if you put Yahoo! into a file referenced by that attribute, it may survive analysis. I'd be curious to hear whether it works. Upayavira On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote: Question about WordDelimiterFilter. The search behavior that we experience with WordDelimiterFilter satisfies well, except for the case where there is a special character either at the leading or trailing end of the term. For instance: *‘db’ * — Works as expected. Finds all docs with ‘db’. *‘p!nk’* — Works fine as above. But on cases when, there is a special character towards the trailing end of the term, like ‘Yahoo!’ *‘yahoo!’* — Turns out to be a search for just *‘yahoo’* with the special character *‘!’* stripped out. This WordDelimiterFilter behavior is documented http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html What I would like to have is, the search performed without stripping out the leading trailing special character. Is there a way to achieve this behavior with WordDelimiterFilter. This is current config that we have for the field: fieldType name=text_wdf class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 types=specialchartypes.txt/ filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1 types=specialchartypes.txt/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType thanks
Re: Solr Cloud: Duplicate documents in multiple shards
I suspect you can delete a document from the wrong shard by using update?distrib=false. I also suspect there are people here who would like to help you debug this, because it has been reported before, but we haven't yet been able to see whether it occurred due to human or software error. Upayavira On Tue, Jul 21, 2015, at 05:51 AM, mesenthil1 wrote: Thanks Erick for clarifying .. We are not explicitly setting the compositeId. We are using numShards=5 alone as part of the server start up. We are using uuid as unique field. One sample id is : possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30 Not sure how it would have gone to multiple shards. Do you have any suggestion for fixing this. Or we need to completely rebuild the index. When the routing key is compositeId, should we explicitly set ! with shard key? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218296.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr blocking and client timeout issue
We have a similar situation: production runs Java 7u10 (yes, we know its old!), and has custom GC options (G1 works well for us), and a 40Gb heap. We are a heavy user of NRT (sub-second soft-commits!), so that may be the common factor here. Every time we have tried a later Java 7 or Java 8, the heap blows up in no time at all. We are still investigating the root cause (we do need to migrate to Java 8), but I'm thinking that very high commit rates seem to be the common link here (and its not a common Solr use case I admit). I don't have any silver bullet answers to offer yet, but my suspicion/conjecture (no real evidence yet, I admit) is that the frequent commits are leaving temporary objects around (which they are entitled to do), and something has changed in the GC in later Java 7/8 which means they are slower to get rid of those, hence the overall heap usage is higher under this use case. @Jeremy, you don't have a lot of head room, but try a higher heap size? Could you go to 6Gb and see if that at least delays the issue? Erick is correct though, if you can reduce the commit rate, I'm sure that would alleviate the issue. On 21 July 2015 at 05:31, Erick Erickson erickerick...@gmail.com wrote: bq: the config is set up per the NRT suggestions in the docs. autoSoftCommit every 2 seconds and autoCommit every 10 minutes. 2 second soft commit is very aggressive, no matter what the NRT suggestions are. My first question is whether that's really needed. The soft commits should be as long as you can stand. And don't listen to your product manager who says 2 seconds is required, push back and answer whether that's really necessary. Most people won't notice the difference. bq: ...we are noticing a lot higher number of hard commits than usual. Is a client somewhere issuing a hard commit? This is rarely recommended... And is openSearcher true or false? False is a relatively cheap operation, true is quite expensive. More than you want to know about hard and soft commits: https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best, Erick Best, Erick On Mon, Jul 20, 2015 at 12:48 PM, Jeremy Ashcraft jashcr...@edgate.com wrote: heap is already at 5GB On 07/20/2015 12:29 PM, Jeremy Ashcraft wrote: no swapping that I'm seeing, although we are noticing a lot higher number of hard commits than usual. the config is set up per the NRT suggestions in the docs. autoSoftCommit every 2 seconds and autoCommit every 10 minutes. there have been 463 updates in the past 2 hours, all followed by hard commits INFO - 2015-07-20 12:26:20.979; org.apache.solr.update.DirectUpdateHandler2; start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} INFO - 2015-07-20 12:26:21.021; org.apache.solr.core.SolrDeletionPolicy; SolrDeletionPolicy.onCommit: commits: num=2 commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@ /opt/solr/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_e9nk,generation=665696} commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@ /opt/solr/solr/collection1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@524b89bd; maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_e9nl,generation=665697} INFO - 2015-07-20 12:26:21.022; org.apache.solr.core.SolrDeletionPolicy; newest commit generation = 665697 INFO - 2015-07-20 12:26:21.026; org.apache.solr.update.DirectUpdateHandler2; end_commit_flush INFO - 2015-07-20 12:26:21.026; org.apache.solr.update.processor.LogUpdateProcessor; [collection1] webapp=/solr path=/update params={omitHeader=falsewt=json} {add=[8653ea29-a327-4a54-9b00-8468241f2d7c (1507244513403338752), 5cf034a9-d93a-4307-a367-02cb21fa8e35 (1507244513404387328), 816e3a04-9d0e-4587-a3ee-9f9e7b0c7d74 (1507244513405435904)],commit=} 0 50 could that be causing some of the problems? From: Shawn Heisey apa...@elyograg.org Sent: Monday, July 20, 2015 11:44 AM To: solr-user@lucene.apache.org Subject: Re: solr blocking and client timeout issue On 7/20/2015 11:54 AM, Jeremy Ashcraft wrote: I'm ugrading to the 1.8 JDK on our dev VM now and testing. Hopefully i can get production upgraded tonight. still getting the big GC pauses this morning, even after applying the GC tuning options. Everything was fine throughout the weekend. My biggest concern is that this instance had been running with no issues for almost 2 years, but these GC issues started just last week. It's very possible that you're simply going to need a larger heap than you have needed in the past, either because your index has grown, or because your query patterns have changed and now your queries need more memory.
Performance of facet contain search in 5.2.1
I found that facet contain search take much longer time than facet prefix search. Do anyone have idea how to make contain search faster? org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.prefix=duty+of+carerows=1wt=jsonfacet=true_=1437462916852} hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.contains=duty+of+carerows=1wt=jsonfacet=truefacet.contains.ignoreCase=true} hits=1856 status=0 QTime=10951 As show above, prefix search take 5 but contain search take 10951 Thanks.
Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content
Dear Erick, I found another thing, I did check the number of unique terms for this field using schema browser, It reported 1683404 number of terms! Does it exceed the maximum number of unique terms for fcs facet method? I read somewhere it should be more than 16m does it true?! Best regards. On Tue, Jul 21, 2015 at 10:00 AM, Ali Nazemian alinazem...@gmail.com wrote: Dear Erick, Actually faceting on this field is not a user wanted application. I did that for the purpose of testing the customized normalizer and charfilter which I used. Therefore it just used for the purpose of testing. Anyway I did some googling on this error and It seems that changing facet method to enum works in other similar cases too. I dont know the differences between fcs and enum methods on calculating facet behind the scene, but it seems that enum works better in my case. Best regards. On Tue, Jul 21, 2015 at 9:08 AM, Erick Erickson erickerick...@gmail.com wrote: This really seems like an XY problem. _Why_ are you faceting on a tokenized field? What are you really trying to accomplish? Because faceting on a generalized content field that's an analyzed field is often A Bad Thing. Try going into the admin UI Schema Browser for that field, and you'll see how many unique terms you have in that field. Faceting on that many unique terms is rarely useful to the end user, so my suspicion is that you're not doing what you think you are. Or you have an unusual use-case. Either way, we need to understand what use-case you're trying to support in order to respond helpfully. You say that using facet.enum works, this is very surprising. That method uses the filterCache to create a bitset for each unique term. Which is totally incompatible with the uninverted field error you're reporting, so I clearly don't understand something about your setup. Are you _sure_? Best, Erick On Mon, Jul 20, 2015 at 9:32 PM, Ali Nazemian alinazem...@gmail.com wrote: Dear Toke and Davidphilip, Hi, The fieldtype text_fa has some custom language specific normalizer and charfilter, here is the schema.xml value related for this field: fieldType name=text_fa class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_fa.txt / /analyzer analyzer type=query charFilter class=com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_fa.txt / /analyzer /fieldType I did try the facet.method=enum and it works fine. Did you mean that actually applying facet on analyzed field is wrong? Best regards. On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Ali Nazemian alinazem...@gmail.com wrote: I have a collection of 1.6m documents in Solr 5.2.1. [...] Caused by: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content [...] field name=content type=text_fa stored=true indexed=true default=noval termVectors=true termPositions=true termOffsets=true/ You are hitting an internal limit in Solr. As davidphilip tells you, the solution is docValues, but they cannot be enabled for text fields. You need String fields, but the name of your field suggests that you need analyzation tokenization, which cannot be done on String fields. Would you please help me to solve this problem? With the information we have, it does not seem to be easy to solve: It seems like you want to facet on all terms in your index. As they need to be String (to use docValues), you would have to do all the splitting on white space, normalization etc. outside of Solr. - Toke Eskildsen -- A.Nazemian -- A.Nazemian -- A.Nazemian
Re: Performance of facet contain search in 5.2.1
Hi Dave, generally giving terms in a dictionary, it's much more efficient to run prefix queries than contain queries. Talking about using docValues, if I remember well when they are loaded in memory they are skipList, so you can use two operators on them : - next() that simply gives you ht next field value for the field doc values loaded - advance ( ByteRef term) which jump to the term of the greatest term if the one searched is missing. Using the facet prefix we can jump to the point we want and basically iterate the values that are matching. To verify the contains, it is simply used on each term in the docValues, term by term, using the StringUtil.contains() . How many different unique terms do you have in the index for that field ? So the difference in performance could make sense ( we are basically moving to logarithmic to linear to simplify) . I read the name of the field as facet.field=autocomplete, it's legit to ask you if you are using faceting to obtain infix auto completion ? In the case, can you help us, better identifying the problem and maybe provide you with a better solution ? Cheers 2015-07-21 9:16 GMT+01:00 Lo Dave dav...@hotmail.com: I found that facet contain search take much longer time than facet prefix search. Do anyone have idea how to make contain search faster? org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.prefix=duty+of+carerows=1wt=jsonfacet=true_=1437462916852} hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select params={q=sentence:duty+of+carefacet.field=autocompleteindent=truefacet.contains=duty+of+carerows=1wt=jsonfacet=truefacet.contains.ignoreCase=true} hits=1856 status=0 QTime=10951 As show above, prefix search take 5 but contain search take 10951 Thanks. -- -- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
RE: Programmatically find out if node is overseer
Hello - this approach not only solves the problem but also allows me to run different processing threads on other nodes. Thanks! Markus -Original message- From:Chris Hostetter hossman_luc...@fucit.org Sent: Saturday 18th July 2015 1:00 To: solr-user solr-user@lucene.apache.org Subject: Re: Programmatically find out if node is overseer : Hello - i need to run a thread on a single instance of a cloud so need : to find out if current node is the overseer. I know we can already : programmatically find out if this replica is the leader of a shard via : isLeader(). I have looked everywhere but i cannot find an isOverseer. I At one point, i woked up a utility method to give internal plugins access to an isOverseer() type utility method... https://issues.apache.org/jira/browse/SOLR-5823 ...but ultimately i abandoned this because i was completley forgetting (until much much too late) that there's really no reason to assume that any/all collections will have a single shard on the same node as the overseer -- so having a plugin that only does stuff if it's running on the overseer node is a really bad idea, because it might not run at all. (even if it's configured in every collection) what i ultimately wound up doing (see SOLR-5795) is implementing a solution where every core (of each collection configured to want this functionality) has a thread running (a TimedExecutor) which would do nothing unless... * my slice is active? (ie: not in the process of being shut down) * my slice is 'first' in a sorted list of slices? * i am currently the leader of my slice? ...that way when the timer goes off ever X minutes, at *most* one thread fires (we might sporadically get no evens triggered if/when there is leader election in progress for the slice that matters) the choice of first slice name alphabetically is purely becuase it's something cheap to compute and garunteeded to be unique. If you truly want exactly one thread for the entire cluster, regardless of collection, you could do the same basic idea by just adding a my collection is 'first' in a sorted list of collection names? -Hoss http://www.lucidworks.com/
Re: Use REST API URL to update field
curl is just a command line HTTP client. You can use HTTP POST to send the JSON that you are mentioning below via any means that works for you - the file does not need to exist on disk - it just needs to be added to the body of the POST request. I'd say review how to do HTTP POST requests from your chosen programming language and you should see how to do this. Upayavira On Tue, Jul 21, 2015, at 04:12 AM, Zheng Lin Edwin Yeo wrote: Hi Shawn, So it means that if my following is in a text file called update.txt, {id:testing_0001, popularity:{inc:1} This text file must still exist if I use the URL? Or can this information in the text file be put directly onto the URL? Regards, Edwin On 20 July 2015 at 22:04, Shawn Heisey apa...@elyograg.org wrote: On 7/20/2015 2:06 AM, Zheng Lin Edwin Yeo wrote: I'm using Solr 5.2.1, and I would like to check, is there a way to update certain field by using REST API URL directly instead of using curl? For example, I would like to increase the popularity field in my index each time a user click on the record. Currently, it can work with the curl command by having this in my text file to be read by curl (the id is hard-coded here for example purpose) {id:testing_0001, popularity:{inc:1} Is there a REST API URL that I can call to achieve the same purpose? The URL that you would use with curl *IS* the URL that you would use for a REST-like call. Thanks, Shawn
Re: Solr Cloud: Duplicate documents in multiple shards
Unable to delete by passing distrib=false as well. Also it is difficult to identify those duplicate documents among the 130 million. Is there a way we can see the generated hash key and mapping them to the specific shard? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr Cloud: Duplicate documents in multiple shards
When are you generating the UUID exactly? If you set the unique ID field on an update, and it contains a new UUID, you have effectively created a new document. Just a thought. -Original Message- From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] Sent: Tuesday, July 21, 2015 4:11 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud: Duplicate documents in multiple shards Unable to delete by passing distrib=false as well. Also it is difficult to identify those duplicate documents among the 130 million. Is there a way we can see the generated hash key and mapping them to the specific shard? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html Sent from the Solr - User mailing list archive at Nabble.com. * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
RE: Solr Cloud: Duplicate documents in multiple shards
Also, the function used to generate hashes is org.apache.solr.common.util.Hash.murmurhash3_x86_32(), which produces a 32-bit value. The range of the hash values assigned to each shard are resident in Zookeeper. Since you are using only a single hash component, all 32-bits will be used by the entire ID field value. I.e. I see no routing delimiter (!) in your example ID value: possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30 Which isn't required, but it means that documents (logs?) will be distributed in a round-robin fashion over the shards. Not grouped by host or environment (if I am reading it right). You might consider the following: environment!hostname!UUID E.g. intl-staging!possting.mongo-v2.services.com!c2d2a376-5e4a-11e2-8963-0026b9414f30 This way documents from the same host will be grouped together, most likely on the same shard. Further, within the same environment, documents will be grouped on the same subset of shards. This will allow client applications to set _route_=environment! or _route_=environment!hostname! and limit queries to those shards containing relevant data when the corresponding filter queries are applied. If you were using route delimiters, then the default for a 2-part key (1 delimiter) is to use 16 bits for each part. The default for a 3-part key (2 delimiters) is to use 8-bits each for the 1st 2 parts and 16 bits for the 3rd part. In any case, the high-order bytes of the hash dominate the distribution of data. -Original Message- From: Reitzel, Charles Sent: Tuesday, July 21, 2015 9:55 AM To: solr-user@lucene.apache.org Subject: RE: Solr Cloud: Duplicate documents in multiple shards When are you generating the UUID exactly? If you set the unique ID field on an update, and it contains a new UUID, you have effectively created a new document. Just a thought. -Original Message- From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] Sent: Tuesday, July 21, 2015 4:11 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud: Duplicate documents in multiple shards Unable to delete by passing distrib=false as well. Also it is difficult to identify those duplicate documents among the 130 million. Is there a way we can see the generated hash key and mapping them to the specific shard? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html Sent from the Solr - User mailing list archive at Nabble.com. * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *
Re: Installing Banana on Solr 5.2.1
On Tue, Jul 21, 2015, at 02:00 AM, Shawn Heisey wrote: On 7/20/2015 5:45 PM, Vineeth Dasaraju wrote: I am trying to install Banana on top of solr but haven't been able to do so. All the procedures that I get are for an earlier version of solr. Since the directory structure has changed in the new version, inspite of me placing the banana folder under the server/solr-webapp/webapp folder, I am not able to access it using the url localhost:8983/banana/src/index.html#/dashboard. I would appreciate it if someone can throw some more light into how I can do it. I think you would also need an xml file in server/contexts that tells Jetty how to load the application. I cloned the git repository for banana, and I see jetty-contexts/banana-context.xml there. I would imagine that copying this xml file into server/contexts and copying the banana.war generated by ant build-war into server/webapps would be enough to install it. If what I have said here is not enough to help you, then your best bet for help with this is to talk to Lucidworks. They know Solr REALLY well. I just tried it with the latest Solr. I downloaded v1.5.0.tgz and unpacked it. I moved the contents of the src directory into server/solr-webapp/webapp/banana then visited http://localhost:8983/solr/banana/index.html and it loaded up. I then needed to click the cog in the top right and change the collection it was accessing from collection1 to something that was actually there. From there, I assume the rest of it will work fine - my test system didn't have any data in it for me to confirm that. Upayavira
Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content
On Tue, Jul 21, 2015 at 3:09 AM, Ali Nazemian alinazem...@gmail.com wrote: Dear Erick, I found another thing, I did check the number of unique terms for this field using schema browser, It reported 1683404 number of terms! Does it exceed the maximum number of unique terms for fcs facet method? The real limit is not simple since the data is not stored in a simple way (it's compressed). I read somewhere it should be more than 16m does it true?! More like 16MB of delta-coded terms per block of documents (the index is split up into 256 blocks for this purpose) See DocTermOrds.java if you want more details than that. -Yonik
Re: Data Import Handler Stays Idle
There are some zip files inside the directory and have been addressed to in the database. I'm thinking those are the one's it's jumping right over. They are not the issue. At least I'm 95% sure. And Shawn if you're still watching I'm sorry I'm using solr-5.1.0. -- View this message in context: http://lucene.472066.n3.nabble.com/Data-Import-Handler-Stays-Idle-tp4218250p4218371.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Data Import Handler Stays Idle
On 7/21/2015 8:17 AM, Paden wrote: There are some zip files inside the directory and have been addressed to in the database. I'm thinking those are the one's it's jumping right over. They are not the issue. At least I'm 95% sure. And Shawn if you're still watching I'm sorry I'm using solr-5.1.0. Have you started Solr with a larger heap than the default 512MB in Solr 5.x? Tika can require a lot of memory. I would have expected there to be OutOfMemoryError exceptions in the log if that were the problem, though. You may need to use the -m option on the startup scripts to increase the max heap. Starting with -m 2g would be a good idea. Also, seeing the entire multi-line IOException from the log (which may be dozens of lines) could be important. Thanks, Shawn