backport Heliosearch features to Solr
As many of you know, I've been doing some work in the experimental heliosearch fork of Solr over the past year. I think it's time to bring some more of those changes back. So here's a poll: Which Heliosearch features do you think should be brought back to Apache Solr? http://bit.ly/1E7wi1Q (link to google form) -Yonik
Using HDFS with Solr
Hello. I have a question about using HDFS with Solr. I watched when one of shard node is gone, another node take them like this graph in admin console. *(10.62.65.46 is Gone)* +- shard 1-10.62.65.48 (active) collection-hdfs---+- shard 2-10.62.65.47 (active) +- shard 3-10.62.65.48 (active) So, when 10.62.65.46 is restarted but shard 1 is still assignd 10.62.65.48 node. Is it right? I think shard 1 assignd 10.62.65.46 node instead of 10.62.65.48 node. Please comment me. Thanks. -- - BLOG : http://www.codingstar.net -
solr5 - where does solr5 look for schema files?
I am running the out-of-the-box solr5 as instructed in the tutorial. The solr documentation has no useful documentation about the shema file argument to create core. I have a schema.xml that I was using for a solr 4 installation by manually editing the core directories as root. When playing with solr5, I have tried a number of things without success. a) copied my custom schema.xml to server/solr/configsets/basic_configs/conf/custom_schema.xml - when I typed custom_schema.xml into the schema: field in the create core dialog, a core is created but the new schema isn't used. Making cusom_schema.xml into invalid XML doesn't break anything. b) put custom_schema.xml in an accessible location on my server and entered the full path into the schema field - in this case I got an error message Error CREATEing SolrCore 'xxx': Unable to create core ... Invalid path string /configs/gettingstarted//.../custom_schema.xml There is no configs directory in the solr installaition.There is no gettingstarted directory either, though there are gettingstarted_shard1_replica1 etc. directories. The only meaningful schema.xml seems to be server/solr/configsets/basic_configs/conf/schema.xml. The cores are created in example/cloud/node*/solr There is no directory structure in the installation matching that described in the 500 page pdf.The files screen in the admin console does not mention schema.xml and there doesn't seem to be any place namimg or showing schema.xml in the admin interface. So how in the world is one to install a custom schema? Thanks Gulliver
Re: solr5 - where does solr5 look for schema files?
You haven't stated it explicitly, but I think you're running SolrCloud, right? In which case... the configs are all stored in ZooKeeper, and you don't edit them there. The startup scripts automate the upconfig step that pushes your configs to Zookeeper. Thereafter, they are read from Zookeeper by the Solr node on startup from ZK, but not stored locally on each node. Otherwise, keeping all the nodes coordinated would be difficult. You can see the uploaded configs in the Solr admin UI/Cloud/tree/configs area. So you keep your configs somewhere (some kind of VCS is recommended) and, when you make changes to them, push the results to ZK and either restart or reload your collection. Did you see the documentation at: https://cwiki.apache.org/confluence/display/solr/Using+ZooKeeper+to+Manage+Configuration+Files? And assuming I'm right and you're using SolrCloud, I _strongly_ suggest you try to think in terms of replicas rather than cores. In particular avoid using the old, familiar admin core API and instead use the collections api (see the ref guide). You can do pretty much anything with the collections api you used to do with the core admin, and at the same time have a lot less chance to get something wrong. The collections api makes use of the individual core admin API calls to carry out the instructed tasks as necessary. All that said, the new way of doing things is a bit of a shock to the system if you're an old Solr hand, especially in SolrCloud. Best, Erick On Sun, Mar 1, 2015 at 4:58 PM, Gulliver Smith gulliver.m.sm...@gmail.com wrote: I am running the out-of-the-box solr5 as instructed in the tutorial. The solr documentation has no useful documentation about the shema file argument to create core. I have a schema.xml that I was using for a solr 4 installation by manually editing the core directories as root. When playing with solr5, I have tried a number of things without success. a) copied my custom schema.xml to server/solr/configsets/basic_configs/conf/custom_schema.xml - when I typed custom_schema.xml into the schema: field in the create core dialog, a core is created but the new schema isn't used. Making cusom_schema.xml into invalid XML doesn't break anything. b) put custom_schema.xml in an accessible location on my server and entered the full path into the schema field - in this case I got an error message Error CREATEing SolrCore 'xxx': Unable to create core ... Invalid path string /configs/gettingstarted//.../custom_schema.xml There is no configs directory in the solr installaition.There is no gettingstarted directory either, though there are gettingstarted_shard1_replica1 etc. directories. The only meaningful schema.xml seems to be server/solr/configsets/basic_configs/conf/schema.xml. The cores are created in example/cloud/node*/solr There is no directory structure in the installation matching that described in the 500 page pdf.The files screen in the admin console does not mention schema.xml and there doesn't seem to be any place namimg or showing schema.xml in the admin interface. So how in the world is one to install a custom schema? Thanks Gulliver
Re: Getting started with Solr
OK, got it, works now. Maybe you can advise on something more general? I'm trying to use Solr to analyze html data retrieved with Nutch. I want to crawl a list of webpages built according to a certain template, and analyze certain fields in their HTML (identified by a span class and consisting of a number,) then output results as csv to generate a list with the website's domain and sum of the numbers in all the specified fields. How should I set up the flow? Should I configure Nutch to only pull the relevant fields from each page, then use Solr to add the integers in those fields and output to a csv? Or should I use Nutch to pull in everything from the relevant page and then use Solr to strip out the relevant fields and process them as above? Can I do the processing strictly in Solr, using the stuff found here https://cwiki.apache.org/confluence/display/solr/Indexing+and+Basic+Data+Operations, or should I use PHP through Solarium or something along those lines? Your advice would be appreciated-I don't want to reinvent the bicycle. Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype On Sun, Mar 1, 2015 at 9:17 AM, Baruch Kogan bar...@sellerpanda.com wrote: Thanks for bearing with me. I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this: *Welcome to the SolrCloud example!* *This interactive session will help you launch a SolrCloud cluster on your local workstation.* *To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2] * *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.* *Please enter the port for node1 [8983] * *8983* *Please enter the port for node2 [7574] * *7574* *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1 into /home/ubuntu/crawler/solr/example/cloud/node2* *Starting up SolrCloud node1 on port 8983 using command:* *solr start -cloud -s example/cloud/node1/solr -p 8983 * I then go to http://localhost:8983/solr/admin/cores and get the following: *This XML file does not appear to have any style information associated with it. The document tree is shown below.* *responselst name=responseHeaderint name=status0/intint name=QTime2/int/lstlst name=initFailures/lst name=statuslst name=testCollection_shard1_replica1str name=nametestCollection_shard1_replica1/strstr name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1//strstr name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data//strstr name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate name=startTime2015-03-01T06:59:12.296Z/datelong name=uptime46380/longlst name=indexint name=numDocs0/intint name=maxDoc0/intint name=deletedDocs0/intlong name=indexHeapUsageBytes0/longlong name=version1/longint name=segmentCount0/intbool name=currenttrue/boolbool name=hasDeletionsfalse/boolstr name=directoryorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long name=sizeInBytes71/longstr name=size71 bytes/str/lst/lstlst name=testCollection_shard1_replica2str name=nametestCollection_shard1_replica2/strstr name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2//strstr name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data//strstr name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate name=startTime2015-03-01T06:59:12.751Z/datelong name=uptime45926/longlst name=indexint name=numDocs0/intint name=maxDoc0/intint name=deletedDocs0/intlong name=indexHeapUsageBytes0/longlong name=version1/longint name=segmentCount0/intbool name=currenttrue/boolbool name=hasDeletionsfalse/boolstr name=directoryorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; maxCacheMB=48.0 maxMergeSizeMB=4.0)/strlst name=userData/long name=sizeInBytes71/longstr name=size71 bytes/str/lst/lstlst name=testCollection_shard2_replica1str name=nametestCollection_shard2_replica1/strstr name=instanceDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1//strstr name=dataDir/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data//strstr name=configsolrconfig.xml/strstr name=schemaschema.xml/strdate name=startTime2015-03-01T06:59:12.596Z/datelong name=uptime46081/longlst name=indexint name=numDocs0/intint name=maxDoc0/intint name=deletedDocs0/intlong name=indexHeapUsageBytes0/longlong
Correct connection methodology for Zookeeper/SolrCloud?
Hi I'm really after best practice guidelines for making queries to an index on a Solr cluster. I'm not calling from Java. I have Solr 4.10.2 up and running, seems stable. I have about 6 indexes/collections - am running SolrCloud with two Solr instances (both currently running on the same dev. box - just one shard each) and standalone Zookeeper with 3 instances. All seems fine. I can do queries against either instance, and perform index updates and replication works fine. I'm not using Java to talk to Solr - the web pages are built with PHP (or something similar - happy to call zk/Solr from C). So I need to call Solr from the web page code. Clearly I need resilience and so don't want to specifically call one of the Solr instances directly. I could just set up a load balancer on the two Solr instances and let client query requests use the load balancer to find a working instance. From what I have read though - I am supposed to make a call to zookeeper to ask which Solr instances are running up to date and working replicas of the collection that I need. Is that right? I should do that every time I need to make a query? There seems to be a zookeeper client library in the zk dist - in zookeeper-3.4.6/src/c/ - can I use that? It looks like I can pass in a list of potential zk host:port pairs and it will find a working zk for me - is that right? Then I need to ask the working zk which solr instance I should connect to for the given index/collection - how do I do that - is that held in clusterstate.json? So the steps to make a Solr query against my cluster would be: a) call zk client library with list of zk host/ports b) ask zk for clusterstate.json c) pick an active server (at random) for the relevant collection (is there some load balancing option in there) d) call the Solr server returned by (c) Is that best practice - or am I missing something? -- Cheers Jules.
Conditional invocation of HTMLStripCharFactory
is it possible to make a considional invocation of a HTMLStripCharFactory? I want to decide when to enable or disable it according to a value of specific field in my document. E.g. when a value of field A is true, then enable a filter on field B,or disable otherwise. -- View this message in context: http://lucene.472066.n3.nabble.com/Conditional-invocation-of-HTMLStripCharFactory-tp4190010.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: backport Heliosearch features to Solr
Hi Yonik, Now that you joined Cloudera, why not everything? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Sun, Mar 1, 2015 at 4:50 PM, Yonik Seeley ysee...@gmail.com wrote: As many of you know, I've been doing some work in the experimental heliosearch fork of Solr over the past year. I think it's time to bring some more of those changes back. So here's a poll: Which Heliosearch features do you think should be brought back to Apache Solr? http://bit.ly/1E7wi1Q (link to google form) -Yonik
Re: backport Heliosearch features to Solr
On Sun, Mar 1, 2015 at 7:18 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Yonik, Now that you joined Cloudera, why not everything? Everything is on the table, but from a practical point of view I wanted to verify areas of user interest/support before doing the work to get things back. Even when there is user support, some things may be blocked anyway (part of the reason why I did things under a fork in the first place). I'll do what I can though. -Yonik Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Sun, Mar 1, 2015 at 4:50 PM, Yonik Seeley ysee...@gmail.com wrote: As many of you know, I've been doing some work in the experimental heliosearch fork of Solr over the past year. I think it's time to bring some more of those changes back. So here's a poll: Which Heliosearch features do you think should be brought back to Apache Solr? http://bit.ly/1E7wi1Q (link to google form) -Yonik
SOLR Backup and Restore - Solr 3.6.1
Hello, we have solr 3.6.1 in our environment. we are trying to analyse backup and recovery solutions for the same. is there a way to compress the backup taken? we have explored about replicationHandler with backup command. but as our index is in 100's of GB's we would like a solution that provides compression to reduce storage overhead. thanks in advance Regards, Abhishek
Re: solr cloud does not start with many collections
I still see the same cloud startup issue with Solr 5.0.0. I created 4,000 collections from scratch and then attempted to stop/start the cloud. node1: WARN - 2015-03-02 18:09:02.371; org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog WARN - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state. WARN - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard shard1 for collection DD-3219 after 30 seconds; our state says http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says http://host:8000/solr/DD-3219_shard1_replica2/ node2: WARN - 2015-03-02 18:09:01.871; org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog WARN - 2015-03-02 18:17:04.458; org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered, but Solr cannot talk to ZK stop/start WARN - 2015-03-02 18:53:12.725; org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog WARN - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard shard1 for collection DD-3581 after 30 seconds; our state says http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says http://host:8002/solr/DD-3581_shard1_replica1/ node3: WARN - 2015-03-02 18:09:03.022; org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog WARN - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed out waiting to see all nodes published as DOWN in our cluster state. WARN - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard shard1 for collection DD-2707 after 30 seconds; our state says http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says http://host:8000/solr/DD-2707_shard1_replica1/ On 27 February 2015 at 17:48, Shawn Heisey apa...@elyograg.org wrote: On 2/26/2015 11:14 PM, Damien Kamerman wrote: I've run into an issue with starting my solr cloud with many collections. My setup is: 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single server (256GB RAM). 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores 1 x Zookeeper 3.4.6 Java arg -Djute.maxbuffer=67108864 added to solr and ZK. Then I stop all nodes, then start all nodes. All replicas are in the down state, some have no leader. At times I have seen some (12 or so) leaders in the active state. In the solr logs I see lots of: org.apache.solr.cloud.ZkController; Still seeing conflicting information about the leader of shard shard1 for collection DD-4351 after 30 seconds; our state says http://ftea1:8001/solr/DD-4351_shard1_replica1/, but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/ snip I've tried staggering the starts (1min) but does not help. I've reproduced with zero documents. Restarts are OK up to around 3,000 cores. Should this work? This is going to push SolrCloud beyond its limits. Is this just an exercise to see how far you can push Solr, or are you looking at setting up a production install with several thousand collections? In Solr 4.x, the clusterstate is one giant JSON structure containing the state of the entire cloud. With 5000 collections, the entire thing would need to be downloaded and uploaded at least 5000 times during the course of a successful full system startup ... and I think with replicationFactor set to 2, that might actually be 1 times. The best-case scenario is that it would take a VERY long time, the worst-case scenario is that concurrency problems would lead to a deadlock. A deadlock might be what is happening here. In Solr 5.x, the clusterstate is broken up so there's a separate state structure for each collection. This setup allows for faster and safer multi-threading and far less data transfer. Assuming I understand the implications correctly, there might not be any need to increase jute.maxbuffer with 5.x ... although I have to assume that I might be wrong about that. I would very much recommend that you set your scenario up from scratch in Solr 5.0.0, to see if the new clusterstate format can eliminate the problem you're seeing. If it doesn't, then we can pursue it as a likely bug in the 5.x branch and you can file an issue in Jira. Thanks, Shawn -- Damien Kamerman
filtering tfq() function query to specific part of collection not the whole documents
Hi, I was wondering is it possible to filter tfq() function query to specific selection of collection? Suppose I want to count all occurrences of term test in documents with fq=category:2, how can I handle such query with tfq() function query? It seems applying fq=category:2 in a select query with considering tfq() does not affect tfq(), no matter what is the other part of my query, tfq() always return the total term frequency for specific field in the whole collection. So what is the solution for this case? Best regards. -- A.Nazemian
Re: Is it possible to use multiple index data directory in Apache Solr?
On 1 March 2015 at 01:03, Shawn Heisey apa...@elyograg.org wrote: How exactly does ES split the index files when multiple paths are configured? I am very curious about exactly how this works. Google is not helping me figure it out. I even grabbed the ES master branch and wasn't able to trace how path.data is used after it makes it into the environment. Elasticsearch automatically creates indexes and shards. So, multiple directories are just used to distribute the shards' indexes among them. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-dir-layout.html So, when a new shard is created, one of the directories is used either randomly or usage-based. So, to me, the question would be not about the implementation matching but what is the OP trying to achieve with that: replication? more even disk utilization? something else? Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/
RE: Is it possible to use multiple index data directory in Apache Solr?
Under Solr/example folder, you will find multicore folder under which you can create multiple core/index directory folders and edit the solr.xml to specify each of the new core/directory. When you start Solr under examples directory, use command line like below to load Solr and then you should be able to see these multiple core in Solr admin and index data in each of the core/data directory. java -Dsolr.solr.home=multicore -jar start.jar Thnx -Original Message- From: Jou Sung-Shik [mailto:lik...@gmail.com] Sent: February 28, 2015 10:03 PM To: solr-user@lucene.apache.org Subject: Is it possible to use multiple index data directory in Apache Solr? I'm new in Apache Lucene/Solr. I try to move from Elasticsearch to Apache Solr. So, I have a question about following index data location configuration. *in Elasticsearch* # Can optionally include more than one lo # the locations (a la RAID 0) on a file l # space on creation. For example: # # path.data: /path/to/data1,/path/to/data2 *in Apache Solr* dataDir/var/data/solr//dataDir I want to configure multiple index data directory like Elasticsearch in Apache Solr. Is it possible? How I can reach the goal? -- - BLOG : http://www.codingstar.net -
Re: [ANNOUNCE] Luke 4.10.3 released
Hi Tomoko, I have just created the pivot branch off of the current master. Let's move our discussion there: https://github.com/DmitryKey/luke/tree/pivot-luke Thanks, Dmitry On Fri, Feb 27, 2015 at 7:53 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi Dmitry, In my environment, I cannot produce this pivots's error in HotSpot VM 1.7.0, please give me some time... Or, I'll try to make pull requests https://github.com/DmitryKey/luke for pivots's version. At any rate, it would be best to manage both of (current) thinlet's and pivots's versions at same place, as you suggested. Thanks, Tomoko 2015-02-26 22:15 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Sure, it is: java version 1.7.0_76 Java(TM) SE Runtime Environment (build 1.7.0_76-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.76-b04, mixed mode) On Thu, Feb 26, 2015 at 2:39 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Sorry, I'm afraid I have not encountered such errors when launch. Seems something wrong around Pivot's, but I have no idea about it. Would you tell me java version you're using ? Tomoko 2015-02-26 21:15 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Thanks, Tomoko, it compiles ok! Now launching produces some errors: $ java -cp dist/* org.apache.lucene.luke.ui.LukeApplication Exception in thread main java.lang.ExceptionInInitializerError at org.apache.lucene.luke.ui.LukeApplication.main(Unknown Source) Caused by: java.lang.NumberFormatException: For input string: 3 1644336 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:492) at java.lang.Byte.parseByte(Byte.java:148) at java.lang.Byte.parseByte(Byte.java:174) at org.apache.pivot.util.Version.decode(Version.java:156) at org.apache.pivot.wtk.ApplicationContext.clinit(ApplicationContext.java:1704) ... 1 more On Thu, Feb 26, 2015 at 1:48 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Thank you for checking out it! Sorry, I've forgot to note important information... ivy jar is needed to compile. Packaging process needs to be organized, but for now, I'm borrowing it from lucene's tools/lib. In my environment, Fedora 20 and OpenJDK 1.7.0_71, it can be compiled and run as follows. If there are any problems, please let me know. $ svn co http://svn.apache.org/repos/asf/lucene/sandbox/luke/ $ cd luke/ // copy ivy jar to lib/tools $ cp /path/to/lucene_solr_4_10_3/lucene/tools/lib/ivy-2.3.0.jar lib/tools/ $ ls lib/tools/ ivy-2.3.0.jar $ java -version java version 1.7.0_71 OpenJDK Runtime Environment (fedora-2.5.3.3.fc20-x86_64 u71-b14) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) $ ant ivy-resolve ... BUILD SUCCESSFUL // compile and make jars and run $ ant dist ... BUILD SUCCESSFULL $ java -cp dist/* org.apache.lucene.luke.ui.LukeApplication ... Thanks, Tomoko 2015-02-26 16:39 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Hi Tomoko, Thanks for the link. Do you have build instructions somewhere? When I executed ant with no params, I get: BUILD FAILED /home/dmitry/projects/svn/luke/build.xml:40: /home/dmitry/projects/svn/luke/lib-ivy does not exist. On Thu, Feb 26, 2015 at 2:27 AM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Thanks! Would you announce at LUCENE-2562 to me and all watchers interested in this issue, when the branch is ready? :) As you know, current pivots's version (that supports Lucene 4.10.3) is here. http://svn.apache.org/repos/asf/lucene/sandbox/luke/ Regards, Tomoko 2015-02-25 18:37 GMT+09:00 Dmitry Kan solrexp...@gmail.com: Ok, sure. The plan is to make the pivot branch in the current github repo and update its structure accordingly. Once it is there, I'll let you know. Thank you, Dmitry On Tue, Feb 24, 2015 at 5:26 PM, Tomoko Uchida tomoko.uchida.1...@gmail.com wrote: Hi Dmitry, Thank you for the detailed clarification! Recently, I've created a few patches to Pivot version(LUCENE-2562), so I'd like to some more work and keep up to date it. If you would like to work on the Pivot version, may I suggest you to fork the github's version? The ultimate goal is to donate this to Apache,
Re: About solr recovery
Several. One is if your network has trouble and Zookeeper times out a Solr node. Can you describe your problem though? Or is this just an informational question? Because I'm quite sure how to respond helpfully here. Best, Erick On Fri, Feb 27, 2015 at 10:37 PM, 龚俊衡 junheng.g...@icloud.com wrote: HI, Our production solr’s replication was offline in some time but both zookeeper and network is ok, and Solr jvm is normal. my question are there any other reason will let solr’s replication into recovering state?
Re: Correct connection methodology for Zookeeper/SolrCloud?
bq: I could just set up a load balancer on the two Solr instances and let client query requests use the load balancer to find a working instance. That's all you need to do. The client shouldn't have to really even be aware that Zookeeper exists, there's really no need to query ZK and route your requests yourself. The _Solr_ instances query ZK and know about each other's state and are notivied of any problems, i.e. nodes going up/down etc. Once a request hits any running Solr node, it'll be routed around any problems. In the setup you describe, i.e. not using SolrJ, your client really shouldn't even need to be aware ZK exists. Your load balancer should know what nodes are up and route your requests around any hosed machines. If you _do_ decide to use SolrJ sometime, CloudSolrServer (renamed CloudSolrClient in 5x) _does_ take the ZK ensemble and do some smart routing on the client side, including simple load balancing, and responds to any solr nodes going up/down for you. Putting a load balancer in front or some other type of connection, though, will accomplish much the same thing if Java isn't an option. The SolrJ stuff is more sophisticated though. Best, Erick On Sun, Mar 1, 2015 at 3:51 AM, Julian Perry ju...@limitless.co.uk wrote: Hi I'm really after best practice guidelines for making queries to an index on a Solr cluster. I'm not calling from Java. I have Solr 4.10.2 up and running, seems stable. I have about 6 indexes/collections - am running SolrCloud with two Solr instances (both currently running on the same dev. box - just one shard each) and standalone Zookeeper with 3 instances. All seems fine. I can do queries against either instance, and perform index updates and replication works fine. I'm not using Java to talk to Solr - the web pages are built with PHP (or something similar - happy to call zk/Solr from C). So I need to call Solr from the web page code. Clearly I need resilience and so don't want to specifically call one of the Solr instances directly. I could just set up a load balancer on the two Solr instances and let client query requests use the load balancer to find a working instance. From what I have read though - I am supposed to make a call to zookeeper to ask which Solr instances are running up to date and working replicas of the collection that I need. Is that right? I should do that every time I need to make a query? There seems to be a zookeeper client library in the zk dist - in zookeeper-3.4.6/src/c/ - can I use that? It looks like I can pass in a list of potential zk host:port pairs and it will find a working zk for me - is that right? Then I need to ask the working zk which solr instance I should connect to for the given index/collection - how do I do that - is that held in clusterstate.json? So the steps to make a Solr query against my cluster would be: a) call zk client library with list of zk host/ports b) ask zk for clusterstate.json c) pick an active server (at random) for the relevant collection (is there some load balancing option in there) d) call the Solr server returned by (c) Is that best practice - or am I missing something? -- Cheers Jules.
Integrating Solr with Nutch
Hi, guys, I'm working through the tutorial here http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch. I've run a crawl on a list of webpages. Now I'm trying to index them into Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns queries. I've edited the Nutch schema as per instructions. Now I hit a wall: - Save the file and restart Solr under ${APACHE_SOLR_HOME}/example: java -jar start.jar\ On my install (the latest Solr,) there is no such file, but there is a solr.sh file in the /bin which I can start. So I pasted it into solr/example/ and ran it from there. Solr cranks over. Now I need to: - run the Solr Index command from ${NUTCH_RUNTIME_HOME}: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/ and I get this: *ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex http://127.0.0.1:8983/solr/ http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* *Indexer: starting at 2015-03-01 19:51:09* *Indexer: deleting gone documents: false* *Indexer: URL filtering: false* *Indexer: URL normalizing: false* *Active IndexWriters :* *SOLRIndexWriter* * solr.server.url : URL of the SOLR instance (mandatory)* * solr.commit.size : buffer size when sending to SOLR (default 1000)* * solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)* * solr.auth : use authentication (default false)* * solr.auth.username : use authentication (default false)* * solr.auth : username for authentication* * solr.auth.password : password for authentication* *Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/crawldb/current* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/linkdb/current* * at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)* * at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)* * at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)* * at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)* * at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)* * at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)* * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)* * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)* * at java.security.AccessController.doPrivileged(Native Method)* * at javax.security.auth.Subject.doAs(Subject.java:415)* * at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)* * at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)* * at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)* * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)* * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)* * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)* * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)* * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)* What am I doing wrong? Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype
RE: Integrating Solr with Nutch
Hello Baruch! You are not pointing to a directory of segments, not a specific segment. You must either point to a directory with the -dir option: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ Or point to a segment: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/YOUR_SEGMENT Cheers -Original message- From:Baruch Kogan bar...@sellerpanda.com Sent: Sunday 1st March 2015 18:57 To: solr-user@lucene.apache.org Subject: Integrating Solr with Nutch Hi, guys, I'm working through the tutorial here http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch. I've run a crawl on a list of webpages. Now I'm trying to index them into Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns queries. I've edited the Nutch schema as per instructions. Now I hit a wall: - Save the file and restart Solr under ${APACHE_SOLR_HOME}/example: java -jar start.jar\ On my install (the latest Solr,) there is no such file, but there is a solr.sh file in the /bin which I can start. So I pasted it into solr/example/ and ran it from there. Solr cranks over. Now I need to: - run the Solr Index command from ${NUTCH_RUNTIME_HOME}: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/ and I get this: *ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex http://127.0.0.1:8983/solr/ http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* *Indexer: starting at 2015-03-01 19:51:09* *Indexer: deleting gone documents: false* *Indexer: URL filtering: false* *Indexer: URL normalizing: false* *Active IndexWriters :* *SOLRIndexWriter* * solr.server.url : URL of the SOLR instance (mandatory)* * solr.commit.size : buffer size when sending to SOLR (default 1000)* * solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)* * solr.auth : use authentication (default false)* * solr.auth.username : use authentication (default false)* * solr.auth : username for authentication* * solr.auth.password : password for authentication* *Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/crawldb/current* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/linkdb/current* * at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)* * at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)* * at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)* * at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)* * at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)* * at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)* * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)* * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)* * at java.security.AccessController.doPrivileged(Native Method)* * at javax.security.auth.Subject.doAs(Subject.java:415)* * at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)* * at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)* * at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)* * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)* * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)* * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)* * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)* * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)* What am I doing wrong? Sincerely, Baruch Kogan Marketing Manager Seller Panda http://sellerpanda.com +972(58)441-3829 baruch.kogan at Skype