Solr startup script in version 4.10.3
Hi, In release 4.10.3, the following lines were removed from solr starting script (bin/solr) # TODO: see SOLR-3619, need to support server or example # depending on the version of Solr if [ -e $SOLR_TIP/server/start.jar ]; then DEFAULT_SERVER_DIR=$SOLR_TIP/server else DEFAULT_SERVER_DIR=$SOLR_TIP/example fi However, the usage message always say -d dir Specify the Solr server directory; defaults to server Either the usage have to be fixed or the removed lines put back to the script. Personally, I like the default to server directory. My installation process in order to have a clean empty solr instance is to copy examples into server and remove directories like example-DIH, example-shemaless, multicore and solr/collection1 Solr server (or node) can be started without the -d parameter. If this makes sense, a Jira issue could be open. Dominique http://www.eolya.fr/
Re: FOSDEM Open source search devroom
On 02/01/2015 08:37, Bram Van Dam wrote: Hi folks, There will be an Open source search devroom[1] at this year's FOSDEM in Brussels, 31st of January 1st of February. I don't know if there will be a Lucene/Solr presence (there's no schedule for the dev room yet), but this seems like a good place meet up and talk shop. I'll be there, and I hope some of you will as well. Sadly I won't, but my colleague (and committer) Alan Woodward will, talking about text search for stream processing. C - Bram [1] https://fosdem.org/2015/schedule/track/open_source_search/ -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Vertical search Engine
hello i want to create a vertical search engine like trovit.com. I have installed solr and solarium. What else to i need can you recommend a suitable crawler and how to structure my data to be indexed -- View this message in context: http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Vertical search Engine
Hi, You should estimate the size of the data you will index before you decide crawler. Crawler is out of scope at this mail list. If you will crawl big size of data you can check Apache Nutch user list. Furkan KAMACI 2015-01-06 10:39 GMT+02:00 klunwebale klunweb...@gmail.com: hello i want to create a vertical search engine like trovit.com. I have installed solr and solarium. What else to i need can you recommend a suitable crawler and how to structure my data to be indexed -- View this message in context: http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html Sent from the Solr - User mailing list archive at Nabble.com.
edismax with multiple words for keyword tokenizer splitting on space
Hi I come across this weird behaviour in solr. I'm not sure that why this is desired in solr. I have filed this on stackoverflow. Please check http://stackoverflow.com/questions/27795177/edismax-with-multiple-words-for-keyword-tokenizer-splitting-on-space Thanks Sankalp Gupta
Re: Running Multiple Solr Instances
On 1/5/2015 9:31 PM, Nishanth S wrote: I am running multiple solr instances (Solr 4.10.3 on tomcat 8).There are 3 physical machines and I have 4 solr instances running on each machine on ports 8080,8081,8082 and 8083.The set up is well up to this point.Now I want to point each of these instance to a different index directories.The drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define /d/1 as the solr home all solr index directories are created in /d/1 where as the other drives remain un used.So how do I configure solr to make use of all the drives so that I can get maximum storage for solr.I would really appreciate any help in this regard. You should only run one Solr instance per machine. One instance can handle as many indexes as you want to run. Running multiple instances will waste a fair amount of system resources, and will also make the entire setup a lot more complicated than it needs to be. If you don't plan on setting up RAID (which would probably be a lot easier to manage), here's an idea: Set up the solr home somewhere on the root filesystem, then create symlinks under that which will be the instance directories, pointed to various directories under your other mount points. When Solr starts, it should begin core detection at the solr home and follow those symlinks into the other locations. I'm not aware of any problems with using symlinks in this way. If you're running SolrCloud, that can be a little more complicated, because creating a new collection from scratch will create the cores under the solr home ... but you can move them and symlink them after they're created, then either reload the collection or restart Solr. Just be sure that no indexing is happening when you begin the move. Thanks, Shawn
Re: Running Multiple Solr Instances
I would do one of either: 1. Set a different Solr home for each instance. I'd use the -Dsolr.solr.home=/d/2 command line switch when launching Solr to do so. 2. RAID 10 the drives. If you expect the Solr instances to get uneven traffic, pooling the drives will allow a given Solr instance to share the capacity of all of them. On 1/5/15 23:31, Nishanth S wrote: Hi folks, I am running multiple solr instances (Solr 4.10.3 on tomcat 8).There are 3 physical machines and I have 4 solr instances running on each machine on ports 8080,8081,8082 and 8083.The set up is well up to this point.Now I want to point each of these instance to a different index directories.The drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define /d/1 as the solr home all solr index directories are created in /d/1 where as the other drives remain un used.So how do I configure solr to make use of all the drives so that I can get maximum storage for solr.I would really appreciate any help in this regard. Thanks, Nishanth
Re: edismax with multiple words for keyword tokenizer splitting on space
You need to escape the space in your query (using backslash or quotes around the term) - the query parser doesn't parse based on the analyzer/tokenizer for each field. -- Jack Krupansky On Tue, Jan 6, 2015 at 4:05 AM, Sankalp Gupta sankalp.gu...@snapdeal.com wrote: Hi I come across this weird behaviour in solr. I'm not sure that why this is desired in solr. I have filed this on stackoverflow. Please check http://stackoverflow.com/questions/27795177/edismax-with-multiple-words-for-keyword-tokenizer-splitting-on-space Thanks Sankalp Gupta
Re: Vertical search Engine
Consider the Fusion product from LucidWorks: http://lucidworks.com/product/fusion/ Structuring of your data should be driven by your queries and access patterns - what are the most common queries and what are the most extreme and complex queries that you expect to handle, both tin terms of the queries are expressed and the results being returned. -- Jack Krupansky On Tue, Jan 6, 2015 at 3:39 AM, klunwebale klunweb...@gmail.com wrote: hello i want to create a vertical search engine like trovit.com. I have installed solr and solarium. What else to i need can you recommend a suitable crawler and how to structure my data to be indexed -- View this message in context: http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to limit the number of result sets of the 'export' handler
Export was specifically designed to get everything which is very expensive otherwise. If you just want the subset, you might be better off with normal queries and/or with deep paging (cursor). Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 6 January 2015 at 00:30, Sandy Ding sandy.ding...@gmail.com wrote: Using rows=xxx doesn't seem to work. Is there a way to do this?
Re: Vertical search Engine
Hi, http://manifoldcf.apache.org is another option to consider. It is useful to crawl projected pages. Free resources : http://www.manning.com/wright/ManifoldCFinAction_manuscript.pdf https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/ Ahmet On Tuesday, January 6, 2015 1:56 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Consider the Fusion product from LucidWorks: http://lucidworks.com/product/fusion/ Structuring of your data should be driven by your queries and access patterns - what are the most common queries and what are the most extreme and complex queries that you expect to handle, both tin terms of the queries are expressed and the results being returned. -- Jack Krupansky On Tue, Jan 6, 2015 at 3:39 AM, klunwebale klunweb...@gmail.com wrote: hello i want to create a vertical search engine like trovit.com. I have installed solr and solarium. What else to i need can you recommend a suitable crawler and how to structure my data to be indexed -- View this message in context: http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html Sent from the Solr - User mailing list archive at Nabble.com.
IstvanKulcsar - Wiki Solr
Hy, I would like suggest pages which use SOLR and developed my company. Please put this page this site: http://wiki.apache.org/solr/PublicServers http://www.odrportal.hu/kereso/ http://idea.unideb.hu/idealista/ http://www.jobmonitor.hu http://www.profession.hu/ http://webicina.com/ http://www.cylex.hu/ Thanks for answer. STeve
RE: Frequent deletions
Well, we are doing same thing(in a way). we have to do frequent deletions in mass, at a time we are deleting around 20M+ documents.All i am doing is after deletion i am firing the below command on each of our solr node and keep some patience as it take way much time. curl -vvv http://node1.solr.x.com/collection1/update?optimize=truedistrib=false; /tmp/__solr_clener_log After finishing optimisation curl returns below xml : ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime10268995/int/lst /response Regards,Amey Date: Wed, 31 Dec 2014 02:32:37 -0700 From: inna.gel...@elbitsystems.com To: solr-user@lucene.apache.org Subject: Frequent deletions Hello, We perform frequent deletions from our index, which greatly increases the index size. How can we perform an optimization in order to reduce the size. Please advise, Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: .htaccess / password
Craig, 1. What is .htaccess file meant for? 2. What are the contents inside this file? 3. How will you or how Solr knows that it needs to look for this file to bring in the needed security to this (which) area? 4. What event is causing for you to re-index the engine every night? Please share Thanks G -Original Message- From: Craig Hoffman [mailto:choff...@eclimb.net] Sent: Tuesday, January 06, 2015 12:29 PM To: Apache Solr Subject: .htaccess / password Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/http://www.mydomin.com/8983/solr/#/ will Solr continue to function properly? One thing to note, I will have a CRON job that runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to secure this area. Thanks, Craig -- Craig Hoffman w: http://www.craighoffmanphotography.com FB: www.facebook.com/CraigHoffmanPhotographyhttp://www.facebook.com/CraigHoffmanPhotography TW: https://twitter.com/craiglhoffman
facet.contains
https://issues.apache.org/jira/browse/SOLR-1387 https://issues.apache.org/jira/browse/SOLR-1387 contains a patch to support facet.contains and facet.contains.ignoreCase, making it possible to easily filter facet results without the facet.prefix limitations. I know that it is possible to approximate this using two fields, searching against one and displaying values from the other. However, this also has its limitations, and doesn’t work properly for multi-valued fields. Are there any other suggestions for being able to display facet values and counts based on a user supplied string? Thanks, Will
Re: htaccess
Hi, Your message seems quite confused (even the URL is not right for most normal Solr setup), and it is not clear as to what you mean by function properly. Solr is a search engine, and has no idea about .htacess files. Are you asking whether Solr respects directives in .htaccess files? I am pretty sure that cannot be the case. With regards to Solr security, it is again normally not a Solr concern. Please start from https://wiki.apache.org/solr/SolrSecurity No offence, but it seems that your real concerns might lie elsewhere. Please take a look at http://people.apache.org/~hossman/#xyproblem Please do follow up on this list if your questions have not been addressed. Regards, Gora On 6 January 2015 at 23:28, Craig Hoffman choff...@eclimb.net wrote: Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will Solr continue to function properly? One thing to note, I will have a CRON job that runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to secure this area. Thanks, Craig -- Craig Hoffman w: http://www.craighoffmanphotography.com FB: www.facebook.com/CraigHoffmanPhotography TW: https://twitter.com/craiglhoffman
RE: Solr Memory Usage - How to reduce memory footprint for solr
Abhishek Sharma [abhishe...@unbxd.com] wrote: *Q* - I am forced to set Java Xmx as high as 3.5g for my solr app.. If i keep this low, my CPU hits 100% and response time for indexing increases a lot.. And i have hit OOM Error as well when this value is low.. [...] 2. Index Size - 2 g 3. num. of Search Hits per sec - 10 [*IMP* - All search queries have faceting..] Faceting is often the reason for high memory usage. If you are not already doing so, do enable DocValues for the fields you are faceting on. If you have a lot of unique values in your facets (millions), you might also consider limiting the amount of concurrent searches. Still, 3.5GB heap seems like quite a bit for a 2GB index. How many documents do you have? - Toke Eskildsen
RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question
Still looking for answer on Schema.xml and SolrConfig.xml 1. Do I need to tell Solr, to extract Title from PDF, go look for Title word and extract entire line after the Tag and collect all such occurrence’s from hundreds of PDFs and build the Title column data and index it? 2. How to define my own schema to Solr 3. Say I defined my fields Title, Ticket_number, Submitter, client and so on, How can I verify respective data is extracted in specific columns in Solr and indexed? Any suggestion on how the Analyzer, Tokenizer and Filter and which one will help for this purpose? 1. I do not want to dump entire 4 GB PDF contents in one searchable field (ATTR_CONTENT) in Solr 2. Even if entire PDF contents is extracted in above field as a default, I still want to extract specific searchable column data in their respective fields 3. Rather I want to configure Solr to have column wise searchable contents such as Title, number, and so on Any suggestions on performance? PDF database is 80 GB, will it be fast enough? Do I Need to divide in multiple cores and on multiple machines ? and on multiple web apps? And clustering? I should have mentioned my PDFs are from Ticketing system like Jira which is already retired way back from production and all I have is the Ticketing system PDF database. 4. My system will be used internally just by the selected number of very few people. 5. They can wait 4 GB PDF to get loaded. 6. I agree there will be many matches will be found in one large PDF, based on search criteria 7. To make searches faster I want Solr to create more columns and column based indexes 8. Solr underneath uses Tika which is extracting contents and getting rid of all the rich content formatting characters present in the PDF document. 9. I believe resulting extraction size is 1/5th of the original PDF ..just a random guess based on one sample extraction From: Jürgen Wagner (DVT) [mailto:juergen.wag...@devoteam.com] Sent: Tuesday, January 06, 2015 11:56 AM To: solr-user@lucene.apache.org Subject: Re: PDF search functionality using Solr Hello, no matter which search platform you will use, this will pose two challenges: - The size of the documents will render search less and less useful as the likelihood of matches increases with document size. So, without a proper semantic extraction (e.g., using decent NER or relationship extraction with a commercial text mining product), I doubt you will get the required precision to make this overly usefiul. - PDFs can have their own character sets based on the characters actually used. Such file-specific character sets are almost impossible to parse, i.e., if your PDFs happen to use this feature of the PDF format, you won't be lucky getting any meaningful text out of them. My suggestion is to use the Jira REST API to collect all necessary documents and index the resulting XML or attachment formats. As the REST API provides filtering capabilities, you could easily create incremental feeds to avoid humongous indexing every time there's new information in Jira. Dumping Jira stuff as PDF seems to me to be the least suitable way of handling this. Best regards, --Jürgen On 06.01.2015 18:30, ganesh.ya...@sungard.commailto:ganesh.ya...@sungard.com wrote: Hello Solr-users and developers, Can you please suggest, 1. What I should do to index PDF content information column wise? 2. Do I need to extract the contents using one of the Analyzer, Tokenize and Filter combination and then add it to Index? How can test the results on command prompt? I do not know the selection of specific Analyzer, Tokenizer and Filter for this purpose 3. How can I verify that the needed column info is extracted out of PDF and is indexed? 4. So for example How to verify Ticket number is extracted in Ticket_number tag and is indexed? 5. Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? I think I saw some posts complaining on how large size that can be posted ? 6. What will enable Solr to search in any PDF out of many, with different words such as Runtime Error and result will provide the link to the PDF My PDFs are nothing but Jira ticket system. PDF has info on Ticket Number: Desc: Client: Status: Submitter: And so on: 1. I imported PDF document in Solr and it does the necessary searching and I can test some of it using the browse client interface provided. 2. I have 80 GB worth of PDFs. 3. Total number of PDFs are about 200 4. Many PDFs are of size 4 GB 5. What do you suggest me to import such a large PDFs? What tools can you suggest to extract PDF contents first in some XML format and later Post that XML to be indexed by Solr.? Your early response is much appreciated. Thanks G -- Mit freundlichen
Re: How large is your solr index?
Have you considered pre-supposing SolrCloud and using the SPLITSHARD API command? Even after that's done, the sub-shard needs to be physically moved to another machine (probably), but that too could be scripted. May not be desirable, but I thought I'd mention it. Best, Erick On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge peter.stu...@gmail.com wrote: Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even performs reasonably well on commodity hardware with lots of faceting and concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but it works. ++1 for the automagic shard creator. We've been looking into doing this sort of thing internally - i.e. when a shard reaches a certain size/num docs, it creates 'sub-shards' to which new commits are sent and queries to the 'parent' shard are included. The concept works, as long as you don't try any non-dist stuff - it's one reason why all our fields are always single valued. There are also other implications like cleanup, deletes and security to take into account, to name a few. A cool side-effect of sub-sharding (for lack of a snappy term) is that the parent shard then stops suffering from auto-warming latency due to commits (we do a fair amount of committing). In theory, you could carry on sub-sharding until your hardware starts gasping for air. On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam bram.van...@intix.eu wrote: On 01/04/2015 02:22 AM, Jack Krupansky wrote: The reality doesn't seem to be there today. 50 to 100 million documents, yes, but beyond that takes some kind of heroic effort, whether a much beefier box, very careful and limited data modeling or limiting of query capabilities or tolerance of higher latency, expert tuning, etc. I disagree. On the scale, at least. Up until 500M Solr performs well (read: well enough considering the scale) in a single shard on a single box of commodity hardware. Without any tuning or heroic efforts. Sure, some queries aren't as snappy as you'd like, and sure, indexing and querying at the same time will be somewhat unpleasant, but it will work, and it will work well enough. Will it work for thousands of concurrent users? Of course not. Anyone who is after that sort of thing won't find themselves in this scenario -- they will throw hardware at the problem. There is something to be said for making sharding less painful. It would be nice if, for instance, Solr would automagically create a new shard once some magic number was reached (2B at the latest, I guess). But then that'll break some query features ... :-( The reason we're using single large instances (sometimes on beefy hardware) is that SolrCloud is a pain. Not just from an administrative point of view (though that seems to be getting better, kudos for that!), but mostly because some queries cannot be executed with distributed=true. Our users, at least, prefer a slow query over an impossible query. Actually, this 2B limit is a good thing. It'll help me convince $management to donate some of our time to Solr :-) - Bram
Re: PDF search functionality using Solr
Seconding Jürgen's comment. 4G docs are almost, but not quite totally useless to search How many JIRA's each? That's _one_ document unless you do some fancy dancing. Pulling the data directly using the JIRA API sounds far superior. If you _must_ use the JIRA-PDF-Solr option, consider the following: Use Tika on the client to parse the doc and taking control of the mapping of the meta-data and, probably, breaking thins up into individual document, one Solr document per JIRA. That'll give you a chance to deal with charset issues and the like. Here's an example: https://lucidworks.com/blog/indexing-with-solrj/ That one has both Tika and database connectivity but should be pretty straight-forward to adapt, just pull the database junk out. Best, Erick On Tue, Jan 6, 2015 at 9:55 AM, Jürgen Wagner (DVT) juergen.wag...@devoteam.com wrote: Hello, no matter which search platform you will use, this will pose two challenges: - The size of the documents will render search less and less useful as the likelihood of matches increases with document size. So, without a proper semantic extraction (e.g., using decent NER or relationship extraction with a commercial text mining product), I doubt you will get the required precision to make this overly usefiul. - PDFs can have their own character sets based on the characters actually used. Such file-specific character sets are almost impossible to parse, i.e., if your PDFs happen to use this feature of the PDF format, you won't be lucky getting any meaningful text out of them. My suggestion is to use the Jira REST API to collect all necessary documents and index the resulting XML or attachment formats. As the REST API provides filtering capabilities, you could easily create incremental feeds to avoid humongous indexing every time there's new information in Jira. Dumping Jira stuff as PDF seems to me to be the least suitable way of handling this. Best regards, --Jürgen On 06.01.2015 18:30, ganesh.ya...@sungard.com wrote: Hello Solr-users and developers, Can you please suggest, 1. What I should do to index PDF content information column wise? 2. Do I need to extract the contents using one of the Analyzer, Tokenize and Filter combination and then add it to Index? How can test the results on command prompt? I do not know the selection of specific Analyzer, Tokenizer and Filter for this purpose 3. How can I verify that the needed column info is extracted out of PDF and is indexed? 4. So for example How to verify Ticket number is extracted in Ticket_number tag and is indexed? 5. Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? I think I saw some posts complaining on how large size that can be posted ? 6. What will enable Solr to search in any PDF out of many, with different words such as Runtime Error and result will provide the link to the PDF My PDFs are nothing but Jira ticket system. PDF has info on Ticket Number: Desc: Client: Status: Submitter: And so on: 1. I imported PDF document in Solr and it does the necessary searching and I can test some of it using the browse client interface provided. 2. I have 80 GB worth of PDFs. 3. Total number of PDFs are about 200 4. Many PDFs are of size 4 GB 5. What do you suggest me to import such a large PDFs? What tools can you suggest to extract PDF contents first in some XML format and later Post that XML to be indexed by Solr.? Your early response is much appreciated. Thanks G -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением i.A. Jürgen Wagner Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Solr Memory Usage - How to reduce memory footprint for solr
*Q* - I am forced to set Java Xmx as high as 3.5g for my solr app.. If i keep this low, my CPU hits 100% and response time for indexing increases a lot.. And i have hit OOM Error as well when this value is low.. Is this too high? If so, how can I reduce this? *Machine Details* 4 G RAM, SSD *Solr App Details* (Standalone solr app, no shards) 1. num. of Solr Cores = 5 2. Index Size - 2 g 3. num. of Search Hits per sec - 10 [*IMP* - All search queries have faceting..] 4. num. of times Re-Indexing per hour per core - 10 (it may happen at the same time at a moment for all the 5 cores) 5. Query Result Cache, Document cache and Filter Cache are all default size - 4 kb. *top* stats - VIRTRESSHR S %CPU %MEM 6446600 3.478g 18308 S 11.3 94.6 *iotop* stats DISK READ DISK WRITE SWAPIN IO 0-1200 K/s0-100 K/s 0 0-5%
SOLR - any open source framework
I am new to SOLR and was able to configure, run samples as well as able to index data using DIH (from database). Just wondering if there are open source framework to query and display/visualize. Regards
Re: SOLR - any open source framework
We've compared several projects before starting - AngularJS was on them, it is great for stuff where you could find components (already prepared) but writing custom components was easier in other framworks (you need to take this statement with grain of salt: it was specific to our situation), but that was one year ago... On Tue, Jan 6, 2015 at 5:20 PM, Vishal Swaroop vishal@gmail.com wrote: Thanks Roman... I will check it... Maybe it's off topic but how about Angular... On Jan 6, 2015 5:17 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Vishal, Alexandre, Here is another one, using Backbone, just released v1.0.16 https://github.com/adsabs/bumblebee you can see it in action: http://ui.adslabs.org/ While it primarily serves our own needs, I tried to architect it to be extendible (within reasonable limits of code, man power) Roman On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: That's very general question. So, the following are three random ideas just to get you started to think of options. *) spring.io (Spring Data Solr) + Vaadin *) http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too) *) http://projectblacklight.org/ Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote: I am new to SOLR and was able to configure, run samples as well as able to index data using DIH (from database). Just wondering if there are open source framework to query and display/visualize. Regards
Re: solrcloud without faceting, i.e. for failover only
The downsides that come to mind: 1. Every write gets amplified by the number of nodes in the cloud. 1000 write requests end up creating 1000*N HTTP calls as the leader forwards those writes individually to all of the followers in the cloud. Contrast that with classical replication where only changed index segments get replicated asynchronously. 2. Slightly more complicated infrastructure in terms of having to run a zookeeper cluster. #1 is a trade off against being possibly more available to writes in the case of a single down node. In the cloud case, you're still open for business. In the classical replication case, you're no longer available for writes if the downed node is the master. My two cents. On 1/6/15 16:30, Will Milspec wrote: Hi all, We have a smallish index that performs well for searches and are considering using solrcloud --but just for high availability/redundancy, i.e. without any sharding. The indexes would be replicated, but not distributed. I know that there are no stupid questions..Only stupid people...but here goes: -is solrcloud w/o sharding done?( I.e. it's just not done!! ) -any downside (i.e. aside from the lack of horizontal scalability ) will
Re: SOLR - any open source framework
That's very general question. So, the following are three random ideas just to get you started to think of options. *) spring.io (Spring Data Solr) + Vaadin *) http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too) *) http://projectblacklight.org/ Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote: I am new to SOLR and was able to configure, run samples as well as able to index data using DIH (from database). Just wondering if there are open source framework to query and display/visualize. Regards
Re: SOLR - any open source framework
Thanks a lot... We are in the process of analyzing what to use with SOLR... On Jan 6, 2015 5:30 PM, Roman Chyla roman.ch...@gmail.com wrote: We've compared several projects before starting - AngularJS was on them, it is great for stuff where you could find components (already prepared) but writing custom components was easier in other framworks (you need to take this statement with grain of salt: it was specific to our situation), but that was one year ago... On Tue, Jan 6, 2015 at 5:20 PM, Vishal Swaroop vishal@gmail.com wrote: Thanks Roman... I will check it... Maybe it's off topic but how about Angular... On Jan 6, 2015 5:17 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Vishal, Alexandre, Here is another one, using Backbone, just released v1.0.16 https://github.com/adsabs/bumblebee you can see it in action: http://ui.adslabs.org/ While it primarily serves our own needs, I tried to architect it to be extendible (within reasonable limits of code, man power) Roman On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: That's very general question. So, the following are three random ideas just to get you started to think of options. *) spring.io (Spring Data Solr) + Vaadin *) http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too) *) http://projectblacklight.org/ Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote: I am new to SOLR and was able to configure, run samples as well as able to index data using DIH (from database). Just wondering if there are open source framework to query and display/visualize. Regards
Re: Vertical search Engine
Hi, You can have a look at www.crawl-anywhere.com A web crawler on top of Solr. Used for following vertical search engines : http://www.hurisearch.org/ http://www.searchamnesty.org/ Regards Dominique 2015-01-06 15:22 GMT+01:00 Ahmet Arslan iori...@yahoo.com.invalid: Hi, http://manifoldcf.apache.org is another option to consider. It is useful to crawl projected pages. Free resources : http://www.manning.com/wright/ManifoldCFinAction_manuscript.pdf https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/ Ahmet On Tuesday, January 6, 2015 1:56 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Consider the Fusion product from LucidWorks: http://lucidworks.com/product/fusion/ Structuring of your data should be driven by your queries and access patterns - what are the most common queries and what are the most extreme and complex queries that you expect to handle, both tin terms of the queries are expressed and the results being returned. -- Jack Krupansky On Tue, Jan 6, 2015 at 3:39 AM, klunwebale klunweb...@gmail.com wrote: hello i want to create a vertical search engine like trovit.com. I have installed solr and solarium. What else to i need can you recommend a suitable crawler and how to structure my data to be indexed -- View this message in context: http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: cloudsolrserver
To get started, the ref guide should be helpful. https://cwiki.apache.org/confluence/display/solr/Using+SolrJ You just need to pass the Zk host string to the constructor and then use the server. Also, what do you mean by *connect to CloudSolrServer*? you mean connect using, right? On Tue, Jan 6, 2015 at 2:58 PM, tharpa 7kavsn...@sneakemail.com wrote: We are switching from a direct HTTP connection to use cloudsolrserver. I have looked and failed for an example of code for connecting to cloudsolrserver. Are there any tutorials or code examples? -- View this message in context: http://lucene.472066.n3.nabble.com/cloudsolrserver-tp4177724.html Sent from the Solr - User mailing list archive at Nabble.com. -- Anshum Gupta http://about.me/anshumgupta
Re: cloudsolrserver
Thanks Anshum. If you say that connect using CloudSolrServer is more correct than saying, connect to CloudSolrServer, I believe you. -- View this message in context: http://lucene.472066.n3.nabble.com/cloudsolrserver-tp4177724p4177728.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR - any open source framework
Thanks Roman... I will check it... Maybe it's off topic but how about Angular... On Jan 6, 2015 5:17 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Vishal, Alexandre, Here is another one, using Backbone, just released v1.0.16 https://github.com/adsabs/bumblebee you can see it in action: http://ui.adslabs.org/ While it primarily serves our own needs, I tried to architect it to be extendible (within reasonable limits of code, man power) Roman On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: That's very general question. So, the following are three random ideas just to get you started to think of options. *) spring.io (Spring Data Solr) + Vaadin *) http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too) *) http://projectblacklight.org/ Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote: I am new to SOLR and was able to configure, run samples as well as able to index data using DIH (from database). Just wondering if there are open source framework to query and display/visualize. Regards
Re: SOLR - any open source framework
Hi Vishal, Alexandre, Here is another one, using Backbone, just released v1.0.16 https://github.com/adsabs/bumblebee you can see it in action: http://ui.adslabs.org/ While it primarily serves our own needs, I tried to architect it to be extendible (within reasonable limits of code, man power) Roman On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: That's very general question. So, the following are three random ideas just to get you started to think of options. *) spring.io (Spring Data Solr) + Vaadin *) http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too) *) http://projectblacklight.org/ Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote: I am new to SOLR and was able to configure, run samples as well as able to index data using DIH (from database). Just wondering if there are open source framework to query and display/visualize. Regards
Re: .htaccess / password
Thanks Otis. Do think a .htaccess / .passwd file in the Solr admin dir would interfere with its operation? -- Craig Hoffman w: http://www.craighoffmanphotography.com FB: www.facebook.com/CraigHoffmanPhotography TW: https://twitter.com/craiglhoffman On Jan 6, 2015, at 1:09 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Craig, If you want to protect Solr, put it behind something like Apache / Nginx / HAProxy and put .htaccess at that level, in front of Solr. Or try something like http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman choff...@eclimb.net wrote: Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will Solr continue to function properly? One thing to note, I will have a CRON job that runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to secure this area. Thanks, Craig -- Craig Hoffman w: http://www.craighoffmanphotography.com FB: www.facebook.com/CraigHoffmanPhotography TW: https://twitter.com/craiglhoffman
Re: .htaccess / password
The Jetty servlet container that Solr uses doesn't understand those files. It would not use them to determine access, and would likely make them accessible to web requests in plain text. On 1/6/15 16:01, Craig Hoffman wrote: Thanks Otis. Do think a .htaccess / .passwd file in the Solr admin dir would interfere with its operation? -- Craig Hoffman w: http://www.craighoffmanphotography.com FB: www.facebook.com/CraigHoffmanPhotography TW: https://twitter.com/craiglhoffman On Jan 6, 2015, at 1:09 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Craig, If you want to protect Solr, put it behind something like Apache / Nginx / HAProxy and put .htaccess at that level, in front of Solr. Or try something like http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman choff...@eclimb.net wrote: Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will Solr continue to function properly? One thing to note, I will have a CRON job that runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to secure this area. Thanks, Craig -- Craig Hoffman w: http://www.craighoffmanphotography.com FB: www.facebook.com/CraigHoffmanPhotography TW: https://twitter.com/craiglhoffman
Re: solrcloud without faceting, i.e. for failover only
: #1 is a trade off against being possibly more available to writes in the case : of a single down node. In the cloud case, you're still open for business. In : the classical replication case, you're no longer available for writes if the : downed node is the master. or to put it another way: classic replication lets you use N nodes for high availability reads, but you have a single point of failure for writes. solr cloud gives you high availability for reads and writes -- including NRT support -- at the expense of more network overhead when writes happen. : -is solrcloud w/o sharding done?( I.e. it's just not done!! ) : -any downside (i.e. aside from the lack of horizontal scalability ) it is certainly done -- specifically it is a matter of creating a collection with numShards=1 and replicationFactor=N. -Hoss http://www.lucidworks.com/
Re: SOLR - any open source framework
There's also the VelocityResponseWriter that comes with Solr. It takes some effort to modify, but not a lot. It's useful for very fast iterations. Best, Erick On Tue, Jan 6, 2015 at 1:58 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: That's very general question. So, the following are three random ideas just to get you started to think of options. *) spring.io (Spring Data Solr) + Vaadin *) http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too) *) http://projectblacklight.org/ Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote: I am new to SOLR and was able to configure, run samples as well as able to index data using DIH (from database). Just wondering if there are open source framework to query and display/visualize. Regards
Re: .htaccess / password
Hi Craig, If you want to protect Solr, put it behind something like Apache / Nginx / HAProxy and put .htaccess at that level, in front of Solr. Or try something like http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman choff...@eclimb.net wrote: Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will Solr continue to function properly? One thing to note, I will have a CRON job that runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to secure this area. Thanks, Craig -- Craig Hoffman w: http://www.craighoffmanphotography.com FB: www.facebook.com/CraigHoffmanPhotography TW: https://twitter.com/craiglhoffman
solrcloud without faceting, i.e. for failover only
Hi all, We have a smallish index that performs well for searches and are considering using solrcloud --but just for high availability/redundancy, i.e. without any sharding. The indexes would be replicated, but not distributed. I know that there are no stupid questions..Only stupid people...but here goes: -is solrcloud w/o sharding done?( I.e. it's just not done!! ) -any downside (i.e. aside from the lack of horizontal scalability ) will
Re: SOLR - any open source framework
Great... Thanks for the inputs... I explored Velocity respond writer some posts suggest it is good for prototyping but not for production... On Jan 6, 2015 4:59 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: That's very general question. So, the following are three random ideas just to get you started to think of options. *) spring.io (Spring Data Solr) + Vaadin *) http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too) *) http://projectblacklight.org/ Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 6 January 2015 at 16:35, Vishal Swaroop vishal@gmail.com wrote: I am new to SOLR and was able to configure, run samples as well as able to index data using DIH (from database). Just wondering if there are open source framework to query and display/visualize. Regards
cloudsolrserver
We are switching from a direct HTTP connection to use cloudsolrserver. I have looked and failed for an example of code for connecting to cloudsolrserver. Are there any tutorials or code examples? -- View this message in context: http://lucene.472066.n3.nabble.com/cloudsolrserver-tp4177724.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to limit the number of result sets of the 'export' handler
Thanks Alexandre. I actually need the whole result set. But it is large(perhaps 10m-100m) and I find select is slow. How does export differ from select except that select will make distributed requests and do the merge? Will select with ‘distrib=false’ have comparable performance with export? 2015-01-06 20:55 GMT+08:00 Alexandre Rafalovitch arafa...@gmail.com: Export was specifically designed to get everything which is very expensive otherwise. If you just want the subset, you might be better off with normal queries and/or with deep paging (cursor). Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 6 January 2015 at 00:30, Sandy Ding sandy.ding...@gmail.com wrote: Using rows=xxx doesn't seem to work. Is there a way to do this?
PDF search functionality using Solr
Hello Solr-users and developers, Can you please suggest, 1. What I should do to index PDF content information column wise? 2. Do I need to extract the contents using one of the Analyzer, Tokenize and Filter combination and then add it to Index? How can test the results on command prompt? I do not know the selection of specific Analyzer, Tokenizer and Filter for this purpose 3. How can I verify that the needed column info is extracted out of PDF and is indexed? 4. So for example How to verify Ticket number is extracted in Ticket_number tag and is indexed? 5. Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? I think I saw some posts complaining on how large size that can be posted ? 6. What will enable Solr to search in any PDF out of many, with different words such as Runtime Error and result will provide the link to the PDF My PDFs are nothing but Jira ticket system. PDF has info on Ticket Number: Desc: Client: Status: Submitter: And so on: 1. I imported PDF document in Solr and it does the necessary searching and I can test some of it using the browse client interface provided. 2. I have 80 GB worth of PDFs. 3. Total number of PDFs are about 200 4. Many PDFs are of size 4 GB 5. What do you suggest me to import such a large PDFs? What tools can you suggest to extract PDF contents first in some XML format and later Post that XML to be indexed by Solr.? Your early response is much appreciated. Thanks G
htaccess
Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will Solr continue to function properly? One thing to note, I will have a CRON job that runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to secure this area. Thanks, Craig -- Craig Hoffman w: http://www.craighoffmanphotography.com FB: www.facebook.com/CraigHoffmanPhotography TW: https://twitter.com/craiglhoffman
Re: Solr on HDFS in a Hadoop cluster
Hi Charles, See http://search-lucene.com/?q=solr+hdfs and https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE charles.val...@edf.fr wrote: I am considering using *Solr* to extend *Hortonworks Data Platform* capabilities to search. - I found tutorials to index documents into a Solr instance from *HDFS*, but I guess this solution would require a Solr cluster distinct to the Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop cluster instead? - *With the index stored in HDFS?* - Where would the processing take place (could it be handed down to Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to integrate with *Yarn*? - What about *SolrCloud*: what does it bring regarding Hadoop based use-cases? Does it stand for a Solr-only cluster? - Well, if that could lead to something working with a roles-based authorization-compliant *Banana*, it would be Christmass again! Thanks a lot for any help! Charles Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
Re: Running Multiple Solr Instances
Thanks a lot guys.As a begineer these are very helpful fo rme. Thanks, Nishanth On Tue, Jan 6, 2015 at 5:12 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I would do one of either: 1. Set a different Solr home for each instance. I'd use the -Dsolr.solr.home=/d/2 command line switch when launching Solr to do so. 2. RAID 10 the drives. If you expect the Solr instances to get uneven traffic, pooling the drives will allow a given Solr instance to share the capacity of all of them. On 1/5/15 23:31, Nishanth S wrote: Hi folks, I am running multiple solr instances (Solr 4.10.3 on tomcat 8).There are 3 physical machines and I have 4 solr instances running on each machine on ports 8080,8081,8082 and 8083.The set up is well up to this point.Now I want to point each of these instance to a different index directories.The drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define /d/1 as the solr home all solr index directories are created in /d/1 where as the other drives remain un used.So how do I configure solr to make use of all the drives so that I can get maximum storage for solr.I would really appreciate any help in this regard. Thanks, Nishanth
Solr on HDFS in a Hadoop cluster
I am considering using Solr to extend Hortonworks Data Platform capabilities to search. - I found tutorials to index documents into a Solr instance from HDFS, but I guess this solution would require a Solr cluster distinct to the Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop cluster instead? - With the index stored in HDFS? - Where would the processing take place (could it be handed down to Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to integrate with Yarn? - What about SolrCloud: what does it bring regarding Hadoop based use-cases? Does it stand for a Solr-only cluster? - Well, if that could lead to something working with a roles-based authorization-compliant Banana, it would be Christmass again! Thanks a lot for any help! Charles Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
Re: How large is your solr index?
Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even performs reasonably well on commodity hardware with lots of faceting and concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but it works. ++1 for the automagic shard creator. We've been looking into doing this sort of thing internally - i.e. when a shard reaches a certain size/num docs, it creates 'sub-shards' to which new commits are sent and queries to the 'parent' shard are included. The concept works, as long as you don't try any non-dist stuff - it's one reason why all our fields are always single valued. There are also other implications like cleanup, deletes and security to take into account, to name a few. A cool side-effect of sub-sharding (for lack of a snappy term) is that the parent shard then stops suffering from auto-warming latency due to commits (we do a fair amount of committing). In theory, you could carry on sub-sharding until your hardware starts gasping for air. On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam bram.van...@intix.eu wrote: On 01/04/2015 02:22 AM, Jack Krupansky wrote: The reality doesn't seem to be there today. 50 to 100 million documents, yes, but beyond that takes some kind of heroic effort, whether a much beefier box, very careful and limited data modeling or limiting of query capabilities or tolerance of higher latency, expert tuning, etc. I disagree. On the scale, at least. Up until 500M Solr performs well (read: well enough considering the scale) in a single shard on a single box of commodity hardware. Without any tuning or heroic efforts. Sure, some queries aren't as snappy as you'd like, and sure, indexing and querying at the same time will be somewhat unpleasant, but it will work, and it will work well enough. Will it work for thousands of concurrent users? Of course not. Anyone who is after that sort of thing won't find themselves in this scenario -- they will throw hardware at the problem. There is something to be said for making sharding less painful. It would be nice if, for instance, Solr would automagically create a new shard once some magic number was reached (2B at the latest, I guess). But then that'll break some query features ... :-( The reason we're using single large instances (sometimes on beefy hardware) is that SolrCloud is a pain. Not just from an administrative point of view (though that seems to be getting better, kudos for that!), but mostly because some queries cannot be executed with distributed=true. Our users, at least, prefer a slow query over an impossible query. Actually, this 2B limit is a good thing. It'll help me convince $management to donate some of our time to Solr :-) - Bram
.htaccess / password
Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will Solr continue to function properly? One thing to note, I will have a CRON job that runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to secure this area. Thanks, Craig -- Craig Hoffman w: http://www.craighoffmanphotography.com FB: www.facebook.com/CraigHoffmanPhotography TW: https://twitter.com/craiglhoffman
Re: Solr on HDFS in a Hadoop cluster
Oh, and https://issues.apache.org/jira/browse/SOLR-6743 Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Charles, See http://search-lucene.com/?q=solr+hdfs and https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE charles.val...@edf.fr wrote: I am considering using *Solr* to extend *Hortonworks Data Platform* capabilities to search. - I found tutorials to index documents into a Solr instance from *HDFS*, but I guess this solution would require a Solr cluster distinct to the Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop cluster instead? - *With the index stored in HDFS?* - Where would the processing take place (could it be handed down to Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to integrate with *Yarn*? - What about *SolrCloud*: what does it bring regarding Hadoop based use-cases? Does it stand for a Solr-only cluster? - Well, if that could lead to something working with a roles-based authorization-compliant *Banana*, it would be Christmass again! Thanks a lot for any help! Charles Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
RE: Running Multiple Solr Instances
Nishanth, 1. I understand you are implementing clustering for the web apps which is running the same application on multiple different instances on one or more machines. 2. If each of your web apps start pointing to the different index directory, how it will switch to the next web App with different index if search term is not found in the first index directory? 3. Or will the web app collect the result sequentially from all the Index directories and will present the resulting collection to the user? Please share your thoughts Thanks G -Original Message- From: Nishanth S [mailto:nishanth.2...@gmail.com] Sent: Tuesday, January 06, 2015 12:17 PM To: solr-user@lucene.apache.org Subject: Re: Running Multiple Solr Instances Thanks a lot guys.As a begineer these are very helpful fo rme. Thanks, Nishanth On Tue, Jan 6, 2015 at 5:12 AM, Michael Della Bitta michael.della.bi...@appinions.commailto:michael.della.bi...@appinions.com wrote: I would do one of either: 1. Set a different Solr home for each instance. I'd use the -Dsolr.solr.home=/d/2 command line switch when launching Solr to do so. 2. RAID 10 the drives. If you expect the Solr instances to get uneven traffic, pooling the drives will allow a given Solr instance to share the capacity of all of them. On 1/5/15 23:31, Nishanth S wrote: Hi folks, I am running multiple solr instances (Solr 4.10.3 on tomcat 8).There are 3 physical machines and I have 4 solr instances running on each machine on ports 8080,8081,8082 and 8083.The set up is well up to this point.Now I want to point each of these instance to a different index directories.The drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define /d/1 as the solr home all solr index directories are created in /d/1 where as the other drives remain un used.So how do I configure solr to make use of all the drives so that I can get maximum storage for solr.I would really appreciate any help in this regard. Thanks, Nishanth
Re: IstvanKulcsar - Wiki Solr
On 1/6/2015 7:28 AM, ikulc...@precognox.com wrote: I would like suggest pages which use SOLR and developed my company. Please put this page this site: http://wiki.apache.org/solr/PublicServers http://www.odrportal.hu/kereso/ http://idea.unideb.hu/idealista/ http://www.jobmonitor.hu http://www.profession.hu/ http://webicina.com/ http://www.cylex.hu/ Create a user on the Solr wiki and let us know what your username is. We will get your username added to the group that allows you to edit the wiki. Is IstvanKulcsar (first thing in the subject) your username on the wiki? Thanks, Shawn
Re: PDF search functionality using Solr
Hello, no matter which search platform you will use, this will pose two challenges: - The size of the documents will render search less and less useful as the likelihood of matches increases with document size. So, without a proper semantic extraction (e.g., using decent NER or relationship extraction with a commercial text mining product), I doubt you will get the required precision to make this overly usefiul. - PDFs can have their own character sets based on the characters actually used. Such file-specific character sets are almost impossible to parse, i.e., if your PDFs happen to use this feature of the PDF format, you won't be lucky getting any meaningful text out of them. My suggestion is to use the Jira REST API to collect all necessary documents and index the resulting XML or attachment formats. As the REST API provides filtering capabilities, you could easily create incremental feeds to avoid humongous indexing every time there's new information in Jira. Dumping Jira stuff as PDF seems to me to be the least suitable way of handling this. Best regards, --Jürgen On 06.01.2015 18:30, ganesh.ya...@sungard.com wrote: Hello Solr-users and developers, Can you please suggest, 1. What I should do to index PDF content information column wise? 2. Do I need to extract the contents using one of the Analyzer, Tokenize and Filter combination and then add it to Index? How can test the results on command prompt? I do not know the selection of specific Analyzer, Tokenizer and Filter for this purpose 3. How can I verify that the needed column info is extracted out of PDF and is indexed? 4. So for example How to verify Ticket number is extracted in Ticket_number tag and is indexed? 5. Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? I think I saw some posts complaining on how large size that can be posted ? 6. What will enable Solr to search in any PDF out of many, with different words such as Runtime Error and result will provide the link to the PDF My PDFs are nothing but Jira ticket system. PDF has info on Ticket Number: Desc: Client: Status: Submitter: And so on: 1. I imported PDF document in Solr and it does the necessary searching and I can test some of it using the browse client interface provided. 2. I have 80 GB worth of PDFs. 3. Total number of PDFs are about 200 4. Many PDFs are of size 4 GB 5. What do you suggest me to import such a large PDFs? What tools can you suggest to extract PDF contents first in some XML format and later Post that XML to be indexed by Solr.? Your early response is much appreciated. Thanks G -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de http://www.devoteam.de/ Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question
Thanks Jürgen for your quick reply. Still looking for answer on Schema.xml and SolrConfig.xml 1. Do I need to tell Solr, to extract Title from PDF, go look for Title word and extract entire line after the Tag and collect all such occurrence’s from hundreds of PDFs and build the Title column data and index it? 2. How to define my own schema to Solr 3. Say I defined my fields Title, Ticket_number, Submitter, client and so on, How can I verify respective data is extracted in specific columns in Solr and indexed? Any suggestion on how the Analyzer, Tokenizer and Filter and which one will help for this purpose? 1. I do not want to dump entire 4 GB PDF contents in one searchable field (ATTR_CONTENT) in Solr 2. Even if entire PDF contents is extracted in above field as a default, I still want to extract specific searchable column data in their respective fields 3. Rather I want to configure Solr to have column wise searchable contents such as Title, number, and so on Any suggestions on performance? PDF database is 80 GB, will it be fast enough? Do I Need to divide in multiple cores and on multiple machines ? and on multiple web apps? And clustering? I should have mentioned my PDFs are from Ticketing system like Jira which is already retired way back from production and all I have is the Ticketing system PDF database. 4. My system will be used internally just by the selected number of very few people. 5. They can wait 4 GB PDF to get loaded. 6. I agree there will be many matches will be found in one large PDF, based on search criteria 7. To make searches faster I want Solr to create more columns and column based indexes 8. Solr underneath uses Tika which is extracting contents and getting rid of all the rich content formatting characters present in the PDF document. 9. I believe resulting extraction size is 1/5th of the original PDF ..just a random guess based on one sample extraction From: Jürgen Wagner (DVT) [mailto:juergen.wag...@devoteam.com] Sent: Tuesday, January 06, 2015 11:56 AM To: solr-user@lucene.apache.org Subject: Re: PDF search functionality using Solr Hello, no matter which search platform you will use, this will pose two challenges: - The size of the documents will render search less and less useful as the likelihood of matches increases with document size. So, without a proper semantic extraction (e.g., using decent NER or relationship extraction with a commercial text mining product), I doubt you will get the required precision to make this overly usefiul. - PDFs can have their own character sets based on the characters actually used. Such file-specific character sets are almost impossible to parse, i.e., if your PDFs happen to use this feature of the PDF format, you won't be lucky getting any meaningful text out of them. My suggestion is to use the Jira REST API to collect all necessary documents and index the resulting XML or attachment formats. As the REST API provides filtering capabilities, you could easily create incremental feeds to avoid humongous indexing every time there's new information in Jira. Dumping Jira stuff as PDF seems to me to be the least suitable way of handling this. Best regards, --Jürgen On 06.01.2015 18:30, ganesh.ya...@sungard.commailto:ganesh.ya...@sungard.com wrote: Hello Solr-users and developers, Can you please suggest, 1. What I should do to index PDF content information column wise? 2. Do I need to extract the contents using one of the Analyzer, Tokenize and Filter combination and then add it to Index? How can test the results on command prompt? I do not know the selection of specific Analyzer, Tokenizer and Filter for this purpose 3. How can I verify that the needed column info is extracted out of PDF and is indexed? 4. So for example How to verify Ticket number is extracted in Ticket_number tag and is indexed? 5. Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? I think I saw some posts complaining on how large size that can be posted ? 6. What will enable Solr to search in any PDF out of many, with different words such as Runtime Error and result will provide the link to the PDF My PDFs are nothing but Jira ticket system. PDF has info on Ticket Number: Desc: Client: Status: Submitter: And so on: 1. I imported PDF document in Solr and it does the necessary searching and I can test some of it using the browse client interface provided. 2. I have 80 GB worth of PDFs. 3. Total number of PDFs are about 200 4. Many PDFs are of size 4 GB 5. What do you suggest me to import such a large PDFs? What tools can you suggest to extract PDF contents first in some XML format and later Post that XML to be indexed by Solr.? Your early response is much appreciated. Thanks