RE: Using Tesseract OCR to extract PDF files in EML file attachment

2019-10-11 Thread Davis, Daniel (NIH/NLM) [C]
Nuance and ABBYY provide OCR capabilities as well. Looking at higher level solutions, both indexengines.com and Comvault can do email remediation for legal issues. > -Original Message- > From: Retro > Sent: Friday, October 11, 2019 8:06 AM > To: solr-user@lucene.apache.org > Subject:

RE: Using Solr as a Database?

2019-06-03 Thread Davis, Daniel (NIH/NLM) [C]
I think the sweet spot of Cassandra and Solr should be mentioned in this discussion. Cassandra is more scalable/clusterable than an RDBMS, without losing all of the structure that is desirable in an RDBMS. In contrast, if you use a full document store such as MongoDB, you lose some of the

RE: Load balance writes

2019-02-11 Thread Davis, Daniel (NIH/NLM) [C]
I think that the container orchestration framework takes care of that for you, but I am not an expert. In Kubernetes, NGINX is often the Ingress controller, and as long as the services are running within the Kubernetes cluster, it can also serve as a load balancer, AFAICT. In Kubernetes, a

RE: what are the best client interface ?

2019-01-11 Thread Davis, Daniel (NIH/NLM) [C]
WordPress and Drupal both have ways to interface with Solr through plugins/modules. Not sure that describes your PHP website. I like Ruby on Rails "projectblacklight" for an easy and usable discovery layer. We are a Python/Django shop - we've had good luck with Django-haystack and pysolr. >

RE: [solr-solrcloud] How does DIH work when there are multiple nodes?

2019-01-04 Thread Davis, Daniel (NIH/NLM) [C]
DIH is also not designed to multi-thread very well. One way I've handled this is to have a DIH XML that breaks-up a database query into multiple processes by taking the modulo of a row, as follows: This allows me to do sub-queries within the entity, but it is often better to just write

RE: Solr OCR Support

2018-11-02 Thread Davis, Daniel (NIH/NLM) [C]
I think that you also have to process a PDF pretty deeply to decide if you want it to be OCR. I have worked on projects where all of the PDFs are really like faxes - images are encoded in JBIG2 black and white or similar, and there is really one image per page, and no text. I have also

RE: Solr cluster tuning

2018-10-24 Thread Davis, Daniel (NIH/NLM) [C]
Usually, responses are due to I/O waits getting the data off of the disk. So, to me, this seems more likely because as you bombard the server with queries, you cause more and more of the data needed to answer the query into memory. To verify this, I'd bombard your server with queries to warm

RE: Securying ONLY the web interface console

2018-10-22 Thread Davis, Daniel (NIH/NLM) [C]
I think that it is not really Solr's job to solve this. I'm sure that there are many Java ways to solve this with Jetty configuration of JAAS, but the *safest* ways involve ports and rights. In other words, port 8983 and zookeeper ports are then for Solr nodes to communicate with each

RE: How to restrict solr 7.4 to use TLS 1.2 only?

2018-10-10 Thread Davis, Daniel (NIH/NLM) [C]
Best Option - Put a load balancer/distributor in front of it. Other Option - Edit jetty.xml.Solr uses Jetty, and so the key is in the HTTPConfiguration for jetty. This file, in my installation is in solr-X.Y.Z/etc/jetty.xml There is some documentation at

RE: Remove schema.xml in favor of managed-schema

2018-06-19 Thread Davis, Daniel (NIH/NLM) [C]
Elastic allows the mappings to be set all at once, either in the template or as index settings. That is an important feature because it allows the field definitions to be source code artifacts, which can be deployed very easily by an automatic script. Solr's Managed Schema API allows multiple

RE: Some performance questions....

2018-03-16 Thread Davis, Daniel (NIH/NLM) [C]
Deepak, A better test of multi-user support might be to vary the queries and try to simulate a realistic 'working set' of search data. I've made this same performance analysis mistake with the search index of www.indexengines.com, which I developed (in part). Somewhat different from Lucene,

RE: Resend: Authorization on 6.6.0

2018-03-13 Thread Davis, Daniel (NIH/NLM) [C]
I believe that Joe needs to be given some level of access for him to be able to see the collections, and joe should always be required to give his/her/its password to access any collection. -Original Message- From: Terry Steichen [mailto:te...@net-frame.com] Sent: Monday, March 12,

RE: CDCR performance issues

2018-03-09 Thread Davis, Daniel (NIH/NLM) [C]
These are general guidelines, I've done loads of networking, but may be less familiar with SolrCloud and CDCR architecture. However, I know it's all TCP sockets, so general guidelines do apply. Check the round-trip time between the data centers using ping or TCP ping. Throughput tests may

RE: SolrCloud: How best to do backups?

2018-02-08 Thread Davis, Daniel (NIH/NLM) [C]
I would suggest you have a separate EBS to save the backup from each server. These EBS volumes would be mounted all the time, but only modified by a backup. Then, you can create an AWS Lambda function that runs on a periodic trigger from CloudWatch, and does the following: - run the

RE: Fusion or DIY w/Solr?

2018-02-06 Thread Davis, Daniel (NIH/NLM) [C]
Norconex filesystem collector should be able to handle XML output by Sovren very flexibly. I am a big fan. You can use a DOMSplitter to split a single large XML document into multiple smaller ones. I started with Norconex because I found Heritrix a bit of a pain to configure, as it is

RE: SolrCloud installation troubles...

2018-01-29 Thread Davis, Daniel (NIH/NLM) [C]
Trying 127.0.0.1 could help. We kind of tend to think localhost is always 127.0.0.1, but I've seen localhost start to resolve to ::1, the IPv6 equivalent of 127.0.0.1. I guess some environments can be strict enough to restrict communication on localhost; seems hard to imagine, but it does

RE: SolrCloud installation troubles...

2018-01-29 Thread Davis, Daniel (NIH/NLM) [C]
To expand on that answer, you have to wonder what ports are open in the server system's port-based firewall.I have to ask my systems team to open ports for everything I'm using, especially when I move from localhost to outside. You should be able to "fake it out" if you set up your

RE: Profanity

2018-01-08 Thread Davis, Daniel (NIH/NLM) [C]
Fun topic. Same complicated issues as normal search: Multilingual support?Is "Merde" profanity too, or just in French. Multi-word synonyms? Does "God Damn" becomes "goddamn", or do you treat "Damn" and "God damn" the same because you drop "God"

RE: Anyone have any comments on current solr monitoring favorites?

2017-11-06 Thread Davis, Daniel (NIH/NLM) [C]
I have used Java Melody for this purpose on past Java based servers, but I haven't tried to embed it in Jetty. -Original Message- From: Petersen, Robert (Contr) [mailto:robert.peters...@ftr.com] Sent: Monday, November 06, 2017 4:50 PM To: solr-user@lucene.apache.org Subject: Re: Anyone

RE: AEM SOLR integaration

2017-09-22 Thread Davis, Daniel (NIH/NLM) [C]
Gunalan, I think this depends on your system environment. It is a general "service discovery" issue. On-premise, my organization uses f5 BigIP as a load balancer, and so we merely have f5 LTM direct traffic from one name to any of a number of Solr instances. If they are all SolrCloud, it

RE: Customizing JSON response of a query

2017-09-07 Thread Davis, Daniel (NIH/NLM) [C]
st to be clear I am not having any trouble querying >children/parent document since I have all of this stored using fully >qualified names in each document in the collection. > > > > > >Regards, > >Sarvo > > > >On Wed, Sep 6, 2017 at 3:52 PM, Rick Leir <rl

RE: Customizing JSON response of a query

2017-09-06 Thread Davis, Daniel (NIH/NLM) [C]
It should be possible with a custom response handler. -Original Message- From: Sarvothaman Madhavan [mailto:relad...@gmail.com] Sent: Wednesday, September 06, 2017 10:17 AM To: solr-user@lucene.apache.org Subject: Customizing JSON response of a query Hello all, After a week of research

RE: "What is Solr" in Google search results

2017-08-31 Thread Davis, Daniel (NIH/NLM) [C]
Wikipedia seems to be better now. Thank you, Peaceray. Honestly, though, by the numbers, I think the comment was correct. Elasticsearch has a much smoother on-ramp for IT developers, but it is much harder to customize relevancy and integrate with BigData pipelines. IT developers are

RE: Solr config upgrade tool

2017-08-11 Thread Davis, Daniel (NIH/NLM) [C]
Hrishikesh Gadre, I'm interested in how that might integrate with continuous integration. I briefly worked on a tool to try a configuration out with SolrCloud, e.g. upload the config, create a collection, run some stuff, test some stuff. I got the first two working, but not the "run some

RE: Need guidance solrcloud shardings with date interval

2017-07-20 Thread Davis, Daniel (NIH/NLM) [C]
Muhammad, This sounds like it might be handled better by multiple collections rather than multiple "sub collections". If you create a new collection for each date, all using the same common config set, and then create an alias that contains all of these collections. Then, the alias will

RE: How are people using the ICUTokenizer?

2017-06-20 Thread Davis, Daniel (NIH/NLM) [C]
? Really curious about if this would to return the "interesting Phrases" On Tue, Jun 20, 2017 at 12:01 PM, Davis, Daniel (NIH/NLM) [C] < daniel.da...@nih.gov> wrote: > Joel, > > I think the issue is doing word-breaking according to ICU rules. So, if > you are trying to

RE: How are people using the ICUTokenizer?

2017-06-20 Thread Davis, Daniel (NIH/NLM) [C]
Joel, I think the issue is doing word-breaking according to ICU rules. So, if you are trying to make sure your index breaks words properly on eastern languages, just use ICU Tokenizer. Unless your text is already in an ICU normal form, you should always use the ICUNormalizer character

RE: Solr in NAS or Network Shared Drive

2017-05-19 Thread Davis, Daniel (NIH/NLM) [C]
quot;/"listen" state to do no writing to the index, but keep referencing the index properties/version files. On Fri, May 19, 2017 at 1:26 PM, Davis, Daniel (NIH/NLM) [C] < daniel.da...@nih.gov> wrote: > Better off to just do Replication to the slave using

RE: Solr in NAS or Network Shared Drive

2017-05-19 Thread Davis, Daniel (NIH/NLM) [C]
Better off to just do Replication to the slave using the replication handler. However, if there is no network connectivity, e.g. this is an offsite cold/warm spare, then here is a solution: The NAS likely supports some Copy-on-write/snapshotting capabilities. If your systems people will

RE: Solr Query Performance benchmarking

2017-04-28 Thread Davis, Daniel (NIH/NLM) [C]
percentiles are $pct95" echo `date` ": full results are in ${test}" wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 28, 2017, at 12:00 PM, Davis, Daniel (NIH/NLM) [C] > <daniel.da...@nih.gov> wrote: > > Walter, &

RE: Solr Query Performance benchmarking

2017-04-28 Thread Davis, Daniel (NIH/NLM) [C]
Walter, If you can share a pointer to that JMeter add-on, I'd love it. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Friday, April 28, 2017 2:53 PM To: solr-user@lucene.apache.org Subject: Re: Solr Query Performance benchmarking I use production logs

RE: Import Handler using shell scripts

2017-04-28 Thread Davis, Daniel (NIH/NLM) [C]
Attached is a Python script I use, with slight redactions, on several data import jobs. The main points here are: * Watch the job until the import finishes * Always send email whether it succeeds or fails * Put the hostname, and whether it was a success, in the subject for quick removal *

RE: Poll: Master-Slave or SolrCloud?

2017-04-28 Thread Davis, Daniel (NIH/NLM) [C]
I am also very surprised. Even though I am no longer using my solr-config-tool, the main thing I like about SolrCloud is how easy it is to bring up a new collection and set up the schema and fields that you want. I also like that I don't need to manage replication in the solr configuration.

RE: Does DIH queues up requests

2017-01-25 Thread Davis, Daniel (NIH/NLM) [C]
DIH is not multi-threaded, and so the idea of "queueing" up requests is a misnomer. You might be better off using something other than DataImportHandler. LogStash can pull what it calls "events" from a database and then push them into Solr, and you have some of the same row transformation

RE: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-18 Thread Davis, Daniel (NIH/NLM) [C]
Marek, I've wanted to do something like this in the past as well. However, a rewrite that supports the same XML syntax might be better. There are several problems with the design of the Data Import Handler that make it not quite suitable: - Not designed for Multi-threading - Bad

RE: How to stop long running/memory eating query

2016-11-17 Thread Davis, Daniel (NIH/NLM) [C]
Mikhail, If the query is not asynchronous, it would certainly be OK to stop the long-running query if the client socket is disconnected. I know that is a feature of the niche indexer used in the products of www.indexengines.com, because I wrote it. We did not have asynchronous queries, and

RE: Multi word synonyms

2016-11-15 Thread Davis, Daniel (NIH/NLM) [C]
d the rate of indexing you need to maintain. Depending on your experience, you can do this even if you are new to Solr, as you've mentioned. -Original Message----- From: Davis, Daniel (NIH/NLM) [C] Sent: Tuesday, November 15, 2016 10:22 AM To: solr-user@lucene.apache.org Subject: RE: Multi word syn

RE: Multi word synonyms

2016-11-15 Thread Davis, Daniel (NIH/NLM) [C]
I'm not as expert as some on this list, but reading the article suggested, https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/, what you do is this: - Have one field that takes text as normal - Copy that field to another

Measuring the entropy of a field

2016-11-15 Thread Davis, Daniel (NIH/NLM) [C]
Does Lucene/Solr include any tools for measuring the entropy/information of a field? My intuition is that this would only work if the field were a single-value field and the analysis identified characters rather than tokens. Also, Unicode does through a wrench in it - I suppose such a

RE: Apache Solr Question

2016-11-03 Thread Davis, Daniel (NIH/NLM) [C]
Case in point - https://collections.nlm.nih.gov/ has one index (core) for documents and another index (core) for pages within the documents. I think LOC (Library of Congress) does something similar from a presentation they gave at Lucene/DC Exchange. -Original Message- From: Doug

RE: PDF writer

2016-10-21 Thread Davis, Daniel (NIH/NLM) [C]
If the PDF report is truly a report, I agree with this. We have a use-case with IBM InfoSphere Watson Explorer where our users want a PDF report on the results for their query to be generated on the fly. They can then save the query and have the report emailed to them :) Not only is Solr

RE: Solr with logstash solr_http output plugin and geoip filter

2016-10-21 Thread Davis, Daniel (NIH/NLM) [C]
Don Tavoletti, I'm not sure you mean "me" by Daniel, despite that being my name. There is a LogStash output plugin to output to Solr: https://www.elastic.co/guide/en/logstash/current/plugins-outputs-solr_http.html For really simple use cases, there is also a LogStash input plugin for JDBC:

RE: Solr and Drupal

2016-08-09 Thread Davis, Daniel (NIH/NLM) [C]
John/Rose, With Drupal 7, the module John pointed to was the module to use. With Drupal 8, I have no idea. -Original Message- From: John Bickerstaff [mailto:j...@johnbickerstaff.com] Sent: Tuesday, August 09, 2016 2:38 PM To: solr-user@lucene.apache.org Subject: Re: Solr and Drupal

RE: Installing Solr with Ivy

2016-08-03 Thread Davis, Daniel (NIH/NLM) [C]
amount of load on the archive servers. I'd still love in theory to find a solution that's a little more future-proof than "build a URL and download from it," but for now, I think this will get me through. Thanks again! - Demian -Original Message----- From: Davis, Daniel (N

RE: Installing Solr with Ivy

2016-08-02 Thread Davis, Daniel (NIH/NLM) [C]
Demian, I've long meant to upload my own "automated installation" - it is ant without ivy, but with checksums. I suppose gpg signatures could also be worked in. It is only semi-automated, because our DevOps group does not have root, but here is a clean version -

RE: Access Solr via Apache's mod_proxy_balancer or mod_jk (AJP)

2016-07-06 Thread Davis, Daniel (NIH/NLM) [C]
Again I have to insert the larger company view: * if your company is largish, you may have a load balancer hardware already in use by systems. * If you are using a Cloud system for the Solr, then you can probably use a load balancer provided by the cloud provider, and this may be cheaper

RE: deploy solr on cloud providers

2016-07-05 Thread Davis, Daniel (NIH/NLM) [C]
Lorenzo, this probably comes late, but my systems guys just don't want to give me real disk. Although RAID-5 or LVM on-top of JBOD may be better than Amazon EBS, Amazon EBS is still much closer to real disk in terms of IOPS and latency than NFS ;)I even ran a mini test (not an official

RE: Access Solr via Apache's mod_proxy_balancer or mod_jk (AJP)

2016-07-05 Thread Davis, Daniel (NIH/NLM) [C]
Because access to Solr is typically to an API, rather than to webapps having images and static files that can be served directly, I think you can use mod_proxy_http just as well as mod_jk. I would suggest you not pursue trying to get AJP to work. mod_proxy_balancer will work with

RE: Regarding CDCR SOLR 6

2016-06-14 Thread Davis, Daniel (NIH/NLM) [C]
I must chime in to clarify something - in case 2, would the source cluster eventually start a log reader on its own? That is, would the CDCR heal over time, or would manual action be required? -Original Message- From: Renaud Delbru [mailto:renaud@siren.solutions] Sent: Tuesday, June

RE: Help: Lucidwork Fusion documentation

2016-06-02 Thread Davis, Daniel (NIH/NLM) [C]
Is the Solr Reference Guide what you are looking for? https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/apache-solr-ref-guide-6.0.pdf I don't know how to find older versions. From: Aman Tandon [amantandon...@gmail.com] Sent: Thursday, June 02,

RE: Suspicious message with attachment

2016-05-16 Thread Davis, Daniel (NIH/NLM) [C]
I know the sender, he often posts to this list, and I don't download attachments until I've vetted them anyway. -Original Message- From: postmas...@ssww.com [mailto:postmas...@ssww.com] On Behalf Of h...@ssww.com Sent: Monday, May 16, 2016 11:54 AM To: solr-user@lucene.apache.org

RE: Using Ping Request Handler in SolrCloud within a load balancer

2016-05-12 Thread Davis, Daniel (NIH/NLM) [C]
Shawn, that's a great idea for how to integrate f5 with Solr. I'd thought about having Apache httpd in-front of Solr, but I suppose I could just have f5 BigIP on its own. -Original Message- From: Sandy Foley [mailto:sandy.fo...@verndale.com] Sent: Thursday, May 12, 2016 2:38 PM To:

RE: Using updateRequest Processor with DIH

2016-05-02 Thread Davis, Daniel (NIH/NLM) [C]
I don't know whether that works; but you can use the ScriptTransformer with DIH to achieve similar results. I've only used JavaScript (Rhino) scripts, but they worked for me. More recently, I've found that most of my transformations can be accomplished with the TemplateTransformer.

RE: Solr 5.2.1 on Java 8 GC

2016-04-30 Thread Davis, Daniel (NIH/NLM) [C]
Bram, on the subject of brute force - if your script is "clever" and uses binary first search, I'd love to adapt it to my environment. I am trying to build a truly multi-tenant Solr because each of our indexes is tiny, but all together they will eventually be big, and so I'll have to repeat

RE: dataimport db-data-config.xml

2016-04-29 Thread Davis, Daniel (NIH/NLM) [C]
Kishor, Data Import Handler doesn't know how to randomly access rows from the CSV to "JOIN" them to rows from the MySQL table at indexing time. However, both MySQL and Solr know how to JOIN rows/documents from multiple tables/collections/cores. Data Import Handler could read the CSV first, and

Remedial Map-Reduce logic

2016-04-20 Thread Davis, Daniel (NIH/NLM) [C]
Well, it's been a long time since I took any data structures and algorithms course (2000, basically), and after the recent Solr 6 feature chat, I was very curious whether there was real computational goodness behind the move towards a JDBC interface based on Streaming Expressions. This led me

RE: Streaming with facets

2016-04-19 Thread Davis, Daniel (NIH/NLM) [C]
on-the-fly as facet buckets are being streamed). -Yonik On Tue, Apr 19, 2016 at 4:48 PM, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov> wrote: > So, can someone clarify how faceting works with streaming expressions? > > I can see how document search can return documents as it

Streaming with facets

2016-04-19 Thread Davis, Daniel (NIH/NLM) [C]
So, can someone clarify how faceting works with streaming expressions? I can see how document search can return documents as it finds them, using any particular ordering desired - just a parse tree of query operators with priority queues (or something more complicated) within each query

RE: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Davis, Daniel (NIH/NLM) [C]
One thing I like about SolrCloud is that I don't have to configure Master/Slave replication in each "core" the same way to get them to replicate. The other thing I like about SolrCloud, which is largely theoretical at this point, is that I don't need to test changes to a collection's

RE: Can a field be an array of fields?

2016-04-15 Thread Davis, Daniel (NIH/NLM) [C]
Short answer - JOINs, external query outside Solr, Elastic Search ;) Alternatives: * You get back an id for each document when you query on "Nino". You look up the last names in some other system that has the full list. * You index the authors in another collection and use JOINs * You

RE: Multiple data-config.xml in one collection?

2016-04-14 Thread Davis, Daniel (NIH/NLM) [C]
ent multiple dataimporthandler and do a full import, does Solr perform import of all handlers at once or can just specify which handler to import? Thank you Yangrui On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov> wrote: > If Shawn is correct,

RE: Adding configset in SolrCloud via API

2016-04-08 Thread Davis, Daniel (NIH/NLM) [C]
Anshum, The project https://github.com/danizen/solr-config-tool has code that uploads a configset to ZooKeeper via API. It depends only on SolrJ. You'll want to look at the following Java classes under src/main/java: org.danizen.solrconfig.SolrConfig

RE: BYOPW in security.json

2016-04-06 Thread Davis, Daniel (NIH/NLM) [C]
I'm bordering on development post, but I want to write an Authentication Plugin that uses Proxy Authentication and a White List. So, it will accept a request header such as REMOTE_USER as the username from certain hosts, by default 127.0.0.1, ::1. I also thought about having a whitelist of IPs

RE: Multiple data-config.xml in one collection?

2016-04-05 Thread Davis, Daniel (NIH/NLM) [C]
Handler. There are many different ways to do it. -Original Message- From: Davis, Daniel (NIH/NLM) [C] Sent: Tuesday, April 05, 2016 5:40 PM To: solr-user@lucene.apache.org Subject: RE: Multiple data-config.xml in one collection? Yangrui, Solr will just do one data import.You can

RE: Multiple data-config.xml in one collection?

2016-04-05 Thread Davis, Daniel (NIH/NLM) [C]
do a full import, does Solr perform import of all handlers at once or can just specify which handler to import? Thank you Yangrui On Tuesday, April 5, 2016, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov> wrote: > If Shawn is correct, and you are using DIH, then I have done thi

RE: Multiple data-config.xml in one collection?

2016-04-05 Thread Davis, Daniel (NIH/NLM) [C]
You have choices: - Use a separate collection for each data import - Use the same collection for each data import, differentiating them using a field you can query The choice depends on the objects and how they will be use, and I trust others on this list to have better advise on how to

RE: Multiple data-config.xml in one collection?

2016-04-05 Thread Davis, Daniel (NIH/NLM) [C]
If Shawn is correct, and you are using DIH, then I have done this by implementing multiple requestHandlers each of them using Data Import Handler, and have each specify a different XML file for the data config. Instead of using data-config.xml, I've used a large number of files such as:

RE: SolrCloud App Unit Testing

2016-03-20 Thread Davis, Daniel (NIH/NLM) [C]
MiniSolrCloudCluster is intended for building unit tests for cloud commands within Solr itself. What most people do to test applications based on Solr (and their Solr configurations) is to start solr either on their CI server or in the cloud (more likely the later), and then point their

Indexing both meta-data and full content of HTML

2016-03-19 Thread Davis, Daniel (NIH/NLM) [C]
I have some XML that includes a stylesheet maintained by another organization that renders to HTML. The HTML is pretty good - it is not "structured" in RDFa or schema.org, but has classes and anchors that can be used to find some key data. So, I can probably get all the meta-data I want

RE: Making managed schema unmutable correctly?

2016-03-19 Thread Davis, Daniel (NIH/NLM) [C]
Alexandre, I just made this transition, both to SolrCloud and to managed schema. In QA and Production, you update solrconfig.xml to say the schema is not mutable: true managed-schema My workflow in development is as follows: - Start with gettingstarted configuration and

RE: Indexing both meta-data and full content of HTML

2016-03-19 Thread Davis, Daniel (NIH/NLM) [C]
So, I think I've solved my problem, it basically comes from having only done Data Import Handler with any depth. I'll simply use extract request processing handler with some literal fields. -Original Message- From: Davis, Daniel (NIH/NLM) [C] Sent: Wednesday, March 16, 2016 11:47 AM

RE: Ping handler in SolrCloud mode

2016-03-19 Thread Davis, Daniel (NIH/NLM) [C]
Shawn Heisey wrote: > On 3/16/2016 10:11 AM, Tom Evans wrote: > > This worked, I would still be interested in a lighter-weight approach > > that doesn't involve joins to see if a given collection has a shard on > > this server. I suspect that might require a custom ping handler plugin > >

RE: Making managed schema unmutable correctly?

2016-03-18 Thread Davis, Daniel (NIH/NLM) [C]
rectly? Daniel, Thank you for the very concrete example. That is helpful. Regards, Alex. Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 17 March 2016 at 08:17, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov> wrote: > Alexandre,

RE: I have one small question that always intrigue me

2016-02-24 Thread Davis, Daniel (NIH/NLM) [C]
I've wondered about this as well.Recall that the proper architecture for Solr as well as ZooKeeper is as a back-end service, part of a tiered architecture, with web application servers in front. Solr and other search engines should fit in at the same layer as RDBMS and NoSQL, with the

RE: Running solr as a service vs. Running it as a process

2016-02-17 Thread Davis, Daniel (NIH/NLM) [C]
So, running solr as a service also runs it as a process. In typical Linux environments, (based on initscripts), a service is a process installed to meet additional considerations: - Putting logs in predictable places where system operators and administrators expect to see logs - /var/logs -

RE: Which open-source crawler to use with SolrJ and Postgresql ?

2016-02-16 Thread Davis, Daniel (NIH/NLM) [C]
I'm far, far from an expert on this sort of thing, but my personal experience 1-year ago was that Nutch-1 was easier to use, and the blog post I link below suggests that the abstraction layer in Nutch-2 really costs some time.I expect that Nutch-2 has matured some since then, but going with

RE: Running Solr on port 80

2016-02-11 Thread Davis, Daniel (NIH/NLM) [C]
You should edit the files installed by install_solr_service.sh - change the init.d script to pass the -p argument to ${SOLRINSTALLDIR}/bin/solr. By the way, my initscript is modified (a) to support the conventional /etc/sysconfig/ convention, and (b) to run solr as a different user than the

RE: Solrj-collection creation

2016-02-10 Thread Davis, Daniel (NIH/NLM) [C]
Generally, creating a collection may also include uploading a zookeeper configuration: import org.apache.solr.common.cloud.SolrZkClient; import org.apache.solr.common.cloud.ZkConfigManager; import org.apache.solr.common.cloud.ZkStateReader; /* ... much later ... */ SolrZkClient zkClient =

CollectionAdminResponse and CollectionAdminRequest.List

2016-02-05 Thread Davis, Daniel (NIH/NLM) [C]
So, this makes sense: CollectionAdminResponse response = new CollectionAdminResponse(); CollectionAdminRequest.Reload request = new CollectionAdminRequest.Reload(); request.setCollectionName(collectionName); response.setResponse(client.request(request)); But for listing the

sorry, no dataimport-handler defined!

2016-02-02 Thread Davis, Daniel (NIH/NLM) [C]
It sounds a bit like you are just exploring Solr for the first time. To use the Data Import Handler, you need to create an XML file that configures it, data-config.xml by default. But before we go into details, what are you trying to accomplish with Solr? -Original Message- From:

RE: Couple of question about Virtualization and Load Balancer

2016-01-21 Thread Davis, Daniel (NIH/NLM) [C]
> The first one is about virtualization, I'd like to know if there are any > official test > on loss of performance in virtualization environment. I think that the loss of > performance is negligible, and quick question on test infrastructure is > confirming > this, but I'd like to know if there

RE: Pro and cons of using Solr Cloud vs standard Master Slave Replica

2016-01-15 Thread Davis, Daniel (NIH/NLM) [C]
In the multi-tenant model, SolrCloud shines because the configuration directories need not include any details about the cluster.SolrCloud also shines if the number of documents and/or indexing rate requires sharding. But master-slave with replica configuration is OK if you have just a

RE: Manage schema.xml via Solrj?

2016-01-08 Thread Davis, Daniel (NIH/NLM) [C]
Bob, XY problem means that you are presenting the imagined solution without presenting the problem to solve. In other words, you are presenting X (solve for X), without a full statement of the equation to be solved for X. My guess at your problem is the same as my problem - editing Solr

RE: Testing Solr configuration, schema, and other fields

2015-12-31 Thread Davis, Daniel (NIH/NLM) [C]
r 2015 at 23:37, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov> wrote: >> What is the next step you are stuck on? >> >> Regards, >>Alex > > I'm not really stuck. My question has been about the best practices. I am > trying to work against &q

RE: Testing Solr configuration, schema, and other fields

2015-12-31 Thread Davis, Daniel (NIH/NLM) [C]
t step you are stuck on? Regards, Alex On 31 Dec 2015 3:09 am, "Davis, Daniel (NIH/NLM) [C]" <daniel.da...@nih.gov> wrote: > At my organization, I want to create a tool that allows users to keep a > solr configuration as a Git repository. Then, I want my Continuo

RE: Testing Solr configuration, schema, and other fields

2015-12-31 Thread Davis, Daniel (NIH/NLM) [C]
ked for orgs like that too :-( On 1 Jan 2016 12:03 am, "Davis, Daniel (NIH/NLM) [C]" <daniel.da...@nih.gov> wrote: > That's incredibly cool. Much easier than the chef/puppet scripts and > stuff I've seen.I'm certain to play with this and get under the hood; > however

RE: Testing Solr configuration, schema, and other fields

2015-12-31 Thread Davis, Daniel (NIH/NLM) [C]
in order to validate it. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com <http://www.lucidworks.com/> > On Dec 30, 2015, at 3:09 PM, Davis, Daniel (NIH/NLM) [C] > <daniel.da...@nih.gov> wrote: > > At my organization, I want to create a tool that

Testing Solr configuration, schema, and other fields

2015-12-30 Thread Davis, Daniel (NIH/NLM) [C]
At my organization, I want to create a tool that allows users to keep a solr configuration as a Git repository. Then, I want my Continuous Integration environment to take some branch of the git repository and "publish" it into ZooKeeper/SolrCloud. Working on my own, it is only a very small

RE: Testing Solr configuration, schema, and other fields

2015-12-30 Thread Davis, Daniel (NIH/NLM) [C]
? With a VDI environment, then all you would need to do is build a new VM Node based on original setup. Then you can just roll out the node as one of the zk nodes. Just a thought on that subject. v/r, -Mark H. -Original Message- From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da

RE: Testing Solr configuration, schema, and other fields

2015-12-30 Thread Davis, Daniel (NIH/NLM) [C]
op, either stand alone or SolrCloud, _then_ check it in and/or test it on my cloud setup. So I guess the take-away is that I don't have any very good solution here. Best, Erick On Wed, Dec 30, 2015 at 1:10 PM, Davis, Daniel (NIH/NLM) [C] <daniel.da...@nih.gov> wrote: &g

RE: Is DIH going to be removed from Solr future versions?

2015-12-15 Thread Davis, Daniel (NIH/NLM) [C]
I am aware of the problems with the implementation of DIH, but is there any problem with the XML driven data import capability? Could it be rewritten (using modern XPath) to run as a part of SolrJ? I've been interested in that, but I just haven't been able to shake loose the time.

RE: how to secure standalone solr

2015-12-14 Thread Davis, Daniel (NIH/NLM) [C]
Wait a second. There are other sorts of ways to secure Solr that don't work with any sort role-based security control. What I do is place a reverse-proxy in front of Apache Solr on port 80, and have that reverse proxy use CAS authentication. I also have a list of "valid-users" who may

RE: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-04 Thread Davis, Daniel (NIH/NLM) [C]
So, I actually went to an Elastic Search one day conference. One person spoke about having to re-index everything because they had their field mappings wrong. I've also worked on Linked Data, RDF, where the fact that everything is a triple is supposed to make SQL schemas unneeded. The

RE: DevOps question : auto deployment/setup of Solr & Zookeeper on medium-large clusters

2015-10-21 Thread Davis, Daniel (NIH/NLM) [C]
Netflix/exhibitor/wiki > > > On 10/20/15, 6:37 AM, "Davis, Daniel (NIH/NLM) [C]" > <daniel.da...@nih.gov> > wrote: > > >Waste of money in my opinion. I would point you towards other tools - > >bash scripts and free configuration managers such as

RE: DevOps question : auto deployment/setup of Solr & Zookeeper on medium-large clusters

2015-10-20 Thread Davis, Daniel (NIH/NLM) [C]
Waste of money in my opinion. I would point you towards other tools - bash scripts and free configuration managers such as puppet, chef, salt, or ansible. Depending on what development you are doing, you may want a continuous integration environment. For a small company starting out,

RE: DIH parallel processing

2015-10-15 Thread Davis, Daniel (NIH/NLM) [C]
This is also what I have done, but I agree with the notion of using something external to load the data. -Original Message- From: Dyer, James [mailto:james.d...@ingramcontent.com] Sent: Thursday, October 15, 2015 9:24 AM To: solr-user@lucene.apache.org Subject: RE: DIH parallel

RE: [poll] virtualization platform for SOLR

2015-10-01 Thread Davis, Daniel (NIH/NLM) [C]
Shawn, Same answer as Bernd. We have a big VmWare vCenter setup and Netapp. That's what we have to use.Even in a VM world, some advice persists - "local" disk is faster than network disk even if the "local" disk is virtual. Netapp disk is exported to VmWare vCenter over

RE: Passing Basic Auth info to HttpSolrClient

2015-09-30 Thread Davis, Daniel (NIH/NLM) [C]
HttpSolrClient can accept the Apache Commons HttpClient in its constructor: https://lucene.apache.org/solr/5_3_1/solr-solrj/org/apache/solr/client/solrj/impl/HttpSolrClient.html You can use the HttpClientBuilder

RE: Securing solr 5.2 basic auth permission rules

2015-09-17 Thread Davis, Daniel (NIH/NLM) [C]
I had a similar problem attempting to use JNDI when the Jetty included with Solr does not include jetty-plus... I'd like to second the suggestion to include more of jetty. In my case, there was a better solution - I just wrote a JDBC driver to wrap each driverClass I needed (Oracle, MySQL,

RE: How to secure Admin UI with Basic Auth in Solr 5.3.x

2015-09-11 Thread Davis, Daniel (NIH/NLM) [C]
The authorization plugin is new in Solr 5.3.It is hard to describe a secure Solr 5.2.1 environment simply - the basics are to protect /solr by placing it behind Apache httpd or nginx, and also a port-based firewall. I am most familiar with Apache httpd and Linux/RedHat family. Within the

  1   2   >