RE: Skip Headers & Footers while text extraction using Apache Tika parsing for PPT & PDF formats

2019-09-03 Thread Khare, Kushal (MIND)
Okay. I would do that too ! Thanks for sharing the link, I am already doing that. Basically, I am parsing the documents to handle the content that I need to store in the fields and then index them to my core in the Solr Server. With the code below, I am able to deal with Word & Excel file format

Re: Skip Headers & Footers while text extraction using Apache Tika parsing for PPT & PDF formats

2019-09-03 Thread Jörn Franke
Your question is more suitable for the Tika mailing list - it is better if you ask there. You should share more code on what you are currently doing. Here is the documentation on how to get a different output format: https://tika.apache.org/1.8/examples.html#Parsing_using_the_Auto-Detect_Parser

RE: Skip Headers & Footers while text extraction using Apache Tika parsing for PPT & PDF formats

2019-09-03 Thread Khare, Kushal (MIND)
I already spent a lot of time reading on the internet about the same, when I was finished with all the trials and solutions, then only I posted my query here. I know time zones are different and you people are busy, I totally understand it & highly appreciate your efforts! Regarding my file for

Re: Skip Headers & Footers while text extraction using Apache Tika parsing for PPT & PDF formats

2019-09-03 Thread Jörn Franke
People here are in different timezones, have their normal jobs for which they are actually paid to provide answers to questions as those one below etc. There are also a wide number of resources out on the Internet. It can also not harm to read more about the formats that you are processing and

Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Noble Paul
there are known perf issues in computing very large clusters give it a try with the following rules "FOO_CUSTOMER":[ { "replica":"0", "sysprop.HELM_CHART":"!FOO_CUSTOMER", "strict":"true"}, { "replica":"<2", "node":"#ANY", "strict":"fals

Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Mark Miller
Hook up a profiler to the overseer and see what it's doing, file a JIRA and note the hotspots or what methods appear to be hanging out. On Tue, Sep 3, 2019 at 1:15 PM Andrew Kettmann wrote: > > > You’re going to want to start by having more than 3gb for memory in my > opinion but the rest of you

Re: Skip Headers & Footers while text extraction using Apache Tika parsing for PPT & PDF formats

2019-09-03 Thread Khare, Kushal (MIND)
Guys, could I get any help ? Or it's useless posting queries over here ? On Sep 3, 2019 4:00 PM, "Khare, Kushal (MIND)" wrote: Hello, mates ! I am extracting content from my documents using Apache Tika. I need to exclude the headers & footers of the documents. I have already done this for Word

Re: upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-03 Thread Shawn Heisey
On 9/3/2019 4:46 PM, Russell Bahr wrote: Hi Shawn, Here is a screenshot of one of the master nodes solr4 Screen Shot 2019-09-03 at 3.37.08 PM.png solr8 Screen Shot 2019-09-03 at 3.45.46 PM.png Email attachments do not make it to the list. I cannot see those pictures. You will need to use a

Re: SolrClient from inside processAdd function

2019-09-03 Thread Erick Erickson
This really sounds like an XY problem. What do you need the SolrClient _for_? I suspect there’s an easier way to do this….. Best, Erick > On Sep 3, 2019, at 6:17 PM, Arnold Bronley wrote: > > Hi, > > Is there a way to create SolrClient from inside processAdd function for > custom update proce

Re: upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-03 Thread Russell Bahr
Hi Shawn, Here is a screenshot of one of the master nodes solr4 [image: Screen Shot 2019-09-03 at 3.37.08 PM.png] solr8 [image: Screen Shot 2019-09-03 at 3.45.46 PM.png] *Manzama*a MODERN GOVERNANCE company Russell Bahr Lead Infrastructure Engineer USA & CAN Office: +1 (541) 306 3271 USA & CAN

SolrClient from inside processAdd function

2019-09-03 Thread Arnold Bronley
Hi, Is there a way to create SolrClient from inside processAdd function for custom update processor for the same Solr on which it is executing?

Re: upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-03 Thread Shawn Heisey
On 9/3/2019 1:22 PM, Russell Bahr wrote: Yes, some of our queries are quite complex due to a lot of very specific positive as well as negative boosts, however, the query that I ran as the base test after we found our queries were taking so long is just " http://solr.obscured.com:8990/solr/content

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Alexandre Rafalovitch
What about combining: 1) KeywordRepeatFilterFactory 2) An existing folding filter (need to check it ignores Keyword marked word) 3) RemoveDuplicatesTokenFilterFactory That may give what you are after without custom coding. Regards, Alex. On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld - audrey

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke, Thank you! That makes a lot of sense. In other news -- we just had a meeting where we decided to try out a hybrid strategy. I'd love to know what you & everyone else thinks... - Since we are concerned with the overhead created by "double-fielding" all tokens per language (because I'm not

Re: upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-03 Thread Russell Bahr
Hi Toke, Also, if it helps, the content on each server is between around 6.2Gb and 7.8Gb. Thanks, Russ *Manzama*a MODERN GOVERNANCE company Russell Bahr Lead Infrastructure Engineer USA & CAN Office: +1 (541) 306 3271 USA & CAN Support: +1 (541) 706 9393 UK Office & Support: +44 (0)203 282 16

Re: upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-03 Thread Russell Bahr
Hi Toke, Yes, some of our queries are quite complex due to a lot of very specific positive as well as negative boosts, however, the query that I ran as the base test after we found our queries were taking so long is just " http://solr.obscured.com:8990/solr/content/select?q=*%3A*&wt=json&indent=tr

Re: subscription request

2019-09-03 Thread Shawn Heisey
On 9/3/2019 11:47 AM, dev beautiful wrote: I want to subscribe solr mailing list. When I sent a request, I got the following message. Can you add this email address to the mailing list please? Thank you. Louis Choi --- This is the mail system at host n3.nabble.com. Nabble is a website th

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Toke Eskildsen
Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Do you find that searching over both the original title field and the > normalized title > field increases the time it takes for your search engine to retrieve results? It is not something we have measured as that index is fast enough (which

subscription request

2019-09-03 Thread dev beautiful
Hello, I want to subscribe solr mailing list. When I sent a request, I got the following message. Can you add this email address to the mailing list please? Thank you. Louis Choi --- This is the mail system at host n3.nabble.com. I'm sorry to have to inform you that your message could not

Re: Custom TokenizerFactory that reads configuration file from zookeeper

2019-09-03 Thread ron.buchanan
ResourceLoader worked brilliantly - my brain, on the other hand, not so much -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Andrew Kettmann
> You’re going to want to start by having more than 3gb for memory in my > opinion but the rest of your set up is more complex than I’ve dealt with. right now the overseer is set to a max heap of 3GB, but is only using ~260MB of heap, so memory doesn't seem to be the issue unless there is a par

Re: upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-03 Thread Toke Eskildsen
Russell Bahr wrote: > approximately 18 million documents > *:* query across 10 times returning > [13234, 18714, 13384, 12966, 12192, 18420, 16592, 15691, 13373, 12458] >vs > [93359, 94263, 86949, 90747, 91171, 91588, 87921, 88632, 88035, 89137] Even the 12-18 seconds for Solr 4 is a long time, so

upgrading from solr4 to solr8 searches taking 4 to 10 times as long to return

2019-09-03 Thread Russell Bahr
Hi, I am trying to replace our solr4 cluster with a solr 8.1.1 cluster and am running into a problem where searches are taking way to long to respond. The clusters are set up with the same number of servers, same number of shards, and same number of replicas. They are indexing the same documents,

Query terms and the match state

2019-09-03 Thread Kumaresh AK
Hello Solr Community! *Problem*: I wish to know if the result document matched all the terms in the query. The ranking used in solr works most of the time. For some cases where one of the term is rare and occurs in couple of fields; such documents trump a document which matches all the terms. Idea

Re: SOLR Cloud: Few cores goes to recovery mode all of a sudden

2019-09-03 Thread Doss
Thanks Erick, ulimit in all three lodes are more than 65K including max process list. If you look at the timestamp the core down error happened ahead of unable to create thread error, and also core down error took place in node1 and unable to create thread error took place in node3. BTW we are r

Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Dave
You’re going to want to start by having more than 3gb for memory in my opinion but the rest of your set up is more complex than I’ve dealt with. On Sep 3, 2019, at 1:10 PM, Andrew Kettmann wrote: >> How many zookeepers do you have? How many collections? What is there size? >> How much CPU / m

Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Andrew Kettmann
> How many zookeepers do you have? How many collections? What is there size? > How much CPU / memory do you give per container? How much heap in comparison > to total memory of the container ? 3 Zookeepers. 733 containers/nodes 735 total cores. Each core ranges from ~4-10GB of index. (Autoscaling

Custom TokenizerFactory that reads configuration file from zookeeper

2019-09-03 Thread ron.buchanan
I'm working on a custom tokenizer (Solr 7.3.0) whose Factory needs to read a configuration file. I have been able to run it successfully in my local reading from a local directory. I would like to be able to have the configuration read from zookeeper (similarly to how SynonymGraphFilterFactory re

Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Jörn Franke
How many zookeepers do you have? How many collections? What is there size? How much CPU / memory do you give per container? How much heap in comparison to total memory of the container ? > Am 03.09.2019 um 17:49 schrieb Andrew Kettmann : > > Currently our 7.7.2 cluster has ~600 hosts and each co

Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Andrew Kettmann
Currently our 7.7.2 cluster has ~600 hosts and each collection is using an autoscaling policy based on system property. Our goal is a single core per host (container, running on K8S). However as we have rolled more containers/collections into the cluster any creation/move actions are taking a h

Re: Query regarding Solr Cloud Setup

2019-09-03 Thread Jörn Franke
If you have a properly secured cluster eg with Kerberos then you should not update files in ZK directly. Use the corresponding Solr REST interfaces then you also less likely to mess something up. If you want to have HA you should have at least 3 Solr nodes and replicate the collection to all t

Re: Index fetch failed

2019-09-03 Thread Erick Erickson
Shankar: Two things: 1> please do not hijack threads 2> Follow the instructions here: http://lucene.apache.org/solr/community.html#mailing-lists-irc. You must use the _exact_ same e-mail as you used to subscribe. If the initial try doesn't work and following the suggestions at the "problems"

Re: SOLR Cloud: Few cores goes to recovery mode all of a sudden

2019-09-03 Thread Erick Erickson
The “unable to create new thread” is where I’d focus first. It means you’re running out of some system resources and it’s quite possible that your other problems are arising from that root cause. What are you “ulimit” settings? the number of file handles and processes should be set to 65k at le

Re: Query regarding Solr Cloud Setup

2019-09-03 Thread Erick Erickson
Having custom core.properties files is “fraught”. First of all, that file can be re-written. Second, the collections ADDREPLICA command will create a new core.properties file. Third, any mistakes you make when hand-editing the file can have grave consequences. What change exactly do you want to

Re: Query regarding Solr Cloud Setup

2019-09-03 Thread Shawn Heisey
On 9/3/2019 7:22 AM, Porritt, Ian wrote: We have a schema which I have managed to upload to Zookeeper along with the Solrconfig, how do I get the system to recognise both a lib/.jar extension and a custom core.properties file? I bypassed the issue of the core.properties by amending the update.a

SOLR Cloud: Few cores goes to recovery mode all of a sudden

2019-09-03 Thread Doss
Hi, We are using 3 node SOLR (7.0.1) cloud setup 1 node zookeeper ensemble. Each system has 16CPUs, 90GB RAM (14GB HEAP), 130 cores (3 replicas NRT) with index size ranging from 700MB to 20GB. autoCommit - 10 minutes once softCommit - 30 Sec Once We are facing the following problems in recent ti

Re: Problems with restricting access to users using Basic auth

2019-09-03 Thread Jason Gerlowski
Yeah, it beats me. If you've made sure that the security.json in ZooKeeper is exactly the same as the one I posted but you're still getting different results, then I'm stumped. Maybe someone else here has an idea. Out of curiosity, are you setting your security.json via the authentication/author

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke, Do you find that searching over both the original title field and the normalized title field increases the time it takes for your search engine to retrieve results? -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf

Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Languages are the best. Thank you all so much! -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf...@ibm.com On 8/30/19, 4:09 PM, "Walter Underwood" wrote: The right transliteration for accents is language-dependen

Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thank you, Erick! -- Audrey Lorberfeld Data Scientist, w3 Search Digital Workplace Engineering CIO, Finance and Operations IBM audrey.lorberf...@ibm.com On 8/30/19, 3:49 PM, "Erick Erickson" wrote: It Depends (tm). In this case on how sophisticated/precise your users are. If your users

Query regarding Solr Cloud Setup

2019-09-03 Thread Porritt, Ian
Hi, I am relatively new to Solr especially Solr Cloud and have been using it for a few days now. I think I have setup Solr Cloud correctly however would like some guidance to ensure I am doing it correctly. I ideally want to be able to process 40 million documents on production via Solr Cloud.

Re: Idle Timeout while DIH indexing and implicit sharding in 7.4

2019-09-03 Thread Mikhail Khludnev
Tracked https://issues.apache.org/jira/browse/SOLR-13735 patches are welcome. On Mon, Sep 2, 2019 at 12:39 PM Vadim Ivanov < vadim.iva...@spb.ntk-intourist.ru> wrote: > Timeout causes DIH to finish with error message. So, If I check DIH > response to be sure > that DIH session have finished wit

Hackday in DC next Tuesday

2019-09-03 Thread Charlie Hull
Hi all, If you're in town for Activate next week, we're running another free Lucene Hackday on Tuesday: https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/263993681/ - do come along if you can! It's only a block and a half from the Activate venue. Cheers Charlie -- Charlie

Re: Problems with restricting access to users using Basic auth

2019-09-03 Thread Salmaan Rashid Syed
Hi Jason, Apologies for the late reply. My laptop was broken and I got it today from service centre. I am still having issues with solr-user able to view the Collections list as follow. Testing permissions for user [solr] Request [/admin/collections?action=LIST] returned status [200] Req

Re: example for "parse-date" processor

2019-09-03 Thread Arturas Mazeika
Hi Jörn, I am not supplying the name in the update chain. I am not sure pysolr supports it yet: def __init__( self, url, decoder=None, timeout=60, results_cls=Results, search_handler="select", use_qt_param=False, always_commit=False, auth=None, verify=True, ): How can I define it as default? Ch

Re: example for "parse-date" processor

2019-09-03 Thread Jörn Franke
How do you send the request? You need to specify the update.chain parameter with the name of the Update chain or define it as default > Am 03.09.2019 um 12:14 schrieb Arturas Mazeika : > > Hi Solr Fans, > > I am trying to figure out how to use the parse-date processor for pdates. > > I am abl

Skip Headers & Footers while text extraction using Apache Tika parsing for PPT & PDF formats

2019-09-03 Thread Khare, Kushal (MIND)
Hello, mates ! I am extracting content from my documents using Apache Tika. I need to exclude the headers & footers of the documents. I have already done this for Word & Excel format using OfficeParseConfig, but need to implement the same for PPT & PDF. How to achieve that ? ___

example for "parse-date" processor

2019-09-03 Thread Arturas Mazeika
Hi Solr Fans, I am trying to figure out how to use the parse-date processor for pdates. I am able to insert data with this python code to a solr collection/core: solr = pysolr.Solr('http://localhost:/solr/core1', timeout=10) solr.add([ { "t": '2017-08-19T21:00:42.043Z', } ])

Re: Index fetch failed

2019-09-03 Thread Shankar Ramalingam
Please remove my email id from this list. On Tue, 3 Sep, 2019, 11:06 AM Akreeti Agarwal, wrote: > Hello, > > Please help me with the solution for below error. > > Memory details of slave server: > total used free sharedbuffers cached > Mem: 15947 15460