RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David
Hi Alessandro, The docker image is like a disk image of the entire server, so it includes the operating system, the Solr installation and the data. Because we run in the cloud and our index isn't that big, this is an easy and fast way for us to scale our Solr cluster without having to

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread David Hastings
To piggy back on this, what would be the right scenarios to use docvalues='true'? On Tue, Feb 13, 2018 at 1:10 PM, Chris Hostetter wrote: > > : We are using Solr 7.1.0 to index a database of addresses. We have found > : that our index size increases massively when we

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Chris Hostetter
: We are using Solr 7.1.0 to index a database of addresses. We have found : that our index size increases massively when we add one extra field to : the index, even though that field is stored and not indexed, and doesn’t what about docValues? : When we run an index load without the

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Erick Erickson
David: Right, Optimize Is Evil. Well, actually in your case it's not. In your specific case you can optimize every time you build your index and be OK, gory details here: https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ But that's just for background. The key

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Alessandro Benedetti
Hi David, given the fact that you are actually building a new index from scratch, my shot in the dark didn't hit any target. When you say : "Once the import finishes we save the docker image in the AWS docker repository. We then build our cluster using that image as the base" Do you mean just

RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Howe, David
(it only changes once a fortnight). Once the import finishes we save the docker image in the AWS docker repository. We then build our cluster using that image as the base. So we never re-index an existing index, we just build another one from scratch. We haven't configured anything special

Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-13 Thread Alessandro Benedetti
I assume you re-index in full right ? My shot in the dark is that this increment is temporary. You re-index, so effectively delete and add all documents ( this means that even if the new field is just stored, you re-build the entire index for all the fields). Create new segments and the old docs

Re: index fields with custom gaps between terms

2017-12-19 Thread Shawn Heisey
On 12/19/2017 4:16 AM, Amin Raeiszadeh wrote: i solve this problem by developing of DocumentBuilder.toDocument() and DocumentBuilder.addField() functions.i don't use multiValue feature and in shcema i change the condition of check for multi value fields(skip it in colde) then in necessary

Re: index fields with custom gaps between terms

2017-12-19 Thread Amin Raeiszadeh
i solve this problem by developing of DocumentBuilder.toDocument() and DocumentBuilder.addField() functions.i don't use multiValue feature and in shcema i change the condition of check for multi value fields(skip it in colde) then in necessary position i put my custom gap as a value for textField.

Re: index fields with custom gaps between terms

2017-12-18 Thread Amin Raeiszadeh
Shawn i think your way is good. i will study more about it. thanks, Amin On Tue, Dec 19, 2017 at 9:57 AM, Amin Raeiszadeh wrote: > Erick, in your example if first entry contains 10 term then i need to > start the second entry position from 100 not from 110. > thanks,

Re: index fields with custom gaps between terms

2017-12-18 Thread Amin Raeiszadeh
Erick, in your example if first entry contains 10 term then i need to start the second entry position from 100 not from 110. thanks, Amin On Tue, Dec 19, 2017 at 3:25 AM, Shawn Heisey wrote: > On 12/18/2017 12:29 AM, Amin Raeiszadeh wrote: >> thanks too much Erick and

Re: index fields with custom gaps between terms

2017-12-18 Thread Shawn Heisey
On 12/18/2017 12:29 AM, Amin Raeiszadeh wrote: > thanks too much Erick and mikhail. > i change SloppyPhraseScorer class for my custom behavior with some fields. > so i need to index some fields with customized gap between terms of fields. > i'm not profession with solr and i think with schema.xml

Re: index fields with custom gaps between terms

2017-12-18 Thread Erick Erickson
You probably are aware of this already, but I want to be sure. positionIncrementGap is _only_ applied between the last term of one multiValued entry and the first term of the next. So say I have a text field and the input looks like: some stuff other words and my positionIncrementGap is

Re: index fields with custom gaps between terms

2017-12-17 Thread Amin Raeiszadeh
thanks too much Erick and mikhail. i change SloppyPhraseScorer class for my custom behavior with some fields. so i need to index some fields with customized gap between terms of fields. i'm not profession with solr and i think with schema.xml only i can set fixed gap increment between terms of

Re: index fields with custom gaps between terms

2017-12-17 Thread Mikhail Khludnev
On Sun, Dec 17, 2017 at 11:16 AM, Amin Raeiszadeh wrote: > thanks for your guides Mikhail. > in multiple values i can only set static positionIncrementGap but > considering my description i need dynamic gap between terms and > i don't know how to do it. > There is a

Re: index fields with custom gaps between terms

2017-12-17 Thread Erick Erickson
You might be able to do something with PreAnalyzedField, but I confess I've never really dug into it. Best, Erick On Sun, Dec 17, 2017 at 12:16 AM, Amin Raeiszadeh wrote: > thanks for your guides Mikhail. > in multiple values i can only set static positionIncrementGap

Re: index fields with custom gaps between terms

2017-12-17 Thread Amin Raeiszadeh
thanks for your guides Mikhail. in multiple values i can only set static positionIncrementGap but considering my description i need dynamic gap between terms and i don't know how to do it. i can only pass String value for fields like this: SolrInputDocument sDoc = new SolrInputDocument();

Re: index fields with custom gaps between terms

2017-12-16 Thread Mikhail Khludnev
You can assign multiple values to text field and leverage positionIncrementGap https://lucene.apache.org/solr/guide/6_6/field-type-definitions-and-properties.html#general-properties And why wouldn't you use your Lucene plugin in Solr? On Sun, Dec 17, 2017 at 8:45 AM, Amin Raeiszadeh

Re: Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Natarajan, Rajeswari
Thanks a lot for the response. We did not change schema or config. We simply opened 4.5 indexes with 4.10 libraries. Thank you, Rajeswari On 12/7/17, 3:17 PM, "Shawn Heisey" wrote: On 12/7/2017 1:27 PM, Natarajan, Rajeswari wrote: > We have upgraded solr from 4.5.1

Re: Index size optimization between 4.5.1 and 4.10.4 Solr

2017-12-07 Thread Shawn Heisey
On 12/7/2017 1:27 PM, Natarajan, Rajeswari wrote: > We have upgraded solr from 4.5.1 to 4.10.4 and we see index size reduction. > Trying to see if any optimization done to decrease the index sizes , couldn’t > locate. If anyone knows why please share. Here's a history where you can see the a

Re: Index Content Removing the HTML Tags.

2017-12-04 Thread Erick Erickson
Have you tried: HtmlStripCharFilterFactory? On Mon, Dec 4, 2017 at 12:37 PM, Fiz Newyorker wrote: > Hello Solr Group, > > Good Morning ! > > I am working on Solr 6.5 version and I am trying to Index from Mongo DB > 3.2.5. > > I have content collection in mongodb where there

Re: Index Message-ID from EML file to Solr

2017-11-16 Thread Zheng Lin Edwin Yeo
Hi, Just to check, is this feature available in Solr 6.5.1? Or is it only available in Solr 7? Regards, Edwin On 10 November 2017 at 19:45, Zheng Lin Edwin Yeo wrote: > Hi, > > Can we index the Message-ID that is from the EML file into Solr? > Tika does have the

Re: Index time boosting

2017-11-14 Thread Erick Erickson
Do not use index time boosting, please. When something is deprecated, the usual process is that that functionality is supported for one major version after deprecation, then the devs are free to remove it. Index time boosting is not supported in 7.0 even though it is in 6x, from CHANGES.txt, the

Re: Index time boosting

2017-11-14 Thread Venkateswarlu Bommineni
Thanks for the reply Amit. I have Solr 6.6 source code and I can still see the code which sets the index level boost value. If the class name is handy for you , could you please tell me where we will calculate the score of a document. so that i can just go through the code. Thanks, Venkat. On

Re: Index time boosting

2017-11-14 Thread Amrit Sarkar
Hi Venkat, FYI: Index time boosting has been deprecated from latest versions of Solr: https://issues.apache.org/jira/browse/LUCENE-6819. Not sure which version you are on, but best consider the comments on the JIRA before using it. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269

Re: Index relational database

2017-08-31 Thread Erick Erickson
To pile on here: When you denormalize you also get some functionality that you do not get with Solr joins, they've been called "pseudo joins" in Solr for a reason. If you just use the simple approach of indexing the two tables then joining across them you can't return fields from both tables in a

Re: Index relational database

2017-08-31 Thread Walter Underwood
There is no way tell which is faster without trying it. Query speed depends on the size of the data (rows), the complexity of the join, which database, what kind of disk, etc. Solr speed depends on the size of the documents, the complexity of your analysis chains, what kind of disk, how much

Re: Index relational database

2017-08-31 Thread David Hastings
when indexing a relational database its generally always best to denormalize it in a view or in your indexing code On Thu, Aug 31, 2017 at 3:54 AM, Renuka Srishti wrote: > Thanks Erick, Walter > But I think join query will reduce the performance. Denormalization will

Re: Index relational database

2017-08-31 Thread Renuka Srishti
Thank all for sharing your thoughts :) On Thu, Aug 31, 2017 at 5:28 PM, Susheel Kumar wrote: > Yes, if you can avoid join and work with flat/denormalized structure then > that's the best. > > On Thu, Aug 31, 2017 at 3:54 AM, Renuka Srishti < > renuka.srisht...@gmail.com>

Re: Index relational database

2017-08-31 Thread Susheel Kumar
Yes, if you can avoid join and work with flat/denormalized structure then that's the best. On Thu, Aug 31, 2017 at 3:54 AM, Renuka Srishti wrote: > Thanks Erick, Walter > But I think join query will reduce the performance. Denormalization will be > the better way

Re: Index relational database

2017-08-31 Thread Renuka Srishti
Thanks Erick, Walter But I think join query will reduce the performance. Denormalization will be the better way than join query, am I right? On Wed, Aug 30, 2017 at 10:18 PM, Walter Underwood wrote: > Think about making a denormalized view, with all the fields needed in

Re: Index relational database

2017-08-30 Thread Walter Underwood
Think about making a denormalized view, with all the fields needed in one table. That view gets sent to Solr. Each row is a Solr document. It could be implemented as a view or as SQL, but that is a useful mental model for people starting from a relational background. wunder Walter Underwood

Re: Index relational database

2017-08-30 Thread Erick Erickson
First, it's often best, by far, to denormalize the data in your solr index, that's what I'd explore first. If you can't do that, the join query parser might work for you. On Aug 30, 2017 4:49 AM, "Renuka Srishti" wrote: > Thanks Susheel for your response. > Here is

Re: Index relational database

2017-08-30 Thread Renuka Srishti
Thanks Susheel for your response. Here is the scenario about which I am talking: - Let suppose there are two documents doc1 and doc2. - I want to fetch the data from doc2 on the basis of doc1 fields which are related to doc2. How to achieve this efficiently. Thanks, Renuka Srishti

Re: Index relational database

2017-08-28 Thread Susheel Kumar
Hello Renuka, I would suggest to start with your use case(s). May be start with your first use case with the below questions a) What is that you want to search (which fields like name, desc, city etc.) b) What is that you want to show part of search result (name, city etc.) Based on above two

Re: index version - replicable versus searching

2017-07-25 Thread Erick Erickson
Ronald: Actually, people generally don't search on master ;). The idea is that master is configured for heavy indexing and then people search on the slaves which are configured for heavy query loads (e.g. memory, autowarming, whatever may be different). Which is it's own problem since the time

RE: index version - replicable versus searching

2017-07-25 Thread Stanonik, Ronald
Bingo! Right on both counts! opensearcher was false. When I changed it to true, then I could see that master(searching) and master(replicable) both changed. And autocommit.maxtime is causing a commit on the master. Who uses master(replicable)? It seems for my simple master/slave

Re: index version - replicable versus searching

2017-07-24 Thread Erick Erickson
Actually, I'm surprised that the slave returns the new document and I suspect that there's actually a commit on the master, but no new searcher is being opened. On replication, the slave copies all _closed_ segments from the master whether or not they have been opened for searching. Hmmm, a

Re: index new discovered fileds of different types

2017-07-10 Thread Jan Høydahl
I think Thaer’s answer clarify how they do it. So at the time they assemble the full Solr doc to index, there may be a new field name not known in advance, but to my understanding the RDF source contains information on the type (else they could not do the mapping to dynamic field either) and so

Re: index new discovered fileds of different types

2017-07-10 Thread Thaer Sammar
Hi Rick, yes the RDF structure has subject, predicate and object. The object data type is not only text, it can be integer or double as well or other data types. The structure of our solar document doesn't only contain these three fields. We compose one document per subject and we use all found

Re: index new discovered fileds of different types

2017-07-09 Thread Rick Leir
Jan I hope this is not off-topic, but I am curious: if you do not use the three fields, subject, predicate, and object for indexing RDF then what is your algorithm? Maybe document nesting is appropriate for this? cheers -- Rick On 2017-07-09 05:52 PM, Jan Høydahl wrote: Hi, I have

Re: index new discovered fileds of different types

2017-07-09 Thread Jan Høydahl
Hi, I have personally written a Python script to parse RDF files into an in-memory graph structure and then pull data from that structure to index to Solr. I.e. you may perfectly well have RDF (nt, turtle, whatever) as source but index sub structures in very specific ways. Anyway, as Erick

Re: index new discovered fileds of different types

2017-07-07 Thread Rick Leir
Thaer Whoa, hold everything! You said RDF, meaning resource description framework? If so, you have exactly​ three fields: subject, predicate, and object. Maybe they are text type, or for exact matches you might want string fields. Add an ID field, which could be automatically generated by Solr,

Re: index new discovered fileds of different types

2017-07-07 Thread Erick Erickson
I'd recommend "managed schema" rather than schemaless. They're related but distinct. The problem is that schemaless makes assumptions based on the first field it finds. So if it finds a field with a "1" in it, it guesses "int". That'll break if the next doc has a 1.0 since it doesn't parse to an

Re: index new discovered fileds of different types

2017-07-07 Thread Thaer Sammar
Hi Jan, Thanks!, I am exploring the schemaless option based on Furkan suggestion. I need the the flexibility because not all fields are known. We get the data from RDF database (which changes continuously). To be more specific, we have a database and all changes on it are sent to a kafka queue.

Re: index new discovered fileds of different types

2017-07-07 Thread Jan Høydahl
If you do not need the flexibility of dynamic fields, don’t use them. Sounds to me that you really want a field “price” to be float and a field “birthdate” to be of type date etc. If so, simply create your schema (either manually, through Schema API or using schemaless) up front and index each

Re: index new discovered fileds of different types

2017-07-05 Thread Erick Erickson
I really have no idea what "to ignore the prefix and check of the type" means. When? How? Can you give an example of inputs and outputs? You might want to review: https://wiki.apache.org/solr/UsingMailingLists And to add to what Furkan mentioned, in addition to schemaless you can use "managed

Re: index new discovered fileds of different types

2017-07-05 Thread Thaer Sammar
Hi Furkan, No, In the schema we also defined some static fields such as uri and geo field. On 5 July 2017 at 17:07, Furkan KAMACI wrote: > Hi Thaer, > > Do you use schemeless mode [1] ? > > Kind Regards, > Furkan KAMACI > > [1]

Re: index new discovered fileds of different types

2017-07-05 Thread Furkan KAMACI
Hi Thaer, Do you use schemeless mode [1] ? Kind Regards, Furkan KAMACI [1] https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode On Wed, Jul 5, 2017 at 4:23 PM, Thaer Sammar wrote: > Hi, > We are trying to index documents of different types. Document have >

Re: Index 0, Size 0 - hashJoin Stream function Error

2017-06-27 Thread Joel Bernstein
Ok, I'll take a look. Thanks! Joel Bernstein http://joelsolr.blogspot.com/ On Tue, Jun 27, 2017 at 10:01 AM, Susheel Kumar wrote: > Hi Joel, > > I have submitted a patch to handle this. Please review. > >

Re: Index 0, Size 0 - hashJoin Stream function Error

2017-06-27 Thread Susheel Kumar
Hi Joel, I have submitted a patch to handle this. Please review. https://issues.apache.org/jira/secure/attachment/12874681/SOLR-10944.patch Thanks, Susheel On Fri, Jun 23, 2017 at 12:32 PM, Susheel Kumar wrote: > Thanks for confirming. Here is the JIRA > >

Re: Index 0, Size 0 - hashJoin Stream function Error

2017-06-23 Thread Susheel Kumar
Thanks for confirming. Here is the JIRA https://issues.apache.org/jira/browse/SOLR-10944 On Fri, Jun 23, 2017 at 11:20 AM, Joel Bernstein wrote: > yeah, this looks like a bug in the get expression. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Fri, Jun 23, 2017

Re: Index 0, Size 0 - hashJoin Stream function Error

2017-06-23 Thread Joel Bernstein
yeah, this looks like a bug in the get expression. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Jun 23, 2017 at 11:07 AM, Susheel Kumar wrote: > Hi Joel, > > As i am getting deeper, it doesn't look like a problem due to hashJoin etc. > > > Below is a simple let

Re: Index 0, Size 0 - hashJoin Stream function Error

2017-06-23 Thread Susheel Kumar
Hi Joel, As i am getting deeper, it doesn't look like a problem due to hashJoin etc. Below is a simple let expr where if search would not find a match and return 0 result. In that case, I would expect get(a) to show a EOF tuple while it is throwing exception. It looks like something wrong/bug

Re: Index 0, Size 0 - hashJoin Stream function Error

2017-06-23 Thread Joel Bernstein
Ok, I hadn't anticipated some of the scenarios that you've been trying out. Particularly reading streams into variables and performing joins etc... The main idea with variables was to use them with the new statistical evaluators. So you perform retrievals (search, random, nodes, knn etc...) set

Re: Index 0, Size 0 - hashJoin Stream function Error

2017-06-22 Thread Susheel Kumar
Hi Joel, I am able to reproduce this in a simple way. Looks like Let Stream is having some issues. Below complement function works fine if I execute outside let and returns an EOF:true tuple but if a tuple with EOF:true assigned to let variable, it gets changed to EXCEPTION "Index 0, Size 0"

Re: Index 0, Size 0 - hashJoin Stream function Error

2017-06-22 Thread Susheel Kumar
Sorry for typo Facing a weird behavior when using hashJoin / innerJoin etc. The below expression display tuples from variable a shown below let(a=fetch(SMS,having(rollup(over=email, count(email), select(search(SMS, q=*:*,

Re: Re-Index is not working

2017-06-08 Thread Erick Erickson
OK - Contractor > Sent: Thursday, June 08, 2017 10:12 AM > To: 'solr-user@lucene.apache.org' > Subject: RE: Re-Index is not working > > Sorry I did not give enough information. > > "doesn't work" does mean that the documents are not getting indexed. I am > using

RE: Re-Index is not working

2017-06-08 Thread Miller, William K - Norman, OK - Contractor
-Original Message- From: Miller, William K - Norman, OK - Contractor Sent: Thursday, June 08, 2017 10:12 AM To: 'solr-user@lucene.apache.org' Subject: RE: Re-Index is not working Sorry I did not give enough information. "doesn't work" does mean that the documents are not getting inde

RE: Re-Index is not working

2017-06-08 Thread Miller, William K - Norman, OK - Contractor
Sorry I did not give enough information. "doesn't work" does mean that the documents are not getting indexed. I am using a full import. I did discover that if I used the Linux touch command that the document would re-index. I don't have any of the logs as I have been a

Re: Re-Index is not working

2017-06-07 Thread Erick Erickson
g the *:*, , and > commands. Then I attempt to re-index the same file with the same > configuration in my dataConfig file for the DIH, but it fails to index the > file. If I make a change to the xml file that is being indexed and > re-index it works. > > > >

Re-Index is not working

2017-06-07 Thread Miller, William K - Norman, OK - Contractor
Hello, I am new to this mailing list and I am having a problem with re-indexing. I will run an index on an xml file using the DataImportHandler and it will index the file. Then I delete the index using the *:*, , and commands. Then I attempt to re-index the same file with the same

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-20 Thread Shalin Shekhar Mangar
I also opened https://issues.apache.org/jira/browse/SOLR-10532 to fix this annoying and confusing behavior of SuggestComponent. On Thu, Apr 20, 2017 at 8:40 PM, Andrea Gazzarini wrote: > Ah great, many thanks again! > > > > On 20/04/17 17:09, Shalin Shekhar Mangar wrote: >> >>

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-20 Thread Andrea Gazzarini
Ah great, many thanks again! On 20/04/17 17:09, Shalin Shekhar Mangar wrote: Hi Andrea, Looks like I have you some bad information. I looked at the code and ran a test locally. The suggest.build and suggest.reload params are in fact distributed across to all shards but only to one replica of

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-20 Thread Shalin Shekhar Mangar
Hi Andrea, Looks like I have you some bad information. I looked at the code and ran a test locally. The suggest.build and suggest.reload params are in fact distributed across to all shards but only to one replica of each shard. This is still bad enough and you should use buildOnOptimize as

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-20 Thread Andrea Gazzarini
Perfect, I don't need NRT at this moment so that fits perfectly Thanks, Andrea On 20/04/17 14:37, Shalin Shekhar Mangar wrote: Yeah, if it is just once a day then you can afford to do an optimize. For a more NRT indexing approach, I wouldn't recommend optimize at all. On Thu, Apr 20, 2017 at

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-20 Thread Shalin Shekhar Mangar
Yeah, if it is just once a day then you can afford to do an optimize. For a more NRT indexing approach, I wouldn't recommend optimize at all. On Thu, Apr 20, 2017 at 5:29 PM, Andrea Gazzarini wrote: > Ok, many thanks > > I see / read that it should be better to rely on the

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-20 Thread Andrea Gazzarini
Ok, many thanks I see / read that it should be better to rely on the background merging instead of issuing explicit optimizes, but I think in this case one optimize in a day it shouldn't be a problem. Did I get you correctly? Thanks again, Andrea On 20/04/17 13:17, Shalin Shekhar Mangar

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-20 Thread Shalin Shekhar Mangar
On Thu, Apr 20, 2017 at 4:27 PM, Andrea Gazzarini wrote: > Hi Shalin, > many thanks for your response. This is my scenario: > > * I build my index once in a day, it could be a delta or a full >re-index.In any case, that takes some time; > * I have an auto-commit (hard, no

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-20 Thread Andrea Gazzarini
Hi Shalin, many thanks for your response. This is my scenario: * I build my index once in a day, it could be a delta or a full re-index.In any case, that takes some time; * I have an auto-commit (hard, no soft-commits) set to a given period and during the indexing cycle, several hard

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-20 Thread Shalin Shekhar Mangar
Comments inline: On Wed, Apr 19, 2017 at 2:46 PM, Andrea Gazzarini wrote: > Hi, > any help out there? > > BTW I forgot the Solr version: 6.5.0 > > Thanks, > Andrea > > > On 18/04/17 11:45, Andrea Gazzarini wrote: >> >> Hi, >> I have a project, with SolrCloud, where I'm going

Re: Index and query time suggester behavior in a SolrCloud environment

2017-04-19 Thread Andrea Gazzarini
Hi, any help out there? BTW I forgot the Solr version: 6.5.0 Thanks, Andrea On 18/04/17 11:45, Andrea Gazzarini wrote: Hi, I have a project, with SolrCloud, where I'm going to use the Suggester component (BlendedInfixLookupFactory with DocumentDictionaryFactory). Some info: * I will have

Re: Index upgrade time and disk space

2017-04-02 Thread sputul
Thanks, Shawn for getting back with detail explanation. I will run tests upfront with large index and space, and see if fast disk is needed. - Putul -- View this message in context: http://lucene.472066.n3.nabble.com/Index-upgrade-time-and-disk-space-tp4328003p4328040.html Sent from the Solr -

Re: Index upgrade time and disk space

2017-04-02 Thread Shawn Heisey
On 4/2/2017 8:16 AM, Putul S wrote: > I am migrating Solr 4 index to Solr 5. The upgrade tool/script works well. > But ran out disk space upgrading 4 GB index. The server had at least 8 GB > free then. On production, the index is about 200 GB. > > How much disk space is needed for indexing?

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
, March 27, 2017 11:48 AM To: solr-user@lucene.apache.org Subject: Re: Index scanned documents I tried this solution from Tim Allison, and it works. http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files Regards, Edwin On 27 March 2017 at 20:07, Allison, Timothy

Re: Index scanned documents

2017-03-27 Thread Zheng Lin Edwin Yeo
> -Original Message- > From: Arian Pasquali [mailto:arianpasqu...@gmail.com] > Sent: Sunday, March 26, 2017 11:44 AM > To: solr-user@lucene.apache.org > Subject: Re: Index scanned documents > > Hi Walled, > > I've never done that with solr, but you would probably need to u

RE: Index scanned documents

2017-03-27 Thread Allison, Timothy B.
-Original Message- From: Arian Pasquali [mailto:arianpasqu...@gmail.com] Sent: Sunday, March 26, 2017 11:44 AM To: solr-user@lucene.apache.org Subject: Re: Index scanned documents Hi Walled, I've never done that with solr, but you would probably need to use some OCR preprocessing before indexing

RE: Index scanned documents

2017-03-26 Thread Phil Scadden
While building directly into Solr might be appealing, I would argue that it is best to use OCR software first, outside of SOLR, to convert the PDF into "searchable" PDF format. That way when the document is retrieved, it is a lot more useful to the searcher - making it easy to find the text

Re: Index scanned documents

2017-03-26 Thread Arian Pasquali
Hi Walled, I've never done that with solr, but you would probably need to use some OCR preprocessing before indexing. The most popular library I know for the job is tesseract-orc . If you want to do that inside solr I've found that Tika has some support for that

Re: Index scanned documents

2017-03-26 Thread Zheng Lin Edwin Yeo
I'm also working on this issue right now, to extract the text in the scanned image in PDF files. >From what I know, we can use Tesseract OCR to extract the text in the image through Apache Tika, and it will come together with the Solr. By the way, which Solr version are you using? Regards,

Re: Index scanned documents

2017-03-26 Thread Waleed Raza
Hello I want to ask you that how can we extract text in solr from images which are inside pdf and MS office documents ? i found many websites but did not get a reply of it please guide me. On Sun, Mar 26, 2017 at 2:57 PM, Waleed Raza wrote: > Hello > I want to

Re: Index corruption with replication

2017-03-16 Thread santosh sidnal
Hi Erik/David, Schema is same on both live and stage servers. We are using the same schema files on the stage and live files. - Schema files are included in replication but these are not being changed whenever we are observing schema corruption issue. - My guess is that because of

Re: Index and query

2017-03-15 Thread rangeli nepal
Thank you. I like both option XSLT and mapping rules. Would you please provide some pointers to it? so that I can use it. Thanks again. Regards, rn On Wed, Mar 15, 2017 at 1:37 PM, Alexandre Rafalovitch wrote: > Additionally, > > Solr can index arbitrary XML by applying an

Re: Index and query

2017-03-15 Thread rangeli nepal
Would you please elaborate the 1> I guess you are saying add an attribute in managed-schema that is stored only. i.e

Re: Index and query

2017-03-15 Thread Alexandre Rafalovitch
Additionally, Solr can index arbitrary XML by applying an XSLT transform to it before indexing. But you still need to write the XSLT transform. Solr can also index arbitrary XML with DataImportHandler by pulling out specific fields. But you need to write mapping rules. I am not sure what

Re: Index and query

2017-03-15 Thread Erick Erickson
bq: How original document X will be returned? Should I store location of X in Tx? I s there a generic way of doing it? A couple of choices here: 1> create a stored-only field (i.e. stored="true" indexed="false" docValues="false") and stuff the original in that. It'll chew up some disk space, but

Re: Index and query

2017-03-15 Thread Walter Underwood
Solr does not index XML. Period. Solr uses an XML protocol for indexing. It can also use JSON or binary protocols for indexing. You need to convert your XML document into fields, then send those fields to Solr using one of the indexing protocols. If you need an XML database and search engine,

Re: Index and query

2017-03-15 Thread rangeli nepal
Thank you Erick for such a prompt reply. I am bit confused. Suppose I have a document X, I transformed it document Tx. Tx matches the format that you have described. I post Tx and I asume it get indexed. Now I query. How original document X will be returned? Should I store location of X in Tx?

Re: Index corruption with replication

2017-03-15 Thread Erick Erickson
You can specify your replication to include config files, but if the schema has changed you'll have to restart your Solr afterwards. How is it corrupt? what is the symptom? Any error messages in the solr log on the slave? What version of Solr? Details matter. Best, Erick On Wed, Mar 15, 2017 at

Re: Index and query

2017-03-15 Thread Erick Erickson
Solr does _not_ index arbitrary XML, it will index XML in a very specific format, i.e. value . . So if you're sending arbitrary XML to Solr I'm actually surprised it's indexing. You might be able to do something with sending docs through Tika

Re: Index corruption with replication

2017-03-15 Thread David Hastings
are you certain the schema is the same on both master and slave? I find that the schema file doesnt always go with the replication and if a field is different on the slave it will cause problems On Wed, Mar 15, 2017 at 12:08 PM, Santosh Sidnal wrote: > Hi all, > > I

Re: Index Segments not Merging

2017-02-27 Thread Mike Thomsen
Just barely skimmed the documentation, but it looks like the tool generates its own shards and pushes them into the collection by manipulating the configuration of the cluster. https://www.cloudera.com/documentation/enterprise/5-8-x/topics/search_mapreduceindexertool.html If that reading is

Re: Index time sorting and per index mergePolicyFactory

2016-11-28 Thread Erick Erickson
Wait, on the page you referenced there's this which appears to be exactly what you want: timestamp desc inner org.apache.solr.index.TieredMergePolicyFactory 10 10 And since this is in solrconfig.xml which is defined per core you can specify whatever you want for each core. Also see

Re: Index time sorting and per index mergePolicyFactory

2016-11-28 Thread Dorian Hoxha
bump after 11 days On Thu, Nov 17, 2016 at 10:25 AM, Dorian Hoxha wrote: > Hi, > > I know this is done in lucene, but I don't see it in solr (by searching + > docs on collections). > > I see https://cwiki.apache.org/confluence/display/solr/ > IndexConfig+in+SolrConfig

Re: Index and search on PDF text using Solr

2016-11-18 Thread Erick Erickson
see the section in the Solr Reference Guide: "Uploading Data with Solr Cell using Apache Tika" here: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika to get a start. The basic idea is to use Apache Tika to parse the PDF file and then stuff the

Re: index dir of core xxx is already locked.

2016-11-16 Thread Erick Erickson
You really need to go through your Solr logs for the shard(s) in question very carefully. There'll be a lot of information dumped out, including paths used for everything. I suspect you've unknowingly created this situation when trying to set up Solr, HDFS or whatever but I can't really say what

Re: index dir of core xxx is already locked.

2016-11-16 Thread Chetas Joshi
I don't kill the solr instance forcefully using "kill -9". I checked the core.properties file for that shard. The content is different from the core.properties file for all the other shards. It has the following two lines which are different config=solrconfig.xml schema=schema.xml In other

Re: index dir of core xxx is already locked.

2016-11-16 Thread Erick Erickson
bq: Before restarting, I delete all the write.lock files from the data dir. But every time I restart I get the same exception. First, this shouldn't be necessary. Are you by any chance killing the Solr instances with the equivalent of "kill -9"? Allow them to shut down gracefully. That said,

Re: index and data directories

2016-11-15 Thread Erick Erickson
rote: > > Thanks a lot Erick > > > Regards, > Prateek Jain > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: 14 November 2016 09:14 PM > To: solr-user <solr-user@lucene.apache.org> > Subject: Re: index and data d

RE: index and data directories

2016-11-15 Thread Prateek Jain J
Thanks a lot Erick Regards, Prateek Jain -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: 14 November 2016 09:14 PM To: solr-user <solr-user@lucene.apache.org> Subject: Re: index and data directories Theoretically, perhaps. And it's quit

<    1   2   3   4   5   6   7   8   9   10   >