RE: Unusually long data import time?
Thank you everyone for your patience and suggestions. It turns out I was doing something really unreasonable in my schema. I mistakenly edited the max EdgeNgram size to 512, when I meant to set the lengthFilter max to 512. I brought this to a more reasonable number, and my estimated time to import is now down to 4 hours. Based on the size of my record set, this time is more consistent with Walter's observations in his own project. Thanks again for your help, Devon Baumgarten -Original Message- From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] Sent: Wednesday, February 22, 2012 12:42 PM To: 'solr-user@lucene.apache.org' Subject: RE: Unusually long data import time? Ahmet, I do not. I commented autoCommit out. Devon Baumgarten -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Wednesday, February 22, 2012 12:25 PM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? > Would it be unusual for an import of 160 million documents > to take 18 hours? Each document is less than 1kb and I > have the DataImportHandler using the jdbc driver to connect > to SQL Server 2008. The full-import query calls a stored > procedure that contains only a select from my target table. > > Is there any way I can speed this up? I saw recently someone > on this list suggested a new user could get all their Solr > data imported in under an hour. I sure hope that's true! Do have autoCommit or autoSoftCommit configured in solrconfig.xml?
RE: Unusually long data import time?
Ahmet, I do not. I commented autoCommit out. Devon Baumgarten -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Wednesday, February 22, 2012 12:25 PM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? > Would it be unusual for an import of 160 million documents > to take 18 hours? Each document is less than 1kb and I > have the DataImportHandler using the jdbc driver to connect > to SQL Server 2008. The full-import query calls a stored > procedure that contains only a select from my target table. > > Is there any way I can speed this up? I saw recently someone > on this list suggested a new user could get all their Solr > data imported in under an hour. I sure hope that's true! Do have autoCommit or autoSoftCommit configured in solrconfig.xml?
RE: Unusually long data import time?
Walter, Do you mean sub-entities in your database, or something else? The data I am feeding DIH is from a select * (no joins or WHERE clause) on a table with: int, int, varchar(32), varchar(32), varchar(512) (this one is the Name), varchar(512), datetime If it matters, the select * is happening in a stored procedure. Devon Baumgarten -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Wednesday, February 22, 2012 11:46 AM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? In my first try with the DIH, I had several sub-entities and it was making six queries per document. My 20M doc load was going to take many hours, most of a day. I re-wrote it to eliminate those, and now it makes a single query for the whole load and takes 70 minutes. These are small documents, just the metadata for each book. wunder Search Guy Chegg On Feb 22, 2012, at 9:41 AM, Devon Baumgarten wrote: > I changed the heap size (Xmx1582m was as high as I could go). The import is > at about 5% now, and from that I now estimate about 13 hours. It's hard to > say though.. it keeps going up little by little. > > If I get approval to use Solr for this project, I'll have them install a > 64bit jvm instead, but is there anything else I can do? > > > Devon Baumgarten > Application Developer > > > -----Original Message- > From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] > Sent: Wednesday, February 22, 2012 10:32 AM > To: 'solr-user@lucene.apache.org' > Subject: RE: Unusually long data import time? > > Oh sure! As best as I can, anyway. > > I have not set the Java heap size, or really configured it at all. > > The server running both the SQL Server and Solr has: > * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors) > * 64 GB RAM > * One Solr instance (no shards) > > I'm not using faceting. > My schema has these fields: > > > > termVectors="true" /> > termVectors="true" /> > > > > Custom types: > > *LikeText > PatternReplaceCharFilterFactory ("\W+" => "") > KeywordTokenizerFactory > StopFilterFactory (~40 words in stoplist) > ASCIIFoldingFilterFactory > LowerCaseFilterFactory > EdgeNGramFilterFactory > LengthFilterFactory (min:3, max:512) > > *FuzzyText > PatternReplaceCharFilterFactory ("\W+" => "") > KeywordTokenizerFactory > StopFilterFactory (~40 words in stoplist) > ASCIIFoldingFilterFactory > LowerCaseFilterFactory > NGramFilterFactory > LengthFilterFactory (min:3, max:512) > > Devon Baumgarten > > > -Original Message- > From: Glen Newton [mailto:glen.new...@gmail.com] > Sent: Wednesday, February 22, 2012 9:24 AM > To: solr-user@lucene.apache.org > Subject: Re: Unusually long data import time? > > Import times will depend on: > - hardware (speed of disks, cpu, # of cpus, amount of memory, etc) > - Java configuration (heap size, etc) > - Lucene/Solr configuration (many ...) > - Index configuration - how many fields, indexed how; faceting, etc > - OS configuration (this usually to a lesser degree; _usually_) > - Network issues if non-local > - DB configuration (driver, etc) > > If you can give more information about the above, people on this list > should be able to better indicate whether 18 hours sounds right for > your situation. > > -Glen Newton > > On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten > wrote: >> Hello, >> >> Would it be unusual for an import of 160 million documents to take 18 hours? >> Each document is less than 1kb and I have the DataImportHandler using the >> jdbc driver to connect to SQL Server 2008. The full-import query calls a >> stored procedure that contains only a select from my target table. >> >> Is there any way I can speed this up? I saw recently someone on this list >> suggested a new user could get all their Solr data imported in under an >> hour. I sure hope that's true! >> >> >> Devon Baumgarten >> >> > > > > -- > - > http://zzzoot.blogspot.com/ > -
RE: Unusually long data import time?
I changed the heap size (Xmx1582m was as high as I could go). The import is at about 5% now, and from that I now estimate about 13 hours. It's hard to say though.. it keeps going up little by little. If I get approval to use Solr for this project, I'll have them install a 64bit jvm instead, but is there anything else I can do? Devon Baumgarten Application Developer -Original Message----- From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] Sent: Wednesday, February 22, 2012 10:32 AM To: 'solr-user@lucene.apache.org' Subject: RE: Unusually long data import time? Oh sure! As best as I can, anyway. I have not set the Java heap size, or really configured it at all. The server running both the SQL Server and Solr has: * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors) * 64 GB RAM * One Solr instance (no shards) I'm not using faceting. My schema has these fields: Custom types: *LikeText PatternReplaceCharFilterFactory ("\W+" => "") KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory EdgeNGramFilterFactory LengthFilterFactory (min:3, max:512) *FuzzyText PatternReplaceCharFilterFactory ("\W+" => "") KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory NGramFilterFactory LengthFilterFactory (min:3, max:512) Devon Baumgarten -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Wednesday, February 22, 2012 9:24 AM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? Import times will depend on: - hardware (speed of disks, cpu, # of cpus, amount of memory, etc) - Java configuration (heap size, etc) - Lucene/Solr configuration (many ...) - Index configuration - how many fields, indexed how; faceting, etc - OS configuration (this usually to a lesser degree; _usually_) - Network issues if non-local - DB configuration (driver, etc) If you can give more information about the above, people on this list should be able to better indicate whether 18 hours sounds right for your situation. -Glen Newton On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten wrote: > Hello, > > Would it be unusual for an import of 160 million documents to take 18 hours? > Each document is less than 1kb and I have the DataImportHandler using the > jdbc driver to connect to SQL Server 2008. The full-import query calls a > stored procedure that contains only a select from my target table. > > Is there any way I can speed this up? I saw recently someone on this list > suggested a new user could get all their Solr data imported in under an hour. > I sure hope that's true! > > > Devon Baumgarten > > -- - http://zzzoot.blogspot.com/ -
RE: Unusually long data import time?
Oh sure! As best as I can, anyway. I have not set the Java heap size, or really configured it at all. The server running both the SQL Server and Solr has: * 2 Intel Xeon X5660 (each one is 2.8 GHz, 6 cores, 12 logical processors) * 64 GB RAM * One Solr instance (no shards) I'm not using faceting. My schema has these fields: Custom types: *LikeText PatternReplaceCharFilterFactory ("\W+" => "") KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory EdgeNGramFilterFactory LengthFilterFactory (min:3, max:512) *FuzzyText PatternReplaceCharFilterFactory ("\W+" => "") KeywordTokenizerFactory StopFilterFactory (~40 words in stoplist) ASCIIFoldingFilterFactory LowerCaseFilterFactory NGramFilterFactory LengthFilterFactory (min:3, max:512) Devon Baumgarten -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Wednesday, February 22, 2012 9:24 AM To: solr-user@lucene.apache.org Subject: Re: Unusually long data import time? Import times will depend on: - hardware (speed of disks, cpu, # of cpus, amount of memory, etc) - Java configuration (heap size, etc) - Lucene/Solr configuration (many ...) - Index configuration - how many fields, indexed how; faceting, etc - OS configuration (this usually to a lesser degree; _usually_) - Network issues if non-local - DB configuration (driver, etc) If you can give more information about the above, people on this list should be able to better indicate whether 18 hours sounds right for your situation. -Glen Newton On Wed, Feb 22, 2012 at 10:14 AM, Devon Baumgarten wrote: > Hello, > > Would it be unusual for an import of 160 million documents to take 18 hours? > Each document is less than 1kb and I have the DataImportHandler using the > jdbc driver to connect to SQL Server 2008. The full-import query calls a > stored procedure that contains only a select from my target table. > > Is there any way I can speed this up? I saw recently someone on this list > suggested a new user could get all their Solr data imported in under an hour. > I sure hope that's true! > > > Devon Baumgarten > > -- - http://zzzoot.blogspot.com/ -
Unusually long data import time?
Hello, Would it be unusual for an import of 160 million documents to take 18 hours? Each document is less than 1kb and I have the DataImportHandler using the jdbc driver to connect to SQL Server 2008. The full-import query calls a stored procedure that contains only a select from my target table. Is there any way I can speed this up? I saw recently someone on this list suggested a new user could get all their Solr data imported in under an hour. I sure hope that's true! Devon Baumgarten
RE: Solr, SQL Server's LIKE
Great suggestion! Thanks for keeping it simple for a complete Solr newbie. I'm going to go try this right now. Thanks! Devon Baumgarten -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Monday, January 02, 2012 12:30 PM To: solr-user@lucene.apache.org Subject: Re: Solr, SQL Server's LIKE On 12/29/2011 3:51 PM, Devon Baumgarten wrote: > N-Grams get me pretty great results in general, but I don't want the results > for this particular search to be fuzzy. How can I prevent the fuzzy matches > from appearing? > > Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather > than having a low score. To achieve this while using the ngram filter, just do the ngram analysis on the index side, but not on the query side. If you do this, you'll likely need a maxGramSize larger than would normally be required (which will make the index larger), and you might need to use the LengthFilter too. Thanks, Shawn
RE: Solr, SQL Server's LIKE
Hoss, Thanks. You've answered my question. To clarify, what I should have asked for instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me that I didn't need n-grams to use the wildcard. You asking for me to clarify what I meant made me realize that the n-grams are the source of all my current problems. :) Thanks! Devon Baumgarten -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, December 29, 2011 7:00 PM To: solr-user@lucene.apache.org Subject: RE: Solr, SQL Server's LIKE : Thanks. I know I'll be able to utilize some of Solr's free text : searching capabilities in other search types in this project. The : product manager wants this particular search to exactly mimic LIKE%. ... : Ex: If I search "Albatross" I want "Albert" to be excluded completely, : rather than having a low score. please be specific about the types of queries you want. ie: we need more then one example of the type of input you want to provide, the type of matches you want to see for that input, and the type of matches you want to get back. in your first message you said you need to match company titles "pretty exactly" but then seem to contradict yourself by saying the SQL's LIKE command fit's the bill -- even though the SQL LIKE command exists specificly for in-exact matches on field values. Based on your one example above of Albatross, you don't need anything special: don't use ngrams, don't use stemming, don't use fuzzy anything -- just search for "Albatross" and it will match "Albatross" but not "Albert". if you want "Albatross" to match "Albatross Road" use some basic tokenization. If all you really care about is prefix searching (which seems suggested by your "LIKE%" comment above, which i'm guessing is shorthand for something similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both match "abcdef" and "abcd" but neither of them match "abcd" then just use prefix queries (ie: "abcd*") -- they should be plenty efficient for your purposes. you only need to worry about ngrams when you want to efficiently match in the middle of a string. (ie: "TITLE LIKE %ABC%") -Hoss
RE: Solr, SQL Server's LIKE
Erick, Thanks. I know I'll be able to utilize some of Solr's free text searching capabilities in other search types in this project. The product manager wants this particular search to exactly mimic LIKE%. N-Grams get me pretty great results in general, but I don't want the results for this particular search to be fuzzy. How can I prevent the fuzzy matches from appearing? Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather than having a low score. Devon Baumgarten -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, December 29, 2011 3:44 PM To: solr-user@lucene.apache.org Subject: Re: Solr, SQL Server's LIKE SQLs "like" is usually handled with ngrams if you want *stuff* kinds of searches. Wildcards are "interesting" in Solr. Things Solr handles that aren't easy in SQL Phrases, phrases with slop, stemming, synonyms. And, especially, some kind of relevance ranking. But Solr does NOT do the things SQL is best at, things like joins etc. Each has it's sweet spot and trying to make one do all the functions of the other is fraught with places to go wrong. Not a lot of help, but free text searching is what Solr is all about, so if your problem maps into that space, it's a great tool! Best Erick On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant wrote: > for a simple, hackish (albeit inefficient) approach look up wildcard searchers > > e,g foo*, *bar > > > > On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten > wrote: >> I have been tinkering with Solr for a few weeks, and I am convinced that it >> could be very helpful in many of my upcoming projects. I am trying to decide >> whether Solr is appropriate for this one, and I haven't had luck looking for >> answers on Google. >> >> I need to search a list of names of companies and individuals pretty >> exactly. T-SQL's LIKE operator does this with decent performance, but I have >> a feeling there is a way to configure Solr to do this better. I've tried >> using an edge N-gram tokenizer, but it feels like it might be more >> complicated than necessary. What would you suggest? >> >> I know this sounds kind of 'Golden Hammer,' but there has been talk of >> other, more complicated (magic) searches that I don't think SQL Server can >> handle, since its tokens (as far as I know) can't be smaller than one word. >> >> Thanks, >> >> Devon Baumgarten >>
Solr, SQL Server's LIKE
I have been tinkering with Solr for a few weeks, and I am convinced that it could be very helpful in many of my upcoming projects. I am trying to decide whether Solr is appropriate for this one, and I haven't had luck looking for answers on Google. I need to search a list of names of companies and individuals pretty exactly. T-SQL's LIKE operator does this with decent performance, but I have a feeling there is a way to configure Solr to do this better. I've tried using an edge N-gram tokenizer, but it feels like it might be more complicated than necessary. What would you suggest? I know this sounds kind of 'Golden Hammer,' but there has been talk of other, more complicated (magic) searches that I don't think SQL Server can handle, since its tokens (as far as I know) can't be smaller than one word. Thanks, Devon Baumgarten
RE: Removing whitespace
Thanks Alireza, Steven and Koji for the quick responses! I'll read up on those and give it a shot. Devon Baumgarten
RE: Removing whitespace
Thanks Alireza, Steven and Koji for the quick responses! I'll read up on those and give it a shot. Devon Baumgarten -Original Message- From: Alireza Salimi [mailto:alireza.sal...@gmail.com] Sent: Monday, December 12, 2011 4:08 PM To: solr-user@lucene.apache.org Subject: Re: Removing whitespace That sounds strange requirement, but I think you can use CharFilters instead of implementing your own Tokenizer. Take a look at this section, maybe it helps. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories The On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten < dbaumgar...@nationalcorp.com> wrote: > Hello, > > I am having trouble finding how to remove/ignore whitespace when indexing. > The only answer I have found suggested that it is necessary to write my own > tokenizer. Is this true? I want to remove whitespace and special characters > from the phrase and create N-grams from the result. > > Ultimately, the effect I am after is that searching "bobdole" would match > "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way... > can anyone lend some assistance? > > Thanks! > > Dev B > > -- Alireza Salimi Java EE Developer
Removing whitespace
Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. Ultimately, the effect I am after is that searching "bobdole" would match "Bob Dole", "Bo B. Dole", and maybe "Bobdo". Maybe there is a better way... can anyone lend some assistance? Thanks! Dev B