Re: Importing large datasets
On 3 Jun 2010, at 03:51, Blargy wrote: Would dumping the databases to a local file help at all? I would suspect not especally with the size of your data. But it would be good to know how long that takes i.e. Creating a SQL script that just pulls that data out how long does that take? Also have many fields are you indexing per document 10,50,100? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-f tp863447p866538.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Search problem; cannot search the existing word in the index content
Hi Yandong, You are right. It works!!! You are the best. Thanks, Mint 2010/6/3 Zero Yao > Modify all settings in solrconfig.xml and try again, by > default solr will only index the first 1 fields. > > Best Regards, > Yandong > > -Original Message- > From: Mint o_O! [mailto:mint@gmail.com] > Sent: 2010年6月3日 13:58 > To: solr-user@lucene.apache.org > Subject: Re: Solr Search problem; cannot search the existing word in the > index content > > Thanks for you advice. I did as you said and i still cannot search my > content. > > One thing i notice here i can search for only the words within first 100 > rows or maybe bigger than this not sure but not all. So is it the > limitation > of the index it self? When I create another sample content with only small > amount of data. It's working great!!! > My content is around 1.2M. I stored it as the text field as in the > schema.xml sample file. > > Anyone has the same issue with me? > > thanks, > > Mint > > On Tue, May 18, 2010 at 1:58 PM, Lance Norskog wrote: > > > backslash*rhode > > \*rhode may work. > > > > On Mon, May 17, 2010 at 7:23 AM, Erick Erickson > > > wrote: > > > A couple of things: > > > 1> try searching with &debugQuery=on attached to your URL, that'll > > > give you some clues. > > > 2> It's really worthwhile exploring the admin pages for a while, it'll > > also > > > give you a world of information. It takes a while to understand what > the > > > various pages are telling you, but you'll come to rely on them. > > > 3> Are you really searching with leading and trailing wildcards or is > > that > > > just the mail changing bolding? Because this is tricky, very tricky. > > Search > > > the mail archives for "leading wildcard" to see lots of discussion of > > this > > > topic. > > > > > > You might back off a bit and try building up to wildcards if that's > what > > > you're doing > > > > > > HTH > > > Erick > > > > > > On Mon, May 17, 2010 at 1:11 AM, Mint o_O! wrote: > > > > > >> Hi, > > >> > > >> I'm working on the index/search project recently and i found solr > which > > is > > >> very fascinating to me. > > >> > > >> I followed the test successful from the tutorial page. Starting up > jetty > > >> and > > >> run adding new xml (user:~/solr/example/exampledocs$ *java -jar > post.jar > > >> *.xml*) so far so good at this stage. > > >> > > >> Now i have create my own testing westpac.xml file with real data I > > intend > > >> to > > >> implement, putting in exampledocs and again ran the command > > >> (user:~/solr/example/exampledocs$ *java -jar post.jar westpac.xml*). > > >> Everything went on very well however when i searched for "*rhode*" > which > > is > > >> in the content. And Index returned nothing. > > >> > > >> Could anyone guide me what I did wrong why i couldn't search for that > > word > > >> even though that word is in my index content. > > >> > > >> thanks, > > >> > > >> Mint > > >> > > > > > > > > > > > -- > > Lance Norskog > > goks...@gmail.com > > >
Re: Importing large datasets
On 3 Jun 2010, at 02:51, Dennis Gearon wrote: Well, I hope to have around 5 million datasets/documents within 1 year, so this is good info. BUT if I DO have that many, then the market I am aiming at will end giving me 100 times more than than within 2 years. Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 million plus documents? As far as I'm aware there aren't any books yet that cover this for solr. The wiki, this mailing list, nabble are your best sources and there have been some quite indepth conversations on the matter in this list in the past The data is easily shardible geographially, as one given. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Grant Ingersoll wrote: From: Grant Ingersoll Subject: Re: Importing large datasets To: solr-user@lucene.apache.org Date: Wednesday, June 2, 2010, 3:42 AM On Jun 1, 2010, at 9:54 PM, Blargy wrote: We have around 5 million items in our index and each item has a description located on a separate physical database. These item descriptions vary in size and for the most part are quite large. Currently we are only indexing items and not their corresponding description and a full import takes around 4 hours. Ideally we want to index both our items and their descriptions but after some quick profiling I determined that a full import would take in excess of 24 hours. - How would I profile the indexing process to determine if the bottleneck is Solr or our Database. As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). When you say "quite large", what do you mean? Are we talking books here or maybe a couple pages of text or just a couple KB of data? How long does it take you to get that data out (and, from the sounds of it, merge it with your item) w/o going to Solr? - In either case, how would one speed up this process? Is there a way to run parallel import processes and then merge them together at the end? Possibly use some sort of distributed computing? DataImportHandler now supports multiple threads. The absolute fastest way that I know of to index is via multiple threads sending batches of documents at a time (at least 100). Often, from DBs one can split up the table via SQL statements that can then be fetched separately. You may want to write your own multithreaded client to index. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Importing large datasets
On 3 Jun 2010, at 02:58, Dennis Gearon wrote: When adding data continuously, that data is available after committing and is indexed, right? Yes If so, how often is reindexing do some good? You should only need to reindex if the data changes or you change your schema. The DIH in solr 1.4 supports delta imports so you should only really be adding of updating (which is actually deleting and adding) items when necessary. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Andrzej Bialecki wrote: From: Andrzej Bialecki Subject: Re: Importing large datasets To: solr-user@lucene.apache.org Date: Wednesday, June 2, 2010, 4:52 AM On 2010-06-02 13:12, Grant Ingersoll wrote: On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: On 2010-06-02 12:42, Grant Ingersoll wrote: On Jun 1, 2010, at 9:54 PM, Blargy wrote: We have around 5 million items in our index and each item has a description located on a separate physical database. These item descriptions vary in size and for the most part are quite large. Currently we are only indexing items and not their corresponding description and a full import takes around 4 hours. Ideally we want to index both our items and their descriptions but after some quick profiling I determined that a full import would take in excess of 24 hours. - How would I profile the indexing process to determine if the bottleneck is Solr or our Database. As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). When you say "quite large", what do you mean? Are we talking books here or maybe a couple pages of text or just a couple KB of data? How long does it take you to get that data out (and, from the sounds of it, merge it with your item) w/o going to Solr? - In either case, how would one speed up this process? Is there a way to run parallel import processes and then merge them together at the end? Possibly use some sort of distributed computing? DataImportHandler now supports multiple threads. The absolute fastest way that I know of to index is via multiple threads sending batches of documents at a time (at least 100). Often, from DBs one can split up the table via SQL statements that can then be fetched separately. You may want to write your own multithreaded client to index. SOLR-1301 is also an option if you are familiar with Hadoop ... If the bottleneck is the DB, will that do much? Nope. But the workflow could be set up so that during night hours a DB export takes place that results in a CSV or SolrXML file (there you could measure the time it takes to do this export), and then indexing can work from this file. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Solr Search problem; cannot search the existing word in the index content
Modify all settings in solrconfig.xml and try again, by default solr will only index the first 1 fields. Best Regards, Yandong -Original Message- From: Mint o_O! [mailto:mint@gmail.com] Sent: 2010年6月3日 13:58 To: solr-user@lucene.apache.org Subject: Re: Solr Search problem; cannot search the existing word in the index content Thanks for you advice. I did as you said and i still cannot search my content. One thing i notice here i can search for only the words within first 100 rows or maybe bigger than this not sure but not all. So is it the limitation of the index it self? When I create another sample content with only small amount of data. It's working great!!! My content is around 1.2M. I stored it as the text field as in the schema.xml sample file. Anyone has the same issue with me? thanks, Mint On Tue, May 18, 2010 at 1:58 PM, Lance Norskog wrote: > backslash*rhode > \*rhode may work. > > On Mon, May 17, 2010 at 7:23 AM, Erick Erickson > wrote: > > A couple of things: > > 1> try searching with &debugQuery=on attached to your URL, that'll > > give you some clues. > > 2> It's really worthwhile exploring the admin pages for a while, it'll > also > > give you a world of information. It takes a while to understand what the > > various pages are telling you, but you'll come to rely on them. > > 3> Are you really searching with leading and trailing wildcards or is > that > > just the mail changing bolding? Because this is tricky, very tricky. > Search > > the mail archives for "leading wildcard" to see lots of discussion of > this > > topic. > > > > You might back off a bit and try building up to wildcards if that's what > > you're doing > > > > HTH > > Erick > > > > On Mon, May 17, 2010 at 1:11 AM, Mint o_O! wrote: > > > >> Hi, > >> > >> I'm working on the index/search project recently and i found solr which > is > >> very fascinating to me. > >> > >> I followed the test successful from the tutorial page. Starting up jetty > >> and > >> run adding new xml (user:~/solr/example/exampledocs$ *java -jar post.jar > >> *.xml*) so far so good at this stage. > >> > >> Now i have create my own testing westpac.xml file with real data I > intend > >> to > >> implement, putting in exampledocs and again ran the command > >> (user:~/solr/example/exampledocs$ *java -jar post.jar westpac.xml*). > >> Everything went on very well however when i searched for "*rhode*" which > is > >> in the content. And Index returned nothing. > >> > >> Could anyone guide me what I did wrong why i couldn't search for that > word > >> even though that word is in my index content. > >> > >> thanks, > >> > >> Mint > >> > > > > > > -- > Lance Norskog > goks...@gmail.com >
Re: Solr Search problem; cannot search the existing word in the index content
Thanks for you advice. I did as you said and i still cannot search my content. One thing i notice here i can search for only the words within first 100 rows or maybe bigger than this not sure but not all. So is it the limitation of the index it self? When I create another sample content with only small amount of data. It's working great!!! My content is around 1.2M. I stored it as the text field as in the schema.xml sample file. Anyone has the same issue with me? thanks, Mint On Tue, May 18, 2010 at 1:58 PM, Lance Norskog wrote: > backslash*rhode > \*rhode may work. > > On Mon, May 17, 2010 at 7:23 AM, Erick Erickson > wrote: > > A couple of things: > > 1> try searching with &debugQuery=on attached to your URL, that'll > > give you some clues. > > 2> It's really worthwhile exploring the admin pages for a while, it'll > also > > give you a world of information. It takes a while to understand what the > > various pages are telling you, but you'll come to rely on them. > > 3> Are you really searching with leading and trailing wildcards or is > that > > just the mail changing bolding? Because this is tricky, very tricky. > Search > > the mail archives for "leading wildcard" to see lots of discussion of > this > > topic. > > > > You might back off a bit and try building up to wildcards if that's what > > you're doing > > > > HTH > > Erick > > > > On Mon, May 17, 2010 at 1:11 AM, Mint o_O! wrote: > > > >> Hi, > >> > >> I'm working on the index/search project recently and i found solr which > is > >> very fascinating to me. > >> > >> I followed the test successful from the tutorial page. Starting up jetty > >> and > >> run adding new xml (user:~/solr/example/exampledocs$ *java -jar post.jar > >> *.xml*) so far so good at this stage. > >> > >> Now i have create my own testing westpac.xml file with real data I > intend > >> to > >> implement, putting in exampledocs and again ran the command > >> (user:~/solr/example/exampledocs$ *java -jar post.jar westpac.xml*). > >> Everything went on very well however when i searched for "*rhode*" which > is > >> in the content. And Index returned nothing. > >> > >> Could anyone guide me what I did wrong why i couldn't search for that > word > >> even though that word is in my index content. > >> > >> thanks, > >> > >> Mint > >> > > > > > > -- > Lance Norskog > goks...@gmail.com >
Error loading class 'solr.HTMLStripStandardTokenizerFactory'
Hi, I'm trying to use the field collapsing feature. For that I need to take a checkout of the trunk and apply the patch available at https://issues.apache.org/jira/browse/SOLR-236 When I take a checkout and run the example-DIH, I get following error in browser on doing dataimport?command=full-import org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer:Error loading class 'solr.HTMLStripStandardTokenizerFactory' at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:168) at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:904) at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:60) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:445) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:435) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:480) at org.apache.solr.schema.IndexSchema.(IndexSchema.java:122) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:429) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:286) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:198) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:123) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:662) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1250) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:467) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.mortbay.start.Main.invokeMain(Main.java:194) at org.mortbay.start.Main.start(Main.java:534) at org.mortbay.start.Main.start(Main.java:441) at org.mortbay.start.Main.main(Main.java:119) Caused by: org.apache.solr.common.SolrException: Error loading class 'solr.HTMLStripStandardTokenizerFactory' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:388) at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:403) at org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:85) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:142) ... 37 more Caused by: java.lang.ClassNotFoundException: solr.HTMLStripStandardTokenizerFactory at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:372) ... 40 more Because of this error I cannot proceed with applying the patch and trying out the field collapsing feature. Appreciate any help. Thanks, Terance. --
Re: Array of arguments in URL?
Ah! Thank you. On Wed, Jun 2, 2010 at 9:52 AM, Chris Hostetter wrote: > > : In the "/spell" declaration in the example solrconfig.xml, we find > : these lines among the default parameters: > > as grant pointed out: these aren't in the default params > > : How does one supply such an array of strings in HTTP parameters? Does > : Solr have a parsing option for this? > > in general, ignoring for a moment hte question of wether you are asking > about changing the component list in a param (you can't) and addressing > just the question of specifing an array of strings in HTTP params: if the > param supports multiple values, then you can specify multiple values just > be repeating hte key... > > q=foo&fq=firstValue&fq=secondValue&fq=thirdValue > > ...this results in a SolrParams instance where the "value" of "fq" is an > array of [firstValue, secondValue] > > > > > -Hoss > > -- Lance Norskog goks...@gmail.com
Re: Importing large datasets
Would dumping the databases to a local file help at all? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866538.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
Erik Hatcher-4 wrote: > > One thing that might help indexing speed - create a *single* SQL query > to grab all the data you need without using DIH's sub-entities, at > least the non-cached ones. > > Erik > > On Jun 2, 2010, at 12:21 PM, Blargy wrote: > >> >> >> As a data point, I routinely see clients index 5M items on normal >> hardware >> in approx. 1 hour (give or take 30 minutes). >> >> Also wanted to add that our main entity (item) consists of 5 sub- >> entities >> (ie, joins). 2 of those 5 are fairly small so I am using >> CachedSqlEntityProcessor for them but the other 3 (which includes >> item_description) are normal. >> >> All the entites minus the item_description connect to datasource1. >> They >> currently point to one physical machine although we do have a pool >> of 3 DB's >> that could be used if it helps. The other entity, item_description >> uses a >> datasource2 which has a pool of 2 DB's that could potentially be >> used. Not >> sure if that would help or not. >> >> I might as well that the item description will have indexed, stored >> and term >> vectors set to true. >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > I can't find any example of creating a massive sql query. Any out there? Will batching still work with this massive query? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
Lance Norskog-2 wrote: > > Wait! You're fetching records from one database and then doing lookups > against another DB? That makes this a completely different problem. > > The DIH does not to my knowledge have the ability to "pool" these > queries. That is, it will not build a batch of 1000 keys from > datasource1 and then do a query against datasource2 with: > select foo where key_field IN (key1, key2,... key1000); > > This is the efficient way to do what you want. You'll have to write > your own client to do this. > > On Wed, Jun 2, 2010 at 12:00 PM, David Stuart > wrote: >> How long does it take to do a grab of all the data via SQL? I found by >> denormalizing the data into a lookup table meant that I was able to index >> about 300k rows of similar data size with dih regex spilting on some >> fields >> in about 8mins I know it's not quite the scale bit with batching... >> >> David Stuar >> >> On 2 Jun 2010, at 17:58, Blargy wrote: >> >>> >>> >>> One thing that might help indexing speed - create a *single* SQL query to grab all the data you need without using DIH's sub-entities, at least the non-cached ones. >>> >>> Not sure how much that would help. As I mentioned that without the item >>> description import the full process takes 4 hours which is bearable. >>> However >>> once I started to import the item description which is located on a >>> separate >>> machine/database the import process exploded to over 24 hours. >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >> > > > > -- > Lance Norskog > goks...@gmail.com > Whats more efficient a batch size of 1000 or -1 for MySQL? Is this why its so slow because I am using 2 different datasources? Say I am using just one datasource should I still be seing "Creating a connection for entity " for each sub entity in the document or should it just be using one connection? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p866499.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
That's promising!!! That's how I have been desigining my project. It must be all the joins that are causing the problems for him? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, David Stuart wrote: > From: David Stuart > Subject: Re: Importing large datasets > To: "solr-user@lucene.apache.org" > Date: Wednesday, June 2, 2010, 12:00 PM > How long does it take to do a grab of > all the data via SQL? I found by denormalizing the data into > a lookup table meant that I was able to index about 300k > rows of similar data size with dih regex spilting on some > fields in about 8mins I know it's not quite the scale bit > with batching... > > David Stuar > > On 2 Jun 2010, at 17:58, Blargy > wrote: > > > > > > > > >> One thing that might help indexing speed - create > a *single* SQL query > >> to grab all the data you need without using DIH's > sub-entities, at > >> least the non-cached ones. > >> > > > > Not sure how much that would help. As I mentioned that > without the item > > description import the full process takes 4 hours > which is bearable. However > > once I started to import the item description which is > located on a separate > > machine/database the import process exploded to over > 24 hours. > > > > --View this message in context: > > http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html > > Sent from the Solr - User mailing list archive at > Nabble.com. >
Re: Importing large datasets
When adding data continuously, that data is available after committing and is indexed, right? If so, how often is reindexing do some good? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Andrzej Bialecki wrote: > From: Andrzej Bialecki > Subject: Re: Importing large datasets > To: solr-user@lucene.apache.org > Date: Wednesday, June 2, 2010, 4:52 AM > On 2010-06-02 13:12, Grant Ingersoll > wrote: > > > > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: > > > >> On 2010-06-02 12:42, Grant Ingersoll wrote: > >>> > >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote: > >>> > > We have around 5 million items in our > index and each item has a description > located on a separate physical database. > These item descriptions vary in > size and for the most part are quite > large. Currently we are only indexing > items and not their corresponding > description and a full import takes around > 4 hours. Ideally we want to index both our > items and their descriptions but > after some quick profiling I determined > that a full import would take in > excess of 24 hours. > > - How would I profile the indexing process > to determine if the bottleneck is > Solr or our Database. > >>> > >>> As a data point, I routinely see clients index > 5M items on normal > >>> hardware in approx. 1 hour (give or take 30 > minutes). > >>> > >>> When you say "quite large", what do you > mean? Are we talking books here or maybe a couple > pages of text or just a couple KB of data? > >>> > >>> How long does it take you to get that data out > (and, from the sounds of it, merge it with your item) w/o > going to Solr? > >>> > - In either case, how would one speed up > this process? Is there a way to run > parallel import processes and then merge > them together at the end? Possibly > use some sort of distributed computing? > >>> > >>> DataImportHandler now supports multiple > threads. The absolute fastest way that I know of to > index is via multiple threads sending batches of documents > at a time (at least 100). Often, from DBs one can > split up the table via SQL statements that can then be > fetched separately. You may want to write your own > multithreaded client to index. > >> > >> SOLR-1301 is also an option if you are familiar > with Hadoop ... > >> > > > > If the bottleneck is the DB, will that do much? > > > > Nope. But the workflow could be set up so that during night > hours a DB > export takes place that results in a CSV or SolrXML file > (there you > could measure the time it takes to do this export), and > then indexing > can work from this file. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ > _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic > Web > ___|||__|| \| || | Embedded Unix, > System Integration > http://www.sigram.com Contact: info at sigram dot > com > >
Re: Importing large datasets
Well, I hope to have around 5 million datasets/documents within 1 year, so this is good info. BUT if I DO have that many, then the market I am aiming at will end giving me 100 times more than than within 2 years. Are there good references/books on using Solr/Lucen/(linux/nginx) for 500 million plus documents? The data is easily shardible geographially, as one given. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Wed, 6/2/10, Grant Ingersoll wrote: > From: Grant Ingersoll > Subject: Re: Importing large datasets > To: solr-user@lucene.apache.org > Date: Wednesday, June 2, 2010, 3:42 AM > > On Jun 1, 2010, at 9:54 PM, Blargy wrote: > > > > > We have around 5 million items in our index and each > item has a description > > located on a separate physical database. These item > descriptions vary in > > size and for the most part are quite large. Currently > we are only indexing > > items and not their corresponding description and a > full import takes around > > 4 hours. Ideally we want to index both our items and > their descriptions but > > after some quick profiling I determined that a full > import would take in > > excess of 24 hours. > > > > - How would I profile the indexing process to > determine if the bottleneck is > > Solr or our Database. > > As a data point, I routinely see clients index 5M items on > normal > hardware in approx. 1 hour (give or take 30 minutes). > > > When you say "quite large", what do you mean? Are we > talking books here or maybe a couple pages of text or just a > couple KB of data? > > How long does it take you to get that data out (and, from > the sounds of it, merge it with your item) w/o going to > Solr? > > > - In either case, how would one speed up this process? > Is there a way to run > > parallel import processes and then merge them together > at the end? Possibly > > use some sort of distributed computing? > > DataImportHandler now supports multiple threads. The > absolute fastest way that I know of to index is via multiple > threads sending batches of documents at a time (at least > 100). Often, from DBs one can split up the table via > SQL statements that can then be fetched separately. > You may want to write your own multithreaded client to > index. > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem using Solr/Lucene: > http://www.lucidimagination.com/search > >
Re: Importing large datasets
Wait! You're fetching records from one database and then doing lookups against another DB? That makes this a completely different problem. The DIH does not to my knowledge have the ability to "pool" these queries. That is, it will not build a batch of 1000 keys from datasource1 and then do a query against datasource2 with: select foo where key_field IN (key1, key2,... key1000); This is the efficient way to do what you want. You'll have to write your own client to do this. On Wed, Jun 2, 2010 at 12:00 PM, David Stuart wrote: > How long does it take to do a grab of all the data via SQL? I found by > denormalizing the data into a lookup table meant that I was able to index > about 300k rows of similar data size with dih regex spilting on some fields > in about 8mins I know it's not quite the scale bit with batching... > > David Stuar > > On 2 Jun 2010, at 17:58, Blargy wrote: > >> >> >> >>> One thing that might help indexing speed - create a *single* SQL query >>> to grab all the data you need without using DIH's sub-entities, at >>> least the non-cached ones. >>> >> >> Not sure how much that would help. As I mentioned that without the item >> description import the full process takes 4 hours which is bearable. >> However >> once I started to import the item description which is located on a >> separate >> machine/database the import process exploded to over 24 hours. >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html >> Sent from the Solr - User mailing list archive at Nabble.com. > -- Lance Norskog goks...@gmail.com
Some basics
Hi, I'm new to SOLR and have some basic questions that hopefully steer me in the right direction. - I want my search to "auto" spell check - that is if someone types "restarant" I'd like the system to automatically search for restaurant. I've seen the SpellCheckComponent but that doesn't seem to have a simple way to automatically do the "near" type comparison. Is the SpellCheckComponent the wrong one or do I just need to manually handle the situation in my client code? - Also, what is the proper analyzer if I want to search a search for "thai food" or "thai restaurant" to actually match on Thai? I can't totally ignore words like food and restaurant but I want to ignore more general terms and look for specific first (or I should say score them higher). Any tips on what I should be reading up on will be greatly appreciated. Thanks.
Re: DataImportHandler and running out of disk space
: I ran through some more failure scenarios (scenarios and results below). The : concerning ones in my deployment are when data does not get updated, but the : DIH's .properties file does. I could only simulate that scenario when I ran : out of disk space (all all disk space issues behaved consistently). Is this : worthy of a JIRA issue? I don't know that it's DIH's responsibility to be specificly aware of disk space issues -- but it definitely sounds like a bug if Exceptions/Errors like running out of space (or file permissions errors) are occuring but DIH is still reporting success (and still updating hte properties file with the lsat updated timestamp) by all means: please open issues for these types of things. : Successful import : : all dates updated in .properties (title date updated, each [entity : name].last_index_time updated to its own update time. last_index_time set to : earliest entity update time) : : : : : Running out of disk space during import (in data directory only, conf : directory still has space) : : no data updated, but dataimport.properties updated as in 1 : : : : : Running out of disk space during import (in both data directory and conf : directory) : : some data updated, but dataimport.properties updated as in 1 : : : : : Running out of disk space during commit/optimize (in data directory only, : conf directory still has space) : : no data updated, but dataimport.properties updated as in 1 : : : : : Running out of disk space during commit/optimize (in both data directory and : conf directory) : : no data updated, but dataimport.properties updated as in 1 : : : : : File permissions prevent writing (on index directory) : : data not updated, failure reported, properties file not updated : : : : : File permissions prevent writing (on segment files) : : data updated, failure reported, properties file not updated : : : : : File permissions prevent writing (on .properties file) : : data updated, failure reported, properties file not updated : : : : : Shutting down Solr during import (killing process) : : data not updated, .properties not updated, no result reported : : : : : Shutting down Solr during import (issuing shutdown message) : : Some data updated, .properties not updated, no result reported : : : : : DB connection lost (unplugging network cable) : : data not updated, .properties not updated, failure reported : : : : : Updating single entity fails (first one) : : data not updated, .properties not updated, failure reported : : : : : Updating single entity fails (after another one succeeds) : : data not updated, .properties not updated, failure reported : : : : : : -- : View this message in context: http://lucene.472066.n3.nabble.com/DataImportHandler-and-running-out-of-disk-space-tp835125p835368.html : Sent from the Solr - User mailing list archive at Nabble.com. : -Hoss
Help in facet query
Hi, Can I restrict the facet search within the result count? Example: A total of 100 documents were fetched for a given query x, and facet worked in these 100 documents. I want that facet should work only on first 10 documents fetched from query x. Regards, Sushan Rungta
Re: Luke browser does not show non-String Solr fields?
Thank you Chris. I'm clear now. I'll give Luke's latest version a try when it's out. On Wed, Jun 2, 2010 at 9:47 AM, Chris Hostetter wrote: > > : I see. It's still a little confusing to me but I'm fine as long as > : this is the expected behavior. I also tried the "example" index > : with data that come with the solr distribution and observe the > : same behavior - only String fields are displayed. So Lucene is > : sharing _some_ types with Solr but not all. It's still a bit puzzling > : to me that Lucene is not able to understand the simple types > : such as long. But I'm OK as long as there is a reason. Thanks > : for the explanations! > > The key is that there are *no* types in Lucene ... older > versions of Lucene only supported "Strin" and clinets that wanted to index > other types had to encode those types in some way as needed. More > recently lucene has started moving away from even dealing with Strings, > and towards just indexing/searching raw byte[] ... all concepts of "field > types" in Solr are specific to Solr > > (the caveat being that Lucene has, over the years, added utilities to help > people make smart choices about how to encode some data types -- and in > the case of the Trie numeric fields SOlr uses those utilites. But that > data isn't stored anywhere in the index files themselves, so Luke has no > way of knowing that it should attempt to "decode" the binary data of a > field using the Trie utilities. That said: aparently Andrzej is working > on making it possible to tell Luke "oh BTW, i indexed this field using > this solr fieldType" ... i think he said it was on the Luke trunk) > > > -Hoss
Re: Not able to access Solr Admin
When you access from another machine what message error do you get ? Check your remote access with Telnet to see if the server respond On Wed, Jun 2, 2010 at 10:26 PM, Bondiga, Murali < murali.krishna.bond...@hmhpub.com> wrote: > Thank you so much for the reply. > > I am using Jetty which comes with Solr installation. > > http://localhost:8983/solr/ > > The above URL works fine. > > The below URL does not work: > > http://177.44.9.119:8983/solr/ > > > -Original Message- > From: Abdelhamid ABID [mailto:aeh.a...@gmail.com] > Sent: Wednesday, June 02, 2010 5:07 PM > To: solr-user@lucene.apache.org > Subject: Re: Not able to access Solr Admin > > details... detailseverybody let's say details ! > > Which app server are you using ? > What is the error message that you get when trying to access solr admin > from > another machine ? > > > > On Wed, Jun 2, 2010 at 9:39 PM, Bondiga, Murali < > murali.krishna.bond...@hmhpub.com> wrote: > > > Hi, > > > > I installed Solr Server on my machine and able to access with localhost. > I > > tried accessing from a different machine with IP Address but not able to > > access it. What do I need to do to be able to access the Solr instance > from > > any machine within the network? > > > > Thanks, > > Murali > > > > > > -- > Abdelhamid ABID > Software Engineer- J2EE / WEB > -- Abdelhamid ABID Software Engineer- J2EE / WEB
RE: Not able to access Solr Admin
Thank you so much for the reply. I am using Jetty which comes with Solr installation. http://localhost:8983/solr/ The above URL works fine. The below URL does not work: http://177.44.9.119:8983/solr/ -Original Message- From: Abdelhamid ABID [mailto:aeh.a...@gmail.com] Sent: Wednesday, June 02, 2010 5:07 PM To: solr-user@lucene.apache.org Subject: Re: Not able to access Solr Admin details... detailseverybody let's say details ! Which app server are you using ? What is the error message that you get when trying to access solr admin from another machine ? On Wed, Jun 2, 2010 at 9:39 PM, Bondiga, Murali < murali.krishna.bond...@hmhpub.com> wrote: > Hi, > > I installed Solr Server on my machine and able to access with localhost. I > tried accessing from a different machine with IP Address but not able to > access it. What do I need to do to be able to access the Solr instance from > any machine within the network? > > Thanks, > Murali > -- Abdelhamid ABID Software Engineer- J2EE / WEB
Re: Not able to access Solr Admin
details... detailseverybody let's say details ! Which app server are you using ? What is the error message that you get when trying to access solr admin from another machine ? On Wed, Jun 2, 2010 at 9:39 PM, Bondiga, Murali < murali.krishna.bond...@hmhpub.com> wrote: > Hi, > > I installed Solr Server on my machine and able to access with localhost. I > tried accessing from a different machine with IP Address but not able to > access it. What do I need to do to be able to access the Solr instance from > any machine within the network? > > Thanks, > Murali > -- Abdelhamid ABID Software Engineer- J2EE / WEB
Not able to access Solr Admin
Hi, I installed Solr Server on my machine and able to access with localhost. I tried accessing from a different machine with IP Address but not able to access it. What do I need to do to be able to access the Solr instance from any machine within the network? Thanks, Murali
RE: Auto-suggest internal terms
I was interested in the same thing and stumbled upon this article: http://www.mattweber.org/2009/05/02/solr-autosuggest-with-termscomponent -and-jquery/ I haven't followed through, but it looked promising to me. Tim -Original Message- From: Jay Hill [mailto:jayallenh...@gmail.com] Sent: Wednesday, June 02, 2010 4:02 PM To: solr-user@lucene.apache.org Subject: Auto-suggest internal terms I've got a situation where I'm looking to build an auto-suggest where any term entered will lead to suggestions. For example, if I type "wine" I want to see suggestions like this: french *wine* classes *wine* book discounts burgundy *wine* etc. I've tried some tricks with shingles, but the only solution that worked was pre-processing my queries into a core in all variations. Anyone know any tricks to accomplish this in Solr without doing any custom work? -Jay
RE: Auto-suggest internal terms
I'm painfully new to Solr so please be gentle if my suggestion is terrible! Could you use highlighting to do this? Take the first n results from a query and show their highlights, customizing the highlights to show the desired number of words. Just a thought. Patrick -Original Message- From: Jay Hill [mailto:jayallenh...@gmail.com] Sent: Wednesday, June 02, 2010 4:02 PM To: solr-user@lucene.apache.org Subject: Auto-suggest internal terms I've got a situation where I'm looking to build an auto-suggest where any term entered will lead to suggestions. For example, if I type "wine" I want to see suggestions like this: french *wine* classes *wine* book discounts burgundy *wine* etc. I've tried some tricks with shingles, but the only solution that worked was pre-processing my queries into a core in all variations. Anyone know any tricks to accomplish this in Solr without doing any custom work? -Jay
Auto-suggest internal terms
I've got a situation where I'm looking to build an auto-suggest where any term entered will lead to suggestions. For example, if I type "wine" I want to see suggestions like this: french *wine* classes *wine* book discounts burgundy *wine* etc. I've tried some tricks with shingles, but the only solution that worked was pre-processing my queries into a core in all variations. Anyone know any tricks to accomplish this in Solr without doing any custom work? -Jay
Re: Query related question
: When I query for a word say Tiger woods, and sort results by score... i do : notice that the results are mixed up i.e first 5 results match Tiger woods : the next 2 match either tiger/tigers or wood/woods : the next 2 after that i notice again match tiger woods. : : How do i make sure that when searching for words like above i get all the : results matching whole search term first, followed by individual tokens like : tiger, woods later. for starters, you have to make sense of why exactly those docs are scoring that way -- this is what the param debugQuery=true is for -- look at the score explanations and see why those docs are scoring lower. My guess is that it's because of fieldNorms (ie: longer documents score lower with the same number of matches) but it could also be a term frequency factor (some documents contain "tiger" so many times they score high even w/o "woods") ... you have to understand why your docs score they way they do before you can come up with a general plan for how to change the scoring to better meet your goals. -Hoss
Re: Importing large datasets
How long does it take to do a grab of all the data via SQL? I found by denormalizing the data into a lookup table meant that I was able to index about 300k rows of similar data size with dih regex spilting on some fields in about 8mins I know it's not quite the scale bit with batching... David Stuar On 2 Jun 2010, at 17:58, Blargy wrote: One thing that might help indexing speed - create a *single* SQL query to grab all the data you need without using DIH's sub-entities, at least the non-cached ones. Not sure how much that would help. As I mentioned that without the item description import the full process takes 4 hours which is bearable. However once I started to import the item description which is located on a separate machine/database the import process exploded to over 24 hours. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: minpercentage vs. mincount
thx for your reply! On 02.06.2010, at 20:27, Chris Hostetter wrote: > feel free to file a feature request -- truthfully this is kind of a hard > problem to solve in userland, you'd either have to do two queries (the > first to get the numFound, the second with facet.mincount set as an > integer relative numFound) or you'd have to do a single query but ask for > a "big" value for facet.limit and hope that you get enough to prune your > list. well i would probably implement it by just not setting a limit, and then just reducing the facets based on the numRows before sending the facets to the client (aka browser) > Off the top of my head though: i can't relaly think of a sane way to do > this on the server side that would work with distributed search either -- > but go ahead and open an issue and let's see what the folks who are really > smart about the distributed searching stuff have to say. ok i have created it: https://issues.apache.org/jira/browse/SOLR-1937 regards, Lukas Kahwe Smith m...@pooteeweet.org
Re: minpercentage vs. mincount
: Obviously I could implement this in userland (like like mincount for : that matter), but I wonder if anyone else see's use in being able to : define that a facet must match a minimum percentage of all documents in : the result set, rather than a hardcoded value? The idea being that while : I might not be interested in a facet that only covers 3 documents in the : result set if there are lets say 1000 documents in the result set, the : situation would be a lot different if I only have 10 documents in the : result set. typically people deal with this type of situation by using facet.limit to ensure they only get the "top" N constraints back -- and they set facet.mincount to something low just to save bandwidth if all the counts are "too low to care about no matter how few results there are" (ie: 0) : I did not yet see such a feature, would it make sense to file it as a : feature request or should stuff like this rather be done in userland (I : have noticed for example that Solr prefers to have users normalize the : scores in userland too)? feel free to file a feature request -- truthfully this is kind of a hard problem to solve in userland, you'd either have to do two queries (the first to get the numFound, the second with facet.mincount set as an integer relative numFound) or you'd have to do a single query but ask for a "big" value for facet.limit and hope that you get enough to prune your list. Off the top of my head though: i can't relaly think of a sane way to do this on the server side that would work with distributed search either -- but go ahead and open an issue and let's see what the folks who are really smart about the distributed searching stuff have to say. -Hoss
Re: Importing large datasets
> One thing that might help indexing speed - create a *single* SQL query > to grab all the data you need without using DIH's sub-entities, at > least the non-cached ones. > Not sure how much that would help. As I mentioned that without the item description import the full process takes 4 hours which is bearable. However once I started to import the item description which is located on a separate machine/database the import process exploded to over 24 hours. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865324.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Combining index and file spellcheck dictionaries
: Is it possible to combine index and file spellcheck dictionaries? off the top of my head -- i don't think so. however you could add special docs to your index, which only contain the "spell" field you use to build your spellcheck index, based on the contents of your dictionary file. -Hoss
Re: Array of arguments in URL?
: In the "/spell" declaration in the example solrconfig.xml, we find : these lines among the default parameters: as grant pointed out: these aren't in the default params : How does one supply such an array of strings in HTTP parameters? Does : Solr have a parsing option for this? in general, ignoring for a moment hte question of wether you are asking about changing the component list in a param (you can't) and addressing just the question of specifing an array of strings in HTTP params: if the param supports multiple values, then you can specify multiple values just be repeating hte key... q=foo&fq=firstValue&fq=secondValue&fq=thirdValue ...this results in a SolrParams instance where the "value" of "fq" is an array of [firstValue, secondValue] -Hoss
Re: Luke browser does not show non-String Solr fields?
: I see. It's still a little confusing to me but I'm fine as long as : this is the expected behavior. I also tried the "example" index : with data that come with the solr distribution and observe the : same behavior - only String fields are displayed. So Lucene is : sharing _some_ types with Solr but not all. It's still a bit puzzling : to me that Lucene is not able to understand the simple types : such as long. But I'm OK as long as there is a reason. Thanks : for the explanations! The key is that there are *no* types in Lucene ... older versions of Lucene only supported "Strin" and clinets that wanted to index other types had to encode those types in some way as needed. More recently lucene has started moving away from even dealing with Strings, and towards just indexing/searching raw byte[] ... all concepts of "field types" in Solr are specific to Solr (the caveat being that Lucene has, over the years, added utilities to help people make smart choices about how to encode some data types -- and in the case of the Trie numeric fields SOlr uses those utilites. But that data isn't stored anywhere in the index files themselves, so Luke has no way of knowing that it should attempt to "decode" the binary data of a field using the Trie utilities. That said: aparently Andrzej is working on making it possible to tell Luke "oh BTW, i indexed this field using this solr fieldType" ... i think he said it was on the Luke trunk) -Hoss
Re: Importing large datasets
One thing that might help indexing speed - create a *single* SQL query to grab all the data you need without using DIH's sub-entities, at least the non-cached ones. Erik On Jun 2, 2010, at 12:21 PM, Blargy wrote: As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). Also wanted to add that our main entity (item) consists of 5 sub- entities (ie, joins). 2 of those 5 are fairly small so I am using CachedSqlEntityProcessor for them but the other 3 (which includes item_description) are normal. All the entites minus the item_description connect to datasource1. They currently point to one physical machine although we do have a pool of 3 DB's that could be used if it helps. The other entity, item_description uses a datasource2 which has a pool of 2 DB's that could potentially be used. Not sure if that would help or not. I might as well that the item description will have indexed, stored and term vectors set to true. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Luke browser does not show non-String Solr fields?
I see. It's still a little confusing to me but I'm fine as long as this is the expected behavior. I also tried the "example" index with data that come with the solr distribution and observe the same behavior - only String fields are displayed. So Lucene is sharing _some_ types with Solr but not all. It's still a bit puzzling to me that Lucene is not able to understand the simple types such as long. But I'm OK as long as there is a reason. Thanks for the explanations! On Tue, Jun 1, 2010 at 10:38 AM, Chris Hostetter wrote: > > : So it seems like Luke does not understand Solr's long type. This > : is not a native Lucene type? > > No, Lucene has concept of "types" ... there are utilities to help encode > some data in special ways (particularly numbers) but the underlying lucene > index doesn't keep track of when/how you do ths -- so Luke has no way of > knowing what "type" the field is. > > Schema information is specific to Solr. > > > -Hoss > >
Re: Importing large datasets
As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). Also wanted to add that our main entity (item) consists of 5 sub-entities (ie, joins). 2 of those 5 are fairly small so I am using CachedSqlEntityProcessor for them but the other 3 (which includes item_description) are normal. All the entites minus the item_description connect to datasource1. They currently point to one physical machine although we do have a pool of 3 DB's that could be used if it helps. The other entity, item_description uses a datasource2 which has a pool of 2 DB's that could potentially be used. Not sure if that would help or not. I might as well that the item description will have indexed, stored and term vectors set to true. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865219.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
Andrzej Bialecki wrote: > > On 2010-06-02 12:42, Grant Ingersoll wrote: >> >> On Jun 1, 2010, at 9:54 PM, Blargy wrote: >> >>> >>> We have around 5 million items in our index and each item has a >>> description >>> located on a separate physical database. These item descriptions vary in >>> size and for the most part are quite large. Currently we are only >>> indexing >>> items and not their corresponding description and a full import takes >>> around >>> 4 hours. Ideally we want to index both our items and their descriptions >>> but >>> after some quick profiling I determined that a full import would take in >>> excess of 24 hours. >>> >>> - How would I profile the indexing process to determine if the >>> bottleneck is >>> Solr or our Database. >> >> As a data point, I routinely see clients index 5M items on normal >> hardware in approx. 1 hour (give or take 30 minutes). >> >> When you say "quite large", what do you mean? Are we talking books here >> or maybe a couple pages of text or just a couple KB of data? >> >> How long does it take you to get that data out (and, from the sounds of >> it, merge it with your item) w/o going to Solr? >> >>> - In either case, how would one speed up this process? Is there a way to >>> run >>> parallel import processes and then merge them together at the end? >>> Possibly >>> use some sort of distributed computing? >> >> DataImportHandler now supports multiple threads. The absolute fastest >> way that I know of to index is via multiple threads sending batches of >> documents at a time (at least 100). Often, from DBs one can split up the >> table via SQL statements that can then be fetched separately. You may >> want to write your own multithreaded client to index. > > SOLR-1301 is also an option if you are familiar with Hadoop ... > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > I haven't worked with Hadoop before but I'm willing to try anything to cut down this full import time. I see this currently uses the embedded solr server for indexing... would I have to scrap my DIH importing then? -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865103.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Importing large datasets
As a data point, I routinely see clients index 5M items on normal > hardware in approx. 1 hour (give or take 30 minutes). Our master solr machine is running 64-bit RHEL 5.4 on dedicated machine with 4 cores and 16G ram so I think we are good on the hardware. Our DB is MySQL version 5.0.67 (exact stats i don't know of the top of my head) When you say "quite large", what do you mean? Are we talking books here or maybe a couple pages of text or just a couple KB of data? Our item descriptions are very similar to an ebay listing and can include HTML. We are talking about a couple of pages of text. How long does it take you to get that data out (and, from the sounds of it, merge it with your item) w/o going to Solr? I'll have to get back to you on that one. DataImportHandler now supports multiple threads. When you say "now", what do you mean? I am running version 1.4. The absolute fastest way that I know of to index is via multiple threads sending batches of documents at a time (at least 100) Is there a wiki explaining how this multiple thread process works? Which batch size would work best? I am currently using a -1 batch size. You may want to write your own multithreaded client to index. This sounds like a viable option. Can you point me in the right direction on where to begin (what classes to look at, prior examples, etc)? Here is my field type I am using for the item description. Maybe its not the best? Here is an overview of my data-config.xml. Thoughts? ... I appreciate the help. -- View this message in context: http://lucene.472066.n3.nabble.com/Importing-large-datasets-tp863447p865091.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Many Tomcat Processes on Server ?!?!?
okay you are right. thats all threads and no processes ... but so many ? :D hehe so when all the "processes" are threads i think its okay so ?! i can ignore this ... XD -- View this message in context: http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p865008.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrException: No such core
Solr is used to manage lists of indexes. We have a database containing documents of different types. Each document type is defined by a list of properties and we want to associate some of these properties with lists of indexes to help users during query. For example: The property contains a text field "desc" may be associated with a field Solr "desc_en_items. "Desc_en_items" is a dynamic field solr: And so on for each property associated with a field Solr. Each Solr document contains an identifier (stored and indexed) Solr and dynamic fields. (only indexed) When adding a document in our database, if needed, we dynamically generate the document and add it to solr. When a document is deleted from our database we suppress systematically the solr document "deleteById" (the document can not exist in solr). There is only one core (Core0) and the server is embedded. We use a derived lucli/LuceneMethods.java to browse index. It seems to me, without being sure, that the problem comes when no list is set (solr is started but contains no records) in a few days of operation. We have a database with lists parameterized works for several months without problem. Here the wrappers to use ...solrj.SolrServer [code] public class SolrCoreServer { private static Logger log = LoggerFactory.getLogger(SolrCoreServer.class); private SolrServer server=null; public SolrCoreServer(CoreContainer container, String coreName) { server = new EmbeddedSolrServer( container, coreName ); } protected SolrServer getSolrServer(){ return server; } public void cleanup() throws SolrServerException, IOException { log.debug("cleanup()"); UpdateResponse rsp = server.deleteByQuery( "*:*" ); log.debug("cleanup():" + rsp.getStatus()); if (rsp.getStatus() != 0) throw new SolrServerException("cleanup() failed status=" + rsp.getStatus()); } public void add(SolrInputDocument doc) throws SolrServerException, IOException{ log.debug("add(" + doc + ")"); UpdateResponse rsp = server.add(doc); log.debug("add():" + rsp.getStatus()); if (rsp.getStatus() != 0) throw new SolrServerException("add() failed status=" + rsp.getStatus()); } public void add(Collection docs) throws SolrServerException, IOException{ log.debug("add(" + docs + ")"); UpdateResponse rsp = server.add(docs); log.debug("add():" + rsp.getStatus()); if (rsp.getStatus() != 0) throw new SolrServerException("add() failed status=" + rsp.getStatus()); } public void deleteById(String docId) throws SolrServerException, IOException{ log.debug("deleteById(" + docId + ")"); UpdateResponse rsp = server.deleteById(docId); log.debug("deleteById():" + rsp.getStatus()); if (rsp.getStatus() != 0) throw new SolrServerException("deleteById() failed status=" + rsp.getStatus()); } public void commit() throws SolrServerException, IOException { log.debug("commit()"); UpdateResponse rsp = server.commit(); log.debug("commit():" + rsp.getStatus()); if (rsp.getStatus() != 0) throw new SolrServerException("commit() failed status=" + rsp.getStatus()); } public void addAndCommit(Collection docs) throws SolrServerException, IOException{ log.debug("addAndCommit(" + docs + ")"); UpdateRequest req = new UpdateRequest(); req.setAction( UpdateRequest.ACTION.COMMIT, false, false ); req.add( docs ); UpdateResponse rsp = req.process( server ); log.debug("addAndCommit():" + rsp.getStatus()); if (rsp.getStatus() != 0) throw new SolrServerException("addAndCommit() failed status=" + rsp.getStatus()); } public QueryResponse query( SolrQuery query ) throws SolrServerException{ log.debug("query(" + query + ")"); QueryResponse qr = server.query( query ); log.debug("query():" + qr.getStatus()); return qr; } public QueryResponse query( String queryString, String sortField, SolrQuery.ORDER order, Integer maxRows ) throws SolrServerException{ log.debug("query(" + queryString + ")"); SolrQuery query = new SolrQuery(); query.setQuery( queryString ); query.addSortField( sortField, order ); query.setRows(maxRows); QueryResponse qr = server.query( query ); log.debug("query():" + qr.getStatus()); return qr; } } [/code] the schema [code]
Re: nested querries, and LocalParams syntax
Thanks Yonik. I guess the confusing thing is if the lucene query parser (for nested querries) does backslash escaping, and the LocalParams also does backslash escaping when you have a nested query with local params, with quotes at both places... the inner scope needs... double escaping? That gets really confusing fast. [ Yeah, I recognize that using parameter dereferencing can avoid this; I'm trying to see if I can make my code flexible enough to work either way]. Maybe using single vs double quotes is the answer. Let's try one out and see: [Query un-uri escaped for clarity:] _query_:"{!dismax q.alt=' \"a phrase search \" '} \"another phrase search\" " [ Heh, getting that into a ruby string to uri escape it is a pain, but we end up with: ] &q="_query_%3A%7B%21dismax+q.alt%3D%27%5C%22a+phrase+search%5C%22%27%7D+%5C%22another+phrase+search%5C%22 Which, um, I _think_ is working, although the debugQuery=true isn't telling me much, I don't entirely understand it. Have to play around with it more. But it looks like maybe a fine strategy is use double quote for the nested query itself, use single quote for the LocalParam values, and then simply singly escape any single or double quotes inside the LocalParam values. Jonathan Yonik Seeley wrote: Hmmm, well, the lucene query parser does basic backslash escaping, and so does local params within quoted strings. You can also use parameter derefererencing to avoid the need to escape values too. Like you pointed out, using single quotes in some places can also help. But instead of me trying to give you tons of examples that you probably already understand, start from the assumption that things will work, and if you come across something that doesn't make sense (or doesn't work), I can help with that. Or if you give a single real example as a general pattern, perhaps we could help figure out the simplest way to avoid most of the escaping. -Yonik http://www.lucidimagination.com On Tue, Jun 1, 2010 at 6:21 PM, Jonathan Rochkind wrote: I am just trying to figure it out mostly, the particular thing I am trying to do is a very general purpose mapper to complex dismax nested querries. I could try to explain it, and we could go back and forth for a while, and maybe I could convince you it makes sense to do what I'm trying to do. But mostly I'm just exploring at this point, so I can get a sense of what is possible. So it would be super helpful if someone can help me figure out escaping stuff and skip the other part, heh. But basically, it's a mapper from a "CQL" query (a structured language for search-engine-style querries) to Solr, where some of the "fields" searched aren't really Solr fields/indexes, but aggregated definitions of dismax query params including multiple solr fields, where exactly what solr fields and other dismax querries will not be hard-coded, but will be configurable. Thus the use of nested querries. So since it ends up so general purpose and abstract, and many of the individual parameters are configurable, thus my interest in figuring out proper escaping. Jonathan Yonik Seeley wrote: It's not clear if you're just trying to figure it all out, or get something specific to work. If you can give a specific example, we might be able to suggest easier ways to achieve it rather than going escape crazy :-) -Yonik http://www.lucidimagination.com On Tue, Jun 1, 2010 at 5:06 PM, Jonathan Rochkind wrote: Thanks, the pointer to that documentation page (which somehow I had missed), as well as Chris's response is very helpful. The one thing I'm still not sure about, which I might be able to figure it out through trial-and-error reverse engineering, is escaping issues when you combine nested querries WITH local params. We potentially have a lot of levels of quotes: q= URIescape(_local_="{!dismax qf=" value that itself contains a \" quote mark"} "phrase query"" ) Whole bunch of quotes going on there. How do I give this to Solr so all my quotes will end up parsed appropriately? Obviously that above example isn't right. We've got the quotes around the _local_ nested query, then we've got quotes around a LocalParam value, then we've got quotes that might be IN the actual literal value of the LocalParam, or quotes that might be in the actual literal value of the nested query. Maybe using single quotes in some places but double quotes in others will help, for certain places that can take singel or double quotes? Thanks very much for any advice, I get confused thinking about this. Jonathan Chris Hostetter wrote: In addition to yonik's point about the LocalParams wiki page (and please let us know if you aren't sure of the answers to any of your questions after reading it) I wanted to clear up one thing... : Let's start with that not-nested query example. Can you in fact use it as : above, to force dismax handling of the 'q' even if the qt or request handler Quick side
Re: Many Tomcat Processes on Server ?!?!?
Le 02-juin-10 à 16:57, stockii a écrit : all the process in in htop show, have a own PID. so thats are no threads ? No, you can't say that. In general it is sufficient for the "mother process" to be killed but it can take several attempts. i restart my tomcat via " /etc/init.d/tomcat restart " do you think that after ervery resart the processes arent closed ? after bin/shutdown.sh it is very common to me that some hanging threads remain... and we crafted my little script snippet (which is kind of specific) to actually prevent this and kill... after a while only. it's not optimal. paul smime.p7s Description: S/MIME cryptographic signature
RE: Many Tomcat Processes on Server ?!?!?
Try shutting tomcat down instead of restarting. If processes remain, then I'd say further investigation is warranted. If no processes remain, then I think it's safe to disregard unless you notice any problems. -Original Message- From: stockii [mailto:st...@shopgate.com] Sent: Wednesday, June 02, 2010 10:57 AM To: solr-user@lucene.apache.org Subject: Re: Many Tomcat Processes on Server ?!?!? all the process in in htop show, have a own PID. so thats are no threads ? i restart my tomcat via " /etc/init.d/tomcat restart " do you think that after ervery resart the processes arent closed ? -- View this message in context: http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864918.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Many Tomcat Processes on Server ?!?!?
all the process in in htop show, have a own PID. so thats are no threads ? i restart my tomcat via " /etc/init.d/tomcat restart " do you think that after ervery resart the processes arent closed ? -- View this message in context: http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864918.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: PHP output at a multiValued AND dynamicField
Am 02.06.2010 16:42, schrieb Jörg Agatz: > i don't understand what you mean! > Then you should ask more precisely.
Re: Many Tomcat Processes on Server ?!?!?
Am 02.06.2010 16:39, schrieb Paul Libbrecht: > This is impressive, I had this in any Linux I've been using: SuSE, > Ubuntu, Debian, Mandrake, ... > Maybe there's some modern JDK with a modern Linux where it doesn't happen? > It surely is not one process per thread though. I'm not a linux thread expert, but from what I know Linux doesn't know lightweight threads as other systems do. Instead it uses processes for that. But these processes aren't "top level" processes that show up in top/ps. Instead, they're grouped hierarchically (AFAIK). Otherwise you would be able to kill single user threads with their own process id, or kill the main process and let the spawned threads continue. That would be totally crazy. In my configuration, Tomcat doesn't shut down correctly if I call bin/shutdown.sh, so I have to kill the process manually. I don't know why. This might be the reason why stockii has 3 Tomcat processes running.
Re: PHP output at a multiValued AND dynamicField
i don't understand what you mean!
Re: Many Tomcat Processes on Server ?!?!?
This is impressive, I had this in any Linux I've been using: SuSE, Ubuntu, Debian, Mandrake, ... Maybe there's some modern JDK with a modern Linux where it doesn't happen? It surely is not one process per thread though. paul Le 02-juin-10 à 16:29, Michael Kuhlmann a écrit : Am 02.06.2010 16:13, schrieb Paul Libbrecht: Is your server Linux? In this case this is very normal.. any java application spawns many new processes on linux... it's not exactly bound to threads unfortunately. Uh, no. New threads in Java typically don't spawn new processes on OS level. I never had more than one tomcat process on any Linux machine. In fact, if there was more than one because a previous Tomcat hadn't shut down correctly, the new process wouldn't respond to HTTP requests. 55 Tomcat processes shouldn't be normal, at least not if that's what "ps aux" responds. smime.p7s Description: S/MIME cryptographic signature
Re: Many Tomcat Processes on Server ?!?!?
oha... "ps aux" shows only 3 processes from tomcat55. but why show htop 55 ? close the garbage collector these not ? -- View this message in context: http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864849.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Many Tomcat Processes on Server ?!?!?
Maybe he was looking at the output from top or htop? -Original Message- From: Michael Kuhlmann [mailto:michael.kuhlm...@zalando.de] Sent: Wednesday, June 02, 2010 10:29 AM To: solr-user@lucene.apache.org Subject: Re: Many Tomcat Processes on Server ?!?!? Am 02.06.2010 16:13, schrieb Paul Libbrecht: > Is your server Linux? > In this case this is very normal.. any java application spawns many new > processes on linux... it's not exactly bound to threads unfortunately. Uh, no. New threads in Java typically don't spawn new processes on OS level. I never had more than one tomcat process on any Linux machine. In fact, if there was more than one because a previous Tomcat hadn't shut down correctly, the new process wouldn't respond to HTTP requests. 55 Tomcat processes shouldn't be normal, at least not if that's what "ps aux" responds.
Re: PHP output at a multiValued AND dynamicField
Am 02.06.2010 16:15, schrieb Jörg Agatz: > yes i done.. but i dont know how i get the information out of the big > Array... They're simply the keys of a single response array.
Re: Many Tomcat Processes on Server ?!?!?
Am 02.06.2010 16:13, schrieb Paul Libbrecht: > Is your server Linux? > In this case this is very normal.. any java application spawns many new > processes on linux... it's not exactly bound to threads unfortunately. Uh, no. New threads in Java typically don't spawn new processes on OS level. I never had more than one tomcat process on any Linux machine. In fact, if there was more than one because a previous Tomcat hadn't shut down correctly, the new process wouldn't respond to HTTP requests. 55 Tomcat processes shouldn't be normal, at least not if that's what "ps aux" responds.
Re: Many Tomcat Processes on Server ?!?!?
You'd need to search explanations for this at generic java forums. It's the same with any java process on Linux. In the Unix family Solaris and MacOSX do it better, fortunately and is probably due to the very old time where the Linux java was a translation of the Solaris java with the special features implemented when it was not found in Linux (e.g. green-threads). paul Le 02-juin-10 à 16:21, stockii a écrit : yes, its a Linux... Debian System. when i running a import. only 2-3 tomcat processes are running. the other doing nothing ... thats what is strange for me .. ^^ -- View this message in context: http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864804.html Sent from the Solr - User mailing list archive at Nabble.com. smime.p7s Description: S/MIME cryptographic signature
Re: Many Tomcat Processes on Server ?!?!?
yes, its a Linux... Debian System. when i running a import. only 2-3 tomcat processes are running. the other doing nothing ... thats what is strange for me .. ^^ -- View this message in context: http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864804.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: RIA sample and minimal JARs required to embed Solr
Glad to hear someone looking at Solr not just as web enabled search engine, but as a simpler/more powerful interface to Lucene! When you download the source code, look at the Chapter 8 "Crawler" project, specifically "Indexer.java", it demonstrates how to index into both a traditional separate Solr process and how to fire up an embedded Solr. It is remarkably easy to interact with an embedded Solr! In terms of minimal dependencies, what you need for a standalone Solr (outside of the servlet container like Tomcat/Jetty) is what you need for an embedded Solr. Eric On May 29, 2010, at 9:32 PM, Thomas J. Buhr wrote: > Solr, > > The Solr 1.4 EES book arrived yesterday and I'm very much enjoying it. I was > glad to see that "rich clients" are one case for embedding Solr as this is > the case for my application. Multi Cores will also be important for my RIA. > > The book covers a lot and makes it clear that Solr has extensive abilities. > There is however no clean and simple sample of embedding Solr in a RIA in the > book, only a few alternate language usage samples. Is there a link to a Java > sample that simply embeds Solr for local indexing and searching using Multi > Cores? > > Also, what kind of memory footprint am I looking at for embedding Solr? What > are the minimal dependancies? > > Thom - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server Free/Busy: http://tinyurl.com/eric-cal
Re: PHP output at a multiValued AND dynamicField
yes i done.. but i dont know how i get the information out of the big Array... Al fields like P_VIP_ADR_*
Re: Many Tomcat Processes on Server ?!?!?
Is your server Linux? In this case this is very normal.. any java application spawns many new processes on linux... it's not exactly bound to threads unfortunately. And, of course, they all refer to the same invocation path. paul Le 02-juin-10 à 15:59, stockii a écrit : Hello. Our Server is a 8-Core Server with 12 GB RAM. Solr is running with 4 Cores. 55 Tomcat 5.5 processes are running. ist this normal ??? htop show me a list of these processes of the server. and tomcat have about 55. every process using: /usr/share/java/commons-daemon.jar:/usr/share/tomcat5.5/bin/ bootstrap.jar. is this normal ? smime.p7s Description: S/MIME cryptographic signature
Re: Many Tomcat Processes on Server ?!?!?
My guess would be that commons-daemon is somehow thinking that Tomcat has gone down and started up multiple copies... You only need one Tomcat process for your 4 core Solr instance! You may have many other WAR applications hosted in Tomcat, I know a lot of places would have 1 tomcat per deployed WAR pattern. On Jun 2, 2010, at 9:59 AM, stockii wrote: > > Hello. > > Our Server is a 8-Core Server with 12 GB RAM. > Solr is running with 4 Cores. > > 55 Tomcat 5.5 processes are running. ist this normal ??? > > htop show me a list of these processes of the server. and tomcat have about > 55. > every process using: > /usr/share/java/commons-daemon.jar:/usr/share/tomcat5.5/bin/bootstrap.jar. > > is this normal ? > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864732.html > Sent from the Solr - User mailing list archive at Nabble.com. - Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com Co-Author: Solr 1.4 Enterprise Search Server available from http://www.packtpub.com/solr-1-4-enterprise-search-server Free/Busy: http://tinyurl.com/eric-cal
Re: PHP output at a multiValued AND dynamicField
You probably should try the php or phps response writer - it'll likely make your PHP integration easier. Erik On Jun 2, 2010, at 9:50 AM, Jörg Agatz wrote: Hallo Users... I have a Problem... In my SolR, i have a lot of multiValued, dynamicFields and now i must print ther Fields in php.. But i dont know how... In schema.xml: stored="true"/> stored="true"/> stored="true"/> stored="true"/> stored="true"/> output from Solr: A201005311740560002.xml NO A201005311740560002 2010-05-31 17:40:56 − Q:\DatenIBP\AADMS\telli_vip\xml\A201005311740560002.xml D Leichlingen 42799 Schlo� Eicherhof ADRESS KYETG201005311740560002 I don now ha is the name of the Fields, so i dont know how i get the name to "printr" it in PHP Maby someone of you has a answer of the problem? King
Many Tomcat Processes on Server ?!?!?
Hello. Our Server is a 8-Core Server with 12 GB RAM. Solr is running with 4 Cores. 55 Tomcat 5.5 processes are running. ist this normal ??? htop show me a list of these processes of the server. and tomcat have about 55. every process using: /usr/share/java/commons-daemon.jar:/usr/share/tomcat5.5/bin/bootstrap.jar. is this normal ? -- View this message in context: http://lucene.472066.n3.nabble.com/Many-Tomcat-Processes-on-Server-tp864732p864732.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Array of arguments in URL?
You CAN easily turn spellchecking on or off, or set the spellcheck dictionary, in request parameters. So there's really no need, that I can think of, to try to actually add or remove the spellcheck component in request parameters; you could just leave it turned off in your default parameters, but turn it on in request parameters when you want it. With &spellcheck=true&spellcheck.dictionary=whatever. But I suspect you weren't really asking about spellcheck component, but in general, or perhaps for some other specific purpose? I don't think there's any "general" way to pass an array to request parameters. Request parameters that take list-like data structures tend to use whitespace to seperate the elements instead, to allow you to pass them as request parameters. For instance dismax df, pf, etc fields, elements ordinarily seperated by newlines when seen in a solrconfig.xml as default params, can also be seperated simply by spaces in an actual URL too. (newlines in the URL might work too, never tried it, spaces more convenient for an actual URL). From: Grant Ingersoll [gsi...@gmail.com] On Behalf Of Grant Ingersoll [gsing...@apache.org] Sent: Wednesday, June 02, 2010 6:28 AM To: solr-user@lucene.apache.org Subject: Re: Array of arguments in URL? Those aren't in the default parameters. They are config for the SearchHandler itself. On Jun 1, 2010, at 9:00 PM, Lance Norskog wrote: > In the "/spell" declaration in the example solrconfig.xml, we find > these lines among the default parameters: > > > spellcheck > > > How does one supply such an array of strings in HTTP parameters? Does > Solr have a parsing option for this? > > -- > Lance Norskog > goks...@gmail.com
PHP output at a multiValued AND dynamicField
Hallo Users... I have a Problem... In my SolR, i have a lot of multiValued, dynamicFields and now i must print ther Fields in php.. But i dont know how... In schema.xml: output from Solr: A201005311740560002.xml NO A201005311740560002 2010-05-31 17:40:56 − Q:\DatenIBP\AADMS\telli_vip\xml\A201005311740560002.xml D Leichlingen 42799 Schlo� Eicherhof ADRESS KYETG201005311740560002 I don now ha is the name of the Fields, so i dont know how i get the name to "printr" it in PHP Maby someone of you has a answer of the problem? King
Re: Regarding Facet Date query using SolrJ -- Not getting any examples to start with.
Thanks Greet-Jan. Din't know about this trick. [?] On Wed, Jun 2, 2010 at 5:39 PM, Geert-Jan Brits wrote: > Hi Ninad, > > SolrQuery q = new SolrQuery(); > q.setQuery("*:*"); > q.setFacet(true); > q.set("facet.data", "pub"); > q.set("facet.date.start", "2000-01-01T00:00:00Z") > ... etc. > > basically you can completely build your entire query with the 'raw' set > (and > add) methods. > The specific methods are just helpers. > > So this is the same as above: > > SolrQuery q = new SolrQuery(); > q.set("q","*:*"); > q.set("facet","true"); > q.set("facet.data", "pub"); > q.set("facet.date.start", "2000-01-01T00:00:00Z") > ... etc. > > > Geert-Jan > > 2010/6/2 Ninad Raut > > > Hi, > > > > I want to hit the query given below : > > > > > > > ?q=*:*&facet=true&facet.date=pub&facet.date.start=2000-01-01T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=%2B1YEAR > > > > using SolrJ. I am browsing the net but not getting any clues about how > > should I approach it. How can SolJ API be used to create above mentioned > > Query. > > > > Regards, > > Ninad R > > >
Re: Regarding Facet Date query using SolrJ -- Not getting any examples to start with.
Hi Ninad, SolrQuery q = new SolrQuery(); q.setQuery("*:*"); q.setFacet(true); q.set("facet.data", "pub"); q.set("facet.date.start", "2000-01-01T00:00:00Z") ... etc. basically you can completely build your entire query with the 'raw' set (and add) methods. The specific methods are just helpers. So this is the same as above: SolrQuery q = new SolrQuery(); q.set("q","*:*"); q.set("facet","true"); q.set("facet.data", "pub"); q.set("facet.date.start", "2000-01-01T00:00:00Z") ... etc. Geert-Jan 2010/6/2 Ninad Raut > Hi, > > I want to hit the query given below : > > > ?q=*:*&facet=true&facet.date=pub&facet.date.start=2000-01-01T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=%2B1YEAR > > using SolrJ. I am browsing the net but not getting any clues about how > should I approach it. How can SolJ API be used to create above mentioned > Query. > > Regards, > Ninad R >
Regarding Facet Date query using SolrJ -- Not getting any examples to start with.
Hi, I want to hit the query given below : ?q=*:*&facet=true&facet.date=pub&facet.date.start=2000-01-01T00:00:00Z&facet.date.end=2010-01-01T00:00:00Z&facet.date.gap=%2B1YEAR using SolrJ. I am browsing the net but not getting any clues about how should I approach it. How can SolJ API be used to create above mentioned Query. Regards, Ninad R
Re: Importing large datasets
On 2010-06-02 13:12, Grant Ingersoll wrote: > > On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: > >> On 2010-06-02 12:42, Grant Ingersoll wrote: >>> >>> On Jun 1, 2010, at 9:54 PM, Blargy wrote: >>> We have around 5 million items in our index and each item has a description located on a separate physical database. These item descriptions vary in size and for the most part are quite large. Currently we are only indexing items and not their corresponding description and a full import takes around 4 hours. Ideally we want to index both our items and their descriptions but after some quick profiling I determined that a full import would take in excess of 24 hours. - How would I profile the indexing process to determine if the bottleneck is Solr or our Database. >>> >>> As a data point, I routinely see clients index 5M items on normal >>> hardware in approx. 1 hour (give or take 30 minutes). >>> >>> When you say "quite large", what do you mean? Are we talking books here or >>> maybe a couple pages of text or just a couple KB of data? >>> >>> How long does it take you to get that data out (and, from the sounds of it, >>> merge it with your item) w/o going to Solr? >>> - In either case, how would one speed up this process? Is there a way to run parallel import processes and then merge them together at the end? Possibly use some sort of distributed computing? >>> >>> DataImportHandler now supports multiple threads. The absolute fastest way >>> that I know of to index is via multiple threads sending batches of >>> documents at a time (at least 100). Often, from DBs one can split up the >>> table via SQL statements that can then be fetched separately. You may want >>> to write your own multithreaded client to index. >> >> SOLR-1301 is also an option if you are familiar with Hadoop ... >> > > If the bottleneck is the DB, will that do much? > Nope. But the workflow could be set up so that during night hours a DB export takes place that results in a CSV or SolrXML file (there you could measure the time it takes to do this export), and then indexing can work from this file. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Importing large datasets
On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote: > On 2010-06-02 12:42, Grant Ingersoll wrote: >> >> On Jun 1, 2010, at 9:54 PM, Blargy wrote: >> >>> >>> We have around 5 million items in our index and each item has a description >>> located on a separate physical database. These item descriptions vary in >>> size and for the most part are quite large. Currently we are only indexing >>> items and not their corresponding description and a full import takes around >>> 4 hours. Ideally we want to index both our items and their descriptions but >>> after some quick profiling I determined that a full import would take in >>> excess of 24 hours. >>> >>> - How would I profile the indexing process to determine if the bottleneck is >>> Solr or our Database. >> >> As a data point, I routinely see clients index 5M items on normal >> hardware in approx. 1 hour (give or take 30 minutes). >> >> When you say "quite large", what do you mean? Are we talking books here or >> maybe a couple pages of text or just a couple KB of data? >> >> How long does it take you to get that data out (and, from the sounds of it, >> merge it with your item) w/o going to Solr? >> >>> - In either case, how would one speed up this process? Is there a way to run >>> parallel import processes and then merge them together at the end? Possibly >>> use some sort of distributed computing? >> >> DataImportHandler now supports multiple threads. The absolute fastest way >> that I know of to index is via multiple threads sending batches of documents >> at a time (at least 100). Often, from DBs one can split up the table via >> SQL statements that can then be fetched separately. You may want to write >> your own multithreaded client to index. > > SOLR-1301 is also an option if you are familiar with Hadoop ... > If the bottleneck is the DB, will that do much?
Re: Importing large datasets
On 2010-06-02 12:42, Grant Ingersoll wrote: > > On Jun 1, 2010, at 9:54 PM, Blargy wrote: > >> >> We have around 5 million items in our index and each item has a description >> located on a separate physical database. These item descriptions vary in >> size and for the most part are quite large. Currently we are only indexing >> items and not their corresponding description and a full import takes around >> 4 hours. Ideally we want to index both our items and their descriptions but >> after some quick profiling I determined that a full import would take in >> excess of 24 hours. >> >> - How would I profile the indexing process to determine if the bottleneck is >> Solr or our Database. > > As a data point, I routinely see clients index 5M items on normal > hardware in approx. 1 hour (give or take 30 minutes). > > When you say "quite large", what do you mean? Are we talking books here or > maybe a couple pages of text or just a couple KB of data? > > How long does it take you to get that data out (and, from the sounds of it, > merge it with your item) w/o going to Solr? > >> - In either case, how would one speed up this process? Is there a way to run >> parallel import processes and then merge them together at the end? Possibly >> use some sort of distributed computing? > > DataImportHandler now supports multiple threads. The absolute fastest way > that I know of to index is via multiple threads sending batches of documents > at a time (at least 100). Often, from DBs one can split up the table via SQL > statements that can then be fetched separately. You may want to write your > own multithreaded client to index. SOLR-1301 is also an option if you are familiar with Hadoop ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Importing large datasets
On Jun 1, 2010, at 9:54 PM, Blargy wrote: > > We have around 5 million items in our index and each item has a description > located on a separate physical database. These item descriptions vary in > size and for the most part are quite large. Currently we are only indexing > items and not their corresponding description and a full import takes around > 4 hours. Ideally we want to index both our items and their descriptions but > after some quick profiling I determined that a full import would take in > excess of 24 hours. > > - How would I profile the indexing process to determine if the bottleneck is > Solr or our Database. As a data point, I routinely see clients index 5M items on normal hardware in approx. 1 hour (give or take 30 minutes). When you say "quite large", what do you mean? Are we talking books here or maybe a couple pages of text or just a couple KB of data? How long does it take you to get that data out (and, from the sounds of it, merge it with your item) w/o going to Solr? > - In either case, how would one speed up this process? Is there a way to run > parallel import processes and then merge them together at the end? Possibly > use some sort of distributed computing? DataImportHandler now supports multiple threads. The absolute fastest way that I know of to index is via multiple threads sending batches of documents at a time (at least 100). Often, from DBs one can split up the table via SQL statements that can then be fetched separately. You may want to write your own multithreaded client to index. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Array of arguments in URL?
Those aren't in the default parameters. They are config for the SearchHandler itself. On Jun 1, 2010, at 9:00 PM, Lance Norskog wrote: > In the "/spell" declaration in the example solrconfig.xml, we find > these lines among the default parameters: > > > spellcheck > > > How does one supply such an array of strings in HTTP parameters? Does > Solr have a parsing option for this? > > -- > Lance Norskog > goks...@gmail.com
Re: logic for auto-index
You need to do schedule your task. Check out schedulers available in all programming languages. http://www.findbestopensource.com/tagged/job-scheduler Regards Aditya www.findbestopensource.com On Wed, Jun 2, 2010 at 2:39 PM, Jonty Rhods wrote: > Hi Peter, > > actually I want the index process should start automatically. right now I > am > doing mannually. > same thing I want to start indexing when less load on server i.e. late > night. So setting auto will fix my > problem.. > > On Wed, Jun 2, 2010 at 2:00 PM, Peter Karich wrote: > > > Hi Jonty, > > > > what is your specific problem? > > You could use a cronjob or the Java-lib called quartz to automate this > > task. > > Or did you mean replication? > > > > Regards, > > Peter. > > > > > Hi All, > > > > > > I am very new to solr as well as java too. > > > I require to use solrj for indexing also require to index automatically > > once > > > in 24 hour. > > > I wrote java code for indexing now I want to do further coding for > > automatic > > > process. > > > Could you suggest or give me sample code for automatic index process.. > > > please help.. > > > > > > with regards > > > Jonty. > > > > > >
RE: DIH, Full-Import, DB and Performance.
my batchSize is -1 and the load ist to big for us. why i should increase it ? what is a normal serverload ? our server is a fast server. 4 cores 3 GB Ram but we dont want a serverload from over 2 when index a starts. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Full-Import-DB-and-Performance-tp861068p864297.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Question
What analyzer you are using to index and search? Check out schema.xml. You are currently using analyzer which breaks the words. If you don't want to break then you need to use . Regards Aditya www.findbestopensource.com On Wed, Jun 2, 2010 at 2:41 PM, M.Rizwan wrote: > Hi, > > I have solr 1.4. In schema i have a field called "title" of type "text" > Now problem is, when I search for "Test_Title" it brings all documents with > titles like "Test-Title", "Test_Title", "Test,Title, "Test Title", > "Test.Title" > What to do to avoid this? > > "Test_Title" should only return documents having title "Test_Title" > > Any idea? > > Thanks > > - Riz >
Query Question
Hi, I have solr 1.4. In schema i have a field called "title" of type "text" Now problem is, when I search for "Test_Title" it brings all documents with titles like "Test-Title", "Test_Title", "Test,Title, "Test Title", "Test.Title" What to do to avoid this? "Test_Title" should only return documents having title "Test_Title" Any idea? Thanks - Riz
Re: logic for auto-index
Hi Peter, actually I want the index process should start automatically. right now I am doing mannually. same thing I want to start indexing when less load on server i.e. late night. So setting auto will fix my problem.. On Wed, Jun 2, 2010 at 2:00 PM, Peter Karich wrote: > Hi Jonty, > > what is your specific problem? > You could use a cronjob or the Java-lib called quartz to automate this > task. > Or did you mean replication? > > Regards, > Peter. > > > Hi All, > > > > I am very new to solr as well as java too. > > I require to use solrj for indexing also require to index automatically > once > > in 24 hour. > > I wrote java code for indexing now I want to do further coding for > automatic > > process. > > Could you suggest or give me sample code for automatic index process.. > > please help.. > > > > with regards > > Jonty. > > >
Re: logic for auto-index
Hi Peter, actually I want the index process should start automatically. right now I am doing mannually. same thing I want to start indexing when less load on server i.e. late night. So setting auto will fix my problem.. On Wednesday 02 June 2010 02:00 PM, Peter Karich wrote: Hi Jonty, what is your specific problem? You could use a cronjob or the Java-lib called quartz to automate this task. Or did you mean replication? Regards, Peter. Hi All, I am very new to solr as well as java too. I require to use solrj for indexing also require to index automatically once in 24 hour. I wrote java code for indexing now I want to do further coding for automatic process. Could you suggest or give me sample code for automatic index process.. please help.. with regards Jonty.
Re: logic for auto-index
Hi Jonty, what is your specific problem? You could use a cronjob or the Java-lib called quartz to automate this task. Or did you mean replication? Regards, Peter. > Hi All, > > I am very new to solr as well as java too. > I require to use solrj for indexing also require to index automatically once > in 24 hour. > I wrote java code for indexing now I want to do further coding for automatic > process. > Could you suggest or give me sample code for automatic index process.. > please help.. > > with regards > Jonty. >
logic for auto-index
Hi All, I am very new to solr as well as java too. I require to use solrj for indexing also require to index automatically once in 24 hour. I wrote java code for indexing now I want to do further coding for automatic process. Could you suggest or give me sample code for automatic index process.. please help.. with regards Jonty.