Re: DIH to index the data - 250 millions - Need a best architecture

2013-07-30 Thread Shawn Heisey
On 7/30/2013 12:23 AM, Santanu8939967892 wrote: Yes, your assumption is correct. The index size is around 250 GB and we index 20/30 meta data and store around 50. We have plan for a Solr cloud architecture having two nodes one Master and other one is replica of the master

DataImportHandler, BlobTransformer, FieldReaderDataSource and TikaEntityExtractor

2013-07-30 Thread Raymond Wiker
I have a case where I want to documents and metadata content from a datebase. The metadata is is not a problem, but it does not appear that I can handle the document content (held as BLOBS in the database) with out-of-the-box SOLR 4.4 functionality. I was hoping to to be able to solve this by

Re: Solr Cloud - How to balance Batch and Queue indexing?

2013-07-30 Thread Aditya
Hi, Do you want 5 replicas? 1 or 2 is enough. If you already have 100 million records, you don't need to do batch indexing. Push it once, Solr has the capability to soft commit every N docs. Use round robin and send documents to different core. When you search, search from all the cores. How

Re: Machine memory full

2013-07-30 Thread Ranjith Venkatesan
Thanks for the reply. I think this approach will work only for new collections. Is there any approach to shift some existing cores to a new machine or node?? -- View this message in context: http://lucene.472066.n3.nabble.com/Machine-memory-full-tp4080511p4081235.html Sent from the Solr - User

Re: Auto Correction of Solr Query

2013-07-30 Thread sivaprasad
Thank you for the quick response. I checked the document on spellcheck.collate. Looks like, it is going to return the suggestion to the client and the client need to make one more request to the server with the suggestion. Is there any way to auto correct at the server end? -- View this

Index timestamp of pdf in unix timeformat

2013-07-30 Thread xan
Currently, while using ExtractingResourceHandler to index rich documents like pdfs, docs, etc. solr automatically indexes the time-created/modified in human-readable time format (Wed May 29 20:38:30 IST 2013). How can I make solr to index the time in unixtime format? -- View this message in

Re: Shows different result with using 'and' and 'AND'

2013-07-30 Thread Payal.Mulani
Hi Raymond Wiker, When we search like this 1) tag:”test” works 2) tag:”TEST” works 3) tag:”test” tag:”other” works to find items with both tags 4) tag:”TEST” tag:”other” *doesn’t work.* Either 2 should fail with true case sensitivity or 4 should work (as the combination of two valid

Re: Synonyms with wildcard search

2013-07-30 Thread Jack Krupansky
Sorry, but Solr synonym processing does not know about wildcards, so it is bypassed when a wildcard is present. Technically, it could probably be enhanced to support them, at least for some common special cases such as yours, but that prospect won't help you right now. Your best bet is to

Re: Shows different result with using 'and' and 'AND'

2013-07-30 Thread Jack Krupansky
#3 and #4 are different queries - the other term is used in different fields. What is your default search field, which will be used for other in #3? Is your tag field a string field type? If so, then it is case sensitive. If you really need it to be case insensitive, make it a text field

Re: Improper shutdown of Solr in Jetty 9

2013-07-30 Thread Artem Karpenko
After some investigation I found that the problem is not with Jetty's version but usage of --exec flag. Namely, when --exec is used (to specify JVM args) then shutdown is not graceful, it seems that Java process that is just killed. Not sure how to handle this... Regards, Artem Karpenko.

Re: Improper shutdown of Solr in Jetty 9

2013-07-30 Thread Artem Karpenko
Uh, sorry for spamming, but if anyone interested there is a way to properly shutdown Jetty when it's launched with --exec flag. You can use JMX to invoke method stop() on the Jetty's Server MBean. This triggers a proper shutdown with all Solr's close() callbacks executed. I wonder why it's not

Re: Improper shutdown of Solr in Jetty 9

2013-07-30 Thread Alexandre Rafalovitch
Thanks for letting us know. See if you can add it to the documentation somewhere. Solr is not using Tomcat 9, but I believe that was primarily because Tomcat 9 requires Java 7 and Solr 4.x is staying with Java 6 as minimum requirement. Regards, Alex. Personal website:

Re: Improper shutdown of Solr in Jetty 9

2013-07-30 Thread Alexandre Rafalovitch
Of course, I meant Jetty (not Tomcat). So apologies for spam and confusion of my own. The rest of the statement stands. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at

Re: SolrCloud commit process is too time consuming, even if documents are light

2013-07-30 Thread Mark Miller
I don't seem to be seeing a signifigant slowdown over time when I use the old defaults for merge threads and max merges. - Mark On Jul 25, 2013, at 10:17 AM, Mark Miller markrmil...@gmail.com wrote: I'm looking into some possible slow down after long indexing issues when I get back from

Re: Email regular expression.

2013-07-30 Thread Jack Krupansky
Just use the UAX29URLEmailTokenizerFactory, which recognizes email addresses. Any particular reason that you're trying to reinvent the wheel? -- Jack Krupansky -Original Message- From: Luis Cappa Banda Sent: Tuesday, July 30, 2013 10:53 AM To: solr-user@lucene.apache.org Subject:

Re: CachedSqlEntityProcessor not adding fields

2013-07-30 Thread Luis Lebolo
I'm noticing some very odd behavior using dataimport from the Admin UI. Whenever I limit the number of rows to 75 or below, the aliases field never gets populated. As soon as I increase the limit to 76 or more, the aliases field gets populated! What am I not understanding here? On Tue, Jul 30,

Using HP SiteScope to monitor individual Solr shards

2013-07-30 Thread Ali, Saqib
We would like to use HP SiteScope to monitor the availability of the individual Solr shards. Any ideas on how we can do that? Is there a shard based URL that is a sure shot of knowing that the shard is feeling healthy? Thanks! :)

Re: Boost on specific fields

2013-07-30 Thread Chris Hostetter
: coming as part of search results. Here, I am applying boosting on the no of : reviews and the has_image(This will be Y Or N) and I am expecting the : product which has no of reviews count is more and the has_image=Y should : come first. But, in some of the cases , I am not getting what I am :

Re: Shows different result with using 'and' and 'AND'

2013-07-30 Thread Erick Erickson
Try attaching debug=query and see what the parsed query looks, that can often give you clues as to what's really going on. Of course if tag is a string type then Jack's comment is spot on, it's case sensitive. The admin/analysis page will also help you understand the analysis chains. But also,

Re: Sort top N results in solr after boosting

2013-07-30 Thread Chris Hostetter
: bq: I am also trying to figure out if I can place : extra dimensions to the solr score which takes other attributes into : consideration To re-iterate erick's point, you should definitely look at using things like the {!boost} qparser combined with funciton queries that take into account

Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
Until I get the data refed I there was another field (a date field) that was there and not when the geo field was/was not... i tried that field:* and query times come down to 2.5s .. also just removing that filter brings the query down to 30ms.. so I'm very hopeful that with just a boolean i'll be

Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
I am curious why the field:* walks the entire terms list.. could this be discovered from a field cache / docvalues? steve On Tue, Jul 30, 2013 at 2:00 PM, Steven Bower sbo...@alcyon.net wrote: Until I get the data refed I there was another field (a date field) that was there and not when the

Trying to determine the benefit of spellcheck-based suggester vs. using terms component?

2013-07-30 Thread Timothy Potter
Going over the comments in SOLR-1316, I seemed to have lost the forrest for the trees. What is the benefit of using the spellcheck based suggester over something like the terms component to get suggestions as the user types? Maybe it is faster because it builds the in-memory data structure on

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Aloke Ghoshal
Does adding facet.mincount=2 help? On Tue, Jul 30, 2013 at 11:46 PM, Dotan Cohen dotanco...@gmail.com wrote: To search for duplicate IDs, I am running the following query: select?q=*:*facet=truefacet.field=idrows=0 However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Shawn Heisey
On 7/30/2013 12:16 PM, Dotan Cohen wrote: To search for duplicate IDs, I am running the following query: select?q=*:*facet=truefacet.field=idrows=0 However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving OutOfMemoryError errors instead of the desired facet: snip Might there be a

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Michael Della Bitta
Are you talking about the document's ID field? If so, you can't have duplicates... the latter document would overwrite the earlier. If not, sorry for asking irrelevant questions. :) Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal alghos...@gmail.com wrote: Does adding facet.mincount=2 help? In fact, when adding facet.mincount=20 (I know that some dupes are in the hundreds) I got the OutOfMemoryError in seconds instead of minutes. -- Dotan Cohen http://gibberish.co.il

Re: Sort top N results in solr after boosting

2013-07-30 Thread Utkarsh Sengar
Thanks guys! Will play around with it function query. Thanks, -Utkarsh On Tue, Jul 30, 2013 at 10:50 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : bq: I am also trying to figure out if I can place : extra dimensions to the solr score which takes other attributes into : consideration

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:23 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Are you talking about the document's ID field? If so, you can't have duplicates... the latter document would overwrite the earlier. If not, sorry for asking irrelevant questions. :) In Solr 4.1 we

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Michael Della Bitta
Since this is a one-time problem, Have you thought of just dumping all the IDs and looking for dupes using sort and awk or something similar to that? Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:43 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Since this is a one-time problem, Have you thought of just dumping all the IDs and looking for dupes using sort and awk or something similar to that? All 100,000,000 of them :) That would take even

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Shawn Heisey
On 7/30/2013 12:49 PM, Dotan Cohen wrote: ‎Thanks, the query ran for almost 2 full minutes but it returned results! I'll google for how to increase the disk cache for queries like this. Other than the Qtime, is there no way to judge the amount of memory required for a particular query to run?

poor facet search performance

2013-07-30 Thread Robert Stewart
A little bit of history: We built a solr-like solution on Lucene.NET and C# about 5 years ago, which including faceted search. In order to get really good facet performance, what we did was pre-cache all the facet fields in RAM as efficient compressed data structures (either a variable byte

Re: Email regular expression.

2013-07-30 Thread Luis Cappa Banda
Hello, Jack, Steve, Thank you for your answers. I´ve never used UAX29URLEmailTokenizerFactory, but I´ve read about it before trying RegExp´s queries. As far as I know, UAX29URLEmailTokenizerFactory allows to tokenize an entry text value into patterns that match URLs, E-mails, etc. Reading the

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Mikhail Khludnev
Dotan, Could you please provide more line of the stack trace? I have no idea why it made worse at 4.3. I know that 4.3 can use facets backed on DocValues, which are modest for the heap. But from what I saw, but can be wrong it's disabled from numeric facets. Hence, I can suggest to reindex id as

Re: Email regular expression.

2013-07-30 Thread Luis Cappa Banda
Hello guys, Hey, I think I´ve found how to do this just adding a filter. Just for anyone´s curiosity: fieldType name=emails class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.UAX29URLEmailTokenizerFactory/ filter

Re: poor facet search performance

2013-07-30 Thread Mikhail Khludnev
On Tue, Jul 30, 2013 at 11:48 PM, Robert Stewart robert_stew...@epam.comwrote: Also we need to issue frequent commits since we are constantly streaming new content into the system. I'd like to say show me profiler snapshot, but after that note. Solr's filter/field caches are top level

Re: Email regular expression.

2013-07-30 Thread Luis Cappa Banda
I´ve tried this kind of queries in the past but I detected that they have a poor performance and that they are incredibly slow. But it´s just my experience, maybe someone can share with us any other opinion. 2013/7/30 Raymond Wiker rwi...@gmail.com On Jul 30, 2013, at 22:05 , Luis Cappa Banda

Re: Performance question on Spatial Search

2013-07-30 Thread Smiley, David W.
Steve, The FieldCache and DocValues are irrelevant to this problem. Solr's FilterCache is, and Lucene has no counterpart. Perhaps it would be cool if Solr could look for expensive field:* usages when parsing its queries and re-write them to use the FilterCache. That's quite doable, I think. I

Re: Performance question on Spatial Search

2013-07-30 Thread Luis Cappa Banda
Hey, David, I´ve been reading the thread and I think that is one of the most educative mail-threads I´ve read in Solr mailing list. Just for curiosity: internally for Solr, is it the same a query like field:* and field:[* TO *]? I think that it´s expected to receive the same number of numFound

Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
Very good read... Already using MMap... verified using pmap and vsz from top.. not sure what you mean by good hit raitio? Here are the stacks... Name Time (ms) Own Time (ms) org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(AtomicReaderContext, Bits) 300879 203478

Re: Performance question on Spatial Search

2013-07-30 Thread Steven Bower
@David I will certainly update when we get the data refed... and if you have things you'd like to investigate or try out please let me know.. I'm happy to eval things at scale here... we will be taking this index from its current 45m records to 6-700m over the next few months as well.. steve On

Re: Performance question on Spatial Search

2013-07-30 Thread Smiley, David W.
Luis, field:* and field:[* TO *] are semantically equivalent -- they have the same effect. But they internally work differently depending on the field type. The field type has the chance to intercept the range query to do something smart (FieldType.getRangeQuery(...)). Numeric/Date (trie)

Re: Performance question on Spatial Search

2013-07-30 Thread Luis Cappa Banda
Thank you very much, David. That was a great explanation! Regards, - Luis Cappa 2013/7/30 Smiley, David W. dsmi...@mitre.org Luis, field:* and field:[* TO *] are semantically equivalent -- they have the same effect. But they internally work differently depending on the field type. The

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Jack Krupansky
You could also try the terms component which provides a very efficient facet-like feature - counting the terms. And you can set a minimum term frequency of 2, so only the dups would come back: curl http://localhost:8983/solr/terms?terms.fl=idterms.mincount=2; -- Jack Krupansky -Original

Re: Solr Cloud Questions

2013-07-30 Thread Timothy Potter
1) Depends on your document routing strategy. It sounds like you could be using the compositeId strategy and if so, there's still a hash range assigned to each shard, so you can split the big shards into smaller shards. 2) Since you're replicating in 2 places, when one of your servers crash,

Ingesting geo data into Solr very slow

2013-07-30 Thread Simonian, Marta M (US SSA)
Hi, We are using Solr 4.4 to ingest geo data and it's really slow. When we don't index the geo it takes seconds to ingest 100, 000 records but as soon as we add it takes 2 hours. Also we found that when changing the distErrPct from 0.025 to 0.1, 1000 rows are ingested in 20 sec vs 2 min. But

Ingesting geo data into Solr very slow

2013-07-30 Thread Simonian, Marta M (US SSA)
Hi, We are using Solr 4.4 to ingest geo data and it's really slow. When we don't index the geo it takes seconds to ingest 100, 000 records but as soon as we add it takes 2 hours. Also we found that when changing the distErrPct from 0.025 to 0.1, 1000 rows are ingested in 20 sec vs 2 min. But

FieldCollapsing issues in SolrCloud 4.4

2013-07-30 Thread Ali, Saqib
Hello all, Is anyone experiencing issues with the numFound when using group=true in SolrCloud 4.4? Sometimes the results are off for us. I will post more details shortly. Thanks.

Solr rss indexation doubt

2013-07-30 Thread Luís Portela Afonso
Hi, I'm using Apache Solr to index RSS Feeds. I'm with success getting data (url and if feed is active to index) from a database, and using that has a source of an entity to index the rss data. I'm trying to reach a result but i don't get it. I will try to explain that with an example. The

Measuring SOLR performance

2013-07-30 Thread Roman Chyla
Hello, I have been wanting some tools for measuring performance of SOLR, similar to Mike McCandles' lucene benchmark. so yet another monitor was born, is described here: http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/ I tested it on the problem of garbage collectors (see

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Bill Bell
This seems like a fairly large issue. Can you create a Jira issue ? Bill Bell Sent from mobile On Jul 30, 2013, at 12:34 PM, Dotan Cohen dotanco...@gmail.com wrote: On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal alghos...@gmail.com wrote: Does adding facet.mincount=2 help? In fact,

Re: [ANN] Solr Usability contest started

2013-07-30 Thread Alexandre Rafalovitch
Hello. I wanted to do a follow-up after the contest has been running for a week. It has been going relatively well. There was a lot of visitors last week, then a bit of quiet and then - after some of you re-announced the contest - a second wave of activities. Thanks to everybody contributing and

Re: Measuring SOLR performance

2013-07-30 Thread Shawn Heisey
On 7/30/2013 6:59 PM, Roman Chyla wrote: I have been wanting some tools for measuring performance of SOLR, similar to Mike McCandles' lucene benchmark. so yet another monitor was born, is described here: http://29min.wordpress.com/2013/07/31/measuring-solr-query-performance/ I tested it

solr4.0 how the log repeat Waiting for client to connect to Zookeeper

2013-07-30 Thread 黄飞鸿
Hi, The solr4.0’s log always show that “Waiting for client to connect to ZooKeeper” and “Client is connected to Zookeeper” , But I look at the code , it only happen when “state == KeeperState.Expired”. We can see the value of state is syncConnected, how did it happen? Can anyone

Re: Ingesting geo data into Solr very slow

2013-07-30 Thread David Smiley (@MITRE.org)
Hi Marta, Presumably you are indexing polygons -- I suspect complex ones. There isn't too much that you can do about this right now other than index them in parallel. I see you are doing this in 2 threads; try 4, or maybe even 6. Also, ensure that maxDistErr is reflective of the smallest

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:56 PM, Shawn Heisey s...@elyograg.org wrote: On 7/30/2013 12:49 PM, Dotan Cohen wrote: ‎Thanks, the query ran for almost 2 full minutes but it returned results! I'll google for how to increase the disk cache for queries like this. Other than the Qtime, is there no

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 11:00 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Dotan, Could you please provide more line of the stack trace? Sure, thanks: responselst name=errorstr name=msgjava.lang.OutOfMemoryError: Java heap space/strstr name=tracejava.lang.RuntimeException:

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 11:14 PM, Jack Krupansky j...@basetechnology.com wrote: The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe... any particular reason you did not use it? See: http://wiki.apache.org/solr/Deduplication and