RE: Facet sort numeric values
Oh brilliant, didn't think of it being possible to configure that way. Had made my own untokenized type, so I guess it would be better for me to control datatype this way. Bonus question (hehe): What if these field values also contain alphanumeric values? E.g. Alpha, Bravo, Omega, ... How would this affect the sorting? I guess the TrieIntField is not applicable then. Aleksander Akerø @ Gurusoft AS Mobil: 944 89 054 QR-Code (Kontaktinfo) -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: 14. august 2012 17:45 To: solr-user@lucene.apache.org Subject: Re: Facet sort numeric values : I'm having a problem with sorting facets. I am using the facet.sort=index : parameter and it works fine for most of the values. ... : Eksample, when sorting 15, 6, 23, 7, 10, 90 it sorts like this: 10, 15, : 23, 6, 7, 90, but what I wanted was 6, 7, 10, 15, 23, 90. what field type are you using? If you use one of the Trie___Field types then the facet values should sort exactly as you describe. fieldType name=int class=solr.TrieIntField precisionStep=0 positionIncrementGap=0/ fieldType name=float class=solr.TrieFloatField precisionStep=0 positionIncrementGap=0/ fieldType name=long class=solr.TrieLongField precisionStep=0 positionIncrementGap=0/ fieldType name=double class=solr.TrieDoubleField precisionStep=0 positionIncrementGap=0/ -Hoss
Fwd: Solr 3.5 result grouping is failing
Hi, I'm trying to group (field collapse) my search results on a field called site. The schema says that it has to be indexed: *field name=site type=string stored=false indexed=true/.* But when I try to query the results with *group.field=sitegroup.limit=100, *I see only 1 group of results being returned. And the group value is null. This seems to work on another solr instance which only has a few documents indexed. Seems to fail on bigger indexes. Help is appreciated. Thanks Chethan Sent this message again as it seemed to bounce the first time.
Re: offsets issues with multiword synonyms since LUCENE_33
I don't know wether this was discussed previously, but if you tell the synonmyfilter to not break your synonyms (which might be the default). In this case, the parts of the synonyms get new word positions. So you could use a Keywordtokenizer to avoid that behaviour: filter class=solr.SynonymFilterFactory synonyms=Synonyms.txt ignoreCase=true expand=false tokenizerFactory=solr.KeywordTokenizerFactory / with regards, konrad. Am 14.08.2012 18:51, schrieb Marc Sturlese: Well an example would be: synonyms.txt: huge,big size The I have the docs: 1- The huge fox attacks first 2- The big size fox attacks first Then if I query for huge, the highlights for each document are: 1- The stronghuge/strong strongfox/strong attacks first 2- The strongbig size/strong fox attacks first The analyzer looks like this: fieldType name=sy_text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=false expand=true / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=false expand=true / /analyzer /fieldType This was working with a previous version of Solr (couldn't make it work with 3.6, 4-alpha nor 4-beta). -- View this message in context: http://lucene.472066.n3.nabble.com/offsets-issues-with-multiword-synonyms-since-LUCENE-33-tp4001195p4001213.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query regarding dataimporthandler
There is no way to do it within DataImportHandler but you can configure autoCommit in solrconfig.xml to automatically commit pending updates by time or number of documents. On Tue, Aug 14, 2012 at 4:11 PM, ravicv ravichandra...@gmail.com wrote: Hi, Is there any way for intermediate commits while indexing data using dataimport handler? I am using 1.4 solr version. My problem is : Some times while indexing huge data about 4 GB , after indeixng it is while commit process is going on if any user searches the data sometimes solr is throwing heap space error. Since my data before commit operation is nearly 8 GB , but after both commit and optimize is node it reduces to 4 GB. I am usign full import option. Any ideas? Thanks, ravichandra -- View this message in context: http://lucene.472066.n3.nabble.com/Query-regarding-dataimporthandler-tp4001098.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: scanned pdf with solr cell
When I send a scanned pdf to extraction request handler, below icon appears in my Dock. http://tinypic.com/r/2mpmo7o/6 http://tinypic.com/r/28ukxhj/6 I found that text-extractable pdf files triggers above weird icon too. curl http://localhost:8983/solr/update/extract?literal.id=solr-wordcommit=true; -F myfile=@solr-word.pdf I wrote a standalone java program using tika. When text-extracting from all kinds of pdf files, that weird icon pops-up :) I will ask tika-ML about this. AutoDetectParser _autoParser = new AutoDetectParser(); File file = new File(solr-word.pdf); BodyContentHandler textHandler = new BodyContentHandler(); Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); InputStream input = new FileInputStream(file); _autoParser.parse(input, textHandler, metadata, context); System.out.println(text : + textHandler.toString()); input.close(); while (true) { }
Re: scanned pdf with solr cell
Ahmet, the dock icon appears when AWT starts, e.g. when a font is loaded. You can prevent it using the headless mode but this is likely to trigger an exception. Same if your user is not UI-logged-in. hope it helps. Paul Le 15 août 2012 à 01:30, Ahmet Arslan a écrit : Hi All, I have set of rich documents. Some of them are scanned pdf files. When I send a scanned pdf to extraction request handler, below icon appears in my Dock. http://tinypic.com/r/2mpmo7o/6 http://tinypic.com/r/28ukxhj/6 Does anyone know what this is? curl http://localhost:8983/solr/documents/update/extract?literal.ID=ticaret_sicil_gazetesiliteral.URL=ticaret_sicil_gazetesicommit=true; -F myfile=@ticaret_sicil_gazetesi.pdf No exception is seen on solr logs. Doc is indexed, content field is: xmpTPg:NPages 4 Creation-Date 2011-08-24T13:03:16Z stream_source_info myfile created Wed Aug 24 16:03:16 EEST 2011 stream_content_type application/octet-stream stream_size 2302337 producer Image Recognition Integrated Systems, Autoformat5,0,0,229 stream_name ticaret_sicil_gazetesi.pdf Content-Type application/pdf creator I.R.I.S. page page page page Environment: solr-trunk, Mac OS X Version 10.7.4, Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode), jetty. Same thing happens with Solr 4.0-beta and Tomcat too. Thanks,
Re: scanned pdf with solr cell
the dock icon appears when AWT starts, e.g. when a font is loaded. You can prevent it using the headless mode but this is likely to trigger an exception. Same if your user is not UI-logged-in. Hi Paul, thanks for the explanation. So is it nothing to worry about?
Re: SOLR3.6:Field Collapsing/Grouping throws OOM
Hi Erick, You are so right on the memory calculations. I am happy that I know now that I was doing something wrong. Yes I am getting confused with SQL. I will back up and let you know the use case. I am tracking file versions. And I want to give an option to browse your system for the latest files. So in order to remove dups (same filename) I used grouping. Also when you say Sharding is it okay if I do multi cores and does it mean that each core needs a separate tomcat. I meant to say can I use the same machine? 150 mill docs have 120 mill unique paths too. One more thing. If I need sharding and need a new box then it wont be great. Because this system still have horsepower left which I can use. Thanks a ton for explaining the issue. Erick Erickson erickerick...@gmail.com wrote: You'r putting a lot of data on a single box, then asking to group on what I presume is a string field. That's just going to eat up a _bunch_ of memory. let's say your average file name is 16 bytes long. Each unique value will take up 58 + 32 bytes (58 bytes of overhead, I'm presuming Solr 3.X and 16*2 bytes for the chars). So, we're up to 90 bytes/string * number of distinct file names) Say you have, for argument's sake, 100M distinct file names. You're up to 9G memory requirement for sorting alone. Solr's sorting reads all the unique values into memory whether or not they satisfy the query... And Grouping can also be expensive. I don't think you really want to group in this case, I'd simply use a filter query something like: fq=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307 Then you're also grouping on conv_sort which doesn't make much sense, do you really want individual results returned for _each_ file name? What it looks like to me is you're confusing SQL with solr search and getting into bad situations... Also, 150M documents in a single shard is...really a lot. You're probably at a point where you need to shard. Not to mention that your 400G index is trying to be jammed into 12G of memory. This actually feels like an XY problem, can you back up and let us know what the use-case you're trying to solve is? Perhaps there are less memory- consumptive solutions possible. Best Erick On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Editing the query...remove smb:. I don't know where it came from while I did copy/paste Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi, I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6) 2 Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit Data Index Dir Size: 400GB Metadata of files is stored in it. I have around 15 schema fields. Total number of items:150million approx. I have a scenario which I will try to explain to the best of my knowledge here: Let us consider the fields I am interested in Url: Entire path of a file in windows file system including the filename. ex:C:\Documents\A.txt mtm: Modified Time of the file Jid:JOb ID conv_sort is string field type where the filename is stored. I run a job where the following gets inserted Total Items:2 Url:C:\personal\A1.txt mtm:08/14/2012 12:00:00 Jid:1 Conv_sort:A1.txt --- Url:C:\personal\B1.txt mtm:08/14/2012 12:01:00 Jid:1 Conv_sort:B1.txt In the second run only one item changes: Url:C:\personal\A1.txt mtm:08/15/2012 1:00:00 Jid:2 Conv_sort=A1.txt When queried I would like to return the latest A1.txt and B1.txt back to the end user. I am trying to use grouping with no luck. It keeps throwing OOM… can someone please help… as it is critical for my project The query I am trying is under a folder there are 1000 files and I putting a filtered query param too asking it to group by filenames or url and none of them work…what am I doing wrong here http://172.19.108.78:8080/solr/select/?q=*:*version=2.2start=0rows=10indent=ongroup.query=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307smb://pd_dst//646c6907-a948-4b83-ac1d-d44742bb0307group=truegroup.limit=1group.field=conv_sortgroup.ngroup=true The stack trace: SEVERE: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Unknown Source) at java.lang.String.init(Unknown Source) at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122) at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184 ) at org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue( FieldCacheImpl.java:882) at org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java :233) at org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl .java:856) at org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.setN extReader(TermFirstPassGroupingCollector.java:74) at
Re: scanned pdf with solr cell
Le 15 août 2012 à 13:03, Ahmet Arslan a écrit : Hi Paul, thanks for the explanation. So is it nothing to worry about? it is nothing to worry about except to remember that you can't run this step in a daemon-like process. (on Linux, I had to set-up a VNC-server for similar tasks) paul
Re: Switch from Sphinx to Solr - some basics please
Because I have set a post in Stackoverflow, I wan't, that there is dublicate questions. Can you please read this post: http://stackoverflow.com/questions/11956608/sphinx-user-is-switching-to-solr Your questions require Sphinx knowledge. I suggest you to read these book(s) http://lucene.apache.org/solr/books.html http://www.manning.com/hatcher3/ I have in Sphinx: min_word_len ... How to use this in Solr? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/#solr.LengthFilterFactory
Re: Switch from Sphinx to Solr - some basics please
HI iorixxx, thanks for the reply. Well you don't need sphinx knowledge to answer my questions. I have write you what I want: 1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I need to start 2 cores for example. How many cores can I run for solr? I have for example over 100 different indexes, that they should seeing as separate data. This indexes should be reindexed in different times and the data of them should not mixed with each other. You need to understand follow situation: I have for example jobs form country A, jobs from country B and so on until 100 countries. I need to have for each country an separate index, because if someone search for jobs in country A I need to query only the index for country A. How to solve this problem? How to do this? Is there are good tutorial? In the wiki of solr, it is very bad explained. 2. When I become new data for example: Should I rotate the whole index again, or can I include the new rows and delete the old rows. What is your suggestion? Thanks Nik -- View this message in context: http://lucene.472066.n3.nabble.com/Switch-from-Sphinx-to-Solr-some-basics-please-tp4001234p4001379.html Sent from the Solr - User mailing list archive at Nabble.com.
How to design index for related versioned database records
Hi solr-users I have a case where I need to build an index from a database. ***Data structure*** The data is spread across multiple tables and in each table the records are versioned - this means that one real record can exist multiple times in a table, each with different validFrom/validUntil dates. Therefore it is possible to query the valid version of a record for a given point in time. The relations of the data are something like this: Employee - LinkTable (=Employment) - Employer - LinkTable (=offered services) - Service That means I have data across 5 relations, each of them with versioned records. ***Search needs*** Now I need to be able to search for employees and employers based on the services they offer for a given point in time. Therefore I have built an index of all employees and employers with their services as subentity. So I have one index entry for every version of every employee/employer and each version collects the offered services for the given timeframe of the employee/employer version. Problem: The offered services of an employee/employer can change during its validity period. That means I do not only need to take the version timespan of the employee/employer into account but also the version timespans of services and the link-tables. ***Question*** I think I could continue with my strategy to have an index entry of an employee/employer with its services for any given point in time. But there are much more entries than now since every involved validfrom/validuntil period (if they overlap) produces more entries. But I am not sure if this is a good strategy, or if it would be better to try to index the whole datastructure in an other way. Are there any recommendations how to handle such a case? Thanks for any help Stephan
Re: Switch from Sphinx to Solr - some basics please
1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I need to start 2 cores for example. How many cores can I run for solr? Please see : http://search-lucene.com/m/6rYti2ehFZ82 I have for example jobs form country A, jobs from country B and so on until 100 countries. I need to have for each country an separate index, because if someone search for jobs in country A I need to query only the index for country A. How to solve this problem? How to do this? Is there are good tutorial? In the wiki of solr, it is very bad explained. http://wiki.apache.org/solr/MultipleIndexes talks about different solutions. One big index with fq is an option too. 2. When I become new data for example: Should I rotate the whole index again, or can I include the new rows and delete the old rows. What is your suggestion? I don't understand this. What do you mean by rotate the whole index?
Re: RAMDirectoryFactory bug
Hi, Lance, Thanks for your reply! It seems as if RAMDirectoryFactory is being passed the correct path to the index, as it's being logged correctly. It just doesn't recognize it as an index. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Tue, Aug 14, 2012 at 9:57 PM, Lance Norskog goks...@gmail.com wrote: I can't remember the property name, but there is a Solr Java property that tells where to hunt for the data/ directory. You might be able to work around this bug using that property. On Tue, Aug 14, 2012 at 1:34 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi everyone, It looks like I found a bug with RAMDirectoryFactory (I know, I know...) It doesn't seem to be able to load files off the disk. Everytime it starts up, it logs: WARNING: [] Solr index directory 'solr/./data/index' doesn't exist. Creating new index... Even if that filesystem path exists and there's a valid index there (verified by switching back to StandardDirectoryFactory). I experienced this first on our infrastructure on AWS, but I confirmed this by downloading the Solr 3.6.1 distribution fresh, indexing the exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory, and restarting Jetty. The statement above gets logged, but otherwise the core comes up OK, but empty. Should I file a bug? Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game -- Lance Norskog goks...@gmail.com
Re: How to design index for related versioned database records
The date checking can be implemented using range query as a filter query, such as fq=startDate:[* TO NOW] AND endDate:[NOW TO *] (You can also use an frange query.) Then you will have to flatten the database tables. Your Solr schema would have a single merged record type. You will have to decide whether the different record types (tables) will have common fields versus static qualification by adding a prefix or suffix, e.g., name vs. employee_name and employer_name. The latter has the advantage that you do not have to separately specify a table type field since the fields would be empty for records of other types. -- Jack Krupansky -Original Message- From: Stefan Burkard Sent: Wednesday, August 15, 2012 8:12 AM To: solr-user@lucene.apache.org Subject: How to design index for related versioned database records Hi solr-users I have a case where I need to build an index from a database. ***Data structure*** The data is spread across multiple tables and in each table the records are versioned - this means that one real record can exist multiple times in a table, each with different validFrom/validUntil dates. Therefore it is possible to query the valid version of a record for a given point in time. The relations of the data are something like this: Employee - LinkTable (=Employment) - Employer - LinkTable (=offered services) - Service That means I have data across 5 relations, each of them with versioned records. ***Search needs*** Now I need to be able to search for employees and employers based on the services they offer for a given point in time. Therefore I have built an index of all employees and employers with their services as subentity. So I have one index entry for every version of every employee/employer and each version collects the offered services for the given timeframe of the employee/employer version. Problem: The offered services of an employee/employer can change during its validity period. That means I do not only need to take the version timespan of the employee/employer into account but also the version timespans of services and the link-tables. ***Question*** I think I could continue with my strategy to have an index entry of an employee/employer with its services for any given point in time. But there are much more entries than now since every involved validfrom/validuntil period (if they overlap) produces more entries. But I am not sure if this is a good strategy, or if it would be better to try to index the whole datastructure in an other way. Are there any recommendations how to handle such a case? Thanks for any help Stephan
Re: scanned pdf with solr cell
You can try passing -Djava.awt.headless=true as one of the arguments when you start Jetty to see if you can get this to go away with no ill effects. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Wed, Aug 15, 2012 at 7:07 AM, Paul Libbrecht p...@hoplahup.net wrote: Le 15 août 2012 à 13:03, Ahmet Arslan a écrit : Hi Paul, thanks for the explanation. So is it nothing to worry about? it is nothing to worry about except to remember that you can't run this step in a daemon-like process. (on Linux, I had to set-up a VNC-server for similar tasks) paul
Re: scanned pdf with solr cell
You can try passing -Djava.awt.headless=true as one of the arguments when you start Jetty to see if you can get this to go away with no ill effects. I started jetty using : 'java -Djava.awt.headless=true -jar start.jar' and successfully indexed two pdf files. That icon didn't appeared :) Thanks!
Re: RAMDirectoryFactory bug
On Aug 14, 2012, at 4:34 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi everyone, It looks like I found a bug with RAMDirectoryFactory (I know, I know...) Fair warning - RAMDir use in Solr is like a third class citizen. You probably should be using the mmap dir anyway. See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html It doesn't seem to be able to load files off the disk. Everytime it starts up, it logs: WARNING: [] Solr index directory 'solr/./data/index' doesn't exist. Creating new index... Even if that filesystem path exists and there's a valid index there (verified by switching back to StandardDirectoryFactory). I think it *should* work how you want, so does sound like a bug perhaps. I experienced this first on our infrastructure on AWS, but I confirmed this by downloading the Solr 3.6.1 distribution fresh, indexing the exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory, and restarting Jetty. The statement above gets logged, but otherwise the core comes up OK, but empty. Should I file a bug? Sure. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game - Mark Miller lucidimagination.com
Re: RAMDirectoryFactory bug
Yes, moving to mmap was on our roadmap. I'm in the middle of moving our infrastructure from 1.4 to 3.6.1, and didn't want to make too many changes at the same time. However, this bug might push us over the edge to mmap and away from ram. I'll file a bug regardless. Thanks! Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Wed, Aug 15, 2012 at 9:05 AM, Mark Miller markrmil...@gmail.com wrote: On Aug 14, 2012, at 4:34 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi everyone, It looks like I found a bug with RAMDirectoryFactory (I know, I know...) Fair warning - RAMDir use in Solr is like a third class citizen. You probably should be using the mmap dir anyway. See http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html It doesn't seem to be able to load files off the disk. Everytime it starts up, it logs: WARNING: [] Solr index directory 'solr/./data/index' doesn't exist. Creating new index... Even if that filesystem path exists and there's a valid index there (verified by switching back to StandardDirectoryFactory). I think it *should* work how you want, so does sound like a bug perhaps. I experienced this first on our infrastructure on AWS, but I confirmed this by downloading the Solr 3.6.1 distribution fresh, indexing the exampledocs, stopping Jetty and reconfiguring for RAMDirectoryFactory, and restarting Jetty. The statement above gets logged, but otherwise the core comes up OK, but empty. Should I file a bug? Sure. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game - Mark Miller lucidimagination.com
RE: Solr 4.0 - Join performance
You would index rectangles of 0 height but that have a left edge 'x' of the start time and a right edge 'x' of your end time. You can index a variable number of these per Solr document and then query by either a point or another rectangle to find documents which intersect your query shape. It can't do a completely within based query, just intersection for now. I really look forward to seeing this wrapped up in some sort of RangeFieldType so that users don't have to think in spatial terms. - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-0-Join-performance-tp3998827p4001404.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index not loading
On Tue, Aug 14, 2012 at 5:37 PM, Jonatan Fournier jonatan.fourn...@gmail.com wrote: On Tue, Aug 14, 2012 at 10:25 AM, Erick Erickson erickerick...@gmail.com wrote: This is quite odd, it really sounds like you're not actually committing. So, some questions. 1 What happens if you search before you shut down your tomcat? Do you see docs then? If so, somehow you're doing soft commits and never doing a hard commit. Yeah I just realized the behavior is the same as softCommit, is it the default for commitWithin? Cheers, /jonathan 2 What happens if, as the last statement in your SolrJ program you do a commit()? When using commitWithin, if I introduce server.commit() within the data load process the data gets commited ( I didn't reproduce with my 89G of data...), if I shutdown my EmbeddedServer and restart it and send a commit, like on Tomcat, all data gets wiped out too. So I guess that there's state loss somewhere. Cheers, /jonathan 3 While you're indexing, what do you see in your index directory? You should see multiple segments being created, and possibly merged so the number of files should go up and down. If you only have a single set of files, you're somehow not doing a commit. 4 Is there something really silly going on like your restart scripts delete the index directory? Or you're using a VM that restores a blank image? 5 When you do restart, are there any files at all in your index directory? I really suspect you've got some configuration problem here Best Erick On Mon, Aug 13, 2012 at 9:11 AM, Jonatan Fournier jonatan.fourn...@gmail.com wrote: Hi, I'm using Solr 4.0.0-ALPHA and the EmbeddedSolrServer. Within my SolrJ application, the documents are added to the server using the commitWithin parameter (in my case 60s). After 1 day my 125 millions document are all added to the server and I can see 89G of index data files. I stop my SolrJ application and reload my Solr instance in Tomcat. From the Solr admin panel related to my Core (collection1) I see this info: Last Modified: Num Docs:0 Max Doc:0 Version:1 Segment Count:0 Optimized: (green check) Current: (green check) Master: Version: 0 Gen: 1 Size: 88.14 GB From the general Core Admin panel I see: lastModified: version:1 numDocs:0 maxDoc:0 optimized: (red circle) current: (green check) hasDeletions: (red circle) If I query my index for *:* I get 0 result. If I trigger optimize it wipes ALL my data inside the index and reset to empty. I've played around my EmbeddedServer initially using autoCommit/softCommit and it was working fine. Now that I've switched to commitWithin the document add query, it always do that! I'm never able to reload my index within Tomcat/Solr. Any idea? Cheers, /jonathan
Re: Switch from Sphinx to Solr - some basics please
These do require some Sphinx knowledge. I could answer them on StackOverflow because I converted Chegg from Sphinx to Solr this year. As I said there, read about Solr cores. They are independent search configurations and indexes within one Solr server: http://wiki.apache.org/solr/CoreAdmin For your jobs example, I would use filter queries to limit the search to a single country. Filter them to country:us or country:de or country:fr and you will only get result from that country. Solr does not use the term rotate for indexes. You can delete with a query, so you could delete all the jobs for one country, reindex those, then commit. Separate cores are best when you have different kinds of data. At Chegg, we search books and college courses. Those are in different cores and have very different schemas. wunder On Aug 15, 2012, at 5:11 AM, nnikolay wrote: HI iorixxx, thanks for the reply. Well you don't need sphinx knowledge to answer my questions. I have write you what I want: 1. I need to have 2 seprate indexes. In Stackoverlfow I became the answer I need to start 2 cores for example. How many cores can I run for solr? I have for example over 100 different indexes, that they should seeing as separate data. This indexes should be reindexed in different times and the data of them should not mixed with each other. You need to understand follow situation: I have for example jobs form country A, jobs from country B and so on until 100 countries. I need to have for each country an separate index, because if someone search for jobs in country A I need to query only the index for country A. How to solve this problem? How to do this? Is there are good tutorial? In the wiki of solr, it is very bad explained. 2. When I become new data for example: Should I rotate the whole index again, or can I include the new rows and delete the old rows. What is your suggestion? Thanks Nik -- View this message in context: http://lucene.472066.n3.nabble.com/Switch-from-Sphinx-to-Solr-some-basics-please-tp4001234p4001379.html Sent from the Solr - User mailing list archive at Nabble.com. -- Walter Underwood wun...@wunderwood.org
Re: Duplicated facet counts in solr 4 beta: user error
No problem, and thanks for posting the resolution If you have the time and energy, anyone can edit the Wiki if you create a logon, so any clarification you'd like to provide to keep others from having this problem would be most welcome! Best Erick On Tue, Aug 14, 2012 at 6:13 PM, Buttler, David buttl...@llnl.gov wrote: Here are my steps: 1) Download apache-solr-4.0.0-BETA 2) Untar into a directory 3) cp -r example example2 4) cp -r example exampleB 5) cp -r example example2B 6) cd example; java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar 7) cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar 8) cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar 9) cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar start.jar 10) cd example/exampledocs; java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics%22 14 results returned This is correct. Let's try a slightly more circuitous route by running through the solr tutorial first 1) Download apache-solr-4.0.0-BETA 2) Untar into a directory 3) cd example; java -jar start.jar 4) cd example/exampledocs; java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml 5) kill jetty server 6) cp -r example example2 7) cp -r example exampleB 8) cp -r example example2B 9) cd example; java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar 10) cd example2; java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar 11) cd exampleB; java -Djetty.port=8900 -DzkHost=localhost:9983 -jar start.jar 12) cd example2B; java -Djetty.port=7500 -DzkHost=localhost:9983 -jar start.jar 13) cd example/exampledocs; java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml With the same query as above, 22 results are returned. Looking at this, it is somewhat obvious that what is happening is that the index was copied over from the tutorial and was not cleaned up before running the cloud examples. Adding the debug=query parameter to the query URL produces the following: lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedqueryMatchAllDocsQuery(*:*)/str str name=parsedquery_toString*:*/str str name=QParserLuceneQParser/str arr name=filter_queries strcat:electronics/str /arr arr name=parsed_filter_queries strcat:electronics/str /arr /lst So, Erick's diagnoses is correct: pilot error. However, the straightforward path through the tutorial and on to solr cloud makes it easy to make this mistake. Maybe a small warning in the solr cloud page would help? Now, running a delete operations fixes things: cd example/exampledocs; java -Dcommit=false -Ddata=args -jar post.jar deletequery*:*/query/delete causes the number of results to be zero. So, let's reload the data: java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar *.xml now the number of results for our query http://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:electronicshttp://localhost:8983/solr/collection1/select?q=*:*wt=xmlfq=cat:%22electronics is back to the correct 14 results. Dave PS apologizes for hijacking the thread earlier.
Re: Facet sort numeric values
the problem you're running into is that lexical ordering of numeric data != numeric ordering. If you have a mixed alpha and numeric data, you man not care if the alpha stuff is first, i.e. asdb456 asdf490 sorts fine. Problems happen with 9jsdf 100ukel the 100ukel comes first. So if you have a mixed alpha and numeric situation, you have to either live with it or normalize the numeric data so it lexical ordering == numeric ordering, the most common way is to left-pad numeric data to a fixed-width, i.e. rather than index asb9fg, index asb009fg. Of course you have to know what the upper limit of any digit is for this to work... Best Erick On Wed, Aug 15, 2012 at 12:33 AM, Aleksander Akerø solraleksan...@gmail.com wrote: Oh brilliant, didn't think of it being possible to configure that way. Had made my own untokenized type, so I guess it would be better for me to control datatype this way. Bonus question (hehe): What if these field values also contain alphanumeric values? E.g. Alpha, Bravo, Omega, ... How would this affect the sorting? I guess the TrieIntField is not applicable then. Aleksander Akerø @ Gurusoft AS Mobil: 944 89 054 QR-Code (Kontaktinfo) -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: 14. august 2012 17:45 To: solr-user@lucene.apache.org Subject: Re: Facet sort numeric values : I'm having a problem with sorting facets. I am using the facet.sort=index : parameter and it works fine for most of the values. ... : Eksample, when sorting 15, 6, 23, 7, 10, 90 it sorts like this: 10, 15, : 23, 6, 7, 90, but what I wanted was 6, 7, 10, 15, 23, 90. what field type are you using? If you use one of the Trie___Field types then the facet values should sort exactly as you describe. fieldType name=int class=solr.TrieIntField precisionStep=0 positionIncrementGap=0/ fieldType name=float class=solr.TrieFloatField precisionStep=0 positionIncrementGap=0/ fieldType name=long class=solr.TrieLongField precisionStep=0 positionIncrementGap=0/ fieldType name=double class=solr.TrieDoubleField precisionStep=0 positionIncrementGap=0/ -Hoss
Re: Solr 3.5 result grouping is failing
Please attach the results of adding debugQuery=on to your query in both the success and failure case, there's very little information to go on here. You might review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Wed, Aug 15, 2012 at 12:57 AM, chethan chethan.p...@gmail.com wrote: Hi, I'm trying to group (field collapse) my search results on a field called site. The schema says that it has to be indexed: *field name=site type=string stored=false indexed=true/.* But when I try to query the results with *group.field=sitegroup.limit=100, *I see only 1 group of results being returned. And the group value is null. This seems to work on another solr instance which only has a few documents indexed. Seems to fail on bigger indexes. Help is appreciated. Thanks Chethan Sent this message again as it seemed to bounce the first time.
Re: SOLR3.6:Field Collapsing/Grouping throws OOM
No, sharding into multiple cores on the same machine still is limited by the physical memory available. It's still lots of stuf on a limited box. But try backing up and re-thinking the problem a bit. Some possibilities off the top of my head: 1 have a new field current. when you update a doc, reindex the old doc with current=0 and put current= 1 in the new doc (boolean field). Getting one and only one is really simple. 2 Use external file fields (EFF) for the same purpose, that won't require you to re-index the doc. The trick here is you use the value in the EFF as a multiplier for the score (that's what function queries do). So older versions of the doc have scores of 0 and just don't show up. 3 Implement a custom collector that replaces older hits with newer hits. Actually I don't particularly like this because it would potentially replace a higher-scoring document with a lower-scoring one in the results list... Bottom line here is I don't think grouping is a good approach for this problem Best Erick On Wed, Aug 15, 2012 at 5:04 AM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi Erick, You are so right on the memory calculations. I am happy that I know now that I was doing something wrong. Yes I am getting confused with SQL. I will back up and let you know the use case. I am tracking file versions. And I want to give an option to browse your system for the latest files. So in order to remove dups (same filename) I used grouping. Also when you say Sharding is it okay if I do multi cores and does it mean that each core needs a separate tomcat. I meant to say can I use the same machine? 150 mill docs have 120 mill unique paths too. One more thing. If I need sharding and need a new box then it wont be great. Because this system still have horsepower left which I can use. Thanks a ton for explaining the issue. Erick Erickson erickerick...@gmail.com wrote: You'r putting a lot of data on a single box, then asking to group on what I presume is a string field. That's just going to eat up a _bunch_ of memory. let's say your average file name is 16 bytes long. Each unique value will take up 58 + 32 bytes (58 bytes of overhead, I'm presuming Solr 3.X and 16*2 bytes for the chars). So, we're up to 90 bytes/string * number of distinct file names) Say you have, for argument's sake, 100M distinct file names. You're up to 9G memory requirement for sorting alone. Solr's sorting reads all the unique values into memory whether or not they satisfy the query... And Grouping can also be expensive. I don't think you really want to group in this case, I'd simply use a filter query something like: fq=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307 Then you're also grouping on conv_sort which doesn't make much sense, do you really want individual results returned for _each_ file name? What it looks like to me is you're confusing SQL with solr search and getting into bad situations... Also, 150M documents in a single shard is...really a lot. You're probably at a point where you need to shard. Not to mention that your 400G index is trying to be jammed into 12G of memory. This actually feels like an XY problem, can you back up and let us know what the use-case you're trying to solve is? Perhaps there are less memory- consumptive solutions possible. Best Erick On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Editing the query...remove smb:. I don't know where it came from while I did copy/paste Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi, I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6) 2 Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit Data Index Dir Size: 400GB Metadata of files is stored in it. I have around 15 schema fields. Total number of items:150million approx. I have a scenario which I will try to explain to the best of my knowledge here: Let us consider the fields I am interested in Url: Entire path of a file in windows file system including the filename. ex:C:\Documents\A.txt mtm: Modified Time of the file Jid:JOb ID conv_sort is string field type where the filename is stored. I run a job where the following gets inserted Total Items:2 Url:C:\personal\A1.txt mtm:08/14/2012 12:00:00 Jid:1 Conv_sort:A1.txt --- Url:C:\personal\B1.txt mtm:08/14/2012 12:01:00 Jid:1 Conv_sort:B1.txt In the second run only one item changes: Url:C:\personal\A1.txt mtm:08/15/2012 1:00:00 Jid:2 Conv_sort=A1.txt When queried I would like to return the latest A1.txt and B1.txt back to the end user. I am trying to use grouping with no luck. It keeps throwing OOM… can someone please help… as it is critical for my project The query I am trying is under a folder there are 1000 files and I putting a filtered
Re: question(s) re lucene spatial toolkit aka LSP aka spatial4j
Hey solr-user, are you by chance indexing LineStrings? That is something I never tried with this spatial index. Depending on which iteration of LSP you are using, I figure you'd either end up indexing a vast number of points along the line which would be slow to index and make the index quite big, or you might end up with a geohash granularity that will look more like a very blocky (i.e. pixelated) approximation of the line that is much courser and will thus trigger searches near the line to match the line. I don't have this use-case in my work so I haven't put that much thought into handling lines -- I just do points polygons circles rects. ~ David - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/question-s-re-lucene-spatial-toolkit-aka-LSP-aka-spatial4j-tp3997757p4001486.html Sent from the Solr - User mailing list archive at Nabble.com.
Does DataImportHandler do any sanitizing?
I am pulling some fields from a mysql database using DataImportHandler and some of them have invalid XML in them. Does DataImportHandler do any kind of filtering/sanitizing to ensure that it will go in OK or is it all on me? Example bad data: orphaned ampersands (Peanut Butter Jelly), curly quotes (we’re) -jsd-
Re: Does DataImportHandler do any sanitizing?
Hi, Jon, As far as I know, DataImportHandler doesn't transfer data to the rest of Solr via XML so it shouldn't be a problem... Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman jdruk...@gmail.com wrote: I am pulling some fields from a mysql database using DataImportHandler and some of them have invalid XML in them. Does DataImportHandler do any kind of filtering/sanitizing to ensure that it will go in OK or is it all on me? Example bad data: orphaned ampersands (Peanut Butter Jelly), curly quotes (we’re) -jsd-
custom complex field - PolyField
Hi, I have to index a tuple like ('blah', 'more blah info') in a multivalued field type. I have read about the PolyField type and it seems the best solution so far but i can't find documentation pointing how to use or implement a custom field. Any help is appreciated. -- Leonardo S Souza
solr.xml entries got deleted when powered off
Hello, I created an index = all the schema.xml solrconfig.xml files are created with content (I checked that they have contents in the xml files). But, if I poweroff the system restart again - the contents of the files are gone. It's like 0 bytes files. Even, the solr.xml file which got updated when I created a new index (with a core) has 0 bytes all the previous entries are lost too. I'm using Solr 4.0 Does anyone has any idea about the scenarios where it might happen. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr.xml entries got deleted when powered off
Just guessing,. disk full? -- Abraços, Leonardo S Souza 2012/8/15 vempap phani.vemp...@emc.com Hello, I created an index = all the schema.xml solrconfig.xml files are created with content (I checked that they have contents in the xml files). But, if I poweroff the system restart again - the contents of the files are gone. It's like 0 bytes files. Even, the solr.xml file which got updated when I created a new index (with a core) has 0 bytes all the previous entries are lost too. I'm using Solr 4.0 Does anyone has any idea about the scenarios where it might happen. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr.xml entries got deleted when powered off
nopes .. there is good amount of space left on disk -- View this message in context: http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001502.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr.xml entries got deleted when powered off
It's happening when I'm not doing a clean shutdown. Are there any more scenarios it might happen ? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: solr.xml entries got deleted when powered off
You are not putting these files in /tmp are you? That is sometimes wiped by different OS's on shutdown -Original Message- From: vempap [mailto:phani.vemp...@emc.com] Sent: Wednesday, August 15, 2012 3:31 PM To: solr-user@lucene.apache.org Subject: Re: solr.xml entries got deleted when powered off It's happening when I'm not doing a clean shutdown. Are there any more scenarios it might happen ? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001503.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: solr.xml entries got deleted when powered off
No, I'm not keeping them in /tmp -- View this message in context: http://lucene.472066.n3.nabble.com/solr-xml-entries-got-deleted-when-powered-off-tp4001496p4001506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR3.6:Field Collapsing/Grouping throws OOM
: 2 Use external file fields (EFF) for the same purpose, that : won't require you to re-index the doc. The trick : here is you use the value in the EFF as a multiplier : for the score (that's what function queries do). So older : versions of the doc have scores of 0 and just don't : show up. or use it in an fq={!frange ...} to eliminate the older versions completley. : I will back up and let you know the use case. I am tracking file : versions. And I want to give an option to browse your system for the : latest files. So in order to remove dups (same filename) I used : grouping. based on only knowing that sentence, my starting suggestion would be to have two indexes: one where the filename is hte unique key, thus only the most current versions of files are listed, and one where there is no unique key (or you use whatever key you use today) that lets you do the full historical archive search, and query whichever index makes sense for each user action. -Hoss
Re: Atomic Multicore Operations - E.G. Move Docs
Haven't managed to find a good way to do this yet. Does anyone have any ideas on how I could implement this feature? Really need to move docs across from one core to another atomically. Many thanks, Nicholas On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball nicholas.b...@nodelay.com wrote: That could work, but then how do you ensure commit is called on the two cores at the exact same time? Cheers, Nicholas On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog goks...@gmail.com wrote: Index all documents to both cores, but do not call commit until both report that indexing worked. If one of the cores throws an exception, call roll back on both cores. On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball nicholas.b...@nodelay.com wrote: Hey all, Trying to figure out the best way to perform atomic operation across multiple cores on the same solr instance i.e. a multi-core environment. An example would be to move a set of docs from one core onto another core and ensure that a softcommit is done as the exact same time. If one were to fail so would the other. Obviously this would probably require some customization but wanted to know what the best way to tackle this would be and where should I be looking in the source. Many thanks for the help in advance, Nicholas a.k.a. incunix
Re: Atomic Multicore Operations - E.G. Move Docs
在 2012-7-2 傍晚6:37,Nicholas Ball nicholas.b...@nodelay.com写道: That could work, but then how do you ensure commit is called on the two cores at the exact same time? that may needs something like two phrase commit in relational dB. lucene has prepareCommit, but to implement 2pc, many things need to do. Also, any way to commit a specific update rather then all the back-logged ones? Cheers, Nicholas On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog goks...@gmail.com wrote: Index all documents to both cores, but do not call commit until both report that indexing worked. If one of the cores throws an exception, call roll back on both cores. On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball nicholas.b...@nodelay.com wrote: Hey all, Trying to figure out the best way to perform atomic operation across multiple cores on the same solr instance i.e. a multi-core environment. An example would be to move a set of docs from one core onto another core and ensure that a softcommit is done as the exact same time. If one were to fail so would the other. Obviously this would probably require some customization but wanted to know what the best way to tackle this would be and where should I be looking in the source. Many thanks for the help in advance, Nicholas a.k.a. incunix
Re: Atomic Multicore Operations - E.G. Move Docs
do you really need this? distributed transaction is a difficult problem. in 2pc, every node could fail, including coordinator. something like leader election needed to make sure it works. you maybe try zookeeper. but if the transaction is not very very important like transfer money in bank, you can do like this. coordinator: 在 2012-8-16 上午7:42,Nicholas Ball nicholas.b...@nodelay.com写道: Haven't managed to find a good way to do this yet. Does anyone have any ideas on how I could implement this feature? Really need to move docs across from one core to another atomically. Many thanks, Nicholas On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball nicholas.b...@nodelay.com wrote: That could work, but then how do you ensure commit is called on the two cores at the exact same time? Cheers, Nicholas On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog goks...@gmail.com wrote: Index all documents to both cores, but do not call commit until both report that indexing worked. If one of the cores throws an exception, call roll back on both cores. On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball nicholas.b...@nodelay.com wrote: Hey all, Trying to figure out the best way to perform atomic operation across multiple cores on the same solr instance i.e. a multi-core environment. An example would be to move a set of docs from one core onto another core and ensure that a softcommit is done as the exact same time. If one were to fail so would the other. Obviously this would probably require some customization but wanted to know what the best way to tackle this would be and where should I be looking in the source. Many thanks for the help in advance, Nicholas a.k.a. incunix
Re: Atomic Multicore Operations - E.G. Move Docs
http://zookeeper.apache.org/doc/r3.3.6/recipes.html#sc_recipes_twoPhasedCommit On Thu, Aug 16, 2012 at 7:41 AM, Nicholas Ball nicholas.b...@nodelay.com wrote: Haven't managed to find a good way to do this yet. Does anyone have any ideas on how I could implement this feature? Really need to move docs across from one core to another atomically. Many thanks, Nicholas On Mon, 02 Jul 2012 04:37:12 -0600, Nicholas Ball nicholas.b...@nodelay.com wrote: That could work, but then how do you ensure commit is called on the two cores at the exact same time? Cheers, Nicholas On Sat, 30 Jun 2012 16:19:31 -0700, Lance Norskog goks...@gmail.com wrote: Index all documents to both cores, but do not call commit until both report that indexing worked. If one of the cores throws an exception, call roll back on both cores. On Sat, Jun 30, 2012 at 6:50 AM, Nicholas Ball nicholas.b...@nodelay.com wrote: Hey all, Trying to figure out the best way to perform atomic operation across multiple cores on the same solr instance i.e. a multi-core environment. An example would be to move a set of docs from one core onto another core and ensure that a softcommit is done as the exact same time. If one were to fail so would the other. Obviously this would probably require some customization but wanted to know what the best way to tackle this would be and where should I be looking in the source. Many thanks for the help in advance, Nicholas a.k.a. incunix
Re: SOLR3.6:Field Collapsing/Grouping throws OOM
Awesome thanks a lot, I am already on it with option 1. We need to track delete to flip the previous one as the current. Erick Erickson erickerick...@gmail.com wrote: No, sharding into multiple cores on the same machine still is limited by the physical memory available. It's still lots of stuf on a limited box. But try backing up and re-thinking the problem a bit. Some possibilities off the top of my head: 1 have a new field current. when you update a doc, reindex the old doc with current=0 and put current= 1 in the new doc (boolean field). Getting one and only one is really simple. 2 Use external file fields (EFF) for the same purpose, that won't require you to re-index the doc. The trick here is you use the value in the EFF as a multiplier for the score (that's what function queries do). So older versions of the doc have scores of 0 and just don't show up. 3 Implement a custom collector that replaces older hits with newer hits. Actually I don't particularly like this because it would potentially replace a higher-scoring document with a lower-scoring one in the results list... Bottom line here is I don't think grouping is a good approach for this problem Best Erick On Wed, Aug 15, 2012 at 5:04 AM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi Erick, You are so right on the memory calculations. I am happy that I know now that I was doing something wrong. Yes I am getting confused with SQL. I will back up and let you know the use case. I am tracking file versions. And I want to give an option to browse your system for the latest files. So in order to remove dups (same filename) I used grouping. Also when you say Sharding is it okay if I do multi cores and does it mean that each core needs a separate tomcat. I meant to say can I use the same machine? 150 mill docs have 120 mill unique paths too. One more thing. If I need sharding and need a new box then it wont be great. Because this system still have horsepower left which I can use. Thanks a ton for explaining the issue. Erick Erickson erickerick...@gmail.com wrote: You'r putting a lot of data on a single box, then asking to group on what I presume is a string field. That's just going to eat up a _bunch_ of memory. let's say your average file name is 16 bytes long. Each unique value will take up 58 + 32 bytes (58 bytes of overhead, I'm presuming Solr 3.X and 16*2 bytes for the chars). So, we're up to 90 bytes/string * number of distinct file names) Say you have, for argument's sake, 100M distinct file names. You're up to 9G memory requirement for sorting alone. Solr's sorting reads all the unique values into memory whether or not they satisfy the query... And Grouping can also be expensive. I don't think you really want to group in this case, I'd simply use a filter query something like: fq=filefolder:E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307 Then you're also grouping on conv_sort which doesn't make much sense, do you really want individual results returned for _each_ file name? What it looks like to me is you're confusing SQL with solr search and getting into bad situations... Also, 150M documents in a single shard is...really a lot. You're probably at a point where you need to shard. Not to mention that your 400G index is trying to be jammed into 12G of memory. This actually feels like an XY problem, can you back up and let us know what the use-case you're trying to solve is? Perhaps there are less memory- consumptive solutions possible. Best Erick On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee tchatter...@commvault.com wrote: Editing the query...remove smb:. I don't know where it came from while I did copy/paste Tirthankar Chatterjee tchatter...@commvault.com wrote: Hi, I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6) 2 Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit Data Index Dir Size: 400GB Metadata of files is stored in it. I have around 15 schema fields. Total number of items:150million approx. I have a scenario which I will try to explain to the best of my knowledge here: Let us consider the fields I am interested in Url: Entire path of a file in windows file system including the filename. ex:C:\Documents\A.txt mtm: Modified Time of the file Jid:JOb ID conv_sort is string field type where the filename is stored. I run a job where the following gets inserted Total Items:2 Url:C:\personal\A1.txt mtm:08/14/2012 12:00:00 Jid:1 Conv_sort:A1.txt --- Url:C:\personal\B1.txt mtm:08/14/2012 12:01:00 Jid:1 Conv_sort:B1.txt In the second run only one item changes: Url:C:\personal\A1.txt mtm:08/15/2012 1:00:00 Jid:2 Conv_sort=A1.txt When queried I would like to return the latest A1.txt and B1.txt back to the end user. I am trying to use grouping with no luck.
Re: Does DataImportHandler do any sanitizing?
If you want to sanitize them during indexing, the regular expression tools can do this. You would create a regular expression that matches bogus elements. There is a regular expression transformer in the DIH, and a regular expression CharFilter inside the Lucene text analysis stack. On Wed, Aug 15, 2012 at 2:10 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hi, Jon, As far as I know, DataImportHandler doesn't transfer data to the rest of Solr via XML so it shouldn't be a problem... Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Wed, Aug 15, 2012 at 5:03 PM, Jon Drukman jdruk...@gmail.com wrote: I am pulling some fields from a mysql database using DataImportHandler and some of them have invalid XML in them. Does DataImportHandler do any kind of filtering/sanitizing to ensure that it will go in OK or is it all on me? Example bad data: orphaned ampersands (Peanut Butter Jelly), curly quotes (we’re) -jsd- -- Lance Norskog goks...@gmail.com